Fork me on GitHub
#data-science
<
2023-10-30
>
geraldodev08:10:00

Hi, I'm trying to export a DBF (XBase) file as a Parquet file using the parquet/ds->parquet function. I'm passing a vector of maps, but the order of the fields in the Parquet file is not preserved. I think this is because maps do not guarantee order. Initially, I tried using a vector of vectors, but I got an error message about MapEntry. I then realized that the parquet/ds->parquet function requires a vector of maps. Is it possible to create a Parquet file with the order of the fields preserved? Bard says that I need to provide a schema, which is created inside the ds-seq->parquet function https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L1110.

otfrom10:10:01

the parquet file will be written from the dataset objects not the vector of maps. The vector of maps would be a good way to make the datasets and then you can use various functions to re-order, re-name, or change the types of the columns. The parquet should be written out according to what you've defined in the first dataset

otfrom11:10:48

tech.v3.dataset/select-columns should force the ordering you want

👍 1
❤️ 1
otfrom11:10:09

parquet/ds->parquet actually wants a vector of datasets

otfrom11:10:57

(datasets can be created from seqs/vectors of maps, but are sort of maps of columns rather than vectors of rows)

☝️ 1
geraldodev10:10:13

I'm getting Execution error at tech.v3.libs.parquet/column->field (parquet.clj:888). Unsupported datatype for parquet writing: :decimal with:

(-> (ds/->dataset [{:dc (bigdec 3)}])
      (parquet/ds->parquet "teste.parquet"))

Ben Kamphaus14:10:32

different parquet libs handle smuggling more complex types through the base types differently, iirc. looks like it’s not here (where the error hits): https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L879-L909 though I do see it here: https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L712-L734 and in the base apache lib: https://www.javadoc.io/doc/org.apache.parquet/parquet-column/1.8.1/org/apache/parquet/schema/OriginalType.html I’d be curious to see the stack trace? as it looks like it might just be a bit of wiring missing… (I guess I could take a stab at reproducing it from your minimal example to poke at it)

geraldodev07:10:04

#error {
 :cause "Unsupported datatype for parquet writing: :decimal"
 :via
 [{:type java.lang.Exception
   :message "Unsupported datatype for parquet writing: :decimal"
   :at [tech.v3.libs.parquet$column__GT_field invokeStatic "parquet.clj" 888]}]
 :trace
 [[tech.v3.libs.parquet$column__GT_field invokeStatic "parquet.clj" 888]
  [tech.v3.libs.parquet$column__GT_field invoke "parquet.clj" 863]
  [clojure.core$mapv$fn__8535 invoke "core.clj" 6979]
  [clojure.core.protocols$iter_reduce invokeStatic "protocols.clj" 49]
  [clojure.core.protocols$fn__8238 invokeStatic "protocols.clj" 75]
  [clojure.core.protocols$fn__8238 invoke "protocols.clj" 75]
  [clojure.core.protocols$fn__8178$G__8173__8191 invoke "protocols.clj" 13]
  [clojure.core$reduce invokeStatic "core.clj" 6886]
  [clojure.core$mapv invokeStatic "core.clj" 6970]
  [clojure.core$mapv invoke "core.clj" 6970]
  [tech.v3.libs.parquet$ds__GT_schema invokeStatic "parquet.clj" 915]
  [tech.v3.libs.parquet$ds__GT_schema invoke "parquet.clj" 912]
  [tech.v3.libs.parquet$ds_seq__GT_parquet invokeStatic "parquet.clj" 1110]
  [tech.v3.libs.parquet$ds_seq__GT_parquet invoke "parquet.clj" 1086]
  [tech.v3.libs.parquet$ds__GT_parquet invokeStatic "parquet.clj" 1158]
  [tech.v3.libs.parquet$ds__GT_parquet invoke "parquet.clj" 1152]
  [tech.v3.libs.parquet$ds__GT_parquet invokeStatic "parquet.clj" 1160]
  [tech.v3.libs.parquet$ds__GT_parquet invoke "parquet.clj" 1152]
  [sihsus.core$eval27995 invokeStatic "NO_SOURCE_FILE" 91]
  [sihsus.core$eval27995 invoke "NO_SOURCE_FILE" 90]
  [clojure.lang.Compiler eval "Compiler.java" 7194]
  [clojure.lang.Compiler eval "Compiler.java" 7149]
  [clojure.core$eval invokeStatic "core.clj" 3215]
  [clojure.core$eval invoke "core.clj" 3211]
  [nrepl.middleware.interruptible_eval$evaluate$fn__1275$fn__1276 invoke "interruptible_eval.clj" 87]
  [clojure.lang.AFn applyToHelper "AFn.java" 152]
  [clojure.lang.AFn applyTo "AFn.java" 144]
  [clojure.core$apply invokeStatic "core.clj" 667]
  [clojure.core$with_bindings_STAR_ invokeStatic "core.clj" 1990]
  [clojure.core$with_bindings_STAR_ doInvoke "core.clj" 1990]
  [clojure.lang.RestFn invoke "RestFn.java" 425]
  [nrepl.middleware.interruptible_eval$evaluate$fn__1275 invoke "interruptible_eval.clj" 87]
  [clojure.main$repl$read_eval_print__9206$fn__9209 invoke "main.clj" 437]
  [clojure.main$repl$read_eval_print__9206 invoke "main.clj" 437]
  [clojure.main$repl$fn__9215 invoke "main.clj" 458]
  [clojure.main$repl invokeStatic "main.clj" 458]
  [clojure.main$repl doInvoke "main.clj" 368]
  [clojure.lang.RestFn invoke "RestFn.java" 1523]
  [nrepl.middleware.interruptible_eval$evaluate invokeStatic "interruptible_eval.clj" 84]
  [nrepl.middleware.interruptible_eval$evaluate invoke "interruptible_eval.clj" 56]
  [nrepl.middleware.interruptible_eval$interruptible_eval$fn__1308$fn__1312 invoke "interruptible_eval.clj" 152]
  [clojure.lang.AFn run "AFn.java" 22]
  [nrepl.middleware.session$session_exec$main_loop__1378$fn__1382 invoke "session.clj" 218]
  [nrepl.middleware.session$session_exec$main_loop__1378 invoke "session.clj" 217]
  [clojure.lang.AFn run "AFn.java" 22]
  [java.lang.Thread run "Thread.java" 829]]}

Ben Kamphaus16:10:00

My understanding of parquet is that it has a very limited set of primitive base types (as listed there), and then logical/original data types that can be stored using the primitive types. Essentially an abstraction over the primitives to enable some kind of handshake w/other parquet libs/implementers, e.g. that something in the int64 slot is really a decimal. My read on the error is that this is either a bug or deliberate omission in tech v3 parquet impl, that doesn’t extract the corresponding primitive type for the decimal logical type and list it as an original type in the arg to the apache java lib PrimitiveType constructor. my read on the fix required is that decimal would need to be added to the appropriate primitive data type set here: https://github.com/techascent/tech.ml.dataset/blob/8ceafe535ecd506e5241a0815d1bb61378948a04/src/tech/v3/libs/parquet.clj#L852-L855C1 and then given an original type case clause here: https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L892-L907 there appear to be different options for encoding according to the apache spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal — so it may be from tech v3 dataset’s PoV the encoding choice just hasn’t been explicitly made yet? Encoding choices might have to be made by taking into consideration how other impls across languages encode decimals, and/or inspecting the size of the decimal w/r/t possible representations. @UDRJMEFSN @UJ7RSSWDU ^ -- for confirmation or other discussion. If this analysis sounds correct and it’s not something y’all would prioritize soon otherwise, I could take a stab at PR. But it looks like there are some nontrivial details re: the decimal encoding that would require further research and a consensus.

geraldodev20:10:33

How I end up in this spot: I'm using a java library that is reading XBase files as java.math.BigDecimal The objective is to DuckDb to read the parquet file, so custom format is not useful. I've started with python code, pyarrow and it appears to convert to float because in visidata the money fields have % on the header.

geraldodev21:10:32

I'm inspecting the schema with pyarrow.parquet.ParquetDataset and I see double type and int64 type.

Ben Kamphaus22:10:22

it looks like duckdb has very particular Decimal parquet encoding requirements: https://duckdb.org/docs/sql/data_types/numeric.html#fixed-point-decimals I wonder if you’d be better off making the table directly via duckdb create table, w/an explicit duckdb supported decimal(width, scale) type, insert all the entries, then use duckdb’s https://duckdb.org/docs/guides/import/parquet_export.html copy to parquet.

geraldodev22:10:18

I suspect that is about how it stores internally, I've asked on their discord about, parquet, decimal and duckdb. Meanwhile I'm trying to generate a parquet file with decimal type and see if DuckDb reads it properly. (I think it will)

geraldodev23:10:45

@U06GLTD17 hawkfish (from their labs) pointed me to parquet decimal documentation as you did earlier. It looks like (at least from the point of view of handling parquet file with duck db) to export java.math.BigDecimal as parquet Decimal type.

geraldodev09:11:47

What I did to get a proper parquet file: Exported dbf to postgres with pgdbf, connected postgres with duckdb, used copy to export a parquet file on the file system. I want to highlight the non-strings fields that duckdb mapped from postgres sp_qtd_ato: decimal128(4, 0) sp_valato: decimal128(14, 2) sequencia: decimal128(9, 0) sp_qt_proc: decimal128(4, 0) output from script that uses pyarrow.parquet to show the schema

chrisn12:11:48

Sorry I haven't responded on this thread but it seems like you have solved your issue. Currently there is read-only support for that type in tmd-parquet and none in the tmd duckdb bindings - Adding support for that type is possible of course.

geraldodev15:11:34

I'm experimenting with parquet/duckdb at home. For my work I'd like to use clojure to move jdbc data from oracle to parquet. It would be nice if TMD supported mapping BigDecimals to parquet Decimal type.

Ben Kamphaus15:11:34

yeah, I spelunked through the dataset code a bit more and explored other implementers’ libs for decimal support. I think this would have to be a more significant refactor allowing options to set column specific parquet encodings to enable any kind of generic decimal support. to match jvm BigDecimal semantics, you’d have to use a BYTE_ARRAY encoding (arbitrary precision). But duckdb as the docs linked up there show only supports int32,64,128 encodings. So while other readers could decode the BigDecimal as bytearray, this means that any default behavior could only support duckdb compatible decimal logical types by violating BigDecimal semantics. Which, duckdb decimal as a target is a kind of weird opinion to have in a default, and that opens up api use complexity and surprise around how the encoding happens for a generic lib like dataset. Does it silently infer the minimal necessary precision to store all decimals in a field by scanning the values and fitting a range and scale? Should it throw or truncate? These smell to me like behaviors that only should happen when explicitly set in options, and even then I think it surpasses the complexity you’d want for conditions on optional args, and probably belongs in “supply your own encoders for additional types” land. For instance, in the postgres export you highlight above, the inference behavior seems very surprising to me and overly broad and narrow at the same time! Three of the types have a decimal scale of 0 (zero digits after decimal point), they’re all decimal128 (so in int128 according to duckdb docs), but the sufficient ranges for those widths in the duckdb docs are: int16 for width of 4, int32 for the width of 9 case, and int64 for the width of 14 case. So the default export case from postgres w/duckdb -> parquet seems really suboptimal re: value packing while also resulting in extremely restrictive decimal types (three of which might as well not even be decimals). A fixed precision decimal encoded in 128 bits with values between 0-9999 and no values allowed after the decimal place definitely falls into “thanks I hate it” territory for me re: default numeric type inference behavior 🙂

geraldodev15:11:21

Those fields came from postgres by using the copy command to a parquet file: sp_qtd_ato numeric(4) NULL, sp_valato numeric(14, 2) NULL, sequencia numeric(9) NULL, sp_qt_proc numeric(4) NULL,

geraldodev15:11:44

from parquet decimal doc: If not specified, the scale is 0. Scale must be zero or a positive integer less than or equal to the precision. Precision is required and must be a non-zero positive integer. A precision too large for the underlying type (see below) is an error. in databases we have this information. jdbc drivers read numbers as BigDecimals, so I think it would be nice if we could pass this information to the parquet generation process. About the duckdb compatibility, It would be more a parquet compatibility. If possible, I will be using TMD + DuckDB exclusively.

chrisn15:11:52

Write support would mean a scanning process for the column I guess to pick out those two numbers and then store them in the column metadata. Read support is much easier as someone else has already stored those numbers.

chrisn15:11:07

My assumption here is they are fixed per-column and can't change per record.

chrisn15:11:55

There is some overlap here for support map, struct, and list types where you need to fix some more in-depth column schema information and ensure the entire column does in fact respect adhere to those conventions.

chrisn15:11:35

This is coming up though - people are asking for various overlapping subsets of these features. I would be happy to do the work but I want to get paid for it.

👍 2
geraldodev07:10:04

#error {
 :cause "Unsupported datatype for parquet writing: :decimal"
 :via
 [{:type java.lang.Exception
   :message "Unsupported datatype for parquet writing: :decimal"
   :at [tech.v3.libs.parquet$column__GT_field invokeStatic "parquet.clj" 888]}]
 :trace
 [[tech.v3.libs.parquet$column__GT_field invokeStatic "parquet.clj" 888]
  [tech.v3.libs.parquet$column__GT_field invoke "parquet.clj" 863]
  [clojure.core$mapv$fn__8535 invoke "core.clj" 6979]
  [clojure.core.protocols$iter_reduce invokeStatic "protocols.clj" 49]
  [clojure.core.protocols$fn__8238 invokeStatic "protocols.clj" 75]
  [clojure.core.protocols$fn__8238 invoke "protocols.clj" 75]
  [clojure.core.protocols$fn__8178$G__8173__8191 invoke "protocols.clj" 13]
  [clojure.core$reduce invokeStatic "core.clj" 6886]
  [clojure.core$mapv invokeStatic "core.clj" 6970]
  [clojure.core$mapv invoke "core.clj" 6970]
  [tech.v3.libs.parquet$ds__GT_schema invokeStatic "parquet.clj" 915]
  [tech.v3.libs.parquet$ds__GT_schema invoke "parquet.clj" 912]
  [tech.v3.libs.parquet$ds_seq__GT_parquet invokeStatic "parquet.clj" 1110]
  [tech.v3.libs.parquet$ds_seq__GT_parquet invoke "parquet.clj" 1086]
  [tech.v3.libs.parquet$ds__GT_parquet invokeStatic "parquet.clj" 1158]
  [tech.v3.libs.parquet$ds__GT_parquet invoke "parquet.clj" 1152]
  [tech.v3.libs.parquet$ds__GT_parquet invokeStatic "parquet.clj" 1160]
  [tech.v3.libs.parquet$ds__GT_parquet invoke "parquet.clj" 1152]
  [sihsus.core$eval27995 invokeStatic "NO_SOURCE_FILE" 91]
  [sihsus.core$eval27995 invoke "NO_SOURCE_FILE" 90]
  [clojure.lang.Compiler eval "Compiler.java" 7194]
  [clojure.lang.Compiler eval "Compiler.java" 7149]
  [clojure.core$eval invokeStatic "core.clj" 3215]
  [clojure.core$eval invoke "core.clj" 3211]
  [nrepl.middleware.interruptible_eval$evaluate$fn__1275$fn__1276 invoke "interruptible_eval.clj" 87]
  [clojure.lang.AFn applyToHelper "AFn.java" 152]
  [clojure.lang.AFn applyTo "AFn.java" 144]
  [clojure.core$apply invokeStatic "core.clj" 667]
  [clojure.core$with_bindings_STAR_ invokeStatic "core.clj" 1990]
  [clojure.core$with_bindings_STAR_ doInvoke "core.clj" 1990]
  [clojure.lang.RestFn invoke "RestFn.java" 425]
  [nrepl.middleware.interruptible_eval$evaluate$fn__1275 invoke "interruptible_eval.clj" 87]
  [clojure.main$repl$read_eval_print__9206$fn__9209 invoke "main.clj" 437]
  [clojure.main$repl$read_eval_print__9206 invoke "main.clj" 437]
  [clojure.main$repl$fn__9215 invoke "main.clj" 458]
  [clojure.main$repl invokeStatic "main.clj" 458]
  [clojure.main$repl doInvoke "main.clj" 368]
  [clojure.lang.RestFn invoke "RestFn.java" 1523]
  [nrepl.middleware.interruptible_eval$evaluate invokeStatic "interruptible_eval.clj" 84]
  [nrepl.middleware.interruptible_eval$evaluate invoke "interruptible_eval.clj" 56]
  [nrepl.middleware.interruptible_eval$interruptible_eval$fn__1308$fn__1312 invoke "interruptible_eval.clj" 152]
  [clojure.lang.AFn run "AFn.java" 22]
  [nrepl.middleware.session$session_exec$main_loop__1378$fn__1382 invoke "session.clj" 218]
  [nrepl.middleware.session$session_exec$main_loop__1378 invoke "session.clj" 217]
  [clojure.lang.AFn run "AFn.java" 22]
  [java.lang.Thread run "Thread.java" 829]]}