This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-10-30
Channels
- # announcements (2)
- # babashka (37)
- # beginners (64)
- # biff (27)
- # cherry (7)
- # cider (19)
- # clj-kondo (10)
- # clojure-austin (4)
- # clojure-doc (18)
- # clojure-europe (72)
- # clojure-nl (1)
- # clojure-norway (13)
- # clojure-uk (5)
- # clojurescript (18)
- # data-science (28)
- # events (5)
- # graalvm (32)
- # hyperfiddle (6)
- # introduce-yourself (1)
- # jobs (4)
- # joyride (16)
- # juxt (6)
- # malli (7)
- # missionary (3)
- # off-topic (18)
- # pathom (15)
- # portal (14)
- # re-frame (14)
- # reitit (5)
- # releases (1)
- # rum (2)
- # sci (1)
- # shadow-cljs (102)
- # spacemacs (3)
- # sql (6)
- # web-security (2)
- # xtdb (10)
Hi,
I'm trying to export a DBF (XBase) file as a Parquet file using the parquet/ds->parquet
function. I'm passing a vector of maps, but the order of the fields in the Parquet file is not preserved. I think this is because maps do not guarantee order.
Initially, I tried using a vector of vectors, but I got an error message about MapEntry. I then realized that the parquet/ds->parquet
function requires a vector of maps.
Is it possible to create a Parquet file with the order of the fields preserved? Bard says that I need to provide a schema, which is created inside the ds-seq->parquet
function https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L1110.
the parquet file will be written from the dataset objects not the vector of maps. The vector of maps would be a good way to make the datasets and then you can use various functions to re-order, re-name, or change the types of the columns. The parquet should be written out according to what you've defined in the first dataset
(datasets can be created from seqs/vectors of maps, but are sort of maps of columns rather than vectors of rows)
Thank you @U0525KG62
I'm getting Execution error at tech.v3.libs.parquet/column->field (parquet.clj:888). Unsupported datatype for parquet writing: :decimal with:
(-> (ds/->dataset [{:dc (bigdec 3)}])
(parquet/ds->parquet "teste.parquet"))
different parquet libs handle smuggling more complex types through the base types differently, iirc. looks like itâs not here (where the error hits): https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L879-L909 though I do see it here: https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L712-L734 and in the base apache lib: https://www.javadoc.io/doc/org.apache.parquet/parquet-column/1.8.1/org/apache/parquet/schema/OriginalType.html Iâd be curious to see the stack trace? as it looks like it might just be a bit of wiring missing⌠(I guess I could take a stab at reproducing it from your minimal example to poke at it)
#error {
:cause "Unsupported datatype for parquet writing: :decimal"
:via
[{:type java.lang.Exception
:message "Unsupported datatype for parquet writing: :decimal"
:at [tech.v3.libs.parquet$column__GT_field invokeStatic "parquet.clj" 888]}]
:trace
[[tech.v3.libs.parquet$column__GT_field invokeStatic "parquet.clj" 888]
[tech.v3.libs.parquet$column__GT_field invoke "parquet.clj" 863]
[clojure.core$mapv$fn__8535 invoke "core.clj" 6979]
[clojure.core.protocols$iter_reduce invokeStatic "protocols.clj" 49]
[clojure.core.protocols$fn__8238 invokeStatic "protocols.clj" 75]
[clojure.core.protocols$fn__8238 invoke "protocols.clj" 75]
[clojure.core.protocols$fn__8178$G__8173__8191 invoke "protocols.clj" 13]
[clojure.core$reduce invokeStatic "core.clj" 6886]
[clojure.core$mapv invokeStatic "core.clj" 6970]
[clojure.core$mapv invoke "core.clj" 6970]
[tech.v3.libs.parquet$ds__GT_schema invokeStatic "parquet.clj" 915]
[tech.v3.libs.parquet$ds__GT_schema invoke "parquet.clj" 912]
[tech.v3.libs.parquet$ds_seq__GT_parquet invokeStatic "parquet.clj" 1110]
[tech.v3.libs.parquet$ds_seq__GT_parquet invoke "parquet.clj" 1086]
[tech.v3.libs.parquet$ds__GT_parquet invokeStatic "parquet.clj" 1158]
[tech.v3.libs.parquet$ds__GT_parquet invoke "parquet.clj" 1152]
[tech.v3.libs.parquet$ds__GT_parquet invokeStatic "parquet.clj" 1160]
[tech.v3.libs.parquet$ds__GT_parquet invoke "parquet.clj" 1152]
[sihsus.core$eval27995 invokeStatic "NO_SOURCE_FILE" 91]
[sihsus.core$eval27995 invoke "NO_SOURCE_FILE" 90]
[clojure.lang.Compiler eval "Compiler.java" 7194]
[clojure.lang.Compiler eval "Compiler.java" 7149]
[clojure.core$eval invokeStatic "core.clj" 3215]
[clojure.core$eval invoke "core.clj" 3211]
[nrepl.middleware.interruptible_eval$evaluate$fn__1275$fn__1276 invoke "interruptible_eval.clj" 87]
[clojure.lang.AFn applyToHelper "AFn.java" 152]
[clojure.lang.AFn applyTo "AFn.java" 144]
[clojure.core$apply invokeStatic "core.clj" 667]
[clojure.core$with_bindings_STAR_ invokeStatic "core.clj" 1990]
[clojure.core$with_bindings_STAR_ doInvoke "core.clj" 1990]
[clojure.lang.RestFn invoke "RestFn.java" 425]
[nrepl.middleware.interruptible_eval$evaluate$fn__1275 invoke "interruptible_eval.clj" 87]
[clojure.main$repl$read_eval_print__9206$fn__9209 invoke "main.clj" 437]
[clojure.main$repl$read_eval_print__9206 invoke "main.clj" 437]
[clojure.main$repl$fn__9215 invoke "main.clj" 458]
[clojure.main$repl invokeStatic "main.clj" 458]
[clojure.main$repl doInvoke "main.clj" 368]
[clojure.lang.RestFn invoke "RestFn.java" 1523]
[nrepl.middleware.interruptible_eval$evaluate invokeStatic "interruptible_eval.clj" 84]
[nrepl.middleware.interruptible_eval$evaluate invoke "interruptible_eval.clj" 56]
[nrepl.middleware.interruptible_eval$interruptible_eval$fn__1308$fn__1312 invoke "interruptible_eval.clj" 152]
[clojure.lang.AFn run "AFn.java" 22]
[nrepl.middleware.session$session_exec$main_loop__1378$fn__1382 invoke "session.clj" 218]
[nrepl.middleware.session$session_exec$main_loop__1378 invoke "session.clj" 217]
[clojure.lang.AFn run "AFn.java" 22]
[java.lang.Thread run "Thread.java" 829]]}
Where the :decimal is handled in https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L888 ?
My understanding of parquet is that it has a very limited set of primitive base types (as listed there), and then logical/original data types that can be stored using the primitive types. Essentially an abstraction over the primitives to enable some kind of handshake w/other parquet libs/implementers, e.g. that something in the int64 slot is really a decimal. My read on the error is that this is either a bug or deliberate omission in tech v3 parquet impl, that doesnât extract the corresponding primitive type for the decimal logical type and list it as an original type in the arg to the apache java lib PrimitiveType constructor. my read on the fix required is that decimal would need to be added to the appropriate primitive data type set here: https://github.com/techascent/tech.ml.dataset/blob/8ceafe535ecd506e5241a0815d1bb61378948a04/src/tech/v3/libs/parquet.clj#L852-L855C1 and then given an original type case clause here: https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/parquet.clj#L892-L907 there appear to be different options for encoding according to the apache spec: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal â so it may be from tech v3 datasetâs PoV the encoding choice just hasnât been explicitly made yet? Encoding choices might have to be made by taking into consideration how other impls across languages encode decimals, and/or inspecting the size of the decimal w/r/t possible representations. @UDRJMEFSN @UJ7RSSWDU ^ -- for confirmation or other discussion. If this analysis sounds correct and itâs not something yâall would prioritize soon otherwise, I could take a stab at PR. But it looks like there are some nontrivial details re: the decimal encoding that would require further research and a consensus.
On the read side, there is some prior art: https://github.com/techascent/tech.ml.dataset/issues/296 ⢠https://github.com/techascent/tech.ml.dataset/blob/8ceafe535ecd506e5241a0815d1bb61378948a04/src/tech/v3/libs/parquet.clj#L252-L254 ⢠https://github.com/techascent/tech.ml.dataset/blob/8ceafe535ecd506e5241a0815d1bb61378948a04/src/tech/v3/libs/parquet.clj#L296-L299 ⢠https://github.com/techascent/tech.ml.dataset/blob/8ceafe535ecd506e5241a0815d1bb61378948a04/src/tech/v3/libs/parquet.clj#L332-L339 ⢠https://github.com/techascent/tech.ml.dataset/blob/8ceafe535ecd506e5241a0815d1bb61378948a04/src/tech/v3/libs/parquet.clj#L341-L347 ⢠https://github.com/techascent/tech.ml.dataset/blob/8ceafe535ecd506e5241a0815d1bb61378948a04/src/tech/v3/libs/parquet.clj#L715 ⢠https://github.com/techascent/tech.ml.dataset/blob/8ceafe535ecd506e5241a0815d1bb61378948a04/src/tech/v3/libs/parquet.clj#L763-L765 And this test: ⢠https://github.com/techascent/tech.ml.dataset/blob/8ceafe535ecd506e5241a0815d1bb61378948a04/test/tech/v3/libs/parquet_test.clj#L118-L121 Which references this data file: ⢠https://github.com/techascent/tech.ml.dataset/blob/master/test/data/decimaltable.parquet --- I do concur that the write side is unsupported. I haven't personally looked into it.
How I end up in this spot: I'm using a java library that is reading XBase files as java.math.BigDecimal The objective is to DuckDb to read the parquet file, so custom format is not useful. I've started with python code, pyarrow and it appears to convert to float because in visidata the money fields have % on the header.
I'm inspecting the schema with pyarrow.parquet.ParquetDataset and I see double type and int64 type.
it looks like duckdb has very particular Decimal parquet encoding requirements: https://duckdb.org/docs/sql/data_types/numeric.html#fixed-point-decimals I wonder if youâd be better off making the table directly via duckdb create table, w/an explicit duckdb supported decimal(width, scale) type, insert all the entries, then use duckdbâs https://duckdb.org/docs/guides/import/parquet_export.html copy to parquet.
I suspect that is about how it stores internally, I've asked on their discord about, parquet, decimal and duckdb. Meanwhile I'm trying to generate a parquet file with decimal type and see if DuckDb reads it properly. (I think it will)
@U06GLTD17 hawkfish (from their labs) pointed me to parquet decimal documentation as you did earlier. It looks like (at least from the point of view of handling parquet file with duck db) to export java.math.BigDecimal as parquet Decimal type.
What I did to get a proper parquet file: Exported dbf to postgres with pgdbf, connected postgres with duckdb, used copy to export a parquet file on the file system. I want to highlight the non-strings fields that duckdb mapped from postgres sp_qtd_ato: decimal128(4, 0) sp_valato: decimal128(14, 2) sequencia: decimal128(9, 0) sp_qt_proc: decimal128(4, 0) output from script that uses pyarrow.parquet to show the schema
Sorry I haven't responded on this thread but it seems like you have solved your issue. Currently there is read-only support for that type in tmd-parquet and none in the tmd duckdb bindings - Adding support for that type is possible of course.
I'm experimenting with parquet/duckdb at home. For my work I'd like to use clojure to move jdbc data from oracle to parquet. It would be nice if TMD supported mapping BigDecimals to parquet Decimal type.
yeah, I spelunked through the dataset code a bit more and explored other implementersâ libs for decimal support. I think this would have to be a more significant refactor allowing options to set column specific parquet encodings to enable any kind of generic decimal support. to match jvm BigDecimal semantics
, youâd have to use a BYTE_ARRAY
encoding (arbitrary precision). But duckdb as the docs linked up there show only supports int32,64,128 encodings. So while other readers could decode the BigDecimal as bytearray, this means that any default behavior could only support duckdb compatible decimal logical types by violating BigDecimal semantics. Which, duckdb decimal as a target is a kind of weird opinion to have in a default, and that opens up api use complexity and surprise around how the encoding happens for a generic lib like dataset. Does it silently infer the minimal necessary precision to store all decimals in a field by scanning the values and fitting a range and scale? Should it throw or truncate? These smell to me like behaviors that only should happen when explicitly set in options, and even then I think it surpasses the complexity youâd want for conditions on optional args, and probably belongs in âsupply your own encoders for additional typesâ land.
For instance, in the postgres export you highlight above, the inference behavior seems very surprising to me and overly broad and narrow at the same time! Three of the types have a decimal scale of 0 (zero digits after decimal point), theyâre all decimal128 (so in int128 according to duckdb docs), but the sufficient ranges for those widths in the duckdb docs are: int16 for width of 4, int32 for the width of 9 case, and int64 for the width of 14 case. So the default export case from postgres w/duckdb -> parquet seems really suboptimal re: value packing while also resulting in extremely restrictive decimal types (three of which might as well not even be decimals).
A fixed precision decimal encoded in 128 bits with values between 0-9999 and no values allowed after the decimal place definitely falls into âthanks I hate itâ territory for me re: default numeric type inference behavior đ
Those fields came from postgres by using the copy command to a parquet file: sp_qtd_ato numeric(4) NULL, sp_valato numeric(14, 2) NULL, sequencia numeric(9) NULL, sp_qt_proc numeric(4) NULL,
from parquet decimal doc: If not specified, the scale is 0. Scale must be zero or a positive integer less than or equal to the precision. Precision is required and must be a non-zero positive integer. A precision too large for the underlying type (see below) is an error. in databases we have this information. jdbc drivers read numbers as BigDecimals, so I think it would be nice if we could pass this information to the parquet generation process. About the duckdb compatibility, It would be more a parquet compatibility. If possible, I will be using TMD + DuckDB exclusively.
Write support would mean a scanning process for the column I guess to pick out those two numbers and then store them in the column metadata. Read support is much easier as someone else has already stored those numbers.
There is some overlap here for support map, struct, and list types where you need to fix some more in-depth column schema information and ensure the entire column does in fact respect adhere to those conventions.
This is coming up though - people are asking for various overlapping subsets of these features. I would be happy to do the work but I want to get paid for it.
#error {
:cause "Unsupported datatype for parquet writing: :decimal"
:via
[{:type java.lang.Exception
:message "Unsupported datatype for parquet writing: :decimal"
:at [tech.v3.libs.parquet$column__GT_field invokeStatic "parquet.clj" 888]}]
:trace
[[tech.v3.libs.parquet$column__GT_field invokeStatic "parquet.clj" 888]
[tech.v3.libs.parquet$column__GT_field invoke "parquet.clj" 863]
[clojure.core$mapv$fn__8535 invoke "core.clj" 6979]
[clojure.core.protocols$iter_reduce invokeStatic "protocols.clj" 49]
[clojure.core.protocols$fn__8238 invokeStatic "protocols.clj" 75]
[clojure.core.protocols$fn__8238 invoke "protocols.clj" 75]
[clojure.core.protocols$fn__8178$G__8173__8191 invoke "protocols.clj" 13]
[clojure.core$reduce invokeStatic "core.clj" 6886]
[clojure.core$mapv invokeStatic "core.clj" 6970]
[clojure.core$mapv invoke "core.clj" 6970]
[tech.v3.libs.parquet$ds__GT_schema invokeStatic "parquet.clj" 915]
[tech.v3.libs.parquet$ds__GT_schema invoke "parquet.clj" 912]
[tech.v3.libs.parquet$ds_seq__GT_parquet invokeStatic "parquet.clj" 1110]
[tech.v3.libs.parquet$ds_seq__GT_parquet invoke "parquet.clj" 1086]
[tech.v3.libs.parquet$ds__GT_parquet invokeStatic "parquet.clj" 1158]
[tech.v3.libs.parquet$ds__GT_parquet invoke "parquet.clj" 1152]
[tech.v3.libs.parquet$ds__GT_parquet invokeStatic "parquet.clj" 1160]
[tech.v3.libs.parquet$ds__GT_parquet invoke "parquet.clj" 1152]
[sihsus.core$eval27995 invokeStatic "NO_SOURCE_FILE" 91]
[sihsus.core$eval27995 invoke "NO_SOURCE_FILE" 90]
[clojure.lang.Compiler eval "Compiler.java" 7194]
[clojure.lang.Compiler eval "Compiler.java" 7149]
[clojure.core$eval invokeStatic "core.clj" 3215]
[clojure.core$eval invoke "core.clj" 3211]
[nrepl.middleware.interruptible_eval$evaluate$fn__1275$fn__1276 invoke "interruptible_eval.clj" 87]
[clojure.lang.AFn applyToHelper "AFn.java" 152]
[clojure.lang.AFn applyTo "AFn.java" 144]
[clojure.core$apply invokeStatic "core.clj" 667]
[clojure.core$with_bindings_STAR_ invokeStatic "core.clj" 1990]
[clojure.core$with_bindings_STAR_ doInvoke "core.clj" 1990]
[clojure.lang.RestFn invoke "RestFn.java" 425]
[nrepl.middleware.interruptible_eval$evaluate$fn__1275 invoke "interruptible_eval.clj" 87]
[clojure.main$repl$read_eval_print__9206$fn__9209 invoke "main.clj" 437]
[clojure.main$repl$read_eval_print__9206 invoke "main.clj" 437]
[clojure.main$repl$fn__9215 invoke "main.clj" 458]
[clojure.main$repl invokeStatic "main.clj" 458]
[clojure.main$repl doInvoke "main.clj" 368]
[clojure.lang.RestFn invoke "RestFn.java" 1523]
[nrepl.middleware.interruptible_eval$evaluate invokeStatic "interruptible_eval.clj" 84]
[nrepl.middleware.interruptible_eval$evaluate invoke "interruptible_eval.clj" 56]
[nrepl.middleware.interruptible_eval$interruptible_eval$fn__1308$fn__1312 invoke "interruptible_eval.clj" 152]
[clojure.lang.AFn run "AFn.java" 22]
[nrepl.middleware.session$session_exec$main_loop__1378$fn__1382 invoke "session.clj" 218]
[nrepl.middleware.session$session_exec$main_loop__1378 invoke "session.clj" 217]
[clojure.lang.AFn run "AFn.java" 22]
[java.lang.Thread run "Thread.java" 829]]}