Fork me on GitHub
#data-science
<
2022-08-21
>
kwladyka11:08:01

What is the correct way of preparing data from db with https://github.com/scicloj ? I have table with rows like this:

#:binance_candlesticks_1m{:taker_buy_b_volume 165429.65820248M,
                           :b_close 28900.93000000M,
                           :b_min 28887.95000000M,
                           :trades 436,
                           :b_open 28942.95000000M,
                           :b_max 28964.34000000M,
                           :open_time #time/instant"2021-01-01T00:09:00Z",
                           :b_volume 649748.33443327M,
                           :a_volume 22.47045700M,
                           :a "BTC",
                           :b "BUSD",
                           :taker_buy_a_volume 5.72019700M}
I am doing (postgres/query ["SELECT * FROM binance_candlesticks_1m WHERE a='BTC' AND b='BUSD' AND open_time BETWEEN '2021-01-01' AND '2021-12-31'"]) and want to prepare this data to AI. As I understand I have to use [scicloj.ml.dataset :as ds] which is “proxy” to https://scicloj.github.io/tablecloth/ - is it correct? Should I use get all data from postgres, prepare them and then use ds/dataset or something like that. Should I use some magic function like ds/magic-postgres table_name or …. what is the best practice?

kwladyka11:08:40

PS I have to experience with AI and scicloi, so things are very not obvious for me how things should be done.

kwladyka11:08:57

I am trying to figure out this from tutorials, but it doesn’t help

kwladyka11:08:01

*what I am doing right now is I did a tutorial with Titanic and I want to achieve some simple success with my data instead. So I just want to prepare data, do AI on it, use it for prediction. I don’t care too much about accuracy and choosing efficient model. I just want to pass through it but with my data.

kwladyka11:08:02

but there is so much unknown I don’t know from what to start to prepare data with best practice. I will have normalise them and add extra data like indicators etc.

kwladyka12:08:02

I can load everything to memory, but then I have to have machine with very high memory to handle all data at once. Does ds have a lazy way for dataset from database to not keep everything at once in memory?

kwladyka12:08:03

When do (ds/dataset) from sql table do you use as-unqualified-lower-maps ? (`:binance_candlesticks_1m/b_open` vs b_open names in dataset table)

kwladyka12:08:10

Do you build and normalise all data in memory and run learning on it? Or split building data, saving prepared dataset to file / db and later load and learn. What are best practice to make it easy to debug etc.?

kwladyka15:08:03

Why [scicloj.ml.dataset :as ds] instead of tablecloth direcly? What is the advantage?

Mario Trost15:08:30

Isn't it the other way round and tablecloth uses dataset? (just curious, cannot answer your questions myself)

kwladyka15:08:18

As I understand https://scicloj.github.io/scicloj.ml-tutorials/userguide-intro.html teach to use [scicloj.ml.dataset :as ds]

Mario Trost15:08:55

from there:

As it was organically growing over time, it's [i.e. dataset's] API is functional and complete, but lacks consistency in some parts.

This was addressed by an other library, layering on top of it, called tablecloth. It is available here

kwladyka15:08:55

and the tablecloth is used in [scicloj.ml.dataset :as ds]

kwladyka15:08:49

But hmm the sentence can be interpreted in both way. As scicloj.ml.dataset is legacy and use tablecloth under the hood. Or it is the right one and use tablecloth under the hood.

Mario Trost15:08:22

dataset is used in tablecloth

kwladyka15:08:25

While the tutorial example use scicloj.ml.dataset I assumed it is the right one to use, because it is in tutorial. Maytbe it is wrong assumption

kwladyka15:08:39

but not scicloj.ml.dataset

kwladyka15:08:55

the scicloj.ml.dataset require tablecloth.api

kwladyka15:08:13

tablecloth has dateset function but this is different story

kwladyka15:08:58

If I should use tablecloth.api directly and scicloj.ml.dataset is legacy thing, then the tutorial do really bad job explaining it and teaching legacy usage

kwladyka15:08:18

but maybe it is like you suggested. No idea.

kwladyka15:08:44

Is it? Can someone answer this with confidence?

Mario Trost15:08:47

No, other way round: tablecloth uses scicloj.ml.dataset and gives you a more consistens Api, that is similar to R's

kwladyka15:08:19

I literally see in scicloj.ml.dataset

(:require [tech.v3.datatype.export-symbols :as exporter]
            [scicloj.ml.utils :refer :all]
            [tech.v3.dataset.modelling]
            [tech.v3.dataset.column-filters]
            [tech.v3.dataset.categorical]

            [tablecloth.api] ; <<<<<<<<<<<<<
            )

🙌 1
kwladyka15:08:49

Because of that I don’t think your assumption is correct

Mario Trost15:08:32

oh sorry: I guess I confused scicloj.ml.dataset with tech.ml.dataset

👍 1
kwladyka15:08:12

My main question is still up to date. Use [scicloj.ml.dataset :as ds] or tablecloth.api directly.

Carsten Behring21:09:03

Namespace scicloj.ml.dataset contains all functions of tablecloth.api and some more.

Carsten Behring21:09:04

The "some more" are machine learning related and come from tech.ml.dataset The functions are re-exported in a new namespace, so are exactly the same. The reason for that is to reduce teh number of namespace to be used. (to 4). The original code is placed in > 20 namespaces.

Carsten Behring21:09:53

So in practice it does not matter, which you use.

Carsten Behring22:09:42

All methods in ns scicloj.ml.dataset have in common, that the dataset is the first argument and they mostly return a dataset,

Carsten Behring22:09:16

Keep mind as well that ns scicloj.ml.metamorph has methods of the same name as the ones in scicloj.ml.dataset,

Carsten Behring22:09:52

but those take the metamorph context as first argument, so are meant to be used inside pipelines.

kwladyka16:08:41

Actually why to use dateset at all while I can use just pure Clojure? I can use (postgres/query ["SELECT * FROM binance_candlesticks_1m WHERE a='BTC' AND b='BUSD' AND open_time BETWEEN '2021-01-01' AND '2021-03-31'"]) and add extra value in each map

#:binance_candlesticks_1m{:taker_buy_b_volume 165429.65820248M,
                           :b_close 28900.93000000M,
                           :b_min 28887.95000000M,
                           :trades 436,
                           :b_open 28942.95000000M,
                           :b_max 28964.34000000M,
                           :open_time #time/instant"2021-01-01T00:09:00Z",
                           :b_volume 649748.33443327M,
                           :a_volume 22.47045700M,
                           :a "BTC",
                           :b "BUSD",
                           :taker_buy_a_volume 5.72019700M}
with map function. What is the advantage of using dateset fn here? Is it good rule? If it is the same simplicity or simpler to do it in pure clojure, then use pure clojure instead of dateset?

Mario Trost17:08:24

I this question gets answered in this podcast interview. The point that I remember is that performance will be much better for large datasets (when you get to >100k elements/rows). It is a very good and recent interview with the creator of tech.ml.dataset (and more) and I imagine it can help clarify some of your questions, together with the follow up episode https://podcasts.apple.com/us/podcast/e72-data-processing-with-chris-nuernberger-part-1/id1461500416?i=1000565597470

👍 1
kwladyka18:08:28

Do you know what is the difference in performance? Some specific numbers?

kwladyka18:08:10

I strongly prefer simplicity over performance. I don’t think it matters too much if it will take 30 sec. or 50 sec.

kwladyka18:08:47

or even 1 minut vs 5 minute. I mean Clojure itself is very powerful in the context of performance. I don’t expect big difference.

kwladyka18:08:59

simplicity always win for me

kwladyka18:08:21

I am curious how you see this. Why you prefer to use dataset instead of pure Clojure.

Mario Trost18:08:03

I don’t have much practical experience in Clojure yet and am in no way able to answer with any authority here. That being said: My understanding is that dataset and tablecloth also simplify manipulating table like data sets, where you do a lot of selecting columns/fields, grouping and aggregating. In the tablecloth documentation you find many examples comparing functionalities from R’s dplyr, tidyr and data.table , which I found instructive while skimming it some weeks ago. But again: I never used it or Clojure in anger. I hope someone else can answer your questions in more detail.

👍 1
Mario Trost11:08:41

I just remembered that the Clojure data science community is way more active on Zulip than here on Slack. Perhaps a better chance to get answers there, here is the beginners Stream: https://clojurians.zulipchat.com/#narrow/stream/151763-beginners

👍 1
kwladyka11:08:28

yes, I found Zulip to be much more active

Carsten Behring22:09:04

dataset does not only give better performance (as speed), but it use "far, far less memory" and results in far less garbage collection. The reason being that a standard clojure representation of tabular data being a sequence-of-maps is the kind of worst case. It produces a lot of small objects, while dataset stores the data in native java arrays. A CSV file of data of 1 GB can easily require 30 or more GB of heap, when stored as sequnce-of-maps. When stored as tech.ml.dataset, its maybe 2GB of heap. -> Huge difference

👍 1
kwladyka16:08:38

ok so let’s summarise my questions: 1) is it best rule to use pure clojure to modify data before convert to dataset instead of convert to dataset and then modify? As long as using pure clojure is at least the same simplicity or simplet. 2) Is it recommended to use [tablecloth.api :as tc] directly instead of [scicloj.ml.dataset :as ds]? It will be better for me to use [tablecloth.api :as tc] because I can easy jump to source of the function and read the source code to learn how to use function. 3) What is the best practice to prepare dataset from PostgreSQL? 4) What is better practice: a) PostgreSQL query -> convert to dataset -> do AI things b) PostgreSQL query -> convert to dataset -> write to file -> read from file (probably in different instance) -> do AI things I see advantage of b) for debugging. What is your practice?

kwladyka16:08:20

5) What is the most efficient way to normalise data for this use case: I have value which change like 100 -> 99.578934 but also sometimes has bigger change 100 -> 91.5345. If I will do % of change like 100 / 99.578934, then the value will always very close to 1, in this case 1.00422846463 What values should I use: a) raw values 100 and 99.578934 b) normalised values like 1.00422846463 and not care most of them are around 1 c) there is some magic good practice to normalise values for such use case??? d) or maybe change in raw value instead of some kind of % of change? (- 100 99.578934) *use case - this values are prices for crypto market

#:binance_candlesticks_1m{:taker_buy_b_volume 165429.65820248M,
                           :b_close 28900.93000000M,
                           :b_min 28887.95000000M,
                           :trades 436,
                           :b_open 28942.95000000M,
                           :b_max 28964.34000000M,
                           :open_time #time/instant"2021-01-01T00:09:00Z",
                           :b_volume 649748.33443327M,
                           :a_volume 22.47045700M,
                           :a "BTC",
                           :b "BUSD",
                           :taker_buy_a_volume 5.72019700M}
for example b_open values

kwladyka16:08:56

Any brave soul who want to discuss this things with me?

genmeblog20:08:21

I can support you, regarding a tablecloth, but earliest on Tuesday or Wednesday. Wrote a PM.

👍 1
Carsten Behring08:09:18

Regarding 2) This is a "tooling" problem. It seems that Cursive cannot handle well the "dynamically created namespaces" by http://scicloj.ml and so the jumping to code does not work. But it is valid and common practice to do this in Clojure. Emacs/Cider does not have this problem, (or less at least)

Carsten Behring08:09:15

1) It is a question of data size. Dataset needs far less memory and requires the JVM to do less garbage collections then using standard clojure sequence-of-maps. So when you have tabular data in files in some form, I would suggest to start already to read it via dataset functions.

Carsten Behring08:09:26

3) and 4) I think, I would want to have a "physical file in the middle". So I would do the minimal thing in SQL, so "extract" the subset of data I want to work with and write that to a file. One reason of doing that, is that I am sure that my file cannot change "under my feeds", while a DB could. So I would do 4b) I worked in the past with R, and there I would do the same. The R universe is still assuming / tailored to have your data as files on disk. If the "data to work with" is larger then RAM, the story could change. Any DB is made to work with data larger then RAM, so it postgres. While dataset works best with data which fits in RAM. (but is supports as well data larger then RAM, by using arrow files or operating on "several data files". But this is more advanced) Me personally, would go for larger then RAM data got to "Big data tools", like Spark. (usable from Clojure via Geni) But I never was a "SQL guy".

kwladyka10:09:14

thank you for all your comments