This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-08-21
Channels
- # announcements (20)
- # beginners (31)
- # biff (8)
- # cherry (5)
- # cider (4)
- # cljs-dev (1)
- # clojure (26)
- # clojure-australia (2)
- # clojure-europe (16)
- # clojure-spec (10)
- # community-development (8)
- # conjure (1)
- # core-async (1)
- # data-oriented-programming (1)
- # data-science (54)
- # datascript (10)
- # fulcro (1)
- # graalvm (2)
- # malli (5)
- # off-topic (3)
- # pathom (23)
- # rdf (1)
- # re-frame (6)
- # reitit (11)
- # shadow-cljs (6)
- # squint (2)
- # xtdb (33)
What is the correct way of preparing data from db with https://github.com/scicloj ? I have table with rows like this:
#:binance_candlesticks_1m{:taker_buy_b_volume 165429.65820248M,
:b_close 28900.93000000M,
:b_min 28887.95000000M,
:trades 436,
:b_open 28942.95000000M,
:b_max 28964.34000000M,
:open_time #time/instant"2021-01-01T00:09:00Z",
:b_volume 649748.33443327M,
:a_volume 22.47045700M,
:a "BTC",
:b "BUSD",
:taker_buy_a_volume 5.72019700M}
I am doing (postgres/query ["SELECT * FROM binance_candlesticks_1m WHERE a='BTC' AND b='BUSD' AND open_time BETWEEN '2021-01-01' AND '2021-12-31'"])
and want to prepare this data to AI. As I understand I have to use [scicloj.ml.dataset :as ds]
which is “proxy” to https://scicloj.github.io/tablecloth/ - is it correct?
Should I use get all data from postgres, prepare them and then use ds/dataset
or something like that. Should I use some magic function like ds/magic-postgres table_name
or …. what is the best practice?PS I have to experience with AI and scicloi, so things are very not obvious for me how things should be done.
*what I am doing right now is I did a tutorial with Titanic and I want to achieve some simple success with my data instead. So I just want to prepare data, do AI on it, use it for prediction. I don’t care too much about accuracy and choosing efficient model. I just want to pass through it but with my data.
but there is so much unknown I don’t know from what to start to prepare data with best practice. I will have normalise them and add extra data like indicators etc.
I can load everything to memory, but then I have to have machine with very high memory to handle all data at once. Does ds
have a lazy way for dataset from database to not keep everything at once in memory?
When do (ds/dataset)
from sql table do you use as-unqualified-lower-maps
? (`:binance_candlesticks_1m/b_open` vs b_open
names in dataset table)
Do you build and normalise all data in memory and run learning on it? Or split building data, saving prepared dataset to file / db and later load and learn. What are best practice to make it easy to debug etc.?
Why [scicloj.ml.dataset :as ds]
instead of tablecloth
direcly? What is the advantage?
Isn't it the other way round and tablecloth
uses dataset
?
(just curious, cannot answer your questions myself)
As I understand https://scicloj.github.io/scicloj.ml-tutorials/userguide-intro.html teach to use [scicloj.ml.dataset :as ds]
from there:
As it was organically growing over time, it's [i.e. dataset's] API is functional and complete, but lacks consistency in some parts.
This was addressed by an other library, layering on top of it, called tablecloth. It is available here
But hmm the sentence can be interpreted in both way. As scicloj.ml.dataset
is legacy and use tablecloth
under the hood. Or it is the right one and use tablecloth
under the hood.
dataset
is used in tablecloth
While the tutorial example use scicloj.ml.dataset
I assumed it is the right one to use, because it is in tutorial. Maytbe it is wrong assumption
If I should use tablecloth.api
directly and scicloj.ml.dataset
is legacy thing, then the tutorial do really bad job explaining it and teaching legacy usage
No, other way round: tablecloth
uses scicloj.ml.dataset
and gives you a more consistens Api, that is similar to R
's
I literally see in scicloj.ml.dataset
(:require [tech.v3.datatype.export-symbols :as exporter]
[scicloj.ml.utils :refer :all]
[tech.v3.dataset.modelling]
[tech.v3.dataset.column-filters]
[tech.v3.dataset.categorical]
[tablecloth.api] ; <<<<<<<<<<<<<
)
My main question is still up to date. Use [scicloj.ml.dataset :as ds]
or tablecloth.api
directly.
Namespace scicloj.ml.dataset
contains all functions of tablecloth.api and some more.
The "some more" are machine learning related and come from tech.ml.dataset
The functions are re-exported in a new namespace, so are exactly the same.
The reason for that is to reduce teh number of namespace to be used. (to 4).
The original code is placed in > 20 namespaces.
So in practice it does not matter, which you use.
All methods in ns scicloj.ml.dataset
have in common, that the dataset is the first argument and they mostly return a dataset,
Keep mind as well that ns scicloj.ml.metamorph
has methods of the same name as the ones in scicloj.ml.dataset
,
but those take the metamorph context as first argument, so are meant to be used inside pipelines.
See the clojuredoc here: https://scicloj.github.io/scicloj.ml/
Actually why to use dateset
at all while I can use just pure Clojure?
I can use (postgres/query ["SELECT * FROM binance_candlesticks_1m WHERE a='BTC' AND b='BUSD' AND open_time BETWEEN '2021-01-01' AND '2021-03-31'"])
and add extra value in each map
#:binance_candlesticks_1m{:taker_buy_b_volume 165429.65820248M,
:b_close 28900.93000000M,
:b_min 28887.95000000M,
:trades 436,
:b_open 28942.95000000M,
:b_max 28964.34000000M,
:open_time #time/instant"2021-01-01T00:09:00Z",
:b_volume 649748.33443327M,
:a_volume 22.47045700M,
:a "BTC",
:b "BUSD",
:taker_buy_a_volume 5.72019700M}
with map
function. What is the advantage of using dateset
fn here?
Is it good rule? If it is the same simplicity or simpler to do it in pure clojure, then use pure clojure instead of dateset?I this question gets answered in this podcast interview.
The point that I remember is that performance will be much better for large datasets (when you get to >100k elements/rows).
It is a very good and recent interview with the creator of tech.ml.dataset
(and more) and I imagine it can help clarify some of your questions, together with the follow up episode
https://podcasts.apple.com/us/podcast/e72-data-processing-with-chris-nuernberger-part-1/id1461500416?i=1000565597470
I strongly prefer simplicity over performance. I don’t think it matters too much if it will take 30 sec. or 50 sec.
or even 1 minut vs 5 minute. I mean Clojure itself is very powerful in the context of performance. I don’t expect big difference.
I am curious how you see this. Why you prefer to use dataset instead of pure Clojure.
I don’t have much practical experience in Clojure yet and am in no way able to answer with any authority here.
That being said: My understanding is that dataset
and tablecloth
also simplify manipulating table like data sets, where you do a lot of selecting columns/fields, grouping and aggregating.
In the tablecloth
documentation you find many examples comparing functionalities from R’s dplyr
, tidyr
and data.table
, which I found instructive while skimming it some weeks ago.
But again: I never used it or Clojure in anger.
I hope someone else can answer your questions in more detail.
I just remembered that the Clojure data science community is way more active on Zulip than here on Slack. Perhaps a better chance to get answers there, here is the beginners Stream: https://clojurians.zulipchat.com/#narrow/stream/151763-beginners
dataset does not only give better performance (as speed), but it use "far, far less memory" and results in far less garbage collection. The reason being that a standard clojure representation of tabular data being a sequence-of-maps is the kind of worst case. It produces a lot of small objects, while dataset stores the data in native java arrays. A CSV file of data of 1 GB can easily require 30 or more GB of heap, when stored as sequnce-of-maps. When stored as tech.ml.dataset, its maybe 2GB of heap. -> Huge difference
ok so let’s summarise my questions:
1) is it best rule to use pure clojure to modify data before convert to dataset instead of convert to dataset and then modify? As long as using pure clojure is at least the same simplicity or simplet.
2) Is it recommended to use [tablecloth.api :as tc]
directly instead of [scicloj.ml.dataset :as ds]
? It will be better for me to use [tablecloth.api :as tc]
because I can easy jump to source of the function and read the source code to learn how to use function.
3) What is the best practice to prepare dataset from PostgreSQL?
4) What is better practice:
a) PostgreSQL query -> convert to dataset -> do AI things
b) PostgreSQL query -> convert to dataset -> write to file -> read from file (probably in different instance) -> do AI things
I see advantage of b) for debugging. What is your practice?
5) What is the most efficient way to normalise data for this use case:
I have value which change like 100 -> 99.578934 but also sometimes has bigger change 100 -> 91.5345. If I will do % of change like 100 / 99.578934
, then the value will always very close to 1
, in this case 1.00422846463
What values should I use:
a) raw values 100 and 99.578934
b) normalised values like 1.00422846463
and not care most of them are around 1
c) there is some magic good practice to normalise values for such use case???
d) or maybe change in raw value instead of some kind of % of change? (- 100 99.578934)
*use case - this values are prices for crypto market
#:binance_candlesticks_1m{:taker_buy_b_volume 165429.65820248M,
:b_close 28900.93000000M,
:b_min 28887.95000000M,
:trades 436,
:b_open 28942.95000000M,
:b_max 28964.34000000M,
:open_time #time/instant"2021-01-01T00:09:00Z",
:b_volume 649748.33443327M,
:a_volume 22.47045700M,
:a "BTC",
:b "BUSD",
:taker_buy_a_volume 5.72019700M}
for example b_open
valuesI can support you, regarding a tablecloth, but earliest on Tuesday or Wednesday. Wrote a PM.
Regarding 2) This is a "tooling" problem. It seems that Cursive cannot handle well the "dynamically created namespaces" by http://scicloj.ml and so the jumping to code does not work. But it is valid and common practice to do this in Clojure. Emacs/Cider does not have this problem, (or less at least)
1) It is a question of data size. Dataset needs far less memory and requires the JVM to do less garbage collections then using standard clojure sequence-of-maps. So when you have tabular data in files in some form, I would suggest to start already to read it via dataset functions.
3) and 4) I think, I would want to have a "physical file in the middle". So I would do the minimal thing in SQL, so "extract" the subset of data I want to work with and write that to a file. One reason of doing that, is that I am sure that my file cannot change "under my feeds", while a DB could. So I would do 4b) I worked in the past with R, and there I would do the same. The R universe is still assuming / tailored to have your data as files on disk. If the "data to work with" is larger then RAM, the story could change. Any DB is made to work with data larger then RAM, so it postgres. While dataset works best with data which fits in RAM. (but is supports as well data larger then RAM, by using arrow files or operating on "several data files". But this is more advanced) Me personally, would go for larger then RAM data got to "Big data tools", like Spark. (usable from Clojure via Geni) But I never was a "SQL guy".