This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-04-29
Channels
- # announcements (1)
- # babashka (15)
- # beginners (37)
- # calva (94)
- # cider (3)
- # clj-kondo (17)
- # cljsrn (2)
- # clojure (45)
- # clojure-europe (39)
- # clojure-germany (1)
- # clojure-norway (2)
- # clojurescript (16)
- # component (18)
- # conjure (1)
- # cursive (13)
- # datalevin (3)
- # datomic (12)
- # docker (2)
- # duct (5)
- # eastwood (2)
- # emacs (4)
- # events (8)
- # fulcro (8)
- # inf-clojure (5)
- # kaocha (8)
- # lsp (24)
- # malli (11)
- # meander (3)
- # off-topic (19)
- # polylith (11)
- # remote-jobs (4)
- # sci (61)
- # shadow-cljs (9)
- # spacemacs (34)
- # sql (10)
- # tools-deps (27)
- # xtdb (10)
Hi everyone, I'm using next.jdbc and Snowflake to query timeseries data and store all in memory for faster access later but got error message Execution error (OutOfMemoryError) at net.snowflake.client.jdbc.internal.apache.arrow.vector.VarCharVector/get (VarCharVector.java:113)
. The size of all data in the table is less than 2gb. I'm not sure if this is possible or there is something wrong with the way I'm doing:
(def data-cache (atom []))
(defn load-data! []
(reset! data-cache
(into []
(map #(rs/datafiable-row % db/ds {}))
(jdbc/plan
db/ds
["SELECT DATE, TICKER, PRICE FROM price_table
{:builder-fn rs/as-unqualified-lower-maps}))))
@william260 not sure how you're measuring 2GB of data in the DB, but bear in mind when that's expanded up to a Clojure data structure it's going to be much larger.
Also, using plan
like that is a bit pointless: you might as well just call jdbc/execute!
.
plan
is intended for large data sets where you don't hold it all in memory because you reduce
away the result set instead of turning it into data.
Yes, I want to realize all the data right away so to speed up later access, if this makes sense, then reduce
and plan
won't help in this case.
Maybe the data is actually much larger, is caching in file a good idea?
Thanks @seancorfield
Sorry, got dragged away to other things... Databases are generally pretty fast and pretty good at dealing with large datasets so the question is really "what are you doing with this data?"
If you need to do queries on that whole dataset, maybe do those via SQL? You might need additional indices in the DB to make those run faster.
If you can "summarize" the dataset in some way to produce a smaller representation of it that fits in memory, a combination of aggregate queries and (self) joins along with plan
and reduce
might get it small enough to live in memory for certain types of exploration.
Or, find a machine with enough JVM heap to keep that all in memory...
Np, thank you @seancorfield
The dataset is pretty simple: datetime
, name
, price
(feature 1), size
(feature 2).
After querying the table and processing, it should be a map of time series like so:
{ {:feature :price :name "A"} {19000 1.0 19002 1.1 ...},
{:feature :price :name "B"} {19000 2.0 19001 2.1 ...},
{:feature :size :name "A"} { ... }, ... }
Since this is a server I'm working on, I expect frequent accesses to any parts of this table, so I thought to cache(to memory, or to file) instead of querying every time. I'm trying out caching to file by :name
. Also I will try indexing in DB as well, to see which is faster.