Fork me on GitHub
#sql
<
2022-04-29
>
Wei Lian Liu17:04:01

Hi everyone, I'm using next.jdbc and Snowflake to query timeseries data and store all in memory for faster access later but got error message Execution error (OutOfMemoryError) at net.snowflake.client.jdbc.internal.apache.arrow.vector.VarCharVector/get (VarCharVector.java:113). The size of all data in the table is less than 2gb. I'm not sure if this is possible or there is something wrong with the way I'm doing:

(def data-cache (atom []))
(defn load-data! []
  (reset! data-cache
          (into []
                (map #(rs/datafiable-row % db/ds {}))
                (jdbc/plan
                 db/ds
                 ["SELECT DATE, TICKER, PRICE FROM price_table
                 {:builder-fn rs/as-unqualified-lower-maps}))))

seancorfield17:04:10

@william260 not sure how you're measuring 2GB of data in the DB, but bear in mind when that's expanded up to a Clojure data structure it's going to be much larger.

seancorfield17:04:52

Also, using plan like that is a bit pointless: you might as well just call jdbc/execute!.

seancorfield17:04:01

plan is intended for large data sets where you don't hold it all in memory because you reduce away the result set instead of turning it into data.

Wei Lian Liu17:04:51

Yes, I want to realize all the data right away so to speed up later access, if this makes sense, then reduce and plan won't help in this case. Maybe the data is actually much larger, is caching in file a good idea? Thanks @seancorfield

seancorfield20:04:42

Sorry, got dragged away to other things... Databases are generally pretty fast and pretty good at dealing with large datasets so the question is really "what are you doing with this data?"

seancorfield20:04:58

If you need to do queries on that whole dataset, maybe do those via SQL? You might need additional indices in the DB to make those run faster.

seancorfield20:04:18

If you can "summarize" the dataset in some way to produce a smaller representation of it that fits in memory, a combination of aggregate queries and (self) joins along with plan and reduce might get it small enough to live in memory for certain types of exploration.

🙏 1
seancorfield20:04:53

Or, find a machine with enough JVM heap to keep that all in memory...

Wei Lian Liu20:04:16

Np, thank you @seancorfield The dataset is pretty simple: datetime, name, price(feature 1), size(feature 2). After querying the table and processing, it should be a map of time series like so:

{ {:feature :price :name "A"} {19000 1.0 19002 1.1 ...}, 
  {:feature :price :name "B"} {19000 2.0 19001 2.1 ...},
  {:feature :size :name "A"} { ... }, ... }
Since this is a server I'm working on, I expect frequent accesses to any parts of this table, so I thought to cache(to memory, or to file) instead of querying every time. I'm trying out caching to file by :name. Also I will try indexing in DB as well, to see which is faster.