This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-12-08
Channels
- # adventofcode (49)
- # announcements (2)
- # architecture (4)
- # babashka (48)
- # babashka-sci-dev (4)
- # beginners (7)
- # biff (1)
- # calva (14)
- # cider (6)
- # clj-kondo (1)
- # clj-yaml (1)
- # cljsrn (3)
- # clojure (14)
- # clojure-art (12)
- # clojure-europe (62)
- # clojure-nl (1)
- # clojure-norway (35)
- # clojure-uk (5)
- # clojurescript (18)
- # clr (4)
- # community-development (9)
- # conjure (2)
- # core-async (3)
- # cursive (2)
- # datomic (2)
- # emacs (8)
- # events (3)
- # graalvm (1)
- # helix (6)
- # holy-lambda (3)
- # jobs (1)
- # off-topic (16)
- # polylith (30)
- # practicalli (11)
- # reitit (5)
- # shadow-cljs (14)
- # slack-help (10)
- # xtdb (6)
I have a postgres table that contains aggregated data at various levels and a hierarchical algorithm that processes that data in a top-down fashion (general to specific). I'm trying to design this system and would like to tick the following boxes: • The data is loaded lazily (dataset can be larger than memory so need to ensure scalability) • The process might take a long time, so don't use the same db connection throughout • Keep the code as pure as possible (minimize side effects and need for with-redefs and mocking in the tests) Can I eat the cake and have it too? If not, which of the above should I sacrifice? (I suppose the last one)
Nice question. • DB connections are pretty pretty fast to create and destroy. So don't worry about the overhead of doing this now and then. • DB connections are pretty stable especially within the same datacenter - they can often stay connected for hours/days/months etc. • DBs can often read/write over 1000 rows per second. So 1,000,000 rows can often be read/written in under 16 minutes. There are many possible solutions, here's two that might help: Option one if your data is smallish (e.g. less than 100 million rows) and quick to process then you can probably do the whole lot in one go with just a single lazy sequence (so whole dataset is not in memory):
(->> (read-data db-conn-or-db-credentials)
(map process-data) ;;; Maybe use (cpl/upmap 8 process-data)
(write-data db-conn-or-db-credentials))
Option 2: If your data is large and slow to process then you can use batches (each batch can still be lazy sequences/larger than memory).
• You can write the data in batches and use "UPDATE OR REPLACE" to make the process idempotent (e.g. reruns of the same batch won't be rejected by the DB).
• When your process restarts you can query for the highest batch id and then continue from there.
• Each batch can opens and closes it's own connection (saves trying to reuse connection objects that could be broken).
• If a batch fails fails then you may need to rewrite the whole batch - but this should not be an issue due to idempotence.Thanks! the data is several orders of magnitude smaller (like 5-digit figures) but the machine it runs on is weak, so solution one might be ok. I'll ask ChatGPT if it agrees and get back to you (jk 😅, but I will hammock-time this a bit)
> I'll ask ChatGPT if it agrees and get back to you This is the first time I've heard this said to me, but I suspect it won't be the last! 😄