Fork me on GitHub

I have a postgres table that contains aggregated data at various levels and a hierarchical algorithm that processes that data in a top-down fashion (general to specific). I'm trying to design this system and would like to tick the following boxes: • The data is loaded lazily (dataset can be larger than memory so need to ensure scalability) • The process might take a long time, so don't use the same db connection throughout • Keep the code as pure as possible (minimize side effects and need for with-redefs and mocking in the tests) Can I eat the cake and have it too? If not, which of the above should I sacrifice? (I suppose the last one)

Rupert (All Street)10:12:33

Nice question. • DB connections are pretty pretty fast to create and destroy. So don't worry about the overhead of doing this now and then. • DB connections are pretty stable especially within the same datacenter - they can often stay connected for hours/days/months etc. • DBs can often read/write over 1000 rows per second. So 1,000,000 rows can often be read/written in under 16 minutes. There are many possible solutions, here's two that might help: Option one if your data is smallish (e.g. less than 100 million rows) and quick to process then you can probably do the whole lot in one go with just a single lazy sequence (so whole dataset is not in memory):

(->> (read-data db-conn-or-db-credentials)
     (map process-data) ;;; Maybe use (cpl/upmap 8 process-data)
     (write-data db-conn-or-db-credentials))
Option 2: If your data is large and slow to process then you can use batches (each batch can still be lazy sequences/larger than memory). • You can write the data in batches and use "UPDATE OR REPLACE" to make the process idempotent (e.g. reruns of the same batch won't be rejected by the DB). • When your process restarts you can query for the highest batch id and then continue from there. • Each batch can opens and closes it's own connection (saves trying to reuse connection objects that could be broken). • If a batch fails fails then you may need to rewrite the whole batch - but this should not be an issue due to idempotence.


Thanks! the data is several orders of magnitude smaller (like 5-digit figures) but the machine it runs on is weak, so solution one might be ok. I'll ask ChatGPT if it agrees and get back to you (jk 😅, but I will hammock-time this a bit)

👍 1
😂 2
Rupert (All Street)11:12:19

> I'll ask ChatGPT if it agrees and get back to you This is the first time I've heard this said to me, but I suspect it won't be the last! 😄

😄 1