Fork me on GitHub
#data-science
<
2020-03-17
>
Dustin Getz15:03:27

Hey I need to process a 300M line count CSV-like file At a wild guess that is like 80gb does that sound reasonable? I am being told to use Spark. But my iphone could process that right?

mrchance15:03:14

Yes, in my experience Spark is frequently overkill, and it definitely sounds like that's the case here. It depends quite a bit on what kind of analysis you need though

mrchance15:03:09

Also, depending on your row size, 80gb sounds pretty generous

Eddie15:03:49

My suggestion will differ based on how many passes you need to do over the data, and if you need to store intermediate results. The simplest use case would be mapping a function (and/or filtering) over each row and writing the result to a new file. You only need 1 row in memory at a time. Clojure lazy sequences over a stream of data from the file will be fine. Spark will probably be more setup than its worth. If you are doing analytics on the file and thus require aggregate computation, multiple passes, enhancement with other datasets, etc. It can start to get difficult to manage your resources. It is unlikely a single machine (including your iPhone 🙂 ) has 80gb of ram, so putting the entire dataset into memory for efficient reuse will not be an option. You could end up thrashing to-and-from virtual memory, or maybe you with OOM. Spark would be a wonderful solution to that problem.

👍 4
kenny16:03:53

We process large CSV files, though smaller than yours (up to ~30gb). We do just as @U7XR2PZFW says. Simple map/filter over lazily read gzipped csv data and output to a much smaller csv file. Though @U7XR2PZFW, it would take far more than 80gb of memory to fit a 80gb sized csv file in memory.

👍 4
kenny16:03:50

The Java process to do this uses around 3gb of memory. A new iPhone has 4gb of memory so it could probably do it haha.

Eddie16:03:01

Good point.

kenny16:03:48

BTW, if you're going to do take this approach @U09K620SG, ensure locals clearing is enabled. I typically run a Cursive REPL in debugger mode. This disables locals clearing which easily causes OOMs.

👍 4
kenny16:03:29

Depending on the analysis you need to do, you can somewhat easily parallelize this processing by reading the csv in chunks.

mrchance16:03:40

Nice thread, I really should check out the tech-ml dataset, the local clearing is a good hint too. I also realized I should have put what Eddie said in my answer, more immediately helpful than what I said 😉

jsa-aerial18:03:03

There has been a tremendous amount of great new work on the tech.ml.dataset stack recently (as in daily over the last week or so) on just this sort of large scale load and processing. The discussion has been over on Zulip under #data-science > tech.ml.dataset @UDRJMEFSN can say more - or head over there for much more info!

metasoarous17:03:13

Most of the bases are covered above. Depends on use case, in particular whether you need all the data in memory at once, or can process it in a single (or small number of) scan(s). If you don't need it in memory, you could use https://github.com/metasoarous/semantic-csv

chrisn21:03:43

Just saw this thread. I agree that if you don't want to have it in memory, @U05100J3V’s library is a good optionr. If you do use tech.ml.dataset, you can filter columns and take a max num rows to avoid processing any data you don't have to. Aside from that, with tech.ml.dataset it will be in memory and 30GB+ csv files in memory, unless there are a lot of repeated categorical values, will not be ideal I think.

👍 4
✔️ 4
4
Daniel Slutsky20:03:32

We are organizing an online https://twitter.com/hashtag/Clojure?src=hashtag_click hackathon for studying COVID-19 data. Please mark your preferred dates: https://twitter.com/scicloj/status/1240010550555353088

👍 12