Fork me on GitHub
#data-science
<
2023-02-07
>
Wanishing06:02:58

Hi all, Beginner question: I’ve started playing with tech.ml.dataset, with the purpose of comparing it to Spark workloads that are currently being used in my workplace. My question is quite simple: how would recommend consuming data from S3? Of course I can download the data and read it locally, but the decision of which data should be consumed is coupled to the logics of my query (for example, when the data is partitioned by date and my query filters by dates, only the requested partitions can be processed). Moreover, and I have little understanding of it, I assume that the fact your application processes X Terabytes of data doesn’t necessarily mean it uses X Terabytes of disk space. I’d be happy to hear any thoughts and suggestions! Thanks 🙏

Rupert (All Street)09:02:45

You can stream data from S3 in memory using https://github.com/cognitect-labs/aws-api (it's similar to https://github.com/mcohen01/amazonica - but much smaller/lighter weight). If you use InputStreams you can even process data that are too large to fit in memory. Another option is using an https://github.com/dakrone/clj-http to fetch the data or https://clojuredocs.org/clojure.java.shell/sh to a command line S3 client. Whilst its cool to process data without touching the disk, note that disk (even old spinning disk) is much faster than S3 file downloads - so there is no real downside to downloading a file locally to disk, processing it and then deleting the file (the extra delay is often negligible). S3 is very primitive (i.e. key value store) - so ideally your keys should be meaningful so you can fetch just the data that you need. You can fetch data for ranges of keys (they are downloaded in sorted order by key). > application processes X Terabytes of data doesn’t necessarily mean it uses X Terabytes of disk space. It doesn't have to use X terabytes of disk space if you stream the files from S3 (so they never touch disk) or just delete the files as you go.

Wanishing12:02:42

Thanks @UJVEQPAKS! appreciate your response. I'll try streaming the data using https://github.com/cognitect-labs/aws-api. I just wonder what's the size of the input I can process on my local machine without crash.

Rupert (All Street)12:02:50

aws-api fetches each record from S3 as as an inputstream.

;; Body is a blob type, which always returns an InputStream
  (aws/invoke s3 {:op :GetObject :request {:Bucket bucket-name :Key "hello.txt"}})
InputStreams only take up a a few KB of memory at a time even when you are processing a large file (e.g. 10TB). This means that you can process unlimited sized objects in S3 with very small amounts of memory (RAM) and zero access to disk (HDD/SSD).

Rupert (All Street)12:02:02

Do you have (A) large objects in S3 (e.g. each value is 1TB) or (B) lots of objects (e.g. 1 billion small 1Kb objects)? If you are dealing with (A) then streaming can help, if you are dealing with (B) then streaming doesn't matter (since they are small anyway). Either way you don't need to fit the objects on disk or in memory.

Wanishing14:02:40

Thanks for the snippet. Basically the data input varies, but let's say I have total size 5TB with approximately 1GB per file

Rupert (All Street)14:02:34

Sounds doable on most computers. You may find your bottleneck is CPU and bandwidth rather than disk space or memory.

Wanishing14:02:13

I'll give it a try. I really thank you for your attention!

👍 2
chrisn01:02:03

The exact file type of the data also matters. I am hoping its parquet or arrow and not CSV. Both parquet and arrow store their data in blocks so the file may be 1GB but it will be composed of ~400MB record sets in parquet's case. In my opinion parquet is the best answer here for storage by far as its compression will help over the wire times and the metadata on parquet files allows you to often skip the entire file. Aside from that I agree with the above just remember that parquet has parquet->ds-seq and arrow has an analogous method. The CSV loader also has csv->ds-seq and then you can transduce or use eduction to efficiently map ds methods across those dataset sequences. Finally there is a reductions namespace that is built for large sequences of datasets. You will probably have to use keep in mind row limits and re-aggregate things every once in a while to keep enough data in memory to hit ideal parallelism constraints - yes I know spark does this for you and we can build a very efficient spark out of tmd but tmd in that regard is a building block, not the final answer.

👍 2
chrisn01:02:00

At some point I hope to get funding to build a big data processing system to really go toe to toe with Spark. Until that point the user does have to do some manual memory management.

Wanishing09:02:04

Thanks a lot for your response Chris! In this specific use case the files type is parquet, where the data is being written in delta format (I'm not entirely sure what do I miss performance wise when reading the parquet files directly and bypassing the delta layer). Thanks for you remarks, I will definitely keep it in mind. And generally thanks a lot on your awesome work, I love listening your talks and the way you reason about things (I especially remember the "Rows vs Columns" slide:sunglasses:)

chrisn14:02:59

You are welcome and I am super glad you have listened to the talk! One more detail I forgot to mention - the metadata in parquet and arrow files comes and the end of the file so @UJVEQPAKS’s earlier point about downloading the file locally before reading it is just a great one to keep in mind. Unless you have metadata encoded externally in the dataset structure such as dates or something you will have want to first scan metadata and then optionally scan the file - so a local and potentially caching step just makes a lot of sense.

👍 2
Wanishing04:02:51

Indeed important detail. Thanks again both of you 🙏

👍 2