Fork me on GitHub
#clojure-uk
<
2019-12-14
>
dharrigan08:12:35

Good Morning!

otfrom12:12:37

cats are evil and must be obeyed

dharrigan13:12:07

Interesting stuff around core.async etc. Something I'm still trying to work upon. Say for example you had a function that downloaded files from S3, that could be run in parallel (i.e., function fetches files A and function fetches files B), would you launch each function in a thread (I think I read that using go-blocks are not for IO operations, or blocking operations??).

rickmoynihan09:12:39

pipeline / pipeline-blocking / pipeline-async

rickmoynihan09:12:54

actually you probably don’t care about the ordering

otfrom13:12:37

yeah, I wouldn't use go blocks for blocking things (as you'd exhaust the pool) go blocks and things built on them (reduce/into/etc) are for things that want to use all the CPU

otfrom13:12:31

claypoole (mentioned above) might be a way to solve some of that as at least it would be a different thread pool

otfrom13:12:27

(that is an ill informed guess btw, I'd want to read up more on how I'd do that first)

dharrigan13:12:11

What I have can definitely be parallelised, for fetching files from S3 - same function can download from different buckets

dharrigan13:12:33

How many threads can a modern cpu launch?

otfrom13:12:55

threads are an issue of memory rather than CPU

otfrom13:12:24

tho if you have too many at the same time then you might have trouble if they are all trying to do work and the switching overhead gets you

otfrom13:12:00

but if you are downloading a lot from S3 you might create too many threads at once and exhaust memory (possibly by each thread holding a lot of data from s3 at the same time)

dharrigan13:12:28

I wouldn't be launching a thread per key at the lowest level, i.e. if the buckets are organised as year=2019/month=12/day=15/hour=13, I would launch a thread at the day level, then that function would pull back all the files contained in the hour bucket.

dharrigan13:12:39

so I suppose, max about 30/31 threads.

dharrigan13:12:44

I'll have a play, and do some benchmarking around memory/cpu and see what I can discover 🙂

dharrigan13:12:23

are claypool and manifold competitors?

dharrigan13:12:28

(in the friendly sense)

mccraigmccraig13:12:34

ideally you want to do async i/o (rather than threaded i/o)... then you don't much care about how many threads you are using

dharrigan13:12:09

I see - is there an approach/framework for that in Clojure?

mccraigmccraig13:12:41

e.g. aleph client does async requests and returns a promise of the result... it won't block and when a response (or error) is received the promise will be resolved (or rejected)

mccraigmccraig13:12:50

you generally don't need to care much about which thread the response will be processed on

mccraigmccraig13:12:51

until you do, but when you do manifold let's you control threadpools

mccraigmccraig13:12:17

yes @dharrigan, aleph and manifold will do it

mccraigmccraig13:12:53

there are some core.async libs too, tho i haven't looked at core.async for ages

dharrigan13:12:22

I think aelph is too low level, I'm using the cognitect aws library (wonderful!) to connect and retrieve objects

dharrigan13:12:32

aelph seems to handle doing udp/tcp etc..

dharrigan13:12:49

I think I just need to wrap the cognitect aws library calls - in async?

dharrigan13:12:32

I guess manifold can wrap the function

dharrigan13:12:39

and return a promise

mccraigmccraig13:12:45

ah, you are talking about s3 specifically... the newer aws Java libs do async properly (callback based, rather than cheaty futures) , so wrapping those with manifold is def an option

mccraigmccraig13:12:55

but it's a bit of a rabbit hole

dharrigan13:12:01

actually, just looking at their official example

dharrigan13:12:05

at the bottom, shows async

mccraigmccraig13:12:57

unless it's core to what you are doing, or you are just playing for learning, i would use an existing clj s3 client if you just want to get stuff done

mccraigmccraig13:12:01

ah, cool - does the cognitect aws client do async properly now?

dharrigan13:12:09

yes, I'm using the aws-api and it's working fantastically - but sequentially - just playing around to see if I can make it faster by doing things which can be done in parallel - like downloading from multiple buckets (the order in which the files are received/processed is not important)

dharrigan13:12:24

@mccraigmccraig I'll soon find out - I'll have a play 🙂

dharrigan13:12:14

I always find reasoning about parallel stuff hard

mccraigmccraig13:12:04

one of the things i like about async stream-of-promises stuff is it makes reasoning about concurrency very explicit... operations are values, concurrency is a buffer size

mccraigmccraig13:12:29

what's the congnitect aws api using as its http client?

dharrigan13:12:30

I'm not sure...looking

dharrigan13:12:57

I think Jetty

mccraigmccraig14:12:24

looks like the http client is pluggable https://github.com/cognitect-labs/aws-api/blob/master/src/cognitect/aws/http.clj tho i haven't found the default impl yet

mccraigmccraig14:12:47

cognitect.http-client but i can't find the source for it

dharrigan14:12:08

Yeah, you looked in the same places as me

dharrigan14:12:28

I think it's a hidden library, but I suppose can be downloaded and exploded

rickmoynihan10:12:17

Yeah weird that it’s not in a repo; I have the jar in my local .m2 repo though, and took a look… It looks like it’s apache licensed. And it looks like it’s async: - It’s built on jetty’s client with the non-blocking interface: https://www.eclipse.org/jetty/documentation/current/http-client-api.html#http-client-async - It optionally takes a core/async channel, and always returns one which contains the response and headers or an error. The client looks quite good actually, not sure why it’s not in a repo somewhere; might be worth adding an issue to the aws lib to ask them to publish it too.

rickmoynihan10:12:55

Though I’m guessing this is deliberate, that they don’t want it to be widely used outside of the aws lib; as they probably use it internally and want to evolve it slowly.

rickmoynihan10:12:39

i.e. it’s opensource but they’re kinda simulating private by not publishing a repo for it.