Fork me on GitHub
#clojure-uk
<
2018-11-06
>
guy10:11:45

Morning!

maleghast12:11:42

Morning All!

maleghast12:11:07

My 2 Questions for the day: 1. What are all the cool kids using to do Spark with Clojure? 2. Does anyone have experience using Spark to distribute high volumes of what amount to image transforms? (cropping portions out of big Raster images using GDAL and Shapefiles)

maleghast12:11:48

TLDR; I may need a good Spark primer and I want to keep it all Clojure.

alexlynham12:11:22

I ended up just using scala for spark

mccraigmccraig12:11:59

@maleghast being contrarian, and depending on where your data is coming from and going to, if it's basically a work-distribution problem, i might be tempted to do that as a kafka-streams app (with images or paths shoved onto a topic) , 'cos when i've fiddled with spark before the pain has all been about submitting your job to the spark cluster and the kafka-streams "i'm just a library" approach gets rid of all that ceremony

mccraigmccraig12:11:48

if it's about something else, like having workers near to a shard, then spark may be a much better fit though

maleghast12:11:38

I need machine level distribution, to LOTS of machines... Need to cut days worth of single machine time to hours or preferably minutes of job-time through distribution.

alexlynham12:11:49

spark is great for analysis and shuffling multi-terabyte database ops etc, but for streaming @mccraigmccraig is right, just use kafka

alexlynham12:11:57

so it is more of a data science workload?

maleghast12:11:58

Can I do that with Kafka?

alexlynham12:11:22

if you're moving data -> kafka if you're doing ML or stats w/data -> spark

mccraigmccraig12:11:36

yes @maleghast you can distribute jobs to lots of machine with kafka

maleghast12:11:06

Thousands of Shapefile crops of hundreds of Raster files

rickmoynihan10:11:34

Is this really a big data job? Can’t you do this in parallel on a single machine with a fold and a reducer? Or a partition-all to split into batches and execute each on a future, with a (map deref ,,,) to wait for all to finish?

rickmoynihan10:11:00

or perhaps gnu parallel?

rickmoynihan10:11:51

sorry just scrolled down and seen other replies suggesting a fat machine too

mccraigmccraig12:11:06

if you need to do general (full-table, not streaming) joins on outputs then spark is much more your beast though

👍 4
maleghast12:11:11

OK, so answer to #1 is "Use Kafka instead of Spark" I think...

maleghast12:11:09

I need to get my head around how the job distribution angle happens though...

maleghast12:11:01

I mean If I have all the jobs going into a Kafka queue, how do I do the work in a distributed fashion?

mccraigmccraig12:11:17

@maleghast you have lots of partitions on your kafka topic

maleghast12:11:54

It's literally a command line operation per message and I want to distribute that work to MANY machines

mccraigmccraig12:11:18

then you run lots of consumer (kafka-streams app probly) processes

mccraigmccraig12:11:01

do you have a way of starting lots of instances of a container ? k8s or dc/os etc ?

maleghast12:11:29

My thinking re Spark was to use AWS EMR

maleghast12:11:44

so spin up X instances -> chug through work -> outputs into S3 -> kill instances

mccraigmccraig13:11:00

if you don't have a familiar way of starting a lot of instances of a container then the pragmatic aspects of doing that may influence your solution... EMR may be sensible - although ECS or EKS may also be sensible with kafka - i'm not familiar enough to be very helpful

maleghast13:11:52

That's ok - lots of helpful food for thought already, thanks 🙂

maleghast13:11:45

What I am sensing is that if I loaded all of the jobs into a Kafka topic and then unleashed a swarm of consumers on that topic..?

maleghast13:11:56

(In broad terms, yeah?)

mccraigmccraig13:11:41

yep, that would do what you want. and if you use kafka-streams for the consumer then your life will be simpler (you won't have to think so much about low-level consumer details, failures, retries etc)

jasonbell13:11:26

In times gone by I would have used Spark for all sorts of things but tend to hold off as it’s usually a sledgehammer to crack a peanut.

jasonbell13:11:05

And Spark on AWS, never used EMR always did it on spot instances.

jasonbell13:11:10

As long as the message size < 16mb then I’d push it through Kafka to a streaming app as @mccraigmccraig has pretty much explained.

maleghast13:11:12

Thanks chaps 🙂

otfrom13:11:26

I'd have a look at powderkeg for spark stuff too if that still ends up being the right way to handle it

otfrom13:11:46

for me : if it is streaming and you need it right away -> kafka + (streams/other)

otfrom13:11:01

if it is batch: spark (or on a single large box if you can do it)

otfrom13:11:35

@maleghast there are some pretty massive machines you can get out there for short periods of time and very cheap on spot instances. Have you thought about lots of threads on a single box?

maleghast13:11:38

@otfrom - Thanks for that, it's certainly an option,

maleghast13:11:09

I need to figure out how to do the MANY threads thing if I go that route...

maleghast13:11:05

Basically it's just a command-line operation - could I just use Clojure Parallelism and something like Conch..?

jasonbell14:11:44

Suppose the first request question @maleghast is “How much data are we talking about?”

maleghast14:11:09

@jasonbell - hundreds, growing to thousands of Raster images on multiple bands, each needing up to hundreds of specific polygons cropping out

maleghast14:11:36

I didn't see that you'd replied, sorry...

maleghast14:11:55

The initial workload we have has a "back of the fag packet" calculation of about 3 days running on one of the Data Scientists' work laptops

mccraigmccraig15:11:21

how often are you going to run similar workloads ?

maleghast15:11:26

one of the reasons that I want to get the total run-time down is that we are going to be doing a lot of this stuff, and we can't wait days for workloads to complete, but also I may need to run several workloads in parallel in the medium term so it can't be labour intensive; I need to automate a lot and wisely.

mccraigmccraig15:11:25

assuming it's mostly a CPU-bound job - do it on a machine with an SSD, and use pmap, or for more control: stuff your jobs into a manifold stream, set a buffer-size equal to the number of hardware threads on the machine, map over it processing each job in a manifold.deferred/future, reduce to get a result... adjust the buffer-size up if you aren't thrashing all your cores

maleghast15:11:08

OK, that all sounds promising

maleghast15:11:08

There are Java libraries for GDAL but they are notoriously hard to work with and the operations that we want to do are "easy" on the command-line...

mccraigmccraig15:11:09

ah, yeah, i forgot about the shell-out bit

mccraigmccraig15:11:01

well depending on what GDAL does itself with threads you may or may not have a good time running it concurrently on the same machine

maleghast15:11:04

I could use Conch to write a function that could be applied with pmap or run on a manifold.deferred queue

maleghast15:11:58

Or I can try__ with the Java Libs...

maleghast15:11:25

How many h/w threads per core..?

maleghast15:11:46

(I want to say it's 4, don't know why)

mccraigmccraig15:11:10

if GDAL is doing its own threading then you may be best off letting it do its thing and running jobs serially on a big fat machine or many machines

mccraigmccraig15:11:38

dunno about threads per core - my i7 has 2 per core i think... but i dunno if it's sometimes 4

maleghast15:11:42

Well, yeah, my original thought was distributing the job to MANY machines

maleghast15:11:49

hence starting out asking about Spark

mccraigmccraig15:11:55

if GDAL runs single threaded then you may be able to get a 16x speedup just by running on a fat machine, but if it's already saturating your CPUs on a 4-core laptop then your fat-machine speedup may be much less

maleghast15:11:27

'cos Spark /Hadoop would do the distribution for me, and I am lazy like that...

maleghast15:11:21

Data Scientist spiking the job is using Python 3 and the GDAL libs for Python.

maleghast15:11:20

So, the Python interpreter is thread-locked, so his time estimates are based on single-threaded behaviour on a 4 core Mac Laptop

mccraigmccraig15:11:07

python is also kinda slow - you might get a significant speedup just by using a faster runtime

otfrom15:11:51

the GDAL stuff doesn't have to be bad (I did some work with it recently), but you just want to be careful around resources hanging around. What is your source input?

otfrom15:11:29

(not sure if it has the right tools to help you or not)

otfrom15:11:10

depending on what you are doing you might want to do something with postgis as a solution too

mccraigmccraig15:11:31

postgis ftw for geospatial object processing - not sure about rasters tho

dotemacs15:11:46

Jumping into this discussion a bit late, but if you feel that you need to use some sort of queue, maybe look at Redis before using Kafka, and in particular Redis 5. since it just got streams added in and carmine library supports it already. So you can have the same functionality as you would with Kafka, in the sense that you’d have a stream to work on. It’s easier to maintain than to maintain Kafka + Zookeeper(s)

👍 4
mario-star 4
maleghast16:11:28

@otfrom - We are using postgis to manage our polygons, but I don't want to put Raster images (100s of MBs) into a DB...

maleghast16:11:00

I am going to look at Factual/geo now - if it supports GDAL clip/crop we will have a winner...

maleghast16:11:08

@mccraigmccraig - yeah it's not a lot of help with image manipulation, but it's a great tool for storing geo-polygons and reasoning about them.

maleghast16:11:28

@dotemacs - I will definitely look at that, thanks 🙂

maleghast16:11:58

@otfrom - Thanks for the heads-up about https://github.com/Factual/geo 🙂 It's not what I need for this__ thing, but it's absolutely going to come in handy!

jasonbell19:11:38

@robert.g.jones 👋 Did see you join earlier but never said hello, apologies for that. Welcome!