This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2018-11-06
Channels
- # announcements (2)
- # beginners (97)
- # boot (3)
- # cider (23)
- # clara (9)
- # cljs-dev (40)
- # cljsrn (6)
- # clojure (107)
- # clojure-finland (2)
- # clojure-india (3)
- # clojure-italy (15)
- # clojure-nl (2)
- # clojure-spec (107)
- # clojure-uk (91)
- # clojurescript (28)
- # cursive (10)
- # data-science (4)
- # datomic (26)
- # duct (1)
- # emacs (6)
- # events (9)
- # figwheel-main (4)
- # fulcro (4)
- # graphql (2)
- # jobs (3)
- # jobs-discuss (12)
- # juxt (7)
- # kaocha (6)
- # off-topic (8)
- # onyx (2)
- # parinfer (13)
- # pedestal (32)
- # portkey (1)
- # re-frame (58)
- # reagent (17)
- # reitit (21)
- # ring-swagger (3)
- # shadow-cljs (35)
- # spacemacs (1)
- # tools-deps (33)
- # yada (13)
morning
morning
måning
My 2 Questions for the day: 1. What are all the cool kids using to do Spark with Clojure? 2. Does anyone have experience using Spark to distribute high volumes of what amount to image transforms? (cropping portions out of big Raster images using GDAL and Shapefiles)
I ended up just using scala for spark
@maleghast being contrarian, and depending on where your data is coming from and going to, if it's basically a work-distribution problem, i might be tempted to do that as a kafka-streams app (with images or paths shoved onto a topic) , 'cos when i've fiddled with spark before the pain has all been about submitting your job to the spark cluster and the kafka-streams "i'm just a library" approach gets rid of all that ceremony
if it's about something else, like having workers near to a shard, then spark may be a much better fit though
I need machine level distribution, to LOTS of machines... Need to cut days worth of single machine time to hours or preferably minutes of job-time through distribution.
spark is great for analysis and shuffling multi-terabyte database ops etc, but for streaming @mccraigmccraig is right, just use kafka
so it is more of a data science workload?
if you're moving data -> kafka if you're doing ML or stats w/data -> spark
(broadly)
yes @maleghast you can distribute jobs to lots of machine with kafka
Is this really a big data job? Can’t you do this in parallel on a single machine with a fold and a reducer? Or a partition-all
to split into batches and execute each on a future, with a (map deref ,,,)
to wait for all to finish?
or perhaps gnu parallel?
sorry just scrolled down and seen other replies suggesting a fat machine too
if you need to do general (full-table, not streaming) joins on outputs then spark is much more your beast though
I mean If I have all the jobs going into a Kafka queue, how do I do the work in a distributed fashion?
@maleghast you have lots of partitions on your kafka topic
It's literally a command line operation per message and I want to distribute that work to MANY machines
then you run lots of consumer (kafka-streams app probly) processes
do you have a way of starting lots of instances of a container ? k8s or dc/os etc ?
if you don't have a familiar way of starting a lot of instances of a container then the pragmatic aspects of doing that may influence your solution... EMR may be sensible - although ECS or EKS may also be sensible with kafka - i'm not familiar enough to be very helpful
What I am sensing is that if I loaded all of the jobs into a Kafka topic and then unleashed a swarm of consumers on that topic..?
yep, that would do what you want. and if you use kafka-streams for the consumer then your life will be simpler (you won't have to think so much about low-level consumer details, failures, retries etc)
In times gone by I would have used Spark for all sorts of things but tend to hold off as it’s usually a sledgehammer to crack a peanut.
As long as the message size < 16mb then I’d push it through Kafka to a streaming app as @mccraigmccraig has pretty much explained.
I'd have a look at powderkeg for spark stuff too if that still ends up being the right way to handle it
@maleghast there are some pretty massive machines you can get out there for short periods of time and very cheap on spot instances. Have you thought about lots of threads on a single box?
Basically it's just a command-line operation - could I just use Clojure Parallelism and something like Conch..?
Suppose the first request question @maleghast is “How much data are we talking about?”
@jasonbell - hundreds, growing to thousands of Raster images on multiple bands, each needing up to hundreds of specific polygons cropping out
The initial workload we have has a "back of the fag packet" calculation of about 3 days running on one of the Data Scientists' work laptops
how often are you going to run similar workloads ?
one of the reasons that I want to get the total run-time down is that we are going to be doing a lot of this stuff, and we can't wait days for workloads to complete, but also I may need to run several workloads in parallel in the medium term so it can't be labour intensive; I need to automate a lot and wisely.
assuming it's mostly a CPU-bound job - do it on a machine with an SSD, and use pmap, or for more control: stuff your jobs into a manifold stream, set a buffer-size equal to the number of hardware threads on the machine, map over it processing each job in a manifold.deferred/future, reduce to get a result... adjust the buffer-size up if you aren't thrashing all your cores
There are Java libraries for GDAL but they are notoriously hard to work with and the operations that we want to do are "easy" on the command-line...
ah, yeah, i forgot about the shell-out bit
well depending on what GDAL does itself with threads you may or may not have a good time running it concurrently on the same machine
I could use Conch to write a function that could be applied with pmap or run on a manifold.deferred queue
if GDAL is doing its own threading then you may be best off letting it do its thing and running jobs serially on a big fat machine or many machines
dunno about threads per core - my i7 has 2 per core i think... but i dunno if it's sometimes 4
if GDAL runs single threaded then you may be able to get a 16x speedup just by running on a fat machine, but if it's already saturating your CPUs on a 4-core laptop then your fat-machine speedup may be much less
So, the Python interpreter is thread-locked, so his time estimates are based on single-threaded behaviour on a 4 core Mac Laptop
python is also kinda slow - you might get a significant speedup just by using a faster runtime
the GDAL stuff doesn't have to be bad (I did some work with it recently), but you just want to be careful around resources hanging around. What is your source input?
@maleghast you've seen this? https://github.com/Factual/geo
depending on what you are doing you might want to do something with postgis as a solution too
postgis ftw for geospatial object processing - not sure about rasters tho
Jumping into this discussion a bit late, but if you feel that you need to use some sort of queue, maybe look at Redis before using Kafka, and in particular Redis 5. since it just got streams added in and carmine
library supports it already. So you can have the same functionality as you would with Kafka, in the sense that you’d have a stream to work on. It’s easier to maintain than to maintain Kafka + Zookeeper(s)
@otfrom - We are using postgis to manage our polygons, but I don't want to put Raster images (100s of MBs) into a DB...
I am going to look at Factual/geo now - if it supports GDAL clip/crop we will have a winner...
@mccraigmccraig - yeah it's not a lot of help with image manipulation, but it's a great tool for storing geo-polygons and reasoning about them.
@otfrom - Thanks for the heads-up about https://github.com/Factual/geo 🙂 It's not what I need for this__ thing, but it's absolutely going to come in handy!
@robert.g.jones 👋 Did see you join earlier but never said hello, apologies for that. Welcome!