Fork me on GitHub
#clojure-uk
<
2017-03-09
>
agile_geek08:03:07

@mccraigmccraig nothing wrong with 'stealing' ideas... @weavejester admitted 'stealing' from @luke who probably 'stole' them from someone else!

otfrom08:03:33

buon giorno

yogidevbear08:03:25

Morning 😄 Thanks for the reminder @otfrom :thumbsup:

yogidevbear08:03:52

The team involved in organising all of these meetups and recordings rock. You guys are awesome 🎉

otfrom08:03:04

the recordings and hosting for the talks thx to skillsmatter. The organisation is done by many members of the community (a number of whom are here)

thomas08:03:52

@mccraigmccraig Just remember: good artists copy, great artists steal!!!

otfrom08:03:30

thomas I think of myself as distinctly mediocre and I steal all the time

thomas09:03:56

@otfrom maybe we just try and aspire to be great.

thomas09:03:19

morning btw

otfrom09:03:11

I'm all about the trying and the aspiration

mccraigmccraig11:03:19

@otfrom dyu use sparkling at MC ?

otfrom12:03:07

mccraigmccraig yes

otfrom12:03:10

tho not the latest

otfrom12:03:19

marginally prefer it to flambo

mccraigmccraig12:03:50

have you used the spark-sql support ?

mccraigmccraig12:03:45

we've got some jobs in vanilla sparkling, and they are getting more complex and it would be a lot easier to write them in a sql dialect

otfrom12:03:56

tho if I was going to do sql on stuff in s3 I'd probably look at Amazon Athena https://aws.amazon.com/athena/

otfrom12:03:26

most of my spark stuff is dealing with things that don't fit well into sql style

mccraigmccraig12:03:31

it's out of c* so we are kind of tied to spark without getting more complicated

otfrom12:03:57

which version of c* and the c* driver?

mccraigmccraig12:03:43

DSE 5, which is c* 3.0 and spark 1.6 iirc

mccraigmccraig12:03:41

i'm not particularly wed to DSE except that the hive metastore impl gives us tableau interactivity

mccraigmccraig12:03:50

which has proven to be very convenient

otfrom12:03:27

yeah, having a way of doing sql and jdbc connectors make a lot of things easier

otfrom12:03:11

mccraigmccraig things seem to be moving away from that a bit w/flambo, sparkling, parkour and powderkeg

otfrom12:03:35

powderkeg is the one that interests me the most atm. I like the idea of transducer like stuff rather than ->> like stuff

otfrom12:03:50

but this might be a function of the kind of thing I'm doing

mccraigmccraig12:03:06

powderkeg looks nice

otfrom12:03:10

cascalog is pretty dormant now too 😕

mccraigmccraig12:03:40

yeah, i don't want anything to do with deploying hadoop either... and cascalog had many shortcomings - the macros were pretty horrible

mccraigmccraig12:03:23

but the underlying set of cascading ops was pretty neat... and composition was fairly straightforward

otfrom12:03:04

I'm hoping onyx comes through. I'd like to have something clojure all the way down. I'm not into the idea of writing much java and the idea of fixing some scala fills me w/dread

otfrom12:03:41

cascading is cool, though even that seems to be stalling after the parent company got acquired (I think)

lsnape13:03:00

I really like using cascalog when all I need to do is unions and joins. I've found it can get messy beyond that

rickmoynihan14:03:01

otfrom: what’s your impression of onyx? Every time I look at it I hear lots of talk of zookeeper / scheduling / cool distributed systems stuff etc… and almost no talk of data transformation.

otfrom14:03:54

rickmoynihan we're finding it good for streaming ETL, but it is very early days for us still (and we're using it a bit out of its comfort zone)

otfrom14:03:12

the crew on the #onyx channel are really responsive and helpful though

rickmoynihan14:03:41

by streaming ETL - do you mean push data capture?

otfrom14:03:29

we're doing some small bits of analytics atm, but nothing exciting yet. More trying to get used to how to do the ops side of things on some simple flows (archiving from kafka and a bit of analysis)

jasonbell14:03:43

@rickmoynihan Onyx is excellent, I really like it. I've got a few more blog posts in the pipeline to do especially on payload calculations and other interesting things we've learned along the way.

yogidevbear14:03:33

I have a Clojure related question. I'm doing something at work (non Clojure) and massaging some data into a format I'd like and this is a little tedious using our current language. So I created a gist to very loosely show a similar example of the initial data structure I have and what I'd like to convert it to. I've used JSON notation for the data structures in the gist. I'd be really interested to see how this manipulation could be achieved using Clojure (if anyone has some time and is interested in taking a look). https://gist.github.com/yogidevbear/4b386f10c63ba008d3f7b49524262cf0

yogidevbear14:03:29

We have a lot of code in our system that does similar types of things where the data is retrieved relatively flat and then looped over in many differing levels so I figure it might be a good case study for the powers that be in the company to see how much easier / more efficiently the code could be written to achieve the same end result

yogidevbear14:03:59

Plus I get to learn some Clojure in the process 🙂

lsnape14:03:13

Yeah something like (reduce (fn [acc elem] …) [] (group-by :judges)) is how I’d go about it

yogidevbear14:03:24

So for the option_2.txt in that gist, it would be a reduce with a group-by on :comments that is encapsulated within an outer reduce with a group-by on the :judges (or something along those lines)?

yogidevbear14:03:25

I'm guessing it might be a little more complicated than that

lsnape14:03:35

Sounds about right! You could also experiment with creating some intermediate maps to help with building the final structure

mccraigmccraig14:03:12

for opt1 you can group-by with (juxt :entry :judge), then for opt2 you can take the output of opt1 and group-by with :entry

yogidevbear14:03:52

If you had to guesstimate, how quick/efficient would these types of solutions be with datasets of e.g. tens of thousand of rows?

yogidevbear15:03:24

I realise that question might be similar to "How long is a piece of string?"

mccraigmccraig15:03:32

dunno... benchmark it... where are your rows coming from ? fewer than millions of rows is all going to fit into not too much memory though, so should be pretty fast... probably considerably less than the time it takes to read from the disk or network

yogidevbear15:03:42

Rows would be coming from a database (in this particular case, MS SQL Server)

yogidevbear15:03:01

I'm a strong believer in trying to do a lot of the grunt work in the initial SQL commands and letting SQL do what it's best at, but sometimes these queries get particularly complex and hard for all members on the team to maintain so would be good to assess the alternatives from the perspective of Clojure functions 🙂

lsnape15:03:50

@yogidevbear I agree, although writing the transformations in pure clojure functions makes for much easier testing

yogidevbear15:03:15

So if I specify:

(def initial_data [
  { :entry "E1", :judge "J1", :comment "E1J1C1" },
  { :entry "E1", :judge "J2", :comment "E1J2C1" },
  { :entry "E1", :judge "J1", :comment "E1J1C2" },
  { :entry "E2", :judge "J1", :comment "" },
  { :entry "E2", :judge "J2", :comment "" },
  { :entry "E3", :judge "J1", :comment "E3J1C1" },
  { :entry "E3", :judge "J1", :comment "E3J1C2" },
  { :entry "E3", :judge "J2", :comment "" }
])

yogidevbear15:03:25

I have initial_data in my repl now

yogidevbear15:03:15

How would I specify this, for example, using this approach: https://clojurians.slack.com/archives/clojure-uk/p1489071432499317

yogidevbear15:03:39

Like so? (group-by (juxt :entry :judge) initial_data)

yogidevbear15:03:08

So that creates a vector(?) with the paired key/index of [:entry :judge] values and the corresponding data that matches that. Is that correct?

yogidevbear15:03:40

@mccraigmccraig with that example, is this what I'm aiming at? (group-by :entry (group-by (juxt :entry :judge) initial_data))

mccraigmccraig15:03:04

it creates a map where the keys are [:entry :judge] vectors and the values are vectors of the matching records

mccraigmccraig15:03:45

you will then need to process that map of vectors to get your option-1 sequence, and then a couple more steps required to get to your option-2 sequence

yogidevbear15:03:17

Still feels like a lot less effort than the hoops I'm currently jumping through

mccraigmccraig15:03:39

you will probably want to use -> , ->> and maybe as-> to make your code flow nicely @yogidevbear

yogidevbear15:03:56

I'm got my homework cut out for me 🙂

mccraigmccraig15:03:14

e.g. (->> initial-data (group-by (juxt [:entry :judge])) (make-option-1) (group-by :entry) (make-option-2))

mccraigmccraig15:03:34

or

(->> initial-data 
  (group-by (juxt [:entry :judge]))
  (make-option-1)
  (group-by :entry)
  (make-option-2))

yogidevbear15:03:35

With make-option-1 and make-option-2, I'm guessing this is placeholder text for some functionality that I'd still need to define?

mccraigmccraig15:03:02

something that takes a {group-by-key [record...]} map and outputs one of your options

glenjamin17:03:57

terraform users, you may like this if you haven’t heard of it already: https://github.com/coinbase/terraform-landscape

jonpither20:03:30

@glenjamin we use ClojureScript to manipulate and derive Terraform... https://github.com/juxt/roll, but it's a work in progress 🙂

jonpither20:03:19

landscape looks very helpful - thanks for posting

glenjamin20:03:21

I have a pull request that's been open for over a year to add a basic JSON output to plan :(

jonpither20:03:23

hmm good idea 🙂