Fork me on GitHub
#clojure-uk
<
2016-08-20
>
chrisjd07:08:50

@otfrom Nice, are you eagerly reading the entire input file into memory with vec intentionally? I guess if it's a small amount of processing on small files, you'll be IO bound, so eager reading makes sense.

otfrom07:08:56

chrisjd actually, I should remove that vec given that I'm pushing each line onto the channel

otfrom07:08:59

thx for catching that as I want this to work over larger files

otfrom07:08:15

I'm thinking about changing it to pipeline-blocking to work over a seq of files with onto-chan, but I worry a bit about the resource handling without with-open (closing on exceptions, and having (map #(close-that-stream %)) feels weird

otfrom07:08:09

I know that pipeline-blocking can take an exception handler that I can probably use to handle some of the things that can go wrong

chrisjd08:08:21

Didn’t realise the NHS published so much data like that. It must be rewarding to work on that sort of data — things that can have a real-world benefit to people.

agile_geek09:08:18

@otfrom: some of that dataset is my older friend. Some of that data is still loaded and extracted using COBOL programs I wrote in 1989-91

agile_geek10:08:29

I can still name to DB2 tables that data is held in.

agile_geek10:08:26

The shocking thing about that data is I was producing reports via COBOL using a pair of car sized lazer printers that were sent to every practice in the UK in early 90's. Themes changed every quarter and included prescribing of Statins and generics vs proprietary drug prescribing. The cost savings and habit changes identified then are still being highlighted now.

agile_geek10:08:56

GPs have not drastically altered prescribing habits in all that time

agile_geek10:08:53

I have no evidence but suspect increased admin, pressure from partially educated patients (internet in its negative aspect) and targets focussed on simple one dimensional metrics (like patient waiting lists) mean GPs don't have time to make convincing case for alternatives or to explain to patients so they take easy root to keep up throughput.

agile_geek10:08:12

Btw it was the reports not the lazer printers that were put in the post! 😜

agile_geek10:08:59

TIL this month: spent a little time learning about spec. Re-found keep-indexed and map-indexed (had forgotten all about them)

agile_geek10:08:07

Looking like the SoW I submitted to a client a few weeks back is not going to bear fruit until October or possibly November. Oh well, that's how it goes.

otfrom10:08:51

agile_geek I'll keep a look out

otfrom10:08:40

agile_geek the only thing we saw that changed GP prescribing behaviour was a Primary Care Trust (PCT) bullying the GPs to do the right thing according to the NICE guidelines and omg did they complain about that a lot

agile_geek10:08:14

Seems things have not changed since I left NHS in 2000!

agile_geek10:08:03

I've been back as a consultant since (2012 and 2014) and it's like The Land Time Forgot

agile_geek10:08:21

In terms of data analysis it was pretty cutting edge in late eighties/early nineties

agile_geek11:08:07

Is it just me that finds core.async really hard to reason about? I struggle with how to 'unpack' values from channels within go blocks when not using the blocking take <!!

agile_geek11:08:22

For example:

(let [c (chan)]
  (go (>! c "hello"))
  (go (let [res (<! c)] (println res)))
  (close! c)) ;; randomly prints nil and hello? why?

agile_geek11:08:27

I am guessing cos println relies on side effects and is evaluated at some point after the take has parked?

agile_geek11:08:17

but if so what's best way to grab a value from core.async and write it somewhere without using a blocking take?

chrisjd12:08:14

In that instance, isn’t it just a race between <! and close!? If you allow <! to win with Thread/sleep before close! then it works fine.

agile_geek12:08:24

ahh could be

mccraigmccraig13:08:30

@agile_geek: i think you have a race between close! and >! ... if close! happens before >! nothing will get put on the channel... it shouldn't matter if close! is called before <! ... close! doesn't prevent takes, just puts

mccraigmccraig13:08:34

(let [c (chan)]
  (go (>! c "hello") (close! c))
  (go (let [res (<! c)] (println res))))

mccraigmccraig13:08:54

should be ok, though i don't have an editor to hand, so parens may be all wrong

mccraigmccraig13:08:41

i generally prefer to use promises rather than core.async for any async stuff which isn't about a stream of values though... core.async doesn't help you out with any error handling and if you are thinking of a promise-chan you might as well go all the way and use a promise with built-in error handling

mccraigmccraig13:08:24

the go block stuff is pretty darn neat though

otfrom14:08:47

mccraigmccraig what I'm really after is the parallelism in pipeline. I often find myself with a seq of files (containing seqs of lines) that I want to do some mapping and then reducing over that are smaller than something I'd do in spark and I'm trying to find good ways of doing that from a performance and clarity pov

mccraigmccraig14:08:35

@otfrom: if you use manifold-streams to represent those seqs, and manifold's map/reduce operations, then that should efficiently soak up all your CPU resources while providing you with the coordination operations you need

mccraigmccraig14:08:28

also manifold streams convert straightforwardly to/from core.async channels and have error handling too

otfrom14:08:49

mccraigmccraig I'll have to look at manifold again (again again)

mccraigmccraig14:08:13

@otfrom: https://github.com/ztellman/manifold/blob/master/docs/stream.md is a good intro, though there are no more detailed docs afaik - browsing the fn names in the source is instructive

otfrom14:08:16

I might look at doing more or less the same thing w/manifold and see how that goes

mccraigmccraig14:08:03

manifold's conversion capabilities are quite convincing (to me anyway) - i'm using it for everything async on the backend (which is just everything) and it makes it easy to pick up a core.async lib and plug it it, or convert from lazy-seqs etc

otfrom14:08:10

anything interacting w/the file system or remote resources?

mccraigmccraig14:08:00

actually, i white lie - we interact with /tmp synchronously... everything else is async

otfrom14:08:01

(handling the resources is all a bit of a pain I'm looking at this https://github.com/pjstadig/reducible-stream/ )

otfrom14:08:25

mccraigmccraig still using transducers and things like that?

otfrom14:08:03

w/manifold that is

mccraigmccraig14:08:10

not transducers - monads and applicatives all the way baby 🙂

otfrom14:08:07

was thinking I'd do something monadic w/the transducers (but my thinking on this is all pretty early)

otfrom14:08:29

mccraigmccraig do you find you get enough help w/o the haskell compiler or using something like core.spec when doing monadic stuff?

mccraigmccraig14:08:29

it was very much like starting off with lazy-seqs - there are some difficult-to-grok errors at first, which are difficult to relate to their cause, but you quickly learn that there are only a very few types of errors like that and get used to how to diagnose and trace them

mccraigmccraig14:08:20

in particular, failing to wrap return values from a monadic function and not establishing the context (the lack of static types means you often have to give cats some help in identifying the monadic type)

mccraigmccraig14:08:44

i've got plenty of example code @otfrom should you want to see some

otfrom14:08:20

mccraigmccraig that would be lovely if you can point me at some

otfrom14:08:56

I'm trying to figure out what my general approach to "annoying size" data is (stuff to small for spark/hadoop but to big to do w/o thinking about it)

mccraigmccraig14:08:26

hmm... actually, i have a tonne of monadic promise-based code (e.g. https://github.com/employeerepublic/er-cassandra/blob/master/src/er_cassandra/model/select.clj ), but you really want stuff for processing streams... and i don't have any of that

otfrom14:08:58

mccraigmccraig that will at least give me a start. thx!

mccraigmccraig14:08:15

i did just get alia's manifold-stream based queries working for cassandra, for pretty much the reason you outlined (too little for spark, too big for a single query) ... so if you have stuff in cassandra that will help you get it out https://github.com/mpenet/alia/blob/master/modules/alia-manifold/src/qbits/alia/manifold.clj#L62

otfrom14:08:16

that's handy. this data size in cassie is one of the things we do a lot of in kixi.hecuba and probably more in kixi.workspaces and kixi.datastore (see various on http://github.com/mastodonc)

otfrom14:08:33

which would be handy if I use pipeline

mccraigmccraig14:08:57

you have many projects @otfrom 🙂

otfrom14:08:20

I'm a tyrant and I work people hard (and we do almost everything as open source)

otfrom14:08:48

luckily I've hired people much smarter than me

mccraigmccraig14:08:39

there's a carrot and a stick, right there