Fork me on GitHub
#core-async
<
2017-04-09
>
lxsameer12:04:58

Hi, I want to create a web crawler application using core.async. but it's not clear to me that how should I use core async exactly. I mean how my design should be ?

lxsameer12:04:43

here is my current version which does not working http://dpaste.com/03BQJ5S

lxsameer12:04:04

I would be so happy if someone can help me understanding my mistakes in this code

noisesmith15:04:27

@lxsameer for starters, on line 41 it looks like you map a series of >! to a channel, before anybody has a chance to start reading from it

noisesmith15:04:28

you seem to address this with a buffer, but a more elegant solution is to have the out channel be an argument - then the reader can park before that function even runs

noisesmith15:04:28

also, does fetch> do io? if so it could easily starve your go blocks

lxsameer17:04:50

@noisesmith good points, thanks. Anything else ?

noisesmith17:04:49

those are the only things that really stand out - you could easily make the function using fetch> just use blocking takes on the channel without hurting performance (and probably improving it by not stealing a thread from core.async)

noisesmith17:04:41

also, your flow doesn't seem to actually go concurrent anywhere - everything is just input to output?

lxsameer17:04:26

@noisesmith So basic idea is to read from channels in go blocks before writing to them, right ?

lxsameer17:04:40

@noisesmith and how should I make them concurrent ?

noisesmith17:04:30

well - that's a design question - I just don't really know how core.async is helping at all in that code, since it looks like a one way flow of data without concurrency

noisesmith17:04:45

which means it would perform better without core.async

noisesmith17:04:02

but maybe it needs concurrency somewhere, I don't know your design well enough to answer that

lxsameer17:04:04

aha. Let me describe my understanding from core async and correct me if I'm wrong ( which I'm )

lxsameer17:04:44

Using <! inside a go block will read from a channel and if there is no value to read, it will park and gave control to other go blocks

noisesmith17:04:17

yes, but in this code, due to the linear pipeline, that saves you exactly 1 thread

noisesmith17:04:37

and all the channel juggling is more expensive, than putting all the work in one (occasionally blocked) thread would be

lxsameer17:04:07

i can't understand this one, because i thought using core async means to have thread pool with CPU CORES + 2 threads in it

noisesmith17:04:32

right, and you are using CPU_CORES+2 threads to do 1 thread worth of work

noisesmith17:04:47

plus useless context switching / state management

noisesmith17:04:03

if I'm reading the code correctly at least...

noisesmith17:04:59

OK - I need to back off on that because the pipeline nature means that after enough steps, all those blocks can be running

noisesmith17:04:07

so it does go parallel in that sense

lxsameer17:04:26

I read a lot about core async but still have a fuzzy understanding of core async. do you know any video or article that show me how to use core async in action instead of simple examples ?

noisesmith18:04:23

tbaldridge 's talk at clojure/west was good

lxsameer18:04:22

I already watched that

noisesmith18:04:52

@lxsameer what I've found useful is if I think I want to use core.async, make a diagram on paper of what the communication flow should be, and see what should serialize, what needs to be parallel, (maybe even launching N go blocks depending on how much I need to fan out) and whether there's a simpler pattern that would do the same thing faster without core.async

noisesmith18:04:40

also separating out i/o so that it happens in thread blocks instead of go blocks (if you don't do that, everything suffers)

noisesmith18:04:44

like in your case, I imagine every step is very fast except the io, which I would put into thread calls, and also have N loops running all reading from the same input and writing to the same output

lxsameer18:04:27

good points, I'll keep them in mind

lxsameer18:04:35

thank you so much 😉

noisesmith18:04:53

hope it helps

lxsameer19:04:57

is it wise to use core.async/thread inside the function given to core.async/map ?

noisesmith19:04:27

you aren't using core.async/map though?

lxsameer19:04:49

yeah i changed my code a lot

noisesmith19:04:48

you could - but then you would also need to read off of each channel that the thread calls produce

noisesmith19:04:03

for something like that, pipeline-blocking is a better match

lxsameer19:04:35

pipeline-blocking ?

lxsameer19:04:01

cool thanks

lxsameer19:04:13

@noisesmith can you give me an example of it ? just the syntax

noisesmith20:04:08

(pipeline-blocking 12 out (map foo) in)

noisesmith20:04:33

where (map foo) could be any transducer instead, and 12 is the parallelism (max thread usage) and out and in are your channels

lxsameer20:04:36

@noisesmith cool. but I don't know the optimal number of threads number in the host platform, is the any facility to find out about it ?

noisesmith20:04:37

(pipeline has the same syntax, but fixed for the one you would want here)

noisesmith20:04:01

I don't know - it depends on what the threads are doing and how your OS handles threads to some degree...

noisesmith20:04:28

if they are doing IO the count could be pretty high, if doing CPU you want to keep it closer to the CPU count of the machine