Fork me on GitHub
#core-async
<
2017-07-04
>
pawandubey18:07:37

Hello all 👋 So I asked this question a long time ago, but it got lost in a wave of posts, so asking again because I am still curious. Reference: https://github.com/pawandubey/crawljer/blob/master/src/crawljer/core.clj#L107 When the order of the two function calls in that function is reversed, my code (which is expected to start crawling the URLs provided) didn’t work - instead it immediately returned an instance of ManyToManyChannel. However, in this order, it does work. This has been most surprising to me. It seems adding urls to the channel before starting the go-loop doesn’t work as expected. Instead, I have to start the go-loop and then start adding URLs to the channel. What is the insight that I am missing here which would help me understand why this is the case?

noisesmith19:07:09

if something starts a go loop, you can expect it to immediately return a ManyToManyChannel

noisesmith19:07:18

that doesn't have anything to do with whether it works or not

pawandubey19:07:09

I see. But that doesn’t explain why it works just on switching the order.

noisesmith19:07:51

I don't fully comprehend the code yet, but another issue here is that you are doing blocking IO inside go blocks, which can easily starve core.async

pawandubey19:07:33

Do you mean the write-doc function? I added it later. When I first encountered the problem, it was just a no-op. You can ignore its effects and the question is still valid.

noisesmith19:07:48

no, I mean the call to http-get which is called inside read-url which is called inside a go block in read-urls

noisesmith19:07:22

that's a blocking read and if it were parallelized it could block up the entirety of core.async

noisesmith19:07:55

I don't think it's the core issue here, but it is a problem

pawandubey19:07:35

So you suggest not putting blocking I/O inside go-loops? I thought that was the best way of improving I/O performance? Whether disk or network.

noisesmith19:07:54

no, go blocks are not made for IO at all

noisesmith19:07:25

core.async is for coordinating things that are asynchronous, go blocks are made for simplifying coordination code, not for running tasks, and core.async can lock up and fail if you do blocking tasks inside go

pawandubey19:07:32

Interesting. What is the best way of parallelizing I/O in clojure then?

noisesmith19:07:50

there's async/thread that is meant for blocking or CPU intensive code

noisesmith19:07:25

there's also futures and ThreadPools if you don't need the sophisticated coordination core.async provides - if you are just doing a map/reduce type job

pawandubey19:07:27

My main intension was to maintain two channels - one for reading in the web-pages and another for storing the urls left to visit - both acting as producers and consumers for each other. core.async and go-loop seemed perfect for this use case so I ran with it.

noisesmith19:07:49

sure - async/thread returns a channel that a go block can park on

noisesmith19:07:12

(<! (async/thread ...)) - returns the return value of the thread

pawandubey19:07:26

Nice. Thanks!

pawandubey19:07:40

But my original question remains unanswered 😅

noisesmith19:07:02

so

(defn read-urls
  []
  (async/go-loop [current-url (async/<! urls-chan)]
    (when (read-url current-url)
      (recur (async/<! urls-chan)))))
would become
(defn read-urls
  []
  (async/go-loop [current-url (async/<! urls-chan)]
    (when (async/<! (async/thread (read-url current-url)))
      (recur (async/<! urls-chan)))))

noisesmith19:07:43

also tbaldridge's talk at clojure/west had some very good advice about this kind of code https://www.youtube.com/watch?v=096pIlA3GDo&amp;list=PLAMHgQX0SWwrMLtv3V9z8NF3QIrm-6NLi

noisesmith19:07:08

specifically about how to simplify it, separate the implementation details from the logic properly, etc.

pawandubey19:07:30

Will give it a watch. Thanks for all the help 😄

noisesmith19:07:26

I'll take a few more looks at that code - but my hunch is that reorganizing it in the way tbaldridge recommends will either eliminate the issue or make it much simpler to fix... I think his talk will make it more clear why

noisesmith19:07:06

one more small thing - you can simplify the code around line 75 by replacing ->> with ->

pawandubey19:07:42

You are right! Will refactor