Fork me on GitHub
#beginners
<
2017-09-13
>
irasantiago14:09:44

@didibus @noisesmith @ghadi @john thank you all for your time and input!! @noisesmith makes a good point, I should have called out what the goal of this is. 1. I would like each string in the list-of-strings to be processed in parallel 2. I would like process-a and process-b to happen in parallel. I'm still wrapping my head around the concepts of promises, futures and how they act based on how they're used 🙂 In the (doall (mapcat... example, can someone explain to me what doall's purpose?

noisesmith16:09:06

just to add to what @U064X3EF3 said about doall - if you do a deref of each element of a lazy sequence before reading the next item, you don’t actually start the next future until the previous completes. This is because a deref doesn’t complete until the future exits, and the lazy seq doesn’t realize the next item (causing the future to start) until it is accessed.

irasantiago17:09:15

thanks @noisesmith. so does that meant that in order for it to truly be parallel, its the derefing that has to happen at the same time? or would trying to parallelize the derefing not truly work because it will still be blocking?

noisesmith17:09:19

what the doall accomplishes is making sure all the futures are started in parallel (or nearly enough, just enough time to make a new future between the start of each one)

noisesmith17:09:46

then, when you do the derefs, it should only take as long as the longest future, because even if you wait on one of the ones that take longer, it doesn’t slow down the others

noisesmith17:09:10

finally, NB - be careful about this sort of code that goes through a collection and starts futures for each item in them, it’s easy to feed it a collection large enough that the whole vm sputters to a near halt just managing the resources and switching between the threads. This is why we have various utilities for making thread pools with fixed size and queues for input (one higher level tool for this that someone already mentioned is pmap)

irasantiago18:09:42

@noisesmith wow thanks so much! that was a really helpful explanation. the function will never have more than about 10 items so far but i'll definitely keep that in mind if anything changes and it has to scale

noisesmith18:09:07

also be sure to think about what happens if the function gets called in parallel (multiplying complexity) - we had a lot of bugs caused by this kind of behavior in my app, and because we waited so long to address it we had to spend a long time digging up the roots and fixing it. Also, the symptoms are never localized to the code causing this problem - all you see is that the entire app is sluggish (sometimes) without anything in the error messages that lets you know where the problem is. Or worse yet, no errors at all, just an unusable app because of the performance drag… long story short, it’s a good thing to nip in the bud early for everyone’s sanity’s sake

didibus15:09:27

(pmap
  #(let [a (future (process-a %))
            b (future (process-b %))]
       (and @a @b))
   ["1" "2" "3"])

didibus18:09:07

This is what you want. What you need to know is tje future starts executing in the background once you call future, but once you deref, it blocks the foreground thread, effectively making it as if it wasn't running in a future. So you want to deref at the very last moment, only when there's nothing else left to do in your code and everything is waiting on the result.

didibus18:09:14

Now you must also understand that parallelization adds an overhead. The future takes longer to start then just calling process-a. So parallelism often actually makes your code slower. Never do it unless you hit real performance issues. And always benchmark, because it might not help performance.

didibus18:09:22

Also, when you parallelize, you generally want to parallelize as much as possible, and the operations that are the most expensive, aka take longer.

didibus18:09:20

So in your case, you probably want to do this:

(time
(let [futures
(map
#(vector
(future (process-a %)) (future (process-b %)))
(range 100))]
(map (fn [[a b]] (and @a @b)) futures)))

didibus08:09:48

I was on my phone before. Here's a better example.

(time
 (dorun (pmap
         #(let [a (future (Thread/sleep 1000) %)
                b (future (Thread/sleep 1000) %)]
            (and @a @b))
         (range 100))))
This one is what you were trying to do. Over a range of 100, it takes ~4 second on by box.
(time
 (let [futures (doall (map
                       #(vector (future (Thread/sleep 1000) %)
                                (future (Thread/sleep 1000) %))
                       (range 1000)))]
   (dorun (map (fn [[a b]] (and @a @b)) futures))))
This one on the other hand, even over a range of1000, takes only ~1 second. The difference is the first one process the list number of cores at a time. Even though all the processes (in my case a 1 second thread sleep) could happen in parallel at the same time, its not doing it that way. The second example is doing just that, all process are happening at the same time, so even though you've got a thousand things in the list, they all get processed in one go and "anded" at the end. Its called a map-reduce. The mapping is all parallel, and the reduce is a synchronous reduce over the results.

didibus08:09:58

Now if you do:

(time
 (let [futures (doall (map
                       #(vector (future (Thread/sleep 1000) %)
                                (future (Thread/sleep 1000) %))
                       (range 100000)))]
   (dorun (map (fn [[a b]] (and @a @b)) futures))))
Where the range is now 100 000. You'll encounter what noisesmith was talking about. You're creating so many parallel threads, that we run out of memory. Future has no bounds, so its up to you to make sure this doesn't happen. The first example while it is slower, would not have this problem, since the parallelism is not dependent on the length of the input collection, its always just doing number of cores times 2.

didibus08:09:08

To safely use the second example, either you need to bound the size of the collection, or you need to make sure that the rate at which futures complete is faster then the rate at which they are created. So here, if I make the process shorter, sleep only 10ms, we can run it over a 100 000 range.

(time
 (let [futures (doall (map
                       #(vector (future (Thread/sleep 10) %)
                                (future (Thread/sleep 10) %))
                       (range 100000)))]
   (dorun (map (fn [[a b]] (and @a @b)) futures))))
Which takes ~2 second to run on my box.

didibus08:09:46

Even a million:

(time
 (let [futures (doall (map
                       #(vector (future (Thread/sleep 10) %)
                                (future (Thread/sleep 10) %))
                       (range 1000000)))]
   (dorun (map (fn [[a b]] (and @a @b)) futures))))
Runs in ~17 seconds only !

Alex Miller (Clojure team)14:09:53

future = background execution on another thread promise = one-time delivery of a value from one thread to another both can wait for result/delivery via deref

Alex Miller (Clojure team)14:09:12

doall forces realization of a lazy sequence

vuuvi15:09:08

Is it wrong to use a file as an arg in a unit test?

dpsutton15:09:28

is there a function that uses the contents of the file that you could test?

vuuvi15:09:30

like I have a function that’s taking in a .csv as an arg to parse

vuuvi15:09:16

and I want to test it in a predicable way but I don’t know how / if I should write in a fake .csv in my tests file OR if I should use a .csv in a different directory

dpsutton15:09:07

does it need to be a file? can you just have some represetative text in your test file?

vuuvi15:09:39

the function I am passing it to is expecting / will only work with a .csv file

vuuvi15:09:59

It reads it in via http://clojure.io as a raw stream and then parses the contents of the .csv into a series of maps

dpsutton15:09:22

can you remove the guts of that function into another one and test that one? the original one will take a file, and call the new function with the input stream from the file?

vuuvi15:09:47

the function handles all the input streaming but that’s not what I want to test - I want to check the output of the function to make sure that it received the right kind of file and it’s parsing input correctly

skarmeta15:09:17

Can someone tell me why this function just keeps returning an empty string instead of converting the characters?

(ns rna-transcription)

;; Purpose: convert each DNA nucleotide in a sequence to its
;; RNA complement (GCTA -> CGAU)
(defn to-rna [dna-seq]
  (let [complements {:G "C", :C "G", :T "A", :A "U"}]
    (apply str (map (fn [x] (complements x)) (seq dna-seq)))
    )
)

skarmeta15:09:10

The problem is definitely in the part (complements x) -- if I replace that with a different operation in the anonymous function, I get behavior I expect

dpsutton15:09:04

i think you need (map complements dna-seq). your call to map there doesn't have enough args

dpsutton15:09:16

nevermind. misread

dpsutton15:09:38

can you paste the argument you are calling this function on?

skarmeta15:09:43

If I replace (complements x) with say (str x x), I get the doubled input string as expected

skarmeta15:09:55

FAIL in (it-transcribes-all-nucleotides) (rna_transcription_test.clj:6)                        
expected: (= "UGCACCAGAAUU" (rna-transcription/to-rna "ACGTGGTCTTAA"))                         
  actual: (not (= "UGCACCAGAAUU" ""))    

dpsutton15:09:16

one problem will be that (complements "G") = nil

dpsutton15:09:28

the keys are keywords and you are asking for the entry for a char

skarmeta15:09:38

Ah yeah, that must be the issue.

skarmeta15:09:26

Not sure how to fix it though. If I try {:"G" "C", [...]} I get a syntax error

dpsutton15:09:56

don't use a :. that doesn't mean entry that's a keyword literal

dpsutton15:09:06

{"G" "C", ...}

chris15:09:15

use chars as the key instead of keywords

dpsutton15:09:38

(keyword "G") => :G

chris15:09:47

{\G "C", \C "G", \T "A", \A "U"}

chris15:09:04

saves you from having to map keyword over your seq

dpsutton15:09:16

^. when you seq a string you get a sequence of characters, not single character strings

chris15:09:44

lol yeah, you'd have to map str and then keyword

dpsutton15:09:46

and one other note, when you map over something, map will call seq on the collection so you can remove your explicit call to that

skarmeta15:09:50

Ah thanks. I didn't realize keywords were their own type.

skarmeta15:09:54

It works now!

john17:09:49

@irasantiago No prob. You might want to investigate this:

(time 
  (doall 
    (pmap 
      #(let [a (future (Thread/sleep 3000) (inc %) true) 
             b (future (Thread/sleep 4000) (dec %) true)] 
         (and @a @b)) 
      [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19])))

john17:09:23

You'll notice that the whole thing takes only as long as the longest thread: 4+ seconds

vuuvi18:09:52

Is it possible to mock a .csv in a clojure file?

john18:09:45

Depends what you mean. Maybe something like (def my-csv "one,two\nthree, four\n")

noisesmith18:09:18

why not make a csv file, or use one?

noisesmith18:09:41

also, I’d only mock or use a csv file if you are making a new csv reader, if you are using a library that consumes csvs, just mock the data that it returns (that is, make a data literal of the data returned from the library you use)

john18:09:38

Which would usually look something like [["one", "two"], ["three", "four"]]

vuuvi18:09:40

I need to write a unit test for a function that streams in a .csv using http://clojure.io/reader

Alex Miller (Clojure team)18:09:25

There is java.io.StringReader that takes a string and lets you read from it

vuuvi18:09:07

and then it breaks apart the values into a list of maps

dpsutton18:09:59

i'm gonna renew my suggestion to break the logic part of that function that deals with the input into another function and test that. you don't really need to test getting the text out of a csv file

vuuvi18:09:00

what functionality i added I want to test is skipping the first two whitespaces of the .csv

vuuvi18:09:29

but the two things aren’t really breakapartable asfaik

dpsutton18:09:36

(is (= (process "no, spaces") (process " no,spaces")))

vuuvi18:09:53

because to skip the two lines in the .csv I just use the .readLine class

noisesmith18:09:21

you can also use a StringReader to get something that you can read from without needing to do file IO

noisesmith18:09:36

and bonus, the literal string can be right there in the test to make clear what you are testing

vuuvi18:09:39

so I have two sample .csv files that are ready to go I just don’t know that much about unit testing to know the best way to do this

dpsutton18:09:15

i'm guessing somewhere in that file there's a line that goes (with-open-reader [io stuff] (body goes here)) and you just want to break body goes here into a function that deals with the text

dpsutton18:09:33

then you just copy and paste the contents of your sample csv files into your test file

dpsutton18:09:54

or even better, come up with the minimally reproducible version that your test works on

noisesmith18:09:50

@dpsutton my understanding of “to skip lines in the .csv I just use the .readLine class” was that there’s usage of the input stream itself that should be tested. I do 100% agree though, that tests are much better if you separate the IO parts from the data processing, but I also think you can test the input stream handling without file IO by making a stream directly from a string and testing with that

dpsutton18:09:13

that's fine. you can get an inputstream out of a string, right?

noisesmith18:09:46

right, StringReader to get a reader, or some messing with bytes if you really need an InputStream iirc

dpsutton18:09:48

the argument (to the function) can be inputstream but the csv container seems unnecessary for this is my point i guess

noisesmith18:09:57

yeah, agreed

vuuvi19:09:20

@dpsutton you’re saying create a new file that deals with that stuff or simply create a def that has all that information in it?

dpsutton19:09:39

`
(ns csv-parsing-tests)

(def text-with-spaces
" some csv text here ")

dpsutton19:09:49

just put it in your test file

vuuvi19:09:36

at the start of the file so that I can refer to it in my tests instead of the real mccoy

dpsutton19:09:53

yeah that's what i had in mind

vuuvi19:09:41

so the problems seems to be that I get a file not found exception when I try to pass the text as an arg instead of the file

dpsutton19:09:09

did you change the function to not work on a file but on the text of the file?

john19:09:51

Maybe...

(def text-with-spaces-reader
  (java.io.StringReader.
" some csv text here "))

vuuvi19:09:38

The problem I am running into now with breaking out the logic is I skipped the two lines by using (.readLine rdr) (.readLine rdr)

noisesmith19:09:08

right, @john’s example makes a reader you can use (via the method I suggested earlier)

noisesmith19:09:45

so you can abstract the step that gets the reader from the file, and use the StringReader instead in the test if needed

noisesmith19:09:15

it’s a judgment call how to isolate these things, but the more you can segregate IO flavored things from data processing things the easier and more reliable your testing will get

noisesmith19:09:29

also it might be time to move all this discussion to #testing

vuuvi19:09:09

okay sounds good

noisesmith19:09:52

wait, won’t that just return a file object that would map to a file by that name from PWD?

noisesmith19:09:14

that is, not reflecting those contents at all

john19:09:09

That would be correct... deleting suggestion 🙂