This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-09-13
Channels
- # aleph (3)
- # aws (1)
- # beginners (97)
- # boot (41)
- # cider (7)
- # clara (105)
- # cljs-dev (4)
- # cljsrn (66)
- # clojure (185)
- # clojure-argentina (2)
- # clojure-colombia (15)
- # clojure-czech (1)
- # clojure-dusseldorf (8)
- # clojure-greece (2)
- # clojure-italy (5)
- # clojure-russia (33)
- # clojure-spec (14)
- # clojure-uk (9)
- # clojurescript (75)
- # cursive (6)
- # data-science (1)
- # datomic (12)
- # emacs (2)
- # fulcro (71)
- # funcool (1)
- # jobs (6)
- # jobs-discuss (62)
- # juxt (21)
- # lein-figwheel (1)
- # luminus (9)
- # lumo (41)
- # off-topic (39)
- # om (12)
- # onyx (1)
- # portkey (2)
- # protorepl (4)
- # re-frame (14)
- # reagent (50)
- # ring (3)
- # shadow-cljs (6)
- # spacemacs (38)
- # specter (8)
- # test-check (14)
- # testing (52)
- # unrepl (2)
@didibus @noisesmith @ghadi @john thank you all for your time and input!!
@noisesmith makes a good point, I should have called out what the goal of this is.
1. I would like each string in the list-of-strings
to be processed in parallel
2. I would like process-a
and process-b
to happen in parallel.
I'm still wrapping my head around the concepts of promises, futures and how they act based on how they're used 🙂
In the (doall (mapcat...
example, can someone explain to me what doall
's purpose?
just to add to what @U064X3EF3 said about doall - if you do a deref of each element of a lazy sequence before reading the next item, you don’t actually start the next future until the previous completes. This is because a deref doesn’t complete until the future exits, and the lazy seq doesn’t realize the next item (causing the future to start) until it is accessed.
thanks @noisesmith. so does that meant that in order for it to truly be parallel, its the deref
ing that has to happen at the same time? or would trying to parallelize the deref
ing not truly work because it will still be blocking?
what the doall accomplishes is making sure all the futures are started in parallel (or nearly enough, just enough time to make a new future between the start of each one)
then, when you do the derefs, it should only take as long as the longest future, because even if you wait on one of the ones that take longer, it doesn’t slow down the others
finally, NB - be careful about this sort of code that goes through a collection and starts futures for each item in them, it’s easy to feed it a collection large enough that the whole vm sputters to a near halt just managing the resources and switching between the threads. This is why we have various utilities for making thread pools with fixed size and queues for input (one higher level tool for this that someone already mentioned is pmap)
@noisesmith wow thanks so much! that was a really helpful explanation. the function will never have more than about 10 items so far but i'll definitely keep that in mind if anything changes and it has to scale
also be sure to think about what happens if the function gets called in parallel (multiplying complexity) - we had a lot of bugs caused by this kind of behavior in my app, and because we waited so long to address it we had to spend a long time digging up the roots and fixing it. Also, the symptoms are never localized to the code causing this problem - all you see is that the entire app is sluggish (sometimes) without anything in the error messages that lets you know where the problem is. Or worse yet, no errors at all, just an unusable app because of the performance drag… long story short, it’s a good thing to nip in the bud early for everyone’s sanity’s sake
(pmap
#(let [a (future (process-a %))
b (future (process-b %))]
(and @a @b))
["1" "2" "3"])
This is what you want. What you need to know is tje future starts executing in the background once you call future, but once you deref, it blocks the foreground thread, effectively making it as if it wasn't running in a future. So you want to deref at the very last moment, only when there's nothing else left to do in your code and everything is waiting on the result.
Now you must also understand that parallelization adds an overhead. The future takes longer to start then just calling process-a. So parallelism often actually makes your code slower. Never do it unless you hit real performance issues. And always benchmark, because it might not help performance.
Also, when you parallelize, you generally want to parallelize as much as possible, and the operations that are the most expensive, aka take longer.
So in your case, you probably want to do this:
(time
(let [futures
(map
#(vector
(future (process-a %)) (future (process-b %)))
(range 100))]
(map (fn [[a b]] (and @a @b)) futures)))
I was on my phone before. Here's a better example.
(time
(dorun (pmap
#(let [a (future (Thread/sleep 1000) %)
b (future (Thread/sleep 1000) %)]
(and @a @b))
(range 100))))
This one is what you were trying to do. Over a range of 100, it takes ~4 second on by box.
(time
(let [futures (doall (map
#(vector (future (Thread/sleep 1000) %)
(future (Thread/sleep 1000) %))
(range 1000)))]
(dorun (map (fn [[a b]] (and @a @b)) futures))))
This one on the other hand, even over a range of1000, takes only ~1 second.
The difference is the first one process the list number of cores at a time. Even though all the processes (in my case a 1 second thread sleep) could happen in parallel at the same time, its not doing it that way. The second example is doing just that, all process are happening at the same time, so even though you've got a thousand things in the list, they all get processed in one go and "anded" at the end. Its called a map-reduce. The mapping is all parallel, and the reduce is a synchronous reduce over the results.Now if you do:
(time
(let [futures (doall (map
#(vector (future (Thread/sleep 1000) %)
(future (Thread/sleep 1000) %))
(range 100000)))]
(dorun (map (fn [[a b]] (and @a @b)) futures))))
Where the range is now 100 000. You'll encounter what noisesmith was talking about. You're creating so many parallel threads, that we run out of memory. Future has no bounds, so its up to you to make sure this doesn't happen. The first example while it is slower, would not have this problem, since the parallelism is not dependent on the length of the input collection, its always just doing number of cores times 2.To safely use the second example, either you need to bound the size of the collection, or you need to make sure that the rate at which futures complete is faster then the rate at which they are created. So here, if I make the process shorter, sleep only 10ms, we can run it over a 100 000 range.
(time
(let [futures (doall (map
#(vector (future (Thread/sleep 10) %)
(future (Thread/sleep 10) %))
(range 100000)))]
(dorun (map (fn [[a b]] (and @a @b)) futures))))
Which takes ~2 second to run on my box.Even a million:
(time
(let [futures (doall (map
#(vector (future (Thread/sleep 10) %)
(future (Thread/sleep 10) %))
(range 1000000)))]
(dorun (map (fn [[a b]] (and @a @b)) futures))))
Runs in ~17 seconds only !future = background execution on another thread promise = one-time delivery of a value from one thread to another both can wait for result/delivery via deref
doall forces realization of a lazy sequence
thanks @U064X3EF3!
and I want to test it in a predicable way but I don’t know how / if I should write in a fake .csv in my tests file OR if I should use a .csv in a different directory
does it need to be a file? can you just have some represetative text in your test file?
It reads it in via http://clojure.io as a raw stream and then parses the contents of the .csv into a series of maps
can you remove the guts of that function into another one and test that one? the original one will take a file, and call the new function with the input stream from the file?
the function handles all the input streaming but that’s not what I want to test - I want to check the output of the function to make sure that it received the right kind of file and it’s parsing input correctly
Can someone tell me why this function just keeps returning an empty string instead of converting the characters?
(ns rna-transcription)
;; Purpose: convert each DNA nucleotide in a sequence to its
;; RNA complement (GCTA -> CGAU)
(defn to-rna [dna-seq]
(let [complements {:G "C", :C "G", :T "A", :A "U"}]
(apply str (map (fn [x] (complements x)) (seq dna-seq)))
)
)
The problem is definitely in the part (complements x)
-- if I replace that with a different operation in the anonymous function, I get behavior I expect
i think you need (map complements dna-seq)
. your call to map there doesn't have enough args
If I replace (complements x)
with say (str x x)
, I get the doubled input string as expected
FAIL in (it-transcribes-all-nucleotides) (rna_transcription_test.clj:6)
expected: (= "UGCACCAGAAUU" (rna-transcription/to-rna "ACGTGGTCTTAA"))
actual: (not (= "UGCACCAGAAUU" ""))
^. when you seq a string you get a sequence of characters, not single character strings
and one other note, when you map over something, map will call seq
on the collection so you can remove your explicit call to that
@irasantiago No prob. You might want to investigate this:
(time
(doall
(pmap
#(let [a (future (Thread/sleep 3000) (inc %) true)
b (future (Thread/sleep 4000) (dec %) true)]
(and @a @b))
[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19])))
why not make a csv file, or use one?
also, I’d only mock or use a csv file if you are making a new csv reader, if you are using a library that consumes csvs, just mock the data that it returns (that is, make a data literal of the data returned from the library you use)
I need to write a unit test for a function that streams in a .csv using http://clojure.io/reader
There is java.io.StringReader that takes a string and lets you read from it
i'm gonna renew my suggestion to break the logic part of that function that deals with the input into another function and test that. you don't really need to test getting the text out of a csv file
what functionality i added I want to test is skipping the first two whitespaces of the .csv
you can also use a StringReader to get something that you can read from without needing to do file IO
and bonus, the literal string can be right there in the test to make clear what you are testing
so I have two sample .csv files that are ready to go I just don’t know that much about unit testing to know the best way to do this
i'm guessing somewhere in that file there's a line that goes (with-open-reader [io stuff] (body goes here))
and you just want to break body goes here
into a function that deals with the text
then you just copy and paste the contents of your sample csv files into your test file
or even better, come up with the minimally reproducible version that your test works on
@dpsutton my understanding of “to skip lines in the .csv I just use the .readLine class” was that there’s usage of the input stream itself that should be tested. I do 100% agree though, that tests are much better if you separate the IO parts from the data processing, but I also think you can test the input stream handling without file IO by making a stream directly from a string and testing with that
right, StringReader to get a reader, or some messing with bytes if you really need an InputStream iirc
the argument (to the function) can be inputstream but the csv container seems unnecessary for this is my point i guess
yeah, agreed
@dpsutton you’re saying create a new file that deals with that stuff or simply create a def that has all that information in it?
at the start of the file so that I can refer to it in my tests instead of the real mccoy
so the problems seems to be that I get a file not found exception when I try to pass the text as an arg instead of the file
The problem I am running into now with breaking out the logic is I skipped the two lines by using (.readLine rdr) (.readLine rdr)
right, @john’s example makes a reader you can use (via the method I suggested earlier)
so you can abstract the step that gets the reader from the file, and use the StringReader instead in the test if needed
it’s a judgment call how to isolate these things, but the more you can segregate IO flavored things from data processing things the easier and more reliable your testing will get
also it might be time to move all this discussion to #testing
wait, won’t that just return a file object that would map to a file by that name from PWD?
that is, not reflecting those contents at all