This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2021-07-18
Channels
- # announcements (35)
- # babashka (14)
- # beginners (23)
- # calva (5)
- # cljsrn (3)
- # clojure (154)
- # clojure-europe (12)
- # clojure-losangeles (2)
- # clojure-uk (5)
- # clojurescript (42)
- # conjure (3)
- # cursive (10)
- # datomic (3)
- # emacs (6)
- # events (1)
- # graalvm (1)
- # helix (1)
- # honeysql (1)
- # hyperfiddle (1)
- # jobs-discuss (1)
- # lsp (8)
- # malli (54)
- # meander (1)
- # membrane (1)
- # off-topic (246)
- # polylith (4)
- # practicalli (1)
- # re-frame (14)
- # releases (1)
- # shadow-cljs (21)
- # sql (58)
- # vim (1)
- # vrac (2)
In next.jdbc
is there a way to handle nested jsonb values elegantly? Using rs/as-unqualified-lower-maps
will only convert the result set to JSON but the PGObjects stay:
{:oauth_id "google-someoauthzssz",
:user_id "spaghettimaster98",
:card
#object[org.postgresql.util.PGobject 0x63332fc "{\"type\": \"ContactCard\", \"links\": [{\"url\": \"\", \"type\": \"Link.Instagram\", \"title\": \"jessicawaltz\"}, {\"url\": \"\", \"type\": \"Link.Twitter\", \"title\": \"jessicawaltz\"}, {\"url\": \"\", \"type\": \"Link.LinkedIn\", \"title\": \"cunningham74\"}, {\"url\": \"tel:+15555555555\", \"type\": \"Link.Phone\", \"title\": \"Cell\"}, {\"url\": \"\", \"type\": \"Link.Email\", \"title\": \"\"}], \"style\": {\"colors\": [\"#3F2B96\", \"#A8C0FF\"], \"gradient\": \"Vertical\"}, \"header\": {\"name\": \"Jessica Walker\", \"company\": \"Roku TV\", \"portrait\": \"jessica.jpg\", \"description\": \"...\", \"jobPosition\": \"Chief Financial Officer\", \"companyLocation\": \"Seattle, WA\"}, \"$schema\": \"\", \"version\": \"0.1\"}"],
:inserted_at #inst "2021-03-01T07:24:59.000000000-00:00",
:updated_at #inst "2021-03-01T07:24:59.000000000-00:00"}
Scratch that, I found the tips guide: https://cljdoc.org/d/seancorfield/next.jdbc/1.2.659/doc/getting-started/tips-tricks
Also, #sql exists where folks will be more likely to answer your next.jdbc
/PostgreSQL/JSON column questions.
How can I make sure (do I need to) the head of a lazy sequence passed to a function isn't being held on to? It roughly does something like:
(defn f
[xs]
(g (first xs))
(h (rest xs)))
(defn h
[xs]
(doseq [x xs] (foo x)))
It’s not usually an issue.
In your first example, the g
function gets the value from the head of the seq, but not the seq itself, so it’s OK. The h
function uses doseq
and the doc for that function explicitly says:
> Does not retain the head of the sequence.
What do A and I refer to for Afn, APersistentVector
and Ifn, IPersistentCollection
which serve as base interfaces for clojure datastructure?
how do you configure the thread pools clojure uses for futures and go blocks and similar?
(I’ve previously wanted to configure them, but upon reflection, decided it wasn’t necessary.)
The point of goblocks is that they’re lighter-weight than threads. If you devolve into spinning up a thread-per-goblock, you’ve tossed all the gains.
Not necessarily. The nature of goblocks is that they should be parked a lot of the time.
If that’s not true, you probably want to toss some CPU intensive bit or I/O intensive bit in a thread
But of course, there are limits to that theory. I would assume that most of the time you don’t want your goblock pool to exceed the number of core on the machine.
There are def reasons to up the threadpool. I’m saying that setting it to 100 on a 16 core machine is almost certainly a mistake.
If that’s the situation you’re in, you are literally better off using thread
everywhere instead of go
Meaning: “I have a code base that used go
blocks poorly. How can I fix it now?” => Okay yeah, up the threadpool 😄
What ive done before is pushing cpu intensive work into a thread
(from within a go-block) and parked the go block until the thread is done.
Im not sure if that is commonly done but it felt to me like its the way to go as go blocks should be parked most of the time and hard work done outside of the go blocks threadpools
@U6JS7B99S That’s exactly the pattern.
core.async expects you, the dev, to be vigilant about cpu/io and push it into thread
blocks
(But to Cora’s point, if you have a big ol’ codebase where that didn’t happen, either you wander the codebase fixing it, or you do something drastic.)
I guess thats one of the reasons why https://github.com/clj-commons/manifold is a thing 😅 (the mashup)
@U8QBZBHGD set-agent-executor-threadpool!
there is also this https://github.com/TheClimateCorporation/claypoole
biased opinion of the young, but I think using a futures and blocking operations is a better investment than go blocks atm
Going to share a post that I consider a "must read" if you're doing core.async work. In my opinion it's usually a sign that you're misusing go block threads if you need to increase the number above 8. Any blocking work that is being throttled by only having 8 go block threads is really work that should live on separate dedicated threads in the first place. https://eli.thegreenplace.net/2017/clojure-concurrency-and-blocking-with-coreasync/
agree, looking forward to loom 🙂
The threads behind GO blocks are meant for parallel computation, so it always sets them to the number of cores-ish that your computer has. At least I think, I might have assumed that
I think that's correct. There's no advantage to spreading CPU bound work onto more threads(ish) than number of CPUs. So that's one scenario that makes sense to change core.async's dispatch pool size. It's interesting to me, however, that it defaults exactly to 8 and not to ncpu+2 like the other pools in clojure.core for cpu bound work. Anyways, my primary point was sometimes people end up doing blocking work on those threads and then looking to that setting as an escape hatch for a misconceived async system.
The way I came to view go blocks is as a "thread for async work", which in core.async's context is taking and putting on channels. They just let me write code which looks synchronous so I won't have to lose my mind in callback hell. My rule of thumb, and my advice to any new Clojure devs I happen to mentor at work, is to avoid (read: minimize) CPU in go blocks
To me, they seem specifically designed for concurrent computation, they use cooperative multi-tasking, so its up to you to decide when to yield a large computation. I can see if you need to have some responsiveness, that you don't want people not realizing their computation can choke other processes, but like to actually make things faster and parallel they work great too
I guess you could argue, for any large computation, the overhead of a real thread won't matter, so maybe it is a good practice what you say so you avoid ever stalling
This is slightly tangential, but I highly recommend you read this overview: https://webtide.com/do-looms-claims-stack-up-part-1/ https://webtide.com/do-looms-claims-stack-up-part-2/ Virtual threads aren't free, either. The abstraction is less "leaky" than go blocks, but there's always a price Go blocks are not for cooporative multitasking. They are for Communicating Sequential Processes. The way they're designed (with a global thread pool), you want to use them for the communicating property. They have no runtime that can tell them "hey, suspend this task and go do something else" like virtual threads or goroutines. They can only "release" themselves when they park. If you don't let them park, you choke the thread pool.
Another way to view it: CPU-bound tasks are no different from I/O-bound tasks—they both dominate a thread. The core.async threadpool is primarily meant for processes to wait for communications. (i.e. What Ben said.)
i think the gist with cpu bound work is that "it doesn't really matter what thread it's on, it's still going to impact everything about the same" whereas i/o bound DOES matter what thread it's on because it can "get in the way" of other work that could still be done while the machine is waiting for the i/o bound stuff to complete. Moving cpu bound work off of the go-block dispatch threads won't really change anything because the dispatch threads will still be competing with the other cpu bound worker threads for cpu time. But I don't see harm in separating them and keeping the original purpose of the dispatch threads clear.
It does matter because they are managed by the operating system which can schedule different threads
so even if your CPU bound threads are heating up CPU, at some point you can let in your go-pool, shuffle data around between channel, then go back to doing CPU
So it makes lots of sense separating them, because only one concurrency abstraction can give you the synchronous facade over essentially async operations (blocking on channels is actually callbacks when using >!
)
yeah that's fair. probably that is most important in a system that uses core.async to orchestrate a mix of cpu and blocking tasks? i guess i'm imagining if you had a core.async system whose sole purpose was to do computation you wouldn't benefit a lot from separating them
Let dedicated threads worry about crunching data, go blocks worry about shuffling data around. It leaks, but best case scenario is when the go blocks pool isn't busy doing things you don't want it to
thanks, i didn't know :compute pipelines spawned dedicated threads
Using a small, dedicated threadpool for small bits here and there scales like whoa. However, if you mess it up, you hose the whole system for everyone.
Hum, I think you've convinced me. I didn't really realize that core.async doesn't actually have fibers, and has nothing similar to Goshed to force yielding a process, and also doesn't yield on loop, or on system calls, IO, etc. That does make it quite different to Go and Erlang in that way, so I can see since it only yields on >!
and <!
that it really does act simply as a callback rewriting scheme. So GO is for async, THREAD is for blocking or compute
And in ClojureScript you just dont run heavy compute :rolling_on_the_floor_laughing: Though actually I checked it looks like some people use generators to force yielding of long computations. I think you could do the same in ClojureScript and just use a chan as a way to yield throughout your computation like so:
(do
(let [yield (a/chan)]
(a/go-loop []
(when (a/>! yield 1)
(recur)))
(a/go-loop [i 10]
(a/<! yield)
(print i)
(if (pos? i)
(recur (dec i))
(a/close! yield))))
(let [yield (a/chan)]
(a/go-loop []
(when (a/>! yield 1)
(recur)))
(a/go-loop [i -10]
(a/<! yield)
(print i)
(if (neg? i)
(recur (inc i))
(a/close! yield)))))
With that pattern you get proper cooperative multitasking, and can choose when to yield inside your compute
So you are correct in that it depends. But that behooves us to find what it depends on
I came to exactly the opposite conclusion from our discussion 😂 The go blocks pool it too big!
I thought the +2 came from experimentation, the OS won't always perfectly allocate all thread to a core, so having a little extra probably helps get a turn on the cpu
So, maybe we're missing another macro, like go-async
. Does pipeline share the same compute pool? Or each call to it makes a new pool?
It would do what go
currently does. So then go
could be for compute, go-async
to check on channels for values or put values on them.
Where go
has num of cores of threads, and go-async has like 4 fixed thread (or the 8 we currently have)
And thread
would be used for blocking IO as it is now. I believe it already uses a cached umbound pool?
spawn thread
in an unbounded manner for blocking IO, pipeline
for compute, go
for async
Ya, I guess, but pipeline for compute has the issue that your compute must be well contained. What if it's spread in a lot of places? But adds up?
I think it also all depends if you want to optimize for throughput or latency no? If you want to optimize for throughput, it seems better to do compute inside GO blocks, and only move blocking IO to threads. That way you don't get the overhead of context switching all the time to check on async IO or new requests, etc. If you want to optimize for latency, than move your compute to pipeline or pipeline-async, and keep GO only for coordinating. With blocking IO still done by thread or pipeline-blocking
I didn't try it, but I feel logically, it would prioritize compute while making sure all cores are doing work. Unlike threads, context switch between them is cheaper (or I assume). And because they don't yield in the middle of their compute, there's less overhead. But this is all me trying to reason about performance, which I know is hard. I'm thinking like, put all requests in an input channel. Take only as many requests as you have cores from it. Process them in a GO block, when there is blocking IO needed as part of the processing, send the blocking IO to a thread, and have the go block park on the result. Make sure you configure GO pool to the number of cores you have.
So process only as many things as you have cores at a time, except if you need to block, park that, and start another request in the meantime.
Now you can selectively threat some requests, if they are edge cases, and ask you to perform some super long compute, and you'd rather in that case give others a chance, well for those you can send them to a thread as well, or use a pipeline.
Let's work through this logically: We are considering the options go blocks vs threads for compute Your claim is that go blocks will be better because they will achieve higher throughput Why? Cheaper context switch Go blocks represent logical processes multiplexed over a real thread pool. So you can have TWO kinds of context switches - logical and OS level Logical switches only happen when you park. If you occupy the thread with CPU, they will not happen OS level switches happen just like for regular threads So, if you're doing compute, you're not benefitting from go blocks being "virtual" threads, and keeping code that was written correctly from doing so as well by blocking a finite pool. Leaving the question is compute faster in a go block or thread? Even considered in isolation, computation is faster in a thread, because go blocks rewrite your code to a state machine and threads don't. They add overhead all on their own.
> OS level switches happen just like for regular threads I think this is what logically I would assume it would happen less often, since there won't be as many "other threads" needing to be scheduled, I'd assume each current thread would be allowed to run for longer before switching
But, I'm thinking this is where logic probably fails us 😛, with all my assumptions. Would need to try it and benchmark I guess
No, logic is sufficient. Go blocks yield only when blocking on channels. You have plenty of OS and process threads floating about anyway. Context switching adds overhead when you have thousands of threads, not 16 vs 24
The reason OS level switches in go blocks happen just like in threads is because they are running in real threads, in the end. They're just "suspended" execution which is picked up by a real thread in a thread pool. If that thread gets suspended, same situation
Maybe what it comes down to is a tradeoff between not hurting other parts of the app by starving message-passing and maximal performance for compute
@U6JS7B99S your analysis is correct in my opinion, which is why I concluded: • if the app has lots of compute, no need to put it on the go pool • if the app has lots of message passing, doing compute on the go pool will starve it Therefore, leave the go pool to message passing
It won't starve it, it will delay it until one of the Go block parks. But I'm talking about a throughput optimized case. So yes, it will possibly take longer for a new request to be started, but it will be faster for each request to complete once they do. I don't see what you mean that it is logical because Go block yield only on a channel? I know that, but that's actually why I think they could be more optimial, because they will only yield when they are truly waiting on something else. (though I get your point that the underlying thread might be yielded by the OS, so will this happen less often or not is what I can't reason about)
I'm curious, how do you set things up then? Do you have one thread queue incoming requests and then you have a pipeline over that? Or do you instead allocate N number of threads with are abtiriraly tuned N ? for some N requests? And then you create more threads for each blocking IO? And only use GO for callbacks on that blocking IO? How do you coordinate the result to the request with that? Do you block the request thread using <!!
?
Let's take a very simple example, read from Kafka, deserialize JSON, serialize back, write back to Kafka. How would you handle it?
I'd have one input channel buffered to num of cores, I'd have one thread read from Kafka and >!!
on the channel. I'd have num of cores go blocks <!
from the input channel that deserialize the JSON, transforms it however we want, serialize it back to JSON, and >!
on an output channel of a large n based on how much I could buffer in-memory before running out. I'd have anoother thread <!!
on the output chan and write back to Kafka.
Possibly I'd add a few reader threads from Kafka or writer threads to Kafka in case I see that my GO are waiting on them a lot.
And I'd set the thread pool of GO to num of cores as well
Well, in this case not much. But if you added a DB query in the middle, deserialize JSON, query db, serialize back. Now your threads would block on IO. While they are blocked you'd want to process other messages to go faster.
So with using GO, you spawn a thread to do the DB query and the go block then <!
on the thread result chan. That will yield the GO thread underneath so you can reuse it to start processing another msg from Kafka's input chan
But then if you do CPU on the go pool you might find you can't "release" the queries fast enough, no?
I find that the more things "happen" in the program the more I want to move data between channels. If I want to move data efficiently between channels, my only option is go blocks. So I want those threads running around moving data between the threads which will block and do the actual work
(assume 8 cores) Well, I guess it depends what fast enough is, like 8 go are processing 8 msg at first, one of them now waits for IO, so it yields and a new GO picks up another message, now 8 GO are processing and 1 is waiting, maybe the query is done now, and ya, the result won't be processed immediately, it'll be processed only the next time one of the 8 GO either >! on the output chan, <! on the input or they <! on another thread blocking IO
So the query result will wait until one of the active GO are done "computing" whatever they were currently processing. But this isn't "wasted" work, so you're not waiting idle. But the particular msg whose query result it is will take a bit longer to complete.
You can simulate it pretty easily - take malli
, use it to generate interesting data, do the ser/de regularly, instead of a query just Thread/sleep
I might try to mess with that tonight, now I'm curious. So how would you have done it? So I could compare?
Regular threads I can do it. With pipeline I'm actually a bit confused how to do it. Like where would you plug the blocking IO in the middle?
And by regular threads, you mean just swap the GO blocks for a a/thread correct? Cause I can also go Executor and ThreadPool, but now that has a whole set of options into it as well and its not really using core.async at all
although you'll notice pipeline-blocking and pipeline have the same implementation under the hood (for now)
Ya, so what's weird is that the way pipeline works is almost the design I described, like it queues jobs in a channel, then it process N at a time either on the GO thread pool, or on another N num of threads, and puts the results on a result chan.
And I guess where things can get weird with what I described is if that's not all you are doing in your service. Like if you process Kafka msgs, but if somewhere else you also process incoming TomCat requests, and somewhere else you processed events from a GUI, etc. Like if you did that, the management of the GO blocks could get messy, so I can see having pipeline :compute spawn threads just to avoid that, so if you use multiple pipeline at the same time they don't get weird with each other.
But... I think there is one difference, it seems pipeline cannot go faster when blocking. Maybe I'm not fully getting the implementation, but it seems like it will never go beyond N concurrency. What I'm thinking, you spawn another GO once one of them parks on a blocking IO. That's how you get to go faster
Clearly there were learnings from the core team in moving compute of to threads, as it used to run in go blocks as well, see: https://github.com/clojure/core.async/commit/3429e3e1f1d49403bf9608b36dbd6715ffe4dd4f So my guess is you might be right, or at least, maybe not in terms of absolute performance (that's TBD still), but in terms of not shooting yourself in the foot which could ruin your performance much more easily, its probably still a best-practice for core.async
> as it used to run in go blocks as well I know, I was subtly nudging you to find it previously, did you notice?
Lol, well I already knew about the change haha, but I always accounted it to people using GO blocks not knowing what they're doing and ending up doing blocking IO even in a "compute" task
Which I guess makes sense. Like, probably spawning threads doesn't degrade performance at all or much at all, and then if you mistakenly do some blocking IO, or have some infinite thread in there, you don't hurt your latencies in doing so. But at the same time, it begs the question, why use GO at all? You can just use a/thread all the time and >!! and <!! You said GO is the only efficient way to communicate, can you speak more to that? Is (go (<!)) faster then just <!! ?
When you use go blocks to shuffle data between channels, they can be busy only moving stuff around while threads just do work. That way you can multiplex more work onto the pool
go blocks have more chances to get more work done before the OS tells the thread it's nap time
I'm not sure I follow the whole multiplexing? Most advice I saw says to actually avoid GO for that, and to use put!, take! and poll! instead as they have less overhead.
oof, that's tucked away: https://github.com/clojure/core.async/blob/edc3e16c034106f06e861ffbf91ba0ea87107208/src/main/clojure/clojure/core/async/impl/exec/threadpool.clj#L17
If you would like to set it dynamically instead of passing the System property to the JVM, you can try (System/setProperty "clojure.core.async.pool-size" "<num>")
before any of the core.async
functions are called. Since there are delay
s protecting evaluation of this system property's value till the first use of a core.async
facility.
Hum, I guess I was wrong, that's weird, since both pmap and agents use the same processor + 2, I'd have thought core.async would do the same.
Interesting that it has this: https://github.com/clojure/core.async/blob/edc3e16c034106f06e861ffbf91ba0ea87107208/src/main/clojure/clojure/core/async/impl/concurrent.clj#L38 but it is never used anywhere
Interesting, this is the commit that ended up making it default to 8: https://github.com/clojure/core.async/commit/a690c4f3b7bf9ae9e7bdc899c030955d5933042d#diff-df2b18760355fb977cc2720a5b3fece009ba26aec07a04e1b09537a1bb32fd90 Its a rare instance of a contributor that didn't seem to be involved with core or core.async that got in. I don't know, I feel like it be good to change it back to number of processors + 2. At first it looked like they were hoping to make it big or growing so if people blocked in a GO it wouldn't choke, then it looks like they decided that whatever people shouldn't do that, but a default 8 is kind of strange. Or maybe at least make it the max of number of processores +2 OR 8
Using a magic number in this case does seem like an arbitrary choice. There should have been a comment about it if there was some empirical evidence for that choice.