Fork me on GitHub

saw a tweet the other day that an ARM chip had an instruction to make javascript numbers work a bit faster. I thought i remembered that x86 had some optimizations to make C compilers emit better code. Does this ring a bell with anyone?


yes, I remember it has an awful acronym


@dpsutton there's a few really wacky string functions built into x86


honestly i was hoping you would respond @tbaldridge


i think i've heard this in discussions of lisp machines and why they failed. and it was pointed out that cpu's were optimized to make C and algol languages fast. and if they had instead optimized for lisp it would have been different


They're really strange, but they allow you do a strcmp in about 2 assembly instructions.


So yeah, on that subject. I've also wondered how possible it would be to get good performance out of modern CPUs if we dumped the C calling conventions and did something different.


never heard of that one. i guess that's what @mpenet was talking about?


Garbage collection is a hard problem that doesn't get much easier with hardware support, and perhaps hardware support isn't needed, as Azul's GCs are pretty dang fast, and don't require hardware support (although they're even faster with specialized hardware I'm told).


Other things like TCO are hard to implement in a way that is compatible with existing calling conventions, but if the whole system uses something that's TCO friendly, perhaps that wouldn't be a problem?


do you know if the jvm has special instructions for garbage collection? maybe analogous to something a real cpu could offer?


No the only hooks the JVM offers are diagnostic hooks.


For a GC you need a few things: 1) A way to register roots, in Clojure this would be namespaces 2) A way to scan the call stack and find pointers 3) A way to pause your program during allocation 4) A way to pause your program at a semi-regular interval


So the general idea of a simplistic GC is: 1) Try to allocate memory, if there is none left 2) Pause the current thread, now pause all threads 3) Mark all pointers that can be reached from the roots or the call stack of the paused threads 4) Assume everything else is garbage and mark it as freed 5) Resume all threads


So it's not so much that you need support from the JVM as you have to write your program in a way that remains within the JVM's constraints. The JVM doesn't allow you to mess with the call stack, and that's a pain, but you get accurate GC. You can't access raw pointers to object on the JVM, but that means you don't have to mark every pointer in your classes.


There is one fancy hack that the JVM uses though


That bit about "pause threads" is really hard to implement correctly, since you don't want the other threads having to check some lock every so often. It's a problem of performing the check often enough that you get quick GC pauses, but not so often that it slows down your program


So whenever you have a back-jump (recur in Clojure) in the JVM the JIT emits a read from a specified portion of memory. Normally that page of memory is marked as "read" and you pay the cost of reading from the cache (really fast). But when a GC happens the JVM marks that page as protected, and that causes all the caches to flush. Next time a thread tries to read from that page the CPU throws a SEGFAULT (accessing protected memory) and the JVM catches that and pauses the thread.


It's a really cool hack, and I'm not aware of any other VMs that use that approach.


so in theory could you write code that would no obey the request to halt for a GC?


Yes, that's quite possible, if you were to sit in a busy loop that didn't read from that protected block, your thread would spin forever and block GCs from happening.


But the JVM is really smart about that, and I don't think it's possible to get into that state without using native interop. And even then there might be protections in JNI around that.


and just off hand, do you happen to know what info is in that page of memory? I'm assuming its not just wasted or garbage


I don't, but pages are pretty small, and all threads most likely share the same page


On the subject of lisp machines though, the main problem is that x86 is just "really" fast. It's hard beat 40 years of optimization. If your problem is really different there's room for a new processor (GPUs did that), but for relatively normal code, a x86 processor is going to be lightyears ahead of anything produced from scratch


So you won't see this extra load in the JVM byte code, I guess, only in JIT-produced native machine code? Neat. I recall Cliff Click mentioning in a talk on the Azul JVM that one thing they found, unrelated to GC, was that lots of Java code does current time in milliseconds calls so often that they created an optimized implementation for it where the value was stored in RAM in its own page of memory, and threads would load it from there when they made the call -- I forget if they had a dedicated thread for updating that value in memory, or whether the callers somehow did it when needed.


For highly branchy code, general purpose CPUs, AMD and Intel engineers have definitely been busy for decades on the problem.


Yeah, I remember Cliff Click being a bit pissed that they lost a few awesome features when moving away from custom hardware.


> It's hard beat 40 years of optimization. I think this was the argument i had seen. I guess C and the von neuman architechture are just aligned?


lisp emits code that just is "different" from what C does?


Not directly, it's more that lisp code, even Clojure, tends to be really branchy


or did lisp machines not run on x86


oh no, the old lisp machines ran custom hardware


Some companies like Symbolics had their own hardware engineers designing their own CPUs.


And back in that day it was sometimes faster to reboot than run the GC

metal 8

befuddled branch predictor. band name called it

πŸ’― 4

I work at Cisco Systems, not designing CPUs, but have done some hardware design -- it is a lot of effort to go above somewhere around 1.5 GHz nowadays. It boggles my mind the effort Intel/AMD must go to, to get working 3GHz designs.


So if you had ideas on making a processor that was better for Lisp, you would have to figure out ways that it was better by at least a factor of 2 per clock cycle, if not more, because you would be giving up that clock speed advantage (or hire a team of engineers as big as Intel does for one chip).


And that's what NVidia/AMD do for GPUs. They sacrifice branch performance in order to get 4000 cores on a single die.


what's the tradeoff between branch performance and core count?


General purpose CPUs often devote significant chip area and power to branch prediction logic, to guess which way a branch will go correctly, more often than flipping a coin would.


bigger cores means fewer of them on the same chip


Exactly, and GPUs (at least NVidia) goes a step further. 32 cores run in lock-step. If a branch occurs and all 32 cores take the same branch, great. But if there's a split, then half of the cores are paused while one branch is run, then those pause while the GPU goes and runs the other half up to that merge point.


sorry, that is the qualitative difference -- not sure if you were looking for actual numbers (which I don't have handy)


i didn't know branch prediction took a significant amount of silicon real estate


This is known as a "warp", and works just fine for AI or Graphical computing. But was a key reason why raytracing performance is kindof crappy on GPUs, and why NVidia's new cards add logic to perform ray tests in hardware


Thanks for the mention of the 9900K at 5 GHz -- I hadn't seen that announcement yet. It makes sense that the article mentions that it was tested on many high end games. Gaming is definitely an area where $ and power consumption are less of an issue than data centers.


although I'm sure the big compute buyers like Google/Amazon/etc. test everything that comes out looking for whatever new thing has an edge.


Heh, yeah. I've been watching that space as I want to build a new system. Sadly I may have to save a bit more as my hardware lust seems at odds with what the market is at right now.


I can't find it now, but IIRC about 70% of the non-cache transistors in a CPU are devoted to branch prediction and out-of-order execution.


wow. had no idea


If your software's machine code is highly branchy, as a lot of code often turns out to be the way we write it, you get very low instructions-per-cycle if you don't do good enough branch prediction.


I haven't followed up with whatever some software folks are doing to avoid Spectre attacks, but that is one reason why it is so scary -- avoid Spectre 100%, and you avoid a lot of the performance gain techniques that exist in modern CPUs.


or you just run on a physically separate CPU core or box entirely.


And that's why Spectre exists on almost every CPU created in the past 10 years (or more), if it does speculative execution, it probably has Spectre


It boggles my mind that the 6502 was hand-drawn


Ivan Goddard did an interesting talk at Strange Loop about Spectre (and why the Mill isn’t affected by it)


@alexmiller nice, I'd love to hear more about that Mill was pretty cool when I looked into it about 5 years ago, haven't heard much since


Going to put that talk on my pile of "Watch later" videos while I am thinking about it ...


is it on the net somewhere?


I guess technically I may have to wait for the recent Strange Loop talk videos to be published first πŸ™‚


Videos should be out soonish

πŸŽ‰ 20

I hearty "thank you!" to you and the people who make Strange Loop run, from all us freeloaders out here πŸ™‚

βž• 40
❀️ 24

It's one of my favorite events each year. Time to settle in with a nice hot cup of coffee and some intriguing talks.