Fork me on GitHub

@mfikes: so I got it to link to a debug JavaScriptCore (I think) with the help of one of the webkit devs (whose bug was actually responsible for the WTFCrash I pasted above and which was fixed in the last 2 days).


[so, my hacked together libtool-joined mega-.o file approach was giving the same result as dynamic linking-- I was just hitting a bug in webkit]


anyway, now that their bug is fixed, this is happening:

Jonathans-MacBook-Pro:planck-c jonathan$ ./planck 
Planck 2.0
ClojureScript 1.9.89
    Docs: (doc function-name-here)
          (find-doc "part-of-name-here")
  Source: (source function-name-here)
    Exit: Control+D or :cljs/quit or exit or quit
 Results: Stored in vars *1, *2, *3, an exception in *e

cljs.user=> (require '[])
Could not require cljs.source-map.base64-vlq
Maximum call stack size exceeded.


any idea on that?


[oh, there was one other tip the webkit dev gave me which is crucial to making this work: you need to set an env var DYLD_FRAMEWORK_PATH to point to the path where your debug JavaScriptCore.framework is located before starting the executable]


[essentially since this involves with dynamic linking, you can do the compile/link time stuff with either copy of the framework and that env var above determines which one will actually be used at runtime]


[my best guess on the Could not require cljs.source-map.base64-vlq. Maximum call stack size exceeded is that the behavior of JavaScriptCore has changed and we (planck or CLJS) were depending on specific now-obsolete behavior. I have no idea whether that was a bug fixed, interface/spec/behavior changed, or a bug introduced but it seems that it is good that we are hitting this now rather than later when JSC is released next].


And, btw, that error happens regardless which command I enter first (or even if I enter a command at all) into the REPL.


Oooh, but it is continuable from-- just a slightly annoying artifact which causes your REPL to hang while it is thinking about it.


And here's the stack trace for my original bug with symbols!!!

cljs.user=> ( "ls" :dir "/Users/jonathan/Documents" :env {"blah" "blaz"} (fn [_] 9))
==63655==ERROR: AddressSanitizer: SEGV on unknown address 0x003100000011 (pc 0x00010e879a9c bp 0x700000093ac0 sp 0x700000093ac0 T20)
    #0 0x10e879a9b in WTF::StringImpl::bufferOwnership() const StringImpl.h:854
    #1 0x10ecd56c8 in WTF::StringImpl::requiresCopy() const StringImpl.h:797
    #2 0x10ecd5226 in WTF::StringImpl::isolatedCopy() const StringImpl.h:1117
    #3 0x10f7bd746 in WTF::String::isolatedCopy() const & WTFString.cpp:684
    #4 0x10f3f934f in OpaqueJSString::string() const OpaqueJSString.cpp:61
    #5 0x10f1c0647 in JSEvaluateScript JSBase.cpp:65
    #6 0x10e27881b in wait_for_child shell.c:135
    #7 0x10e278904 in thread_proc shell.c:144
    #8 0x7fff9dbc499c in _pthread_body (libsystem_pthread.dylib+0x399c)
    #9 0x7fff9dbc4919 in _pthread_start (libsystem_pthread.dylib+0x3919)
    #10 0x7fff9dbc2350 in thread_start (libsystem_pthread.dylib+0x1350)


The problem was that I was [in essence] casting JSStringRef to JSValueRef and back again (due to c_string_to_value returning a JSValueRef and JSEvaluateScript accepting a JSStringRef. Fix was to call JSStringCreateWithUTF8CString directly instead of c_string_to_value.


@mfikes: So, I'm afraid your JSGlobalContextCreateInGroup fix isn't doing what we want here-- it seems that it creates a new context beside the original one in the same group (whatever that means). Unfortunately the vars that I'm putting the callbacks into on the CLJS side aren't copied into that new context and only exist in the original one.


Hmm, I may be able to manually copy over the vars in question from the original context into the new one. I will try that next.


@johanatan: Wow. Interesting stuff. I haven’t had time to look into any of it recently 😞


@mfikes: you wouldn't happen to know where we obtain our original JSContext from would you?


the webkit devs said that as long as we created it ourselves and didn't get it from a WebView (webkit context) then it should be fine to hit it from a bkg thread


[however, it's clearly crashing for us from bkg thread (repeatably) so I'm starting to suspect that we might have a webkit/webview ctx]


but i'd imagine that a REPL doesn't need the DOM/view stuff etc so that sounds strange


mm, that definitely doesn't look like a WebView-provided one


so, i did notice a bit of heisenburgness about this issue (i.e., using the original ctx from the bkg thread) and that is: sometimes the crash happens during the first call on the context (JSMakeNumber) and sometimes on the second call to the ctx (c_string_to_value) [both inside result_to_object_ref]


and given the other heisen-stuff I was seeing before it leads me to think there's something unstable in the underlying global setup of planck itself (i.e., not in my shell.c code [which doesn't run until you actually issue a sh or sh-async command]).


@johanatan: I wonder if we can repro what you are seeing with Planck master...


@johanatan: You are saying that you see it by simply requiring the namespace?


well there's two heisen-issues


1) the sh-async crash with the main thread's ctx happens on different operations on the ctx [not always the first, but seems to be always one of the first two]


2) [I would have to scroll back through Slack's history to find the other one I mentioned but it either happened on planck startup itself or on requiring don't remember which]


[I hope we have that much Slack history lol]


here's the original heisenbug


so, yes, looks like that one was on require


Yeah. Planck master doesn’t do that. Perhaps there is something interesting in the ClojureScript code in your branch.


If you can, you can also do script/build and then build/Release/planck to see if 1.x crashes in the same way.


Well, the thing is I haven't seen that particular heisenbug in a few days. It started out life with extremely low frequency, then started appearing somewhat regularly, then went away completely. I'm sure it's probably connected to some other factor.


But the existence of both of these occurrences of non-determinism is a bit troubling (one of which continues to this day).


@mfikes: here's an interesting finding: the ctx at process_line is different from the one at my function_shellexec.


[`function_shellexec` is called as a callback and there's a lot of JavaScriptCore stack frames in between those two frames on the stack]


[also with a debugger attached to the main thread, the bkg thread can continue successfully with the context provided-- i.e., the issue here is a race (which also explains the non-determinism I observed around exactly which operation on the context would fail)].


so, it looks like there is some sort of 'local' context that is provided to each of the hooked functions and that it doesn't survive as long as we thought it did]


[as you can see, this stuff is proving to be vastly easier to figure out with symbols simple_smile]


WOO! it works!


I violated encapsulation a bit to pull it off


but basically referring to the global_ctx (defined in repl.c) from shell.c directly instead of using the one passed in works.


the problem is that JSObjectCallAsFunction creates a local context


and copies into it all of the things from the outer context


Good sleuthing!


and when our main thread exits, that context goes away