2024-02-26 clojuredesign-podcast | Clojure Slack Archive

clojuredesign-podcast

genekim 2024-02-26T23:51:32.816989Z

@neumann It’s crazy — over the last couple of weeks, I have been writing an ever-growing monster function that is like a mirror of what you described in your latest series: My goal: I often take screenshots of youtube or podcasts when I hear something interesting. The problem is that those screenshots just go into my phone, and then die. I try to remember what was being said, so I can do something about it (e.g., write about it, take notes, etc.). But it’s too much work to re-establish the content. I want to take those screenshots and turn them into links to the actual video/podcast, get a transcript and know exactly what was being said. That way, I can take an action on it. The majority of the work is being done in one giant function, with a long let block. I just spent the last 30m studying the piece that does the majority of the work, and am trying to figure out how to add a step in the middle, and it’s no longer debatable that it’s time to change it from the big “let block” pattern. It takes too long to run (45 seconds), I can’t start in the middle, and I can’t easily modify it anymore. I am jumping in right now to rewrite it that matches the pattern you described in the last series. The steps 1. retrieves a photo from xtdb 2. resizes it and turns into a stream 3. send a first prompt to local llava (yuck) or openai (so much better — once again, I discovered that if you value your own time, this is the route to go.) 4. do a second run to GPT-4 to analyze the responses and turn into a EDN map 5. add new photo-summary to xtdb I need to add a step before #4 to do manual analysis of image, and manually find compute the YouTube progress bar for images taken on iPhone. Problems so far: • last night, I found that 45 seconds is too long to iterate quickly — the first 30 seconds is the first LLM run, which remained unchanged. I could have used the same result over and over, skipping those steps • exception escaping from somewhere (some LLM run not returning correct JSON), and wreaking havoc — need to write tests to isolate and contain this. Not easy right now • right now, workflow is run primarily from a Fulcro client, with state of its own — result is janky, unreliable machinery that gets out of sync with this server code. For historical purposes, here is the ugly code for your sympathy and ridicule: https://gist.github.com/realgenekim/17f9a7ae48aaf2e03df3cc80326a5094 Wish me luck! 🙂. I’ll keep you posted. (I’m even recording this right now, so you can see the process it underwent. 🙂

👀 3

❤️ 1

slipset 2024-03-04T08:41:05.922029Z

Late to the party and stuff, so sorry about that. Looking at your doit fn, I’d deffo split that thing into a fn for each of the case statements, be that a multi method or just normal fns:

(defn load-photo-from-db [...]...)
(defn url->b64 [...] ...)
;; etc

Also, being slightly obsessed with monoids, I’d suggest defining your empty-state (empty-bag) somewhere, and perhaps add a spec for that bag. IME, these bags tend to grow over time if you don’t take care:

(def empty-bag {:photo-b64-string "" ;; or should it be nil?
                ...}

Also, due to my obsession with monoids, I’d prefer merge over assoc with a bunch of keys, as merge can potentially be an operator in a monoid, and works better in a reduction system. I see also that eg in your url->b64 case, your comment is lying, ie you say that it’s a thing from url -> string, but in fact the case statement returns a new bag. I would separate those two things. In fact, the comments you have before each case statement are perfect doc-strings for the fns I suggested you wrote earlier on in this message :)

slipset 2024-03-04T08:41:43.893089Z

You will most likely also see that a bunch of these fns (if you choose to extract them) are nice, pure fns, and just a few of them be sideffecting.

slipset 2024-03-04T08:44:35.728359Z

pass2-summary-to-edn could also reaonably be split in to several fns. One which creates

{:summary (-> pass1-summary :summary)
 :prompt pass2-prompt
 :is-screenshot? is-screenshot?
 :youtube-percentage youtube-percentage}

Another, which sends this to gpt, A third which knows to figure out if it’s a success or not

pez 2024-03-04T09:34:54.629409Z

Mindblowing thread this!

slipset 2024-03-04T09:35:23.338429Z

I’ve updated the gist, more comming

slipset 2024-03-04T11:30:26.805049Z

I’ve somewhat come to my end now with what I see as fruitful refactors, but I must admit, there is more to be done. I think it’s fair to point out that your interaction with the LLMs could be modeled more strongly. Like, the prompt and the prompt-photo fns seem quite similar, and i’d be curious if we couldn’t have just a prompt datastructure to describe the different prompts we’d need.

slipset 2024-03-04T11:34:50.021689Z

So rather than specializing on what kind of prompt you’re sending I’d have fns like:

(defn ->photo-prompt [....] returns data that describes a photo prompt)

(defn ->screenshot-prompt [....] returns data that describes a screenshot prompt)

and then

(gpt/prompt (->photo-prompt ...))

and

(gpt/prompt (->screenshot-prompt ...))

I’d then also perhaps introduce a datastructure like:

{:prompt :summary :success}

Which I’ve almost done, to record the interaction with the LLM.

genekim 2024-03-04T13:37:15.834349Z

@slipset I’m in complete awe of what you did — thank you so much! I suspect I’ll be studying and replicating this form of code for the next decade. I will be copying this into my codebase later today, but some reactions (more reflections coming later).

genekim 2024-03-04T13:46:17.100579Z

When I was reading your note, I was extremely skeptical that even the first refactoring would be worth it — as @marcel187 noted in another thread, pulling out the code out of the case statements runs the risk of now having to look in two places instead of one, decreasing coherence, and increasing the risk of the pieces no longer fitting together in the end. But wow, there is no doubt that your version is much, much better than mine, on so many dimensions: readability, testability, and getting rid of the stupid “next state” in the case state machine is just 🔥 And the explicit shifts to a monoid pattern is also 🔥 — I wrote about how I loved @ericnormand use of monoids and “algebraic thinking” here: https://clojurians.slack.com/archives/CKKPVDX53/p1700849580265389 I wish I had been able to watch you refactor this code, because I have a potentially silly-sounding question, which actually I think subconsciously is deterring me from doing what you did: • wasn’t it a lot of typing? It sounds laughable, but the idea of having to give each one of those tiny functions names and typing it out could seem like a lot of work! I’d love to know your thought process and response to that question! (Rest assured, I have no doubt that it’s worth it — since I started this thread, I’ve gone on to create 5+ more of these “monster let block” functions, which I now must struggle with. I think seeing your code pattern gives me something that I can build or refactor to much earlier!) Thank you again! Am in awe of what you did!

slipset 2024-03-04T13:56:28.316609Z

Thank you 🙂

slipset 2024-03-04T13:57:44.232119Z

This was actually not a lot of work. As for the fn-names, you already had them spelled out in the case statement. Basically, the three/four comments are the iterations that I took. I am a heavy user of paredit, so that makes stuff so much simpler.

slipset 2024-03-04T14:06:03.370489Z

This is a small(!) recording of the first steps of such a refactor.

genekim 2024-03-04T18:16:41.776579Z

Wow, I watched that whole 5.5m video — thank you so much for posting it! (I’m so glad you recorded it!!) The notion that “refactoring would take a LOT of typing” is clearly false. And the increase in clarity is stunning. I also love the fact that the arguments are now explicit — instead of a “bag of values in a map,” they’re explicitly defined, which makes understanding the function so much easier. In 5.5 minutes, you refactored two functions — so, what a huge payback for 2.5m of work per function! I also LOVE that you turned doit function into this:

(defn doit [db photo-id]
  (-> (xtp/photo-xtdb-fresh-url-uuid db photo-id)
      (url->b64)
      (write-to-file "tmp/decoded.jpg")
      (prompt-vision-llm-pass1 default-prompt)
      (store-summary :pass1)
      (analyze-screenshot "tmp/decoded.jpg")
      (pass2-generate-prompt)
      (pass2-summary-to-edn)
      (write-to-database db photo-id)))

I love everything you did, but this might be my favorite part — that you’ve coupled the steps together so they can be read from top-to-bottom, but also allowed each step to be run independently (i.e., I could execute everything but not do side-effecting function of storing in database.) Thank you again — this is so lovely!

slipset 2024-03-04T18:37:07.327099Z

You make me blush 🙂 Two things: One, I think you’ve picked up on most of the things I was looking to achieve, there’s still some things to be done, but I think we’ve gotten to at least 80% improvement. Two, I could never have written the thing that you wrote, as in I would never be able to realize that I store a bunch of screenshots of talks on my phone (which I used to do), and that there is a way to solve the problem with LLMs and what not. And even if I had realized that it was a solvable problem, I would never have gotten around to even try solving it because there is just too much icky stuff to deal with. So well done!

phronmophobic 2024-02-27T00:08:50.538449Z

How are you using llava? I've found google's vision API for OCR to be unreasonably effective when feeding into LLMs.

phronmophobic 2024-02-27T00:11:05.960539Z

Here's some example code: https://github.com/phronmophobic/slackcommands/blob/7c6db9a8775a5559a3173ddbdb0d033c0e317954/src/slackcommands/ai/vision.clj#L266

genekim 2024-02-27T00:18:31.847229Z

I found llava to be amazing, but super unreliable — it was sort of fun learning to to make it work using ollama, but when I got a bit frustrated with the poor results after a couple of days, I went back to GPT-4-vision, and was startled at how good it was. Will write more soon, and always amazed at your work!

genekim 2024-02-27T00:18:45.986909Z

PS: I’m 3 steps into the rewrite, and am amazed at how much cleaner it is already!!!!

neumann 2024-02-27T00:20:26.372289Z

@genekim I look forward to seeing the rewrite!

phronmophobic 2024-02-27T00:20:29.150249Z

I haven't used llava that much, but I've found it mostly not that useful. Would be interested to hear what it's good at.

phronmophobic 2024-02-27T00:20:59.488149Z

I'll probably update llama.clj with llava support at some point since I think llama.cpp already supports it.

genekim 2024-02-27T02:07:53.659119Z

OMG, I’m so ridiculously happy. Here’s what it generated! I was able to add a step that detected whether the image matched dimensions of an iPhone, and count the red pixels below the YouTube player window (something that I spent 2 hours trying to get gpt-4-vision to do.) It was amazing to be able to work on each of these steps in isolation from each other. Here’s some images: • youtube video screenshot being analyzed • the summary of it being put into the database • the bag of data being displayed in Portal Here’s the rewritten version: it’s not what I expected it to be — I didn’t want to deal with having to write function names, so I just kept it in a big long function. But I was amazed at how easy it was to change the steps. I even introduced a new one, to write the 2nd LLM prompt as a pure function, pulling it out of the 2nd LLM stage. It made the process of writing and understanding the code 100x easier and so much more fun! THANK YOU, @neumann and @nate!

genekim 2024-02-27T02:08:50.709549Z

Some other comments

genekim 2024-02-27T02:13:09.810169Z

The revised code is in a comment in the gist: https://gist.github.com/realgenekim/17f9a7ae48aaf2e03df3cc80326a5094?permalink_comment_id=4943675#gistcomment-4943675 • This big bag of “vars” inside the bag: I thought it would be confusing to have so many of them, but it wasn’t a problem. (I kept thinking I should group them by {:pass1 {:db _ :photo-id _}, :pass2 {:photo-b64-string _…} …}. But I didn’t want the overhead of having to number and renumber them. So, big ol flat map it is! • At the bottom is the comment block manually saving states, passing it to the next. It was super easy to test, revise, and run/re-run. • I have yet the make the wrapper, which will just recur until state is done. This is just night and day compared to before. Thank you!

❤️ 1

genekim 2024-02-27T02:15:32.120529Z

@smith.adriane My initial reaction to llava was initially “amazing!” But I was constantly surprised by how poor quality the interpretations are, and how toxic some of its confabulations/hallucinations are — my reaction was often, “my goodness, what have you been trained on!” 😆

😆 1

genekim 2024-02-27T02:17:04.935859Z

In contrast, see the GPT4-vision interpretation here (the picture with the arrow in it): https://clojurians.slack.com/archives/CKKPVDX53/p1708999673659119?thread_ts=1708991492.816989&cid=CKKPVDX53 Am just amazed at how much relevant details in can pull out. Aside from the iPhone YouTube mobile app, it can reliably pull out the current playtime of the screenshot of almost any screenshot I give it.

genekim 2024-02-27T02:21:42.058119Z

I can’t wait to try running your code — or better yet, wanna get together sometime in the next week or so? I’d love to get a demo of what you’ve built. this looks incredible!]

👍 1

phronmophobic 2024-02-27T02:22:16.926069Z

I'm also pretty impressed that chatgpt can extract the approximate percent played. I'll have to try it out sometime.

genekim 2024-02-27T02:27:42.753269Z

Check this one out — gpt4-vision extracted current play position with no help from me. Prompt was as follows (I will take out the stuff asking it to estimate the red progress bar — I just counted red pixels instead for iPhone portrait mode pics, which it couldn’t handle.) >

You are a helpful AI assistant. I am trying to remember what was said in a video or podcast, so I can write an article about it.
> 
> Describe the contents of the provided image in detail. The image is a paused video on a screen or a screenshot. Mention the apparent video player interface, the progress bar status, any visible text or titles, the individual visible in the video frame, and any discernible actions or expressions. Also, note any visible tabs or browser elements that could indicate the context or content of the video. Provide a clear and coherent description of all elements without speculating about the content beyond what is visible.
> 
> I want to know whether this is a YouTube video or a podcast.  I want you to extract all visible text.
> 
> 
> Right below the in the YouTube video window, which is at the very top of the screen, you will find the episode name (e.g., podcast name), with the channel name right below it.
> 
> If the current playback time is not available, estimate the percentage viewed instead by looking at the length of the red progress indicator positioned at the bottom of the main video window  in relation to the full width of the screen. Assume the full red progress bar represents a 100% completion. Measure or make a visual estimation of the red part of the progress bar. Then, divide this length by the total width of the screen to obtain a rough percentage of the video completion. Provide this percentage as the progress indicator value.
> 
> Do not use code interpreter -- only analyze the images.
> 
> Analyze the screenshot and extract all the relevant information, and put it into
> a Clojure EDN map.  Here is an example:
> 
> 
> {:date-of-photo "2023-12-30"
>  :podcast-app "Overcast, screenshot of mobile interface"
>  :time "5:08"
>  :podcast-episode "Episode 161 — 'There I was!': Real Life Stories from the Cockpit with the Mitchell Institute Part II"
>  :podcast-name "THE AEROSPACE ADVANTAGE"
>  :length "58:06"
>  :current-playback-time "34:11"
>  :percentage-viewed "92%, based on the red bar progress indicator"
>  :name-of-podcast-app "Apple Podcasts"
>  :notes "12/30 was not extracted from the screenshot, but was added manually; and any other comments go here. Man with green shirt standing in front of a presentation. The interface suggests that the video is possibly part of a series or playlist, as the text 'Mix - PapersWeLove' is visible, indicating more content from this channel is available"
>  :all-extracted text "Comments are turned off."}
> 
>  Do not use code interpreter -- only analyze the images.

I had the first LLM pass use gpt-4-vision, which would return prose — it doesn’t support function calling yet. less than 15s to complete The second pass was GPT-4 using function calling to extract a nice JSON map. (I think I tried gpt3.5-turbo, and it worked most of the time. I can’t remember why I switched to gpt4. I’ll try that again, since my pixel counting now works.). Takes less than 10s to complete.

phronmophobic 2024-02-27T02:40:06.367789Z

Interesting. I've found the vision tools that do OCR to be pretty effective and most other vision tools to be kinda mediocre. > I just counted red pixels instead for iPhone portrait mode pics, which it couldn’t handle. Does that mean the LLMs couldn't count the red pixels? I would be surprised if they could.

nate 2024-02-27T04:56:26.310869Z

Wow, this is soo cool.

genekim 2024-02-27T15:27:07.532629Z

@smith.adriane For sure — was super dubious that gpt4-vision could count pixels and divide by screen width, and wasn’t even sure how that could even work, given that I think images are tiled before analyzed. And counting doesn’t seem like something that an LLM naturally does. (I asked GPT4 to write that prompt, so I gave it a whirl. Probably resulted in 2+ hours of wasted time due to dead end.)

genekim 2024-02-27T15:31:36.264039Z

More commentary and reflections on the process and end-state code: • the resulting code shape was surprising to me — much longer, but also much flatter. 2x longer than original, but much easier to read. • Given the length of the doit function, I feel like I should extract out the case statement bodies into separate functions. But “juice doesn’t feel worth the squeeze”? • is this where a multimethod or something might make sense? • Surprising decision: I had trouble naming the bag function argument. I already used state, and couldn’t think of the map it lives in. Just chose bag in homage to @nate and @neumann 😆 • It felt good to just put more stuff in the bag map

pez 2024-02-27T16:30:49.086899Z

> So, big ol flat map it is! Flat as flat juice with flat juice on top? 😃 I have started to flatten my maps more too. I use syntehtically namespaced keywords to help me see categories of things. clojure-lsp lets me refactor-rename such keys, and they are easier to destructure than hierarchies are. I’ve also had help from the namespacing because I can dispatch on them. It lets me break up large dispatch tables (e.g. core.match) in smaller ones.

genekim 2024-02-27T18:45:48.863539Z

@pez That actually bit me yesterday — I changed a keyword name, and was not 100% confident that the [{:keys [ _ _]}] got changed when using IntelliJ/Cursive. And maybe unrelated, now that I think about it, I did get an unexpected nil one time. (And thought of having to destructure multi-level maps made me drop the idea enitrely, as you noted!)

neumann 2024-02-27T22:23:37.733949Z

@genekim The refactor looks great! I can follow it just fine. I love the comment block with all the different scenarios. It really shows how you can try all the different phases in isolation. That's fantastic! I also love the evolution of this: the first version you wrote was figuring out what the process needed to be (your imperative "tracer bullet"), the second version was factoring it for testing and long-term evolution and maintenance. Very nifty! I think the name bag works just fine. @nate and I have used the name context too. I do like calling it a "bag" because the point of it is to get filled with useful things you want to take along through the process. I think your big destructuring statement illustrates the utility of flat maps. Sure, there are a lot of bindings, but it's quite easy to read and follow. Makes sense to me! Thanks so much for sharing this and bringing us all along for the ride!

❤️ 1

Clojurians Log v2

clojuredesign-podcast