This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2024-02-26
Channels
- # announcements (19)
- # babashka (27)
- # beginners (24)
- # calva (14)
- # clerk (5)
- # clj-commons (21)
- # clojure (51)
- # clojure-europe (14)
- # clojure-madison (1)
- # clojure-nl (1)
- # clojure-norway (9)
- # clojure-uk (4)
- # clojuredesign-podcast (32)
- # core-async (14)
- # datomic (7)
- # events (1)
- # honeysql (3)
- # hyperfiddle (14)
- # introduce-yourself (2)
- # kaocha (7)
- # malli (21)
- # off-topic (50)
- # portal (2)
- # reagent (41)
- # reitit (41)
- # releases (1)
- # scittle (6)
- # shadow-cljs (90)
- # tools-deps (10)
- # xtdb (1)
- # yamlscript (1)
@neumann It’s crazy — over the last couple of weeks, I have been writing an ever-growing monster function that is like a mirror of what you described in your latest series:
My goal: I often take screenshots of youtube or podcasts when I hear something interesting. The problem is that those screenshots just go into my phone, and then die. I try to remember what was being said, so I can do something about it (e.g., write about it, take notes, etc.). But it’s too much work to re-establish the content.
I want to take those screenshots and turn them into links to the actual video/podcast, get a transcript and know exactly what was being said. That way, I can take an action on it.
The majority of the work is being done in one giant function, with a long let
block.
I just spent the last 30m studying the piece that does the majority of the work, and am trying to figure out how to add a step in the middle, and it’s no longer debatable that it’s time to change it from the big “let block” pattern. It takes too long to run (45 seconds), I can’t start in the middle, and I can’t easily modify it anymore. I am jumping in right now to rewrite it that matches the pattern you described in the last series.
The steps
1. retrieves a photo from xtdb
2. resizes it and turns into a stream
3. send a first prompt to local llava (yuck) or openai (so much better — once again, I discovered that if you value your own time, this is the route to go.)
4. do a second run to GPT-4 to analyze the responses and turn into a EDN map
5. add new photo-summary to xtdb
I need to add a step before #4 to do manual analysis of image, and manually find compute the YouTube progress bar for images taken on iPhone.
Problems so far:
• last night, I found that 45 seconds is too long to iterate quickly — the first 30 seconds is the first LLM run, which remained unchanged. I could have used the same result over and over, skipping those steps
• exception escaping from somewhere (some LLM run not returning correct JSON), and wreaking havoc — need to write tests to isolate and contain this. Not easy right now
• right now, workflow is run primarily from a Fulcro client, with state of its own — result is janky, unreliable machinery that gets out of sync with this server code.
For historical purposes, here is the ugly code for your sympathy and ridicule: https://gist.github.com/realgenekim/17f9a7ae48aaf2e03df3cc80326a5094
Wish me luck! 🙂. I’ll keep you posted. (I’m even recording this right now, so you can see the process it underwent. 🙂



How are you using llava? I've found google's vision API for OCR to be unreasonably effective when feeding into LLMs.
Here's some example code: https://github.com/phronmophobic/slackcommands/blob/7c6db9a8775a5559a3173ddbdb0d033c0e317954/src/slackcommands/ai/vision.clj#L266
I found llava to be amazing, but super unreliable — it was sort of fun learning to to make it work using ollama, but when I got a bit frustrated with the poor results after a couple of days, I went back to GPT-4-vision, and was startled at how good it was. Will write more soon, and always amazed at your work!
PS: I’m 3 steps into the rewrite, and am amazed at how much cleaner it is already!!!!
@U6VPZS1EK I look forward to seeing the rewrite!
I haven't used llava that much, but I've found it mostly not that useful. Would be interested to hear what it's good at.
I'll probably update llama.clj with llava support at some point since I think llama.cpp already supports it.
OMG, I’m so ridiculously happy. Here’s what it generated! I was able to add a step that detected whether the image matched dimensions of an iPhone, and count the red pixels below the YouTube player window (something that I spent 2 hours trying to get gpt-4-vision to do.) It was amazing to be able to work on each of these steps in isolation from each other. Here’s some images: • youtube video screenshot being analyzed • the summary of it being put into the database • the bag of data being displayed in Portal Here’s the rewritten version: it’s not what I expected it to be — I didn’t want to deal with having to write function names, so I just kept it in a big long function. But I was amazed at how easy it was to change the steps. I even introduced a new one, to write the 2nd LLM prompt as a pure function, pulling it out of the 2nd LLM stage. It made the process of writing and understanding the code 100x easier and so much more fun! THANK YOU, @neumann and @U0510902N!
The revised code is in a comment in the gist: https://gist.github.com/realgenekim/17f9a7ae48aaf2e03df3cc80326a5094?permalink_comment_id=4943675#gistcomment-4943675
• This big bag of “vars” inside the bag: I thought it would be confusing to have so many of them, but it wasn’t a problem. (I kept thinking I should group them by {:pass1 {:db _ :photo-id _}, :pass2 {:photo-b64-string _…} …}
. But I didn’t want the overhead of having to number and renumber them. So, big ol flat map it is!
• At the bottom is the comment block manually saving states, passing it to the next. It was super easy to test, revise, and run/re-run.
• I have yet the make the wrapper, which will just recur until state is done.
This is just night and day compared to before. Thank you!
@U7RJTCH6J My initial reaction to llava was initially “amazing!” But I was constantly surprised by how poor quality the interpretations are, and how toxic some of its confabulations/hallucinations are — my reaction was often, “my goodness, what have you been trained on!” 😆
In contrast, see the GPT4-vision interpretation here (the picture with the arrow in it): https://clojurians.slack.com/archives/CKKPVDX53/p1708999673659119?thread_ts=1708991492.816989&cid=CKKPVDX53 Am just amazed at how much relevant details in can pull out. Aside from the iPhone YouTube mobile app, it can reliably pull out the current playtime of the screenshot of almost any screenshot I give it.
I can’t wait to try running your code — or better yet, wanna get together sometime in the next week or so? I’d love to get a demo of what you’ve built. this looks incredible!]
I'm also pretty impressed that chatgpt can extract the approximate percent played. I'll have to try it out sometime.
Check this one out — gpt4-vision extracted current play position with no help from me. Prompt was as follows (I will take out the stuff asking it to estimate the red progress bar — I just counted red pixels instead for iPhone portrait mode pics, which it couldn’t handle.) >
You are a helpful AI assistant. I am trying to remember what was said in a video or podcast, so I can write an article about it.
>
> Describe the contents of the provided image in detail. The image is a paused video on a screen or a screenshot. Mention the apparent video player interface, the progress bar status, any visible text or titles, the individual visible in the video frame, and any discernible actions or expressions. Also, note any visible tabs or browser elements that could indicate the context or content of the video. Provide a clear and coherent description of all elements without speculating about the content beyond what is visible.
>
> I want to know whether this is a YouTube video or a podcast. I want you to extract all visible text.
>
>
> Right below the in the YouTube video window, which is at the very top of the screen, you will find the episode name (e.g., podcast name), with the channel name right below it.
>
> If the current playback time is not available, estimate the percentage viewed instead by looking at the length of the red progress indicator positioned at the bottom of the main video window in relation to the full width of the screen. Assume the full red progress bar represents a 100% completion. Measure or make a visual estimation of the red part of the progress bar. Then, divide this length by the total width of the screen to obtain a rough percentage of the video completion. Provide this percentage as the progress indicator value.
>
> Do not use code interpreter -- only analyze the images.
>
> Analyze the screenshot and extract all the relevant information, and put it into
> a Clojure EDN map. Here is an example:
>
>
> {:date-of-photo "2023-12-30"
> :podcast-app "Overcast, screenshot of mobile interface"
> :time "5:08"
> :podcast-episode "Episode 161 — 'There I was!': Real Life Stories from the Cockpit with the Mitchell Institute Part II"
> :podcast-name "THE AEROSPACE ADVANTAGE"
> :length "58:06"
> :current-playback-time "34:11"
> :percentage-viewed "92%, based on the red bar progress indicator"
> :name-of-podcast-app "Apple Podcasts"
> :notes "12/30 was not extracted from the screenshot, but was added manually; and any other comments go here. Man with green shirt standing in front of a presentation. The interface suggests that the video is possibly part of a series or playlist, as the text 'Mix - PapersWeLove' is visible, indicating more content from this channel is available"
> :all-extracted text "Comments are turned off."}
>
> Do not use code interpreter -- only analyze the images.
I had the first LLM pass use gpt-4-vision, which would return prose — it doesn’t support function calling yet. less than 15s to complete
The second pass was GPT-4 using function calling to extract a nice JSON map. (I think I tried gpt3.5-turbo, and it worked most of the time. I can’t remember why I switched to gpt4. I’ll try that again, since my pixel counting now works.). Takes less than 10s to complete.Interesting. I've found the vision tools that do OCR to be pretty effective and most other vision tools to be kinda mediocre. > I just counted red pixels instead for iPhone portrait mode pics, which it couldn’t handle. Does that mean the LLMs couldn't count the red pixels? I would be surprised if they could.
Late to the party and stuff, so sorry about that.
Looking at your doit
fn, I’d deffo split that thing into a fn for each of the case
statements, be that a multi method or just normal fns:
(defn load-photo-from-db [...]...)
(defn url->b64 [...] ...)
;; etc
Also, being slightly obsessed with monoids, I’d suggest defining your empty-state (empty-bag) somewhere, and perhaps add a spec for that bag. IME, these bags tend to grow over time if you don’t take care:
(def empty-bag {:photo-b64-string "" ;; or should it be nil?
...}
Also, due to my obsession with monoids, I’d prefer merge
over assoc
with a bunch of keys, as merge
can potentially be an operator in a monoid, and works better in a reduction system.
I see also that eg in your url->b64
case, your comment is lying, ie you say that it’s a thing from url -> string, but in fact the case statement returns a new bag. I would separate those two things.
In fact, the comments you have before each case statement are perfect doc-strings for the fns I suggested you wrote earlier on in this message :)You will most likely also see that a bunch of these fns (if you choose to extract them) are nice, pure fns, and just a few of them be sideffecting.
pass2-summary-to-edn
could also reaonably be split in to several fns.
One which creates
{:summary (-> pass1-summary :summary)
:prompt pass2-prompt
:is-screenshot? is-screenshot?
:youtube-percentage youtube-percentage}
Another, which sends this to gpt,
A third which knows to figure out if it’s a success or notI’ve somewhat come to my end now with what I see as fruitful refactors, but I must admit, there is more to be done.
I think it’s fair to point out that your interaction with the LLMs could be modeled more strongly. Like, the prompt
and the prompt-photo
fns seem quite similar, and i’d be curious if we couldn’t have just a prompt
datastructure to describe the different prompts we’d need.
So rather than specializing on what kind of prompt you’re sending I’d have fns like:
(defn ->photo-prompt [....] returns data that describes a photo prompt)
(defn ->screenshot-prompt [....] returns data that describes a screenshot prompt)
and then
(gpt/prompt (->photo-prompt ...))
and
(gpt/prompt (->screenshot-prompt ...))
I’d then also perhaps introduce a datastructure like:
{:prompt :summary :success}
Which I’ve almost done, to record the interaction with the LLM.@U04V5VAUN I’m in complete awe of what you did — thank you so much! I suspect I’ll be studying and replicating this form of code for the next decade. I will be copying this into my codebase later today, but some reactions (more reflections coming later).
When I was reading your note, I was extremely skeptical that even the first refactoring would be worth it — as @U0609QE83SS noted in another thread, pulling out the code out of the case statements runs the risk of now having to look in two places instead of one, decreasing coherence, and increasing the risk of the pieces no longer fitting together in the end. But wow, there is no doubt that your version is much, much better than mine, on so many dimensions: readability, testability, and getting rid of the stupid “next state” in the case state machine is just 🔥 And the explicit shifts to a monoid pattern is also 🔥 — I wrote about how I loved @U050P0ACR use of monoids and “algebraic thinking” here: https://clojurians.slack.com/archives/CKKPVDX53/p1700849580265389 I wish I had been able to watch you refactor this code, because I have a potentially silly-sounding question, which actually I think subconsciously is deterring me from doing what you did: • wasn’t it a lot of typing? It sounds laughable, but the idea of having to give each one of those tiny functions names and typing it out could seem like a lot of work! I’d love to know your thought process and response to that question! (Rest assured, I have no doubt that it’s worth it — since I started this thread, I’ve gone on to create 5+ more of these “monster let block” functions, which I now must struggle with. I think seeing your code pattern gives me something that I can build or refactor to much earlier!) Thank you again! Am in awe of what you did!
This was actually not a lot of work. As for the fn-names, you already had them spelled out in the case statement. Basically, the three/four comments are the iterations that I took. I am a heavy user of paredit, so that makes stuff so much simpler.
Wow, I watched that whole 5.5m video — thank you so much for posting it! (I’m so glad you recorded it!!)
The notion that “refactoring would take a LOT of typing” is clearly false. And the increase in clarity is stunning.
I also love the fact that the arguments are now explicit — instead of a “bag of values in a map,” they’re explicitly defined, which makes understanding the function so much easier.
In 5.5 minutes, you refactored two functions — so, what a huge payback for 2.5m of work per function!
I also LOVE that you turned doit
function into this:
(defn doit [db photo-id]
(-> (xtp/photo-xtdb-fresh-url-uuid db photo-id)
(url->b64)
(write-to-file "tmp/decoded.jpg")
(prompt-vision-llm-pass1 default-prompt)
(store-summary :pass1)
(analyze-screenshot "tmp/decoded.jpg")
(pass2-generate-prompt)
(pass2-summary-to-edn)
(write-to-database db photo-id)))
I love everything you did, but this might be my favorite part — that you’ve coupled the steps together so they can be read from top-to-bottom, but also allowed each step to be run independently (i.e., I could execute everything but not do side-effecting function of storing in database.)
Thank you again — this is so lovely!You make me blush 🙂 Two things: One, I think you’ve picked up on most of the things I was looking to achieve, there’s still some things to be done, but I think we’ve gotten to at least 80% improvement. Two, I could never have written the thing that you wrote, as in I would never be able to realize that I store a bunch of screenshots of talks on my phone (which I used to do), and that there is a way to solve the problem with LLMs and what not. And even if I had realized that it was a solvable problem, I would never have gotten around to even try solving it because there is just too much icky stuff to deal with. So well done!