https://github.com/askonomm/dompa - A zero-dependency, runtime-agnostic HTML parser and builder. • Turns HTML strings into a tree of nodes, and makes it easy to traverse and modify said tree • Turns a tree of nodes back into HTML, and provides convenience utilities if you want to do so in a templating-like way • Built to be runtime-agnostic and should in theory work in all Clojure runtimes, though is currently tested and proven to work in Clojure, ClojureScript and Babashka, with Jank support hopefully coming soon.
Very nice Asko. Since the structure is uniform, a suggestion would be to add a zipper function (to get the zipper CRUD API for free).
Perhaps implement traverse with zippers as well. I don’t think HTML is generally deep enough to be liable to blow up the stack, but using a zipper would guarantee that it can’t happen as opposed to recursion.
Thank you @henrik! I'm ashamed to say I have very little experience with zippers in Clojure, but quickly reading the docs it does seem like a very convenient way to navigate a tree structure. I've made a issue for it so I won't forget to give it a go here: https://github.com/askonomm/dompa/issues/9.
I don’t think you have to be ashamed of that, I think that lands you among the majority. I wouldn’t have known about them if someone hadn’t directed my attention to them, because you get so far with just the standard stuff in Clojure. They’re cool when you encounter data structures with a repeating pattern, since you can just write one “adapter” and then the actual API for manipulating the structure is the same, regardless. We have some zipper usage in our codebase. For example,
(defn- not-end?
[loc]
(not (zip/end? loc)))
(defn find-node-locations
"Given zipper `loc` find node locations matching the `pred`.
Optionally takes a transducer, eg. `(map zip/node)`."
([loc pred]
(sequence
(comp
(take-while not-end?)
(filter #(pred (zip/node %))))
(iterate zip/next loc)))
([loc xf pred]
(sequence
(comp
(take-while not-end?)
(filter #(pred (zip/node %)))
xf)
(iterate zip/next loc))))
If I were to write a zipper for Dompa, maybe something like:
(zip/zipper map?
(fn get-children [node]
(get node :node/children))
(fn create-node [node children]
(assoc node :node/children (vec children)))
html-root)
The find-node-locations function would work, even though it isn’t written with Dompa in mind.Gave it a test drive in bb. hr elements (and other self-closing elts) can be written as:
<hr/>
or
<hr />
both crash in dompa now in different waysThis also crashes with a ClassCastException:
(require '[dompa.html :as html]
'[dompa.nodes :as nodes])
(prn (nodes/->html (html/->nodes (slurp "")))) this is really neat
i might use this to replace Jsoup in flower
@borkdude oh man, that's not good. I'll get those fixed asap.
(Jsoup has an imperative API and this has a nice clojure-native traversal API for edits)
this doesn't seem to be on clojars currently, is that right?
That's right, just on GitHub. Do you rely on leiningen? I was wondering if/how much is Leiningen still a thing, so didn't go ahead with Clojars just yet, but I can get it up there in a bit.
I don't use leiningen, I can use git deps if clojars is a pain. mostly I wanted to look at the generated API docs.
Ah right, Clojars does that! Clojars isn't really a pain other than having to have the build.clj (which I find not very user-friendly to just publish a library).
does clojars generate api docs?
you probably mean cljdoc
i do mean cljdoc
i think cljdoc pulls from clojars though?
why doesn't cljdoc support git-tagged deps. cc @lee
(probably it does)
(it amuses me quite a lot how similar clojure's infra is to rust's in this respect, even though to my knowledge there wasn't a lot of overlap between the communities at the time any of it was built)
We did some thinking on cljdoc supporting git tagged deps, but did not move forward with it yet
does cljdoc have the same problem that it needs to run build.clj in order to generate API docs? or is it enough to parse the defns without executing require statements?
@jyn514 cljdoc uses runtime analysis so it has to execute the code. my lite documentation solution quickdoc solely uses static analysis. https://github.com/borkdude/quickdoc
(based on clj-kondo)
The issues discovered by @borkdude should be fixed now in v1.0.1.
I'll go ahead and integrate quickdoc as well later today, to get automatic API docs going @jyn514.
@asko304 Thank you. Now I can parse my (unupdated for a long time) homepage!
(require '[babashka.deps :as deps])
(deps/add-deps '{:deps {askonomm/dompa {:git/url ""
:git/tag "v1.0.1"
:git/sha "35de9bc8aaaa165ec3f2efb04691bdca3dd5e446"}}})
(require '[dompa.html :as html]
'[dompa.nodes :as nodes])
(spit "/tmp/html1.html" (slurp " "))
(spit "/tmp/html2.html" (nodes/->html (html/->nodes (slurp " "))))
(babashka.process/shell {:continue true} "diff" "/tmp/html1.html" "/tmp/html2.html")
I do see differences in the parsed HTML and the generated HTML but maybe it's just whitespace.
At the end it appears that some divs are missing:
}(document, "script", "twitter-wjs"));</script></p></div></div></div></body></html>
vs
}(document, "script", "twitter-wjs"));</script></p></div></body></html>
could also be bad HTML in my homepage ;)
Dompa lacks support for HTML healing that browsers have (such as that if you forgot to close a div, it would do it for you). But I wouldn't immediately presume it's that, it could easily be a bug in my code. I can't debug this at the moment, but will give it a go later today to see if I can track down where the difference comes from.
I've tracked down the issue with the missing tags, and it is definitely an issue on my side, your HTML is just fine @borkdude. I've pushed out a fix in v1.0.2 for that.
I've also added quickdoc generated docs for the API now: https://github.com/askonomm/dompa/blob/main/API.md re @jyn514
@asko304 Awesome. I tried again and noticed that round-tripping loses the doctype:
(require '[babashka.deps :as deps])
(deps/add-deps '{:deps {askonomm/dompa {:git/url ""
:git/tag "v1.0.2"
:git/sha "497a7dc"}}})
(require '[dompa.html :as html]
'[dompa.nodes :as nodes])
(spit "/tmp/html1.html" (slurp " "))
(spit "/tmp/html2.html" (nodes/->html (html/->nodes (slurp " "))))
(babashka.process/shell {:continue true} "diff" "/tmp/html1.html" "/tmp/html2.html") As always, fix one bug, another appears. I've now fixed this issue as well in v1.0.3, and I've added https://github.com/askonomm/dompa/blob/main/test/dompa/round_trip_test.clj (I hope you don't mind). I figure it makes sense to start testing against whole sites as opposed to only bits and pieces.
From the README, I gather that the tree-of-nodes is the same idea & shape as that of clojure.xml and clojure.data.xml, but with different keys, so not directly interoperable with libraries that work with those?
I've actually never used clojure.xml or clojure.data.xml, so I've no idea. I just picked naming that I thought would make sense to me. Would the interopability with that be important? Dompa isn't meant to work with XML (though it may be possible with some tweaks)