I'm playing around with a little web-crawler that is limited to the domain of the site you start with so it doesn't sprawl out into the whole web. Here's what I came up with, but I'm thinking it could be improved a bit.
You can keep downloading and parsing on the same thread: I think it is important that when a file download fails or a file parse fails - it's really obvious which it is - so I would do this in different functions. Also (.get (Jsoup/connect url)) is giving you very little control http-clj gives way more options (e.g. headers and timeouts).
One option for decoupling: When you download data you can either keep it as a byte-array or write it to a file - you can run downloading in a future and parsing in a future (possibly with additional multithreading inside of each) and have a queue between them.
I think that makes sense. I'll take a stab at these suggestions. Thanks for the help!
From parsing and page-processing point of view, here is some https://github.com/adityaathalye/clojure-by-example/blob/master/src/clojure_by_example/fun/inspect_nasa_planets.clj (didn't know of Jsoup then). It runs on a single page of interest, to pull out planetary data and then crunch it. The key feature for me was to declare DOM paths of interest as Clojure data, which enlive understands. For scraping across sites, this kind of method may help to declare general patterns of interest, as well as site specific (or even site x page specific) special patterns for more surgical data extraction. As the code warns, it is far from professional grade (and I don't know if this method would perform well), but it may throw some ideas your way.
Another idea could be to navigate around DOM trees using zippers... e.g. https://github.com/adityaathalye/clojure-by-example/blob/master/src/clojure_by_example/fun/workshop_fmt.clj processing and rewriting some Clojure source, but just like Lisp code, the DOM is just a tree :) Gut feeling... zippers might challenge the garbage collector if a lot of ops are done on a big in-memory structure, and will probably challenge to brain to grok, but the fun might be well worth the trouble. 🖖 🤓
zippers are very low level and fiddly. For parsing something you've never seen before enlive / or even (tree-seq) are going to be much simpler.
CLojure can garbage collect millions of objects per second - shouldn't be a problem for it.
Yeah, I thought of zippers if one wants to over-engineer for flair points :) A fast HTML parser would be the right choice, hands down.
(future
(try
(prn "Getting page for url " url)
(.get (Jsoup/connect url))
(catch Exception e
(prn "Failed to open url " url " due to " (.getMessage e))))))
(defn- remove-last-slash [url]
(if (str/ends-with? url "/")
(apply str (drop-last url))
url))
(defn- normalize-url [url]
(remove-last-slash (str/trim url)))
(defn- of-domain? [domain url]
(let [domain (str/replace domain "https://" "")
pattern (re-pattern domain)]
(some? (re-find pattern url))))
(defn- get-hrefs [^org.jsoup.nodes.Document doc]
(when doc
(let [^org.jsoup.nodes.Element hrefs (.select doc "[href]")]
(map (fn [^org.jsoup.nodes.Element href]
(.attr href "abs:href")) hrefs))))
(defn- get-domain [url]
(:host (bean (as-url url))))
(defn- next-hrefs
([domain ^org.jsoup.nodes.Document doc saved-pages]
(next-hrefs domain doc saved-pages {}))
([domain doc saved-pages {:keys [inc-filters] :or {inc-filters []}}]
(let [standard-filters [(partial of-domain? domain)]
hrefs (->> (get-hrefs doc)
(filter (apply some-fn (concat standard-filters inc-filters)))
set)]
(set/difference hrefs (set saved-pages)))))
(defn scan
([url] (scan url {}))
([url opts]
(let [domain (get-domain url)
first-page [(normalize-url url) (get-page url)]]
(loop [[page & rst :as pages] (set [first-page])
saved-pages #{first-page}]
(if-not (seq pages)
saved-pages
(let [[_ doc] page
saved-urls (map first saved-pages)
next-hrefs (next-hrefs domain @doc saved-urls opts)
next-pages (map (fn [href] [href (get-page href)]) next-hrefs)]
(recur (concat rst next-pages) (concat saved-pages next-pages))))))))Nice start - scraping is a major rabbit hole - the first 80% is easy - the last 5% is thousands of hours of work. I'd probably disconnect downloading from parsing and also make it multithreaded. You also want timeouts - not ideal when the scraper is stuck on the same URL for 8 hours.
Thanks for the feedback. I think there may have already been a timeout under the hood of JSoup, but I explicitly added one in the download-page function and re-named a few things. Any tips on how I can separate the download/parsing more? I tried to put each of those in their own function but everything gets coupled back together in the loop. That's also the problem I have with the multi-threading. I have each download happening in its own thread but it all comes back together as it iterates through the loop.