Fork me on GitHub

What is your favorite/recommended webscraping tool in Clojure? Something practical and comprehensive (like Scrapy in Python if you know it)?

practicalli-johnny18:05:17 is a very useful library that is often used to scrape the web and transform it into useful data structures in Clojure

👍 1

Thanks, @U05254DQM! Will check it out now.


jsoup via whatever means


Enlive and jsoup are useful for scraping static pages

👍 1

You may want to check out htmlunit for more complex "modern" pages

👍 1

But even htmlunit has its limits to how much JavaScript it can handle


playwright is a useful library for puppeting real browsers, and seems less fussy than setting up selenium


You mean being able to navigate and click on links before scraping?


depends on what you are scraping, html pages or full on react apps


I mean the main advance part I want is the ability to go to sub-pages by clicking on links, all done by the scraper.


Yeah, enlive and/or jsoup are not that, they are basically just html parsers


Htmlunit is a sort of headless browser written in java, it can execute JavaScript and you can programmatically click things etc

👍 1

a little surprised this didn’t come up

Jon Olick20:05:43

@hiredman how do you suppose you would implement the format command in a C/C++ hosted language without re-implementing all of printf?


Are you trying to avoid using an existing C/C++ implementation of printf?


@U036UA9LZSQ Is this for jo_lisp? I think I would expect it to behave like the C/C++ formatting when there are differences...


The core behavior is similar enough I think? (been a while but all the basics are the same, right?)


I think it be nice to omit printf like formatting, and maybe go straight for something like this: ?


@U0K064KQV Not relevant to what Jon is working on I suspect -- implementing Clojure in C++ as a hosted language there.


I think he's concerned with balancing compatibility with Clojure vs amount of work involved. I wonder how ClojureCLR deals with format?


Oh, I thought he was only inspired by Clojure and diverging. Arguably, if that JEP does happen, Clojure might inadvertently support it form Java 😛

Jon Olick23:05:45

What Sean said


Yeah, I'd expect interop in jo_lisp to open up whatever C/C++ has to offer in terms of other string formatting systems 🙂

Jon Olick23:05:58

Mostly, C doesn't have a mechanism to construct a variable argument list to pass to printf

Jon Olick23:05:09

So, I can't really use it easily


In that case, I consider format one of the things I'd expect to be of the host. But I'd expect cl-format to be consistent.


Is cl-format widely used? I think I've only ever encountered one code base that used it?


Probably not widely, I use it though, it's a great template language once you know it, very powerful. In CL, it also supports inline variables, which I wish Clojure's implementation also had though, apparently they're open to it, just no one ever finished it.


@U036UA9LZSQ You could write a dispatch on the number of arguments, up to a reasonable maximum, and support just that "subset" of available calls?

Jon Olick23:05:08

Yeah that's the alternative in some respects - though limited by type


Would you not have this issue everywhere? Of not supporting var-arg?

Jon Olick23:05:08

Not really no


You can support it for other functions but not printf?

Jon Olick23:05:25

Arguments are passed to native functions in a persistent list


Even though its implemented in Clojure itself?

(defn printf
  "Prints formatted output, as per format"
  {:added "1.0"
   :static true}
  [fmt & args]
  (print (apply format fmt args)))

Jon Olick23:05:51

So the issue only occurs when you need to forward variable arguments to a C function

Jon Olick23:05:01

Only case that has come up so far is printf


Oh I see, you'd use the C format, and since that's not var-arg.


How does the C format gets the values to substitute?

Jon Olick23:05:52

I can just implement it, but was hoping @hiredman had a clever workaround he knew


My google does show that C's format is variadic though 😛, but my familiarity with C is low, and I have no clue how you're implementing Clojure to compile to C


Hum, ok I think I understand the issue now having looked more at the C printf. Sean's idea seems maybe best. But I wonder if you could use vprintf instead. The use of vararg means you'd statically know the length, though I'm not sure how you'd apply over it.

Jon Olick01:05:31

yeah, the problem with that approach, from my research, is there is no way to construct a va_list.

Jon Olick01:05:38

(from scratch that is)


I'm sure you've looked into that, va_start and va_end, not sure if they work or not. Just FYI in case you didn't explore those.

Jon Olick01:05:12

yup, pretty sure I covered all the bases. There are non-portable non-standard, architecture specific, OS specific ways of creating a va_list… but f-that