Fork me on GitHub
#beginners
<
2022-05-21
>
Baye17:05:38

What is your favorite/recommended webscraping tool in Clojure? Something practical and comprehensive (like Scrapy in Python if you know it)?

practicalli-johnny18:05:17

https://github.com/cgrand/enlive is a very useful library that is often used to scrape the web and transform it into useful data structures in Clojure

👍 1
Baye19:05:49

Thanks, @U05254DQM! Will check it out now.

emccue19:05:36

jsoup via whatever means

hiredman20:05:28

Enlive and jsoup are useful for scraping static pages

👍 1
hiredman20:05:03

You may want to check out htmlunit for more complex "modern" pages

👍 1
hiredman20:05:23

But even htmlunit has its limits to how much JavaScript it can handle

hiredman20:05:33

playwright is a useful library for puppeting real browsers, and seems less fussy than setting up selenium

Baye20:05:09

You mean being able to navigate and click on links before scraping?

hiredman21:05:17

depends on what you are scraping, html pages or full on react apps

Baye22:05:26

I mean the main advance part I want is the ability to go to sub-pages by clicking on links, all done by the scraper.

hiredman22:05:01

Yeah, enlive and/or jsoup are not that, they are basically just html parsers

hiredman22:05:01

Htmlunit is a sort of headless browser written in java, it can execute JavaScript and you can programmatically click things etc

👍 1
devn02:05:10

a little surprised this didn’t come up

Jon Olick20:05:43

@hiredman how do you suppose you would implement the format command in a C/C++ hosted language without re-implementing all of printf?

andy.fingerhut21:05:36

Are you trying to avoid using an existing C/C++ implementation of printf?

seancorfield23:05:48

@U036UA9LZSQ Is this for jo_lisp? I think I would expect it to behave like the C/C++ formatting when there are differences...

seancorfield23:05:19

The core behavior is similar enough I think? (been a while but all the basics are the same, right?)

didibus23:05:18

I think it be nice to omit printf like formatting, and maybe go straight for something like this: https://openjdk.java.net/jeps/8273943 ?

seancorfield23:05:26

@U0K064KQV Not relevant to what Jon is working on I suspect -- implementing Clojure in C++ as a hosted language there.

seancorfield23:05:49

I think he's concerned with balancing compatibility with Clojure vs amount of work involved. I wonder how ClojureCLR deals with format?

didibus23:05:00

Oh, I thought he was only inspired by Clojure and diverging. Arguably, if that JEP does happen, Clojure might inadvertently support it form Java 😛

Jon Olick23:05:45

What Sean said

seancorfield23:05:37

Yeah, I'd expect interop in jo_lisp to open up whatever C/C++ has to offer in terms of other string formatting systems 🙂

Jon Olick23:05:58

Mostly, C doesn't have a mechanism to construct a variable argument list to pass to printf

Jon Olick23:05:09

So, I can't really use it easily

didibus23:05:23

In that case, I consider format one of the things I'd expect to be of the host. But I'd expect cl-format to be consistent.

seancorfield23:05:00

Is cl-format widely used? I think I've only ever encountered one code base that used it?

didibus23:05:05

Probably not widely, I use it though, it's a great template language once you know it, very powerful. In CL, it also supports inline variables, which I wish Clojure's implementation also had though, apparently they're open to it, just no one ever finished it.

seancorfield23:05:30

@U036UA9LZSQ You could write a dispatch on the number of arguments, up to a reasonable maximum, and support just that "subset" of available calls?

Jon Olick23:05:08

Yeah that's the alternative in some respects - though limited by type

didibus23:05:44

Would you not have this issue everywhere? Of not supporting var-arg?

Jon Olick23:05:08

Not really no

didibus23:05:31

You can support it for other functions but not printf?

Jon Olick23:05:25

Arguments are passed to native functions in a persistent list

didibus23:05:30

Even though its implemented in Clojure itself?

(defn printf
  "Prints formatted output, as per format"
  {:added "1.0"
   :static true}
  [fmt & args]
  (print (apply format fmt args)))

Jon Olick23:05:51

So the issue only occurs when you need to forward variable arguments to a C function

Jon Olick23:05:01

Only case that has come up so far is printf

didibus23:05:40

Oh I see, you'd use the C format, and since that's not var-arg.

didibus23:05:49

How does the C format gets the values to substitute?

Jon Olick23:05:52

I can just implement it, but was hoping @hiredman had a clever workaround he knew

didibus23:05:47

My google does show that C's format is variadic though 😛, but my familiarity with C is low, and I have no clue how you're implementing Clojure to compile to C

didibus23:05:53

Hum, ok I think I understand the issue now having looked more at the C printf. Sean's idea seems maybe best. But I wonder if you could use vprintf instead. The use of vararg means you'd statically know the length, though I'm not sure how you'd apply over it.

Jon Olick01:05:31

yeah, the problem with that approach, from my research, is there is no way to construct a va_list.

Jon Olick01:05:38

(from scratch that is)

didibus01:05:34

I'm sure you've looked into that, va_start and va_end, not sure if they work or not. Just FYI in case you didn't explore those.

Jon Olick01:05:12

yup, pretty sure I covered all the bases. There are non-portable non-standard, architecture specific, OS specific ways of creating a va_list… but f-that