Fork me on GitHub
#clojure-nlp
<
2023-03-21
simongray08:03:58

when scraping a website for document-like data in order to create a dataset, I wonder how much of the HTML structure it really makes sense to keep. Could converting just the relevant portion into Markdown be a good idea…?

schaueho09:03:28

I guess it depends on what you want to use the dataset for and if any semantic markup would be beneficial for further analysis (I mean, if most of the content is in divs, then there is not a lot of semantics you can extract anyways). On the flip-side, for some use cases it might be beneficial to exploit the HTML structure to throw away some content (e.g. tables or code blocks).

curtosis14:03:28

Yeah, it’s probably highly dependent on the specific content and context. Some sites may have semantic elements or attributes that would be useful, but in others it’ll just be noise. Tables are a great example — it could really help understand that these text elements are cells in a tabular presentation of data, but it could also be just for (visual) layout purposes.