This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-01-07
Channels
- # announcements (1)
- # babashka (38)
- # beginners (21)
- # calva (1)
- # cider (6)
- # cljsrn (1)
- # clojure-austin (3)
- # clojure-dev (23)
- # clojure-europe (51)
- # clojurescript (2)
- # clr (100)
- # conjure (3)
- # core-typed (3)
- # data-science (2)
- # fulcro (21)
- # joker (1)
- # joyride (1)
- # lsp (7)
- # malli (4)
- # nbb (5)
- # reagent (1)
- # releases (1)
- # shadow-cljs (5)
- # spacemacs (5)
- # squint (5)
- # xtdb (16)
Beginner questions. First task. I want to create dataset composed of text in pdf files linked on a web page: is there a way to save the data to a database given it might be cubmersome to save a large number of pdf files on my computer, so I can do NLP training on it? Second task. Same context applies to my second task. In both cases I would like to go to a specific website, crawl page by page and click on the links, download the text from each link, save the text to a database. Note I am a total beginner. Thanks
Good questions. Clojure is great for these kinds of tasks. You may want to actually store the PDF/HTML: Extracting data out of HTML and PDF is often not clean - e.g. PDF may contain tables or incorrect paragraph breaks. The HTML may contain menu links, advertisements etc, So it is very likely that you will want to store copies of the original PDF/HTML so you can see/understand/debug issues in the output text that you have extracted. You can always delete the files any time as part of your process. Some PDF/HTML parsers can work entirely in memory without the file hitting the disk - but it's not essential - if you have an SSD; then its your network connection (downloading the files) and your CPU (for parsing PDF/HTML) that will likely be the bottlenecks and not writing/reading/deleting files on your disk. You will probably want to think about creating a three step process (following ETL): 1. E - Extract - download the data and spider links 2. T - Transform - parse the PDF/HTML and extract the text 3. L - Load - Insert the data into the database. For Extract - you may want to create custom code using something like (https://github.com/dakrone/clj-http) to fetch the file or a complete scraping library like https://github.com/nathell/skyscraper. For Transform - there are several libraries for extracting text from HTML/PDF - consider using java libraries or command line libraries too that you can call from clojure. If you are focused on NLP - then you may even want to skip the database and just save the extracted text to your disk and run your machine learning on it from there.