This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-02-26
Channels
- # aws-lambda (2)
- # beginners (10)
- # boot (17)
- # cider (19)
- # clara (1)
- # cljs-dev (13)
- # cljsjs (22)
- # cljsrn (1)
- # clojure (132)
- # clojure-austin (2)
- # clojure-berlin (2)
- # clojure-dusseldorf (1)
- # clojure-germany (2)
- # clojure-italy (7)
- # clojure-spec (6)
- # clojure-uk (5)
- # clojurescript (45)
- # core-matrix (3)
- # cursive (4)
- # datomic (8)
- # emacs (3)
- # keechma (3)
- # lein-figwheel (1)
- # leiningen (2)
- # lumo (24)
- # nyc (1)
- # off-topic (29)
- # om (68)
- # onyx (5)
- # perun (50)
- # planck (5)
- # protorepl (5)
- # re-frame (128)
- # reagent (10)
- # remote-jobs (1)
- # ring (4)
- # rum (41)
- # untangled (28)
- # yada (4)
someone here was saying that they tried to parse wikipedia and were curious if it's possible... I don't remember who
@ashnur: I was definitely complaining about parsing wikipedia; but it may have been someone else; thanks for sharing this!
but the wtf_wikipedia parser works, just requires a lot of time and patience to see which pages use which templates and give meaning to the structure
so it beocmes (1) download docker img, (2) download data inside docker img, and (3) ahve full local mirror of wikipedia
i was thinking of writing some learning algorithm for it, because i don't want to do it myself 😄 https://www.youtube.com/watch?v=_PwhiWxHK8o
oh man, the README at https://github.com/spencermountain/wtf_wikipedia is great -- I thought it was "wf_wikipedia", but the README makes it clear it's wtf_wikipedia
it's a huge pain on windows and many people use windows because enterprise and games and convenience
would https://www.mediawiki.org/wiki/Parsoid work for you?
yeah, and the fact that it is working (at least it seems to work, i parsed some of the wikiquote dump and the data is there, you just need to write a third layer of parser over it that's recursive, not a big deal 😞 ), makes it genius
parsoid, based on the text, is interesting as it gives you the DOM; is it verified to not work with wikiquotes ?
my issue is that there is no flag or anything to signify parts of the dump as "authors" or "quotes"
even when you look at the rendered page, it's sometimes hard to understand what's going on. there are dialogues, background notes, references, it's just a big pile of copypasta 😄
now, if someone has the patience, they can theoretically just make a list of different templates, write up their features, write a search over the data to extract the content when these features are present.
but no, some quotes are not like that: https://en.wikiquote.org/wiki/Stargate_SG-1#8.10
good thing we have these "semantic" html tags like "description list" and "description definition" or whatever dd stands for
regarding extracting data from wiki DOM, I’ve seen https://github.com/molybdenum-99/infoboxer which is in ruby but might be helpful