off-topic 2017-02-26 | Slack Archive

someone here was saying that they tried to parse wikipedia and were curious if it's possible... I don't remember who

https://github.com/spencermountain/wikipedia-to-mongodb

@ashnur: I was definitely complaining about parsing wikipedia; but it may have been someone else; thanks for sharing this!

qqq12:02:01

were you the guy after wikiquotes ?

Aron12:02:09

yeah, it doesn't work on that

qqq12:02:42

wikipedia really should distribute a docker img with all their software installed

Aron12:02:45

but the wtf_wikipedia parser works, just requires a lot of time and patience to see which pages use which templates and give meaning to the structure

qqq12:02:57

so it beocmes (1) download docker img, (2) download data inside docker img, and (3) ahve full local mirror of wikipedia

Aron12:02:17

i was thinking of writing some learning algorithm for it, because i don't want to do it myself 😄 https://www.youtube.com/watch?v=_PwhiWxHK8o

qqq12:02:38

oh man, the README at https://github.com/spencermountain/wtf_wikipedia is great -- I thought it was "wf_wikipedia", but the README makes it clear it's wtf_wikipedia

Aron12:02:39

i am not a fan of docker

Aron12:02:55

it's a huge pain on windows and many people use windows because enterprise and games and convenience

qqq12:02:45

would https://www.mediawiki.org/wiki/Parsoid work for you?

Aron12:02:50

yeah, and the fact that it is working (at least it seems to work, i parsed some of the wikiquote dump and the data is there, you just need to write a third layer of parser over it that's recursive, not a big deal 😞 ), makes it genius

Aron12:02:23

no, because that level is already solved by wtf_wkipedia

qqq12:02:39

the video you posted deals with svms -- what ML are you doing on wikiquotes ?

qqq12:02:00

parsoid, based on the text, is interesting as it gives you the DOM; is it verified to not work with wikiquotes ?

Aron12:02:15

my issue is that there is no flag or anything to signify parts of the dump as "authors" or "quotes"

Aron12:02:04

even when you look at the rendered page, it's sometimes hard to understand what's going on. there are dialogues, background notes, references, it's just a big pile of copypasta 😄

Aron12:02:37

now, if someone has the patience, they can theoretically just make a list of different templates, write up their features, write a search over the data to extract the content when these features are present.

Aron12:02:25

but i am not 23 anymore 🙂

qqq12:02:37

you would think that with quotes, they can do {:author ... :quote ... :comments ..}

Aron12:02:43

lol

Aron12:02:39

but no, some quotes are not like that: https://en.wikiquote.org/wiki/Stargate_SG-1#8.10

Aron12:02:55

check the source if you dare 😄

Aron12:02:08

good thing we have these "semantic" html tags like "description list" and "description definition" or whatever dd stands for

Aron14:02:36

if i ask wtf to give me plaintext, sometimes i get html

sudodoki17:02:55

regarding extracting data from wiki DOM, I’ve seen https://github.com/molybdenum-99/infoboxer which is in ruby but might be helpful

qqq22:02:34

is there any react ui toolkit that provides draggable windows ? i'm tired of menus, sidebars, and modal elements - I just want draggable windows

2017-02-26

Channels