off-topic 2018-04-09 | Slack Archive

coo-dah?

It's frustrating how pronunciation information is lost when reading text in another language. A bit like textual source code information is lost in Clojure, when manipulating code forms as data, and having lost track of the text that generated them

3Jane10:04:21

tsoo-dah (in Polish; I’d expect the English version to be pronounced like the second part of word barracuda)

3Jane10:04:44

information is lost; otoh bilingual puns are the best

3Jane10:04:27

(a minute of silence to the memory of Osram)

vemv15:04:00

Q: Imagine users are uploading text content that is clearly following a template, maybe with some variations added by hand to "beat the system". Example: - User A submits text content Hello, my name is John and I'm here to have fun! - User B submits text content Hello, my name is Jim and I'm here to have fun! - both texts follow a similar/identical template, so a flag should be raised, so to speak. - We cannot know every possible template in advance - every day a new malicious user could appear. Dunno if this is a generally tractable problem? If yes, which tools would be the best ones for the job?

sundarj15:04:55

Levenshtein distance?

vemv15:04:50

Does it work well for comparing two paragraphs? Let's say with ~300 words each

sundarj15:04:39

well, there's no limit to the number of words you can pass to the distance calculation function, but i guess you'd have to see if it works for your use-case

sundarj15:04:09

the only thing you get is the number of changes between the two texts

sundarj15:04:20

distance("Hello, my name is John and I'm here to have fun!", "Hello, my name is Jim and I'm here to have fun!") would return 3

dpsutton15:04:59

i swear i've seen a cool article about this where you take the dot product of the two paragraphs. a dot product of two vectors (and i forget the transformation of paragraph to vector. i think its just the ordered union of words and their count) and that gives you a notion of how similar the vectors are

dpsutton15:04:23

so then you have all your templates and you see which of the templates are most in the same direction as the text that comes in

dpsutton15:04:35

https://nlp.stanford.edu/IR-book/html/htmledition/dot-products-1.html

dpsutton15:04:47

not what i was remembering but explains it a bit

fellshard15:04:09

A dot product between two unit-scaled vectors will give a value between -1 and 1; parallel in the same direction gives 1, parallel in the opposing direction gives -1, perpendicular gives 0. This is done with mathematical vectors as basically (reduce + (map * v1 v2))

fellshard15:04:02

So there's two questions involved in the paragraph conversion: One is how to perform that 'multiply' operation with words / chunks, and the other is how to ensure the components are / can be scaled to a unit vector

dpsutton15:04:39

i've seen it with word frequencies

dpsutton15:04:49

just union the set of words together and then dot product their word counts

fellshard15:04:17

If it's not unit-scaled, you'll get some arbitrary value that you won't really be able to interpret well. Frequencies / counts should work, yeah

dpsutton15:04:28

"bob has glasses" "greg has glasses too"

fellshard15:04:56

<bob, greg, has, glasses, too>
<1, 0, 1, 1, 0> * 1/√3
<0, 1, 1, 1, 1> * 1/2
Dot Product: 0 + 0 + 1/2√3 + 1/2√3 + 0 = 1/√3

fellshard15:04:48

So using that method and setting a cap on the maximum score to filter would catch any sentence that's similarly scrambled: bob glasses has greg too, glasses bob too greg has has, etc.

fellshard15:04:12

If order is important, you'd have to find some other technique...

fellshard15:04:32

or mix and match multiple types of detection

vemv15:04:43

Interesting answers so far, appreciated! One nuance to keep in mind is that the solution should work in a webapp context. Say there are 10000 texts to compare against executing a massive DB read + executing the compare(a, b) function 10k times on each tentative insert doesn't sound efficient

vemv15:04:50

Wondering if there's something akin to a hash of a paragraph. Or, something like Elasticsearch providing this feature OOTB

dpsutton15:04:55

that's more or less what we are computing

dpsutton15:04:10

a traditional hash is no good though since it doesn't preserve the metric

dpsutton15:04:17

close sources hash to close hashes

vemv15:04:44

> that's more or less what we are computing perfect. Wanted to make sure before diving into the impl details 🙂

joelsanchez15:04:22

alternatively, https://marcobonzanini.com/2015/02/09/phrase-match-and-proximity-search-in-elasticsearch/ "Within-Sentence Proximity Search"

👌 4

sveri15:04:48

Ha, thats what I just wanted to say, AFAIK elasticsearch supports these usecases more or less

justinlee16:04:49

I can’t remember what it is called, but another technique is to compute the set of two-word pairs for each adjacent word in the text, and then compare that distribution of tuples. Computationally much faster and pretty robust in noisy data.

👌 4

justinlee16:04:44

In computational biology, there are n^2 algorithms for perfect string alignment, but they are too slow, so people tend to start with heuristics like the above.

akiroz16:04:04

Ngram Statistics? I remember doing something like this in homework using Hadoop MapReduce 😆

justinlee16:04:13

oh yea! ngrams. yes i probably had to do homework on this too, but it was in the 90s, so it’s getting quite hazy

vemv16:04:55

cool stuff suggested over here, knew Clojurians was the place to go 😜 thanks all!

fellshard16:04:32

Ahh yes, ngrams. Elasticsearch should have ngram analysis as well, iirc?

justinlee16:04:05

yes it does, but i believe vemv wanted a browser-side solution (?)

vemv17:04:54

server-side

justinlee17:04:07

oh well yea then you could probably just stick everything into an elasticsearch index and run the full text of the document as a query then look at the similarily scores

👍 8

fellshard17:04:57

Elasticsearch is like sprinkling magic on these types of problems, hah

jakemcc17:04:22

@vemv I’ve had to do this before. Ended up implementing simhash. http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf , http://www.wwwconference.org/www2007/papers/paper215.pdf , and http://matpalm.com/resemblance/simhash/ for some reference

🍻 4

mpenet17:04:43

Bonus points if you use spandex with elasticsearch ;)

👍 4

fellshard17:04:53

Unrelated - how have I never heard of spacemacs before??

Alex Miller (Clojure team)17:04:30

dang it, we have been trying to hide it from you for years!

fellshard17:04:42

:fist-shake:

Alex Miller (Clojure team)17:04:56

but seriously, if there’s some place you could have seen it but didn’t, perhaps a request to add it is in order there

fellshard17:04:07

Naw, I'd just heard it mentioned but never bothered to look into it. I noticed it come up in CIDER issues and since I wanted to try CIDER but didn't want to dive into straight emacs, this seems like a good compromise. So it's a weird route to approach it from 😅

mv19:04:17

@seancorfield you mentioned you work in online dating, are you working for a big name or a new thing?

dominicm19:04:15

World Singles is the company, it's big from what I know

mv19:04:50

Never heard of it

mv19:04:11

Wonder how it is different

dominicm19:04:40

I believe the unique sale is that they have a lot of websites, each tailored for different markets. e.g. one for uniforms, one for jazz lovers, etc.

justinlee19:04:07

immutable singles, monadic singles…

😀 8

fellshard20:04:54

newtype Couple a = Single a ⨯ Single a

seancorfield22:04:51

@mv Indeed, as @dominicm said, we are World Singles (technically World Singles Networks) and we have about 100 dating sites focused (mostly) on ethnic verticals (http://soulsingles.com, http://italianosingles.com). So the sites (brands) are small and niche but there are a lot of them 🙂

seancorfield22:04:38

Those two sites are React.js on the front and Clojure REST API on the back (as are several other sites in our portfolio). We still have some of our larger sites such as http://arablounge.com and http://eligiblegreeks.com on our legacy platform (which is, essentially, CFML for the View/Controller and Clojure for the Model).

seancorfield22:04:38

We have just under 60K lines of production Clojure and just under 20K lines of test code currently.

mv22:04:16

Interesting. How many engineers to manage that?

2018-04-09

Channels