Fork me on GitHub



It's frustrating how pronunciation information is lost when reading text in another language. A bit like textual source code information is lost in Clojure, when manipulating code forms as data, and having lost track of the text that generated them


tsoo-dah (in Polish; I’d expect the English version to be pronounced like the second part of word barracuda)


information is lost; otoh bilingual puns are the best


(a minute of silence to the memory of Osram)


Q: Imagine users are uploading text content that is clearly following a template, maybe with some variations added by hand to "beat the system". Example: - User A submits text content Hello, my name is John and I'm here to have fun! - User B submits text content Hello, my name is Jim and I'm here to have fun! - both texts follow a similar/identical template, so a flag should be raised, so to speak. - We cannot know every possible template in advance - every day a new malicious user could appear. Dunno if this is a generally tractable problem? If yes, which tools would be the best ones for the job?


Levenshtein distance?


Does it work well for comparing two paragraphs? Let's say with ~300 words each


well, there's no limit to the number of words you can pass to the distance calculation function, but i guess you'd have to see if it works for your use-case


the only thing you get is the number of changes between the two texts


distance("Hello, my name is John and I'm here to have fun!", "Hello, my name is Jim and I'm here to have fun!") would return 3


i swear i've seen a cool article about this where you take the dot product of the two paragraphs. a dot product of two vectors (and i forget the transformation of paragraph to vector. i think its just the ordered union of words and their count) and that gives you a notion of how similar the vectors are


so then you have all your templates and you see which of the templates are most in the same direction as the text that comes in


not what i was remembering but explains it a bit


A dot product between two unit-scaled vectors will give a value between -1 and 1; parallel in the same direction gives 1, parallel in the opposing direction gives -1, perpendicular gives 0. This is done with mathematical vectors as basically (reduce + (map * v1 v2))


So there's two questions involved in the paragraph conversion: One is how to perform that 'multiply' operation with words / chunks, and the other is how to ensure the components are / can be scaled to a unit vector


i've seen it with word frequencies


just union the set of words together and then dot product their word counts


If it's not unit-scaled, you'll get some arbitrary value that you won't really be able to interpret well. Frequencies / counts should work, yeah


"bob has glasses" "greg has glasses too"


<bob, greg, has, glasses, too>
<1, 0, 1, 1, 0> * 1/√3
<0, 1, 1, 1, 1> * 1/2
Dot Product: 0 + 0 + 1/2√3 + 1/2√3 + 0 = 1/√3


So using that method and setting a cap on the maximum score to filter would catch any sentence that's similarly scrambled: bob glasses has greg too, glasses bob too greg has has, etc.


If order is important, you'd have to find some other technique...


or mix and match multiple types of detection


Interesting answers so far, appreciated! One nuance to keep in mind is that the solution should work in a webapp context. Say there are 10000 texts to compare against executing a massive DB read + executing the compare(a, b) function 10k times on each tentative insert doesn't sound efficient


Wondering if there's something akin to a hash of a paragraph. Or, something like Elasticsearch providing this feature OOTB


that's more or less what we are computing


a traditional hash is no good though since it doesn't preserve the metric


close sources hash to close hashes


> that's more or less what we are computing perfect. Wanted to make sure before diving into the impl details 🙂


Ha, thats what I just wanted to say, AFAIK elasticsearch supports these usecases more or less


I can’t remember what it is called, but another technique is to compute the set of two-word pairs for each adjacent word in the text, and then compare that distribution of tuples. Computationally much faster and pretty robust in noisy data.

👌 4

In computational biology, there are n^2 algorithms for perfect string alignment, but they are too slow, so people tend to start with heuristics like the above.


Ngram Statistics? I remember doing something like this in homework using Hadoop MapReduce 😆


oh yea! ngrams. yes i probably had to do homework on this too, but it was in the 90s, so it’s getting quite hazy


cool stuff suggested over here, knew Clojurians was the place to go 😜 thanks all!


Ahh yes, ngrams. Elasticsearch should have ngram analysis as well, iirc?


yes it does, but i believe vemv wanted a browser-side solution (?)




oh well yea then you could probably just stick everything into an elasticsearch index and run the full text of the document as a query then look at the similarily scores

👍 8

Elasticsearch is like sprinkling magic on these types of problems, hah


Bonus points if you use spandex with elasticsearch ;)

👍 4

Unrelated - how have I never heard of spacemacs before??

Alex Miller (Clojure team)17:04:30

dang it, we have been trying to hide it from you for years!



Alex Miller (Clojure team)17:04:56

but seriously, if there’s some place you could have seen it but didn’t, perhaps a request to add it is in order there


Naw, I'd just heard it mentioned but never bothered to look into it. I noticed it come up in CIDER issues and since I wanted to try CIDER but didn't want to dive into straight emacs, this seems like a good compromise. So it's a weird route to approach it from 😅


@seancorfield you mentioned you work in online dating, are you working for a big name or a new thing?


World Singles is the company, it's big from what I know


Never heard of it


Wonder how it is different


I believe the unique sale is that they have a lot of websites, each tailored for different markets. e.g. one for uniforms, one for jazz lovers, etc.


immutable singles, monadic singles…

😀 8

newtype Couple a = Single a ⨯ Single a


@mv Indeed, as @dominicm said, we are World Singles (technically World Singles Networks) and we have about 100 dating sites focused (mostly) on ethnic verticals (, So the sites (brands) are small and niche but there are a lot of them 🙂


Those two sites are React.js on the front and Clojure REST API on the back (as are several other sites in our portfolio). We still have some of our larger sites such as and on our legacy platform (which is, essentially, CFML for the View/Controller and Clojure for the Model).


We have just under 60K lines of production Clojure and just under 20K lines of test code currently.


Interesting. How many engineers to manage that?