This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2018-04-09
Channels
- # beginners (108)
- # boot (14)
- # cider (8)
- # clara (13)
- # cljs-dev (63)
- # cljsrn (5)
- # clojure (57)
- # clojure-brasil (1)
- # clojure-italy (69)
- # clojure-losangeles (10)
- # clojure-nl (6)
- # clojure-poland (2)
- # clojure-spec (6)
- # clojure-uk (50)
- # clojurescript (116)
- # core-async (1)
- # cursive (9)
- # data-science (8)
- # datascript (4)
- # datomic (43)
- # duct (2)
- # editors (1)
- # fulcro (29)
- # instaparse (7)
- # jobs (6)
- # keechma (3)
- # mount (16)
- # off-topic (61)
- # om (10)
- # onyx (5)
- # parinfer (17)
- # pedestal (2)
- # portkey (5)
- # quil (2)
- # re-frame (84)
- # reagent (9)
- # remote-jobs (2)
- # ring-swagger (2)
- # shadow-cljs (17)
- # slack-help (1)
- # tools-deps (29)
- # vim (23)
It's frustrating how pronunciation information is lost when reading text in another language. A bit like textual source code information is lost in Clojure, when manipulating code forms as data, and having lost track of the text that generated them
tsoo-dah (in Polish; I’d expect the English version to be pronounced like the second part of word barracuda)
Q: Imagine users are uploading text content that is clearly following a template, maybe with some variations added by hand to "beat the system". Example:
- User A submits text content Hello, my name is John and I'm here to have fun!
- User B submits text content Hello, my name is Jim and I'm here to have fun!
- both texts follow a similar/identical template, so a flag should be raised, so to speak.
- We cannot know every possible template in advance - every day a new malicious user could appear.
Dunno if this is a generally tractable problem? If yes, which tools would be the best ones for the job?
well, there's no limit to the number of words you can pass to the distance calculation function, but i guess you'd have to see if it works for your use-case
distance("Hello, my name is John and I'm here to have fun!", "Hello, my name is Jim and I'm here to have fun!") would return 3
i swear i've seen a cool article about this where you take the dot product of the two paragraphs. a dot product of two vectors (and i forget the transformation of paragraph to vector. i think its just the ordered union of words and their count) and that gives you a notion of how similar the vectors are
so then you have all your templates and you see which of the templates are most in the same direction as the text that comes in
A dot product between two unit-scaled vectors will give a value between -1 and 1; parallel in the same direction gives 1, parallel in the opposing direction gives -1, perpendicular gives 0. This is done with mathematical vectors as basically (reduce + (map * v1 v2))
So there's two questions involved in the paragraph conversion: One is how to perform that 'multiply' operation with words / chunks, and the other is how to ensure the components are / can be scaled to a unit vector
If it's not unit-scaled, you'll get some arbitrary value that you won't really be able to interpret well. Frequencies / counts should work, yeah
<bob, greg, has, glasses, too>
<1, 0, 1, 1, 0> * 1/√3
<0, 1, 1, 1, 1> * 1/2
Dot Product: 0 + 0 + 1/2√3 + 1/2√3 + 0 = 1/√3
So using that method and setting a cap on the maximum score to filter would catch any sentence that's similarly scrambled: bob glasses has greg too, glasses bob too greg has has, etc.
Interesting answers so far, appreciated!
One nuance to keep in mind is that the solution should work in a webapp context. Say there are 10000 texts to compare against
executing a massive DB read + executing the compare(a, b)
function 10k times on each tentative insert doesn't sound efficient
Wondering if there's something akin to a hash of a paragraph. Or, something like Elasticsearch providing this feature OOTB
> that's more or less what we are computing perfect. Wanted to make sure before diving into the impl details 🙂
alternatively, https://marcobonzanini.com/2015/02/09/phrase-match-and-proximity-search-in-elasticsearch/ "Within-Sentence Proximity Search"
Ha, thats what I just wanted to say, AFAIK elasticsearch supports these usecases more or less
I can’t remember what it is called, but another technique is to compute the set of two-word pairs for each adjacent word in the text, and then compare that distribution of tuples. Computationally much faster and pretty robust in noisy data.
In computational biology, there are n^2 algorithms for perfect string alignment, but they are too slow, so people tend to start with heuristics like the above.
Ngram Statistics? I remember doing something like this in homework using Hadoop MapReduce 😆
oh yea! ngrams. yes i probably had to do homework on this too, but it was in the 90s, so it’s getting quite hazy
oh well yea then you could probably just stick everything into an elasticsearch index and run the full text of the document as a query then look at the similarily scores
@vemv I’ve had to do this before. Ended up implementing simhash. http://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf , http://www.wwwconference.org/www2007/papers/paper215.pdf , and http://matpalm.com/resemblance/simhash/ for some reference
dang it, we have been trying to hide it from you for years!
but seriously, if there’s some place you could have seen it but didn’t, perhaps a request to add it is in order there
Naw, I'd just heard it mentioned but never bothered to look into it. I noticed it come up in CIDER issues and since I wanted to try CIDER but didn't want to dive into straight emacs, this seems like a good compromise. So it's a weird route to approach it from 😅
@seancorfield you mentioned you work in online dating, are you working for a big name or a new thing?
I believe the unique sale is that they have a lot of websites, each tailored for different markets. e.g. one for uniforms, one for jazz lovers, etc.
@mv Indeed, as @dominicm said, we are World Singles (technically World Singles Networks) and we have about 100 dating sites focused (mostly) on ethnic verticals (http://soulsingles.com, http://italianosingles.com). So the sites (brands) are small and niche but there are a lot of them 🙂
Those two sites are React.js on the front and Clojure REST API on the back (as are several other sites in our portfolio). We still have some of our larger sites such as http://arablounge.com and http://eligiblegreeks.com on our legacy platform (which is, essentially, CFML for the View/Controller and Clojure for the Model).
We have just under 60K lines of production Clojure and just under 20K lines of test code currently.