Fork me on GitHub

I’m working on entity resolution in an enterprise (300,000 people entities). Any advice/literature? Specifically identifying people, and secondly departments (departments n~500)


@bocaj I have worked on a couple of entity resolution products, with similar scale. Depending on your requirements, the system will range from simple ML model with a few graph computations to something much more complex. If you are looking for an introductory example of something that would suite a simple use case, you can check out this talk from the 2018 Spark Summit where my former colleague describes the system we built together. It isn't Clojure and no code was open sourced, but hopefully it is still useful.


The TL;DR is: 1. Develop a binary classifier that predicts if two entities are the same. Your features will probably be derived from some comparisons between the records (edit distance, cosine angle of embedding, etc.) 2. Use locality sensitive hashing to bucket your data points into large partitions where similar data points are likely to land in the same partition. 3. Score all intra-partition pairs with the classifier and create graph with 1 node per record and edges between each pair of records that the classifier predicted as positive. 4. Use a graph computation algorithms to identify connected comments of the graph. Each one can be thought of as an "linked" or "resolved" entity. That is, all records in the same connected component can be treated as the same entity.

🔥 9

Entity resolution is a very popular subject at the Spark Summit. There have been presentations on very sophisticated architectures presented by many companies with large tech teams, if you are into that sort of thing. I would recommend watching a few to see if you spot a design that sounds like it might fit your needs. There may even be some Scala tech that you could easy call from Clojure. All the talk from previous years are available on YouTube, if I recall correctly. You might find something there. It's a really cool area, that I enjoy working in. Have fun!


@erp12 thanks for the summary! I'm part way down working through a simple version of the connected components you mention, but I've skipped steps 1 to 3 by using strictly enterprise unique ids (email, employee id, etc). I've hit edge cases, of course, so I'm redesigning a bit. Thanks for the advice, I see the next area to explore now.


@erp12 I watched the talk, and it was very helpful. When using the linked records, did you maintain a history of global ID for users: for example when 2 connected components became 1 because the model learned more, compared to the previous day’s run? Maybe this isn’t necessary for analytics users to know.