clojure-nlp

simongray 2022-05-03T11:52:56.474829Z

👍 5
👍🏻 1
simongray 2022-05-04T07:39:38.259689Z

The only real performance advantage of using transducers—when possible—is the lack of intermediate collections during the transformations.

2022-05-04T08:34:11.333689Z

Yeah. I was hoping to be able to allow for incremental updates to the TF-IDF calculation without going over the entire collection again but couldn't find a way to do it.

simongray 2022-05-04T08:43:15.587709Z

I don’t think that’s possible as the document frequencies will change whenever your corpus changes and every tf-idf score is a product of the document frequency table.

2022-05-04T08:43:51.989639Z

yes, exactly

2022-11-08T08:13:52.475839Z

Sorry to revive this old thread, but I put up my (by now pretty old) implementation on codeberg: https://codeberg.org/schaueho/tfidf

2022-11-08T10:28:13.115699Z

Just to summarize. It seems we have 3 implementations (at least) of TFIDF in clojure: https://github.com/kuhumcst/tf-idf https://codeberg.org/schaueho/tfidf https://github.com/scicloj/scicloj.ml.smile/blob/d70c7e3caff93935d05ab81ed6b2d1e4846ad42b/src/scicloj/ml/smile/nlp.clj#L281 The last one, mine, is very slow compared to at least the first. I have now a use case, where mine is "too slow", while the first one would be "fast enough", so I will come back to it.

2022-10-19T16:55:23.603819Z

I am just working an improving my own TFIDF implementation: https://github.com/scicloj/scicloj.ml.smile/blob/d70c7e3caff93935d05ab81ed6b2d1e4846ad42b/src/scicloj/ml/smile/nlp.clj#L281 To be released soon. I would definitely re-use something existing, so I will have a look.

2022-10-19T17:00:29.313489Z

Let's discuss here: https://github.com/kuhumcst/tf-idf/issues/1

2022-05-04T06:41:00.260499Z

I'll take a deep look at what you're doing differently to make the transducer performant -- my take on this certainly wasn't (not on MS github anymore, sorry).