Fork me on GitHub
#clojure-nlp
<
2022-05-03
>
schaueho06:05:00

I'll take a deep look at what you're doing differently to make the transducer performant -- my take on this certainly wasn't (not on MS github anymore, sorry).

simongray07:05:38

The only real performance advantage of using transducers—when possible—is the lack of intermediate collections during the transformations.

schaueho08:05:11

Yeah. I was hoping to be able to allow for incremental updates to the TF-IDF calculation without going over the entire collection again but couldn't find a way to do it.

simongray08:05:15

I don’t think that’s possible as the document frequencies will change whenever your corpus changes and every tf-idf score is a product of the document frequency table.

schaueho08:05:51

yes, exactly

Carsten Behring16:10:23

I am just working an improving my own TFIDF implementation: https://github.com/scicloj/scicloj.ml.smile/blob/d70c7e3caff93935d05ab81ed6b2d1e4846ad42b/src/scicloj/ml/smile/nlp.clj#L281 To be released soon. I would definitely re-use something existing, so I will have a look.

schaueho08:11:52

Sorry to revive this old thread, but I put up my (by now pretty old) implementation on codeberg: https://codeberg.org/schaueho/tfidf

Carsten Behring10:11:13

Just to summarize. It seems we have 3 implementations (at least) of TFIDF in clojure: https://github.com/kuhumcst/tf-idf https://codeberg.org/schaueho/tfidf https://github.com/scicloj/scicloj.ml.smile/blob/d70c7e3caff93935d05ab81ed6b2d1e4846ad42b/src/scicloj/ml/smile/nlp.clj#L281 The last one, mine, is very slow compared to at least the first. I have now a use case, where mine is "too slow", while the first one would be "fast enough", so I will come back to it.