The only real performance advantage of using transducers—when possible—is the lack of intermediate collections during the transformations.
Yeah. I was hoping to be able to allow for incremental updates to the TF-IDF calculation without going over the entire collection again but couldn't find a way to do it.
I don’t think that’s possible as the document frequencies will change whenever your corpus changes and every tf-idf score is a product of the document frequency table.
yes, exactly
Sorry to revive this old thread, but I put up my (by now pretty old) implementation on codeberg: https://codeberg.org/schaueho/tfidf
Just to summarize. It seems we have 3 implementations (at least) of TFIDF in clojure: https://github.com/kuhumcst/tf-idf https://codeberg.org/schaueho/tfidf https://github.com/scicloj/scicloj.ml.smile/blob/d70c7e3caff93935d05ab81ed6b2d1e4846ad42b/src/scicloj/ml/smile/nlp.clj#L281 The last one, mine, is very slow compared to at least the first. I have now a use case, where mine is "too slow", while the first one would be "fast enough", so I will come back to it.
I am just working an improving my own TFIDF implementation: https://github.com/scicloj/scicloj.ml.smile/blob/d70c7e3caff93935d05ab81ed6b2d1e4846ad42b/src/scicloj/ml/smile/nlp.clj#L281 To be released soon. I would definitely re-use something existing, so I will have a look.
Let's discuss here: https://github.com/kuhumcst/tf-idf/issues/1
I'll take a deep look at what you're doing differently to make the transducer performant -- my take on this certainly wasn't (not on MS github anymore, sorry).