Fork me on GitHub
#biff
<
2023-09-12
>
ianjones17:09:02

has anyone ever built an “auto-tagger” feature. something like: Given a block of text, and a vector of tag names, return a vector of tag names that are similar to the block of text

Jacob O'Bryant19:09:32

I've done keyword extraction, I.e. given a block of text, pick a few words from that text that are representative of the content (with tf-idf), not sure if that's exactly what you're describing here though? tf-idf might be a good first step though. fastText is pretty convenient for getting word embeddings may be useful too.

ianjones21:09:45

is there a library you’ve used?

Jacob O'Bryant00:09:06

Just python actually

Jacob O'Bryant00:09:29

I generate a csv with clojure and then call the python script as a subprocess. then I also call fastText as a subprocess

ianjones01:09:58

oh very cool

ianjones01:09:40

probably overkill to use libpython-clj?

Jacob O'Bryant02:09:30

when I tried libpython-clj several years ago, it ran about 30% slower than calling python a subprocess 🤷 in general I find csv + subprocess is pretty convenient

Jacob O'Bryant02:09:42

looks like spark mllib can do tf-idf, I would definitely check that out first: https://spark.apache.org/docs/latest/mllib-feature-extraction.html I've been using spark mllib for its collaborative filtering algorithm, after previously using python for that, and it's awesome