Fork me on GitHub

Total newb in anything ML here. I have a corpus of natural language documents, that I’d like to a) automatically extract themes from b) apply these themes to new documents as they enter the system c) sort them by similarity (so we can show “related documents” for each document, sort lists by similarity so similar documents appear close together I had a look at AWS comprehend but at least the key phrases functionality didn’t seem to be what I want. I’ll try topic modeling next, but I’m also interested to have full control of the model so it can be helped by humans to get more accurate results.


The amount of data i have is relatively small (hundreds of documents), so hopefully I don’t need to go into big data or anything crazy. Any suggestions on where to start? Ideally Clojure with a robust/mainstream underlying library/framework.


Oh there’s also a ton of conj talks about ML, any pointers on where to start?


Many thanks 🙏

👀 4

I don't know of Clojure library to handle that. AWS Sagemaker looks more what you want to do large scale


MXNet is underneath the hood - you may be able to interface with it with Clojure-MXNet. But it might be easier to deal with it in interop with Also these tasks are most not likely going to be deep learning anyway


There also is a LDA example which might be relevant


Once you have your documents into a the end vector form - you can compare them by cosine similarity


(or some other distance measure)


or honestly - you might just be able to use python sklearn with the python interop


But take all with a grain of salt - I've never done that before. Maybe someone else has a better idea. Just telling you what direction I personally would look in


Thanks @gigasquid — already my head is spinning. Libpython-clj sure does seem to open the door to many powerful libraries, there also seem to be some Java NLP libraries out there. I’ll continue my search and report back if I find something interesting.

👍 4

@orestis: I am also a ML noob here. But I implemented something to find similarity between “Government Forms” a few years back. I used “Jaccard Similarity Index”. It is metric to measure similarity between sets. And, I used something called “Shingling” (similar to n-grams) to creates sets from these documents. It worked reasonably well for our use case. Jaccard: Shingling:


Thanks - I’ll have a look at those. Did you use any particular library for this?


Back then I implemented in python not in Clojure. But In python too I did the custom implementation with out any library. It was not hard to implement.


I have seen the Jaccard distance somewhere in some Apache project.

👍 4

@orestis may I shamelessly plug if you’d like to do training on GPUs? You can currently use a GPU on the free plan and @kommen and I are here to help if you need anything


@chris441 - I took libpython-clj out for a spin in python mxnet - worked great!

clj 20

Tutorials inbound? 😊