Fork me on GitHub
#data-science
<
2022-09-01
>
jumar10:09:35

How should I do sentiment analysis with Clojure? I only found a few older examples and they seem to be using CoreNLP. E.g. this gist: https://gist.github.com/shayanjm/9ff17cf7b3f72b102407

respatialized13:09:46

CoreNLP is still a good choice; recently a more idiomatic Clojure API for it has been developed that makes it much easier to extract Datalog-style triples from text annotations. https://github.com/simongray/datalinguist

Carsten Behring22:09:58

How to you represent your texts ? Bag-of-words or embeddings ?

Carsten Behring22:09:59

Sentiment analysis is often / always handled as a supervised text classification problem. So one key question is then , how to represent your text in a table with a fixed number of columns. has support of supervised classification.

Carsten Behring22:09:40

I am of course assuming that you want to train your own model. Or are you looking for a "trained sentiment analysis model" ?

Carsten Behring22:09:36

has some functions to create bag-of-words (BOW) and TFIDF representations from text. This means your features for classification become the presence / absence of tokens or (counts of tokens) in the text. CoreNLP via data linguist could be used to get "other features", more on the grammatical level.

Carsten Behring22:09:40

scicloj.ml.clj-djl integrates with fasttext. which is a embedding / deep-learning based approach to text classification. There is an example of using it for sentiment analysis here:

Carsten Behring22:09:11

chapter "Fastext text classification from DJL"

Carsten Behring22:09:39

The gist you posted is done in Clojure, is it ? Just using Java interop. A valid way to do these things is to do the "modelling" (= the part of your code where it is hard to find native Clojure libraries) in Java via Clojure Java interop, and you convert to Clojure data structures for the input and output of that part. So some isolated part of you code will use whatever Java library you find via Clojure-java interop. Similar for python / R interop via libpython-clj or clojisr

jumar05:09:29

I'm mostly looking for "trained sentiment analysis model". This isn't for anything production-ready. I'm really a noob in data science and was simply looking for an alternative of NLTK sentiment analysis module in python (we used that to analyse sentiment in a wikipedia article)

Carsten Behring08:09:04

There are a few options then in Java. Some of them might have more-or-less complete wrappers in Clojure, datalinguist being one. Else you can go for Java intertop yourself or to python / R using libpython-clj and clojisr. All three interop solutions work well. I suggest to anybody in data science in Clojure to become familiar with the interop towards Java,Python, R. There is such a large amount of models available in these 3 ecosystems, that nobody will ever rewrite them in Clojure. Using DVC(https://dvc.org/) and based on pipelines written in any programming language, can be an other form of "interop". Some steps of these pipelines can be written in Clojure, others in Java,python, R (and they communicate via data files on disk)

Carsten Behring08:09:28

So in-process interop vs out-of-process interop.

Carsten Behring08:09:35

There is a template which setups the interop using Docker: https://github.com/behrica/clj-py-r-template

Rupert (All Street)16:09:25

@U06BE1L6T - most off-the-shelf sentiment algorithms have unreliable results when they are run on data that differs from their training data e.g. scientific papers will often be detected as negative by a naive sentiment algorithm if the paper has a "Risks" or "Challenges" section. Tweets vs news articles also use different language etc. So the solution is often to train a new supervised learning sentiment detector on the actual data that you are interested in analysing.

👍 1
aaelony22:09:29

Heck, try labels of positive and negative with Gigasquid's zeroshot example. Maybe that will work... http://gigasquidsoftware.com/blog/2021/03/15/breakfast-with-zero-shot-nlp/

👍 1