This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-09-01
Channels
- # adventofcode (2)
- # announcements (3)
- # babashka-sci-dev (79)
- # beginners (76)
- # biff (2)
- # calva (32)
- # cider (2)
- # clj-kondo (42)
- # clj-on-windows (17)
- # clojure (28)
- # clojure-belgium (1)
- # clojure-berlin (1)
- # clojure-europe (95)
- # clojure-nl (4)
- # clojure-norway (4)
- # clojure-uk (5)
- # clojurescript (27)
- # conjure (5)
- # cursive (3)
- # data-science (16)
- # datomic (67)
- # graalvm (12)
- # hyperfiddle (36)
- # jobs (3)
- # jobs-discuss (1)
- # kaocha (2)
- # klipse (1)
- # leiningen (28)
- # lsp (16)
- # luminus (3)
- # malli (10)
- # nrepl (3)
- # off-topic (57)
- # other-languages (18)
- # re-frame (4)
- # reitit (8)
- # releases (1)
- # remote-jobs (1)
- # scittle (4)
- # shadow-cljs (7)
- # test-check (1)
- # tools-deps (4)
- # vim (11)
- # xtdb (25)
How should I do sentiment analysis with Clojure? I only found a few older examples and they seem to be using CoreNLP. E.g. this gist: https://gist.github.com/shayanjm/9ff17cf7b3f72b102407
CoreNLP is still a good choice; recently a more idiomatic Clojure API for it has been developed that makes it much easier to extract Datalog-style triples from text annotations. https://github.com/simongray/datalinguist
How to you represent your texts ? Bag-of-words or embeddings ?
Sentiment analysis is often / always handled as a supervised text classification problem.
So one key question is then , how to represent your text in a table with a fixed number of columns.
has support of supervised classification.
I am of course assuming that you want to train your own model. Or are you looking for a "trained sentiment analysis model" ?
has some functions to create bag-of-words (BOW) and TFIDF representations from text.
This means your features for classification become the presence / absence of tokens or (counts of tokens) in the text.
CoreNLP via data linguist could be used to get "other features", more on the grammatical level.
scicloj.ml.clj-djl
integrates with fasttext.
which is a embedding / deep-learning based approach to text classification.
There is an example of using it for sentiment analysis here:
chapter "Fastext text classification from DJL"
The gist you posted is done in Clojure, is it ?
Just using Java interop.
A valid way to do these things is to do the "modelling" (= the part of your code where it is hard to find native Clojure libraries) in Java via Clojure Java interop, and you convert to Clojure data structures for the input and output of that part.
So some isolated part of you code will use whatever Java library you find via Clojure-java interop.
Similar for python / R interop via libpython-clj
or clojisr
I'm mostly looking for "trained sentiment analysis model". This isn't for anything production-ready. I'm really a noob in data science and was simply looking for an alternative of NLTK sentiment analysis module in python (we used that to analyse sentiment in a wikipedia article)
There are a few options then in Java. Some of them might have more-or-less complete wrappers in Clojure, datalinguist being one. Else you can go for Java intertop yourself or to python / R using libpython-clj and clojisr. All three interop solutions work well. I suggest to anybody in data science in Clojure to become familiar with the interop towards Java,Python, R. There is such a large amount of models available in these 3 ecosystems, that nobody will ever rewrite them in Clojure. Using DVC(https://dvc.org/) and based on pipelines written in any programming language, can be an other form of "interop". Some steps of these pipelines can be written in Clojure, others in Java,python, R (and they communicate via data files on disk)
So in-process interop vs out-of-process interop.
There is a template which setups the interop using Docker: https://github.com/behrica/clj-py-r-template
@U06BE1L6T - most off-the-shelf sentiment algorithms have unreliable results when they are run on data that differs from their training data e.g. scientific papers will often be detected as negative by a naive sentiment algorithm if the paper has a "Risks" or "Challenges" section. Tweets vs news articles also use different language etc. So the solution is often to train a new supervised learning sentiment detector on the actual data that you are interested in analysing.
Heck, try labels of positive and negative with Gigasquid's zeroshot example. Maybe that will work... http://gigasquidsoftware.com/blog/2021/03/15/breakfast-with-zero-shot-nlp/