Fork me on GitHub

So I have a problem I'm solving that I'd appreciate any advice on. I'm a ML newb. I run a curated newsletter of technical deep-dives on pretty much any subject that might fall under software engineering. I'm pretty sick of spending hours curating quality content and want to make an assistant curator for me using machine learning. I have a history of 2k articles with click data (unique and total) + written summaries + the article content extracted. I also have a RSS feed that constantly updates with new articles from ~700 technical sources. I'm currently in the data processing stage of the project. I've gotten all of the data I mentioned before for the articles (clicks, content), but am now looking for other potential data points I can use a parameters for the model. I was thinking of using the bag-of-words technique on the article content to represent that content. Does anyone have any suggestions on what else I can explore data wise? Also taking any suggestions on what actual models to use. Someone recommended the google cloud AutoML NL tool for me to try out, does anyone have experience with that? My naive approach is to have a model that can intake the articles from my RSS feed and label them either "check it out" or "don't waste your time".


sounds like you are building a binary classifier. I'm sure folks here can highlight lots of approaches for binary classification.


Thanks! I'll take a look at this as well.


unless i misunderstood that would imply building a dataset with labels. I'd probably go a different route: building a recommendation engine (with tf idf ?) and tune it manually


For the binary approach, would I have to label articles that shouldn't go into my curation? Or could I get away with just labeling the 2k I have as "should" go into curation and find matches on that?


What is tf idf and what does manual tuning entail?

Rupert (All Street)10:09:21

For most ML classifiers you need both positive and negative examples. Some algorithms are sensitive to the ratio of positive and negatives in your training set too.


Damm, that's not really a feasible option for me then.

Rupert (All Street)10:09:41

Can't you just use the articles in your feed that you rejected?

Rupert (All Street)10:09:17

Often a good approach for building a classifier is to start with simple ML algorithms then slowly increase the complexity when you can get no further value out of it. e.g. • manual keywords that you make up yourself (no AI) • Naive Bayes • Paragraph2Vec • Neural Network • etc You will want to evaluate how your model performances, typically you can trivially calculate Precision, Recall and F1 scores to see how well it is doing.


I don't keep track of every article I've rejected and not every article in the feed that is not in my accepted list should be considered rejected.


Any automed way I can think of that would get a bunch of "rejected" articles would be biased

Rupert (All Street)10:09:10

You can probably build a basic classifier with about 500 to 1,000 positive and negative examples. So you could go through your negatives and filter them by hand.

Rupert (All Street)10:09:33

One issue you may hit is that your criteria for the model to train on is too broad and across too many domains. e.g. if you couldn't explain it concisely to a friend and they would do exactly the same accepts/rejects as you then your target is not clear enough. One way around this is to build multiple classifiers (e.g. one classifier just to find interesting ML scientific papers that are at least 10 pages long and cover NLP - this example is less subjective)


Hmm... perhaps. Not too thrilled about doing that manual labeling. I'm going to try a non-ML approach first by filtering the articles in my feed by length. Since 99% of my sources are technical anyway, it might be "good enough" for now.

Rupert (All Street)10:09:08

Agree - starting with an algorithm and heuristics sounds like a good start. You can add some keyword matching in there and give a score for the number of keywords that match).

👍 1

Do you have any recommendations for algorithms for gauging the "quality" of a piece of text? Maybe something that measures the number of "filler" words or something along those lines?


For folks interested in how Bayesian inference works in R and Stan, I came across a well-produced set of videos here:

🙏 2