clojars 2022-04-15 | Slack Archive

lread16:04:27

Heya @tcrawley I just updated cljdoc artifact searching. The clojars-web search code gave me many Lucene inspirations and insights. Much thanks! ❤️

tcrawley18:04:12

Great! I'm glad you found it useful!

lread17:04:50

It was awesomely useful! Some highlights of minor things I chose to do differently, if you are interested, are: • boost by clojars download counts over the last year (or so) instead of all time to give weight only to currently popular artifacts • used the ngram tokenizer at index time to allow substring matches anywhere in a token • used the icu term folding filter to also normalize accents (facade = façade) • not supporting lucene query syntax at all (not yet anyway, and maybe never, dunno)

tcrawley20:04:15

> boost by clojars download counts over the last year (or so) instead of all time to give weight only to currently popular artifacts That's a good idea. I was wondering how you got that data, then realized we publish daily download stats. I had forgotten :) > used the ngram tokenizer at index time to allow substring matches anywhere in a token Another good idea. That would help with https://github.com/clojars/clojars-web/issues/719 I think

lread22:04:30

> then realized we publish daily download stats. I had forgotten :) Yes, thanks for all the data!

lread23:04:28

The ngram tokenizer does increase Lucene index size.. I think we went from ~~6mb to~~ 20mb to give you an idea. And indexing takes a wee bit longer, but not by any significant amount. I felt corfield should find all seancorfield libraries. We previously only did prefix searching. If you want to index for prefix searching only you could employ the edge ngram tokenizer.

lread23:04:11

Other tidbit that might interest you: Because we don’t support any explicit query syntax, I decided to make some assumptions on when to apply an exact match artifact search. If the user types a single term x we attempt an exact match for artifact x/x. Two terms x/y (or x y) and an exact match for x/y is attempted. These exact matches are weighted heavily so that they appear first in the results. This seems to be working well enough.

lread23:04:34

Anyway, the source is https://github.com/cljdoc/cljdoc/blob/master/src/cljdoc/server/search/search.clj if you are interested. (And some of it will look VERY familiar 🙂).

2022-04-15

Channels