This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-04-15
Channels
- # asami (6)
- # babashka (74)
- # babashka-sci-dev (164)
- # beginners (50)
- # biff (3)
- # calva (71)
- # clj-kondo (4)
- # cljdoc (39)
- # cljsrn (4)
- # clojars (8)
- # clojure (70)
- # clojure-austin (7)
- # clojure-czech (5)
- # clojure-europe (4)
- # clojure-losangeles (1)
- # clojure-nl (10)
- # clojure-norway (1)
- # clojure-uk (3)
- # clojurescript (38)
- # community-development (18)
- # cursive (129)
- # datomic (9)
- # fulcro (7)
- # graalvm (4)
- # improve-getting-started (1)
- # jobs (1)
- # kaocha (2)
- # liberator (9)
- # lsp (22)
- # malli (3)
- # membrane (95)
- # off-topic (86)
- # releases (2)
- # sci (5)
- # specter (2)
Heya @tcrawley I just updated cljdoc artifact searching. The clojars-web search code gave me many Lucene inspirations and insights. Much thanks! ❤️
It was awesomely useful! Some highlights of minor things I chose to do differently, if you are interested, are: • boost by clojars download counts over the last year (or so) instead of all time to give weight only to currently popular artifacts • used the ngram tokenizer at index time to allow substring matches anywhere in a token • used the icu term folding filter to also normalize accents (facade = façade) • not supporting lucene query syntax at all (not yet anyway, and maybe never, dunno)
> boost by clojars download counts over the last year (or so) instead of all time to give weight only to currently popular artifacts That's a good idea. I was wondering how you got that data, then realized we publish daily download stats. I had forgotten :) > used the ngram tokenizer at index time to allow substring matches anywhere in a token Another good idea. That would help with https://github.com/clojars/clojars-web/issues/719 I think
> then realized we publish daily download stats. I had forgotten :) Yes, thanks for all the data!
The ngram tokenizer does increase Lucene index size.. I think we went from 6mb to 20mb to give you an idea. And indexing takes a wee bit longer, but not by any significant amount.
I felt corfield
should find all seancorfield
libraries. We previously only did prefix searching.
If you want to index for prefix searching only you could employ the edge ngram tokenizer.
Other tidbit that might interest you:
Because we don’t support any explicit query syntax, I decided to make some assumptions on when to apply an exact match artifact search.
If the user types a single term x
we attempt an exact match for artifact x/x
.
Two terms x/y
(or x y
) and an exact match for x/y
is attempted.
These exact matches are weighted heavily so that they appear first in the results.
This seems to be working well enough.
Anyway, the source is https://github.com/cljdoc/cljdoc/blob/master/src/cljdoc/server/search/search.clj if you are interested. (And some of it will look VERY familiar 🙂).