Fork me on GitHub
#cljdoc
<
2022-03-31
>
Cora (she/her)00:03:45

@lee suppose I want to make a test project in the cljdoc repo, to have it analyzed during tests so I can make a test search docset... I guess I'm not sure where to put that or how to set that up using deps.edn

Cora (she/her)00:03:14

do you have a minute to talk about it?

lread00:03:33

I have done something similar for visual inspection. There is the stand-alone https://github.com/cljdoc/cljdoc-exerciser

lread00:03:56

Is that what you mean by test project? A test lib/repo?

Cora (she/her)00:03:55

yes, essentially

Cora (she/her)00:03:25

but I'd like this automated and within cljdoc itself, so I can exercise generating a cache bundle and transforming that into a searchset

Cora (she/her)00:03:25

if that seems like a good idea

lread00:03:34

Do you want to go through full integration? Like have your test project analyzed by cljdoc-analyzer?

Cora (she/her)00:03:36

right now I have stuck an edn cache bundle in a file

Cora (she/her)00:03:30

it feels like something that we should have in general?

Cora (she/her)00:03:15

but I need to do things like have a markdown file and an asciidoc file and multiple namespaces and such so I can confirm that my code is breaking those down into chunks correctly (for searching)

lread00:03:50

And you’d prefer not to use a stand-alone github repo because you want it more tightly coupled to cljdoc source base, yeah?

Cora (she/her)00:03:34

I mean, stop me if I'm doing something ridiculous

Cora (she/her)00:03:39

or something ill-advised

lread00:03:59

Sounds reasonable if you want to do a full integration test.

lread00:03:59

Our current integration test already ingests a project. But one from the wild.

Cora (she/her)00:03:01

I'm less focused on the analysis portion and exercising that fully as much as generating the cache bundle input to this searchset creation

Cora (she/her)00:03:20

(for client-side search)

Cora (she/her)00:03:38

I don't want to just stick static edn in the project for this since that could break without us knowing

lread00:03:52

The ingest cycle is not very speedy, I’m guessing you are ok with that for this particular test.

Cora (she/her)00:03:24

does it take a lot to analyze a tiny already-local project?

lread00:03:36

Well… you can get an idea by trying an ingest from the command line. But assuming you are ok with speed, I don’t see a problem with creating some sort of cljdoc test project. Ingest only works from jars, so you’d have to jar it up.

Cora (she/her)01:03:05

ahhh there's the rub

Cora (she/her)01:03:13

maybe I'll stick with static for the moment

Cora (she/her)01:03:22

and let people report it if it breaks

lread01:03:24

rub=too slow?

lread01:03:41

ya, the feedback loop might drive you a bit mad there…

lread01:03:32

btw, I am working on search on the server side. So might have some searchy questions for you sometime soon! simple_smile

Cora (she/her)01:03:19

oh! well, this is only search within an individual :group-id/:artifact-id/:version, and only client-side

Cora (she/her)01:03:40

something I've had on the backburner forever

Cora (she/her)01:03:49

but the server-side is mostly done

lread01:03:16

Ya, but your brain is wired on search and I might have general ideas to bounce.

Cora (she/her)01:03:33

I'm definitely up for it

lread01:03:44

(by server side search I mean the lucene stuff btw)

phronmophobic01:03:28

👋 I've recently been thinking about problems and ideas that it seems like cljdoc either already solves or is in the process of solving. I've been trying to get up to speed on all the cool things cljdoc is up to. Just wanted to say hi and say that cljdoc is really neat!

phronmophobic01:03:55

hope it's not a dumb question. what is client side search?

lread01:03:00

Hey buddy! Nice to see you here!

👋 1
lread01:03:05

I’ll let @corasaurus-hex answer that one. Here’s the PR https://github.com/cljdoc/cljdoc/pull/466 which has a demo link.

🆒 1
lread01:03:41

@smith.adriane I was thinking your https://github.com/phronmophobic/dewey might come in handy for cljdoc someday/somehow.

👍 1
lread01:03:40

Gotta run for now…

👍 1
Cora (she/her)01:03:40

oh, hi 👋:skin-tone-2:

👋 1
Cora (she/her)01:03:57

sorry, had to step away, and now I cut my left pointer so typing is fun

🙁 1
Cora (she/her)01:03:39

I actually am nixing that branch from the PR, @smith.adriane, but it has the beginnings of it

Cora (she/her)01:03:55

it was just too out of date and rebasing was too painful

phronmophobic01:03:22

I was just curious what was considered client-side since I'm still getting acquainted with cljdoc

Cora (she/her)01:03:24

the idea is to feed the docs for a given :group-id/:artifact-id/:version into the browser and populate an in-browser full-text search engine with it. then you can search within just that docset

Cora (she/her)01:03:36

you can try searching here to see what I mean https://corasaurus-hex.github.io/cljdoc-search/

Cora (she/her)01:03:40

that's all client-side

phronmophobic01:03:54

I've been working on "client-side" library search with "client-side" meaning the developer's computer. https://github.com/phronmophobic/add-deps

phronmophobic01:03:18

Yea, I was checking out that link from the PR. Looks cool!

phronmophobic01:03:20

Both the search and the static/dynamic analysis that cljdocs does is really interesting

Cora (she/her)01:03:31

it definitely is

Cora (she/her)01:03:19

I'm trying to avoid having to index every docset server-side, especially when some may only be searched a handful of times before there's a new version (and therefore new docset) and then never searched again

Cora (she/her)01:03:28

and so having the client index it, cheaply and quickly, seems like a good trade-off maintenance-wise

👍 1
Cora (she/her)01:03:39

just a heads-up, I'm not sure how up to date the specs are in the project

Cora (she/her)01:03:01

the cache-bundle doesn't seem to be up to date

Cora (she/her)01:03:57

or perhaps there are two things called cache-bundle in the project

Cora (she/her)04:03:38

defining specs for some of these huge data structures is paaaaaainful

lread13:03:24

Yeah, https://github.com/cljdoc/cljdoc/issues/532. I started describing data structures in docstrings, whenever I was scratching my head.

lread13:03:38

So @corasaurus-hex, I’m learning how server-side search currently works. There’s details around how text is tokenized, but basically it seems like we have prefix searching. So a search for thi will match this and thing but not rethink. I’m gonna guess that client-side search is more along the lines of simple character matching?

Cora (she/her)13:03:52

no, it's as full as we want

Cora (she/her)13:03:36

I'll bet the server-side things can do this as well?

lread13:03:13

Thanks! The search technology we are using server side (lucene) is geared for speed. It can be tweaked programmatically in tons of ways. It has evolved over decades… and still actively maintained… and widely used. Powerful but not trivial to use. At first glance flexsearch seems a whole lot more end-user-focuced. simple_smile But anyway, just wondering out loud how consistent our client vs server side matching techniques are or should be.

Cora (she/her)13:03:47

the server-side is mostly for matching project names?

lread14:03:17

Yeah, it searches on artifact-id group-id and description (the one from the pom).

phronmophobic16:03:42

Github repos include a list of tags/topics. It might also be possible to check poms or deps.edn files for tags. That might be an interesting addition to search at some point.

lread16:03:27

Interesting. Right now we only document libs from clojars, but we’ve been thinking about how to include source-based libs (hosted only on github for example) https://github.com/cljdoc/cljdoc/issues/459. One open question I have on that is how we might rank search results. For clojars we are about to do so by download count, but for a git repo, not sure. Maybe github stars?

phronmophobic16:03:02

Stars and follows seem reasonable. Since many clojure libraries are available on clojars, I was thinking it would be interesting to cross reference downloads with stars and see if you there's a useful correlation between the two that would allow us to convert between the two for ranking.

lread16:03:59

Also interesting.

phronmophobic21:04:35

it's very noisy

phronmophobic22:04:15

Ah, so one thing that throws things off is that if a "popular" library depends on an "unpopular" library, you get things like https://clojars.org/crypto-equality with only 20 stars but 14,949,372 Downloads :rolling_on_the_floor_laughing:

lread22:04:10

Ah… right. The unsung heroes!

phronmophobic22:04:21

I wonder what the data looks like if you somehow credit "unpopular" libraries with the downloads of their dependents

phronmophobic22:04:08

so crypto-equality's star count would have an effective star count of all its dependents

lread22:04:07

Huh, hadn’t noticed the https://clojars.org/rewrite-clj/dependents page on clojars until just… now.

lread02:04:31

Yeah… part of google’s ranking system is how many other pages refer to page, right? So how many libs use a lib would be an interesting indicator?

👍 1
Cora (she/her)14:03:01

is it worth investigating alternatives? those needs seem super light

lread14:03:07

It will soon rank results by clojars download count.

lread14:03:53

Maybe. But it is super-fast. Which is nice for suggest typing.

lread14:03:12

And there are probably enough lucene geeks out there who understand the tech well… I’m just not one of them… yet!

lread14:03:34

Clojars also uses lucene but https://github.com/clojars/clojars-web/wiki/Search-Query-Syntax. I was considering offering that syntax, and while it makes sense for clojars, I decided it probably adds more end-user complexity than value for cljdoc.

lread14:03:44

Related tangent: We search on pom description but don’t display it in results. I personally find that kind of confusing. Any opinion?

Cora (she/her)14:03:42

so a lot of search tools offer returning the relevant section of text with the matches highlighted

Cora (she/her)14:03:49

which is a nice feature

Cora (she/her)14:03:29

in general I like it to feel obvious why something matched, that way I can assess relevance without having to click through

Cora (she/her)14:03:49

in this case, though, are people searching mostly for projects they already know about?

Cora (she/her)14:03:05

or is this more about project discovery?

Cora (she/her)14:03:47

it feels like the former to me but the latter definitely has some value (however someone needs to have generated docs for the description to be searched?)

Cora (she/her)14:03:30

sorry, I realize that i'm not offering a concrete opinion

Cora (she/her)14:03:44

if I had to do it myself I'd just make it match artifact and group id. I'd want to hear a use case and get some inspiration before taking it further

lread14:03:07

Yeah, I am muddling through myself, thanks for muddling with me! I feel the same way about what has matched being obvious. And I also like highlighting (which we don’t yet do but can do).

lread14:03:10

I’m guessing that we search the description for discovery. Which seems like an ok thing to do. The description comes from clojars so no need for it to have been already built by cljdoc yet. But if we are searching on it, I feel we should show it in results.

phronmophobic16:03:42

Github repos include a list of tags/topics. It might also be possible to check poms or deps.edn files for tags. That might be an interesting addition to search at some point.