Fork me on GitHub
#announcements
<
2021-01-14
>
Dainius Jocas11:01:10

https://github.com/dainiusjocas/lucene-grep Lucene based grep-like utility compiled with GraalVM native-image. Grab a binary and tell me what you think. Cheers!

๐Ÿ’ฏ 6
๐Ÿ˜ 1
borkdude11:01:01

wow, awesome :)

borkdude11:01:43

If I do this in a Clojure repo:

lmgrep "select-keys" .
should this work? it turns up empty

borkdude11:01:46

I would expect it to search the dir recursively

Dainius Jocas11:01:00

the problem is with the . at the end

Dainius Jocas11:01:22

as of now the file pattern is GLOB

tvaughan11:01:37

Super cool! Would it be possible to output the "score" associated with each match?

borkdude11:01:46

the problem with glob is always: is it recursive or not? this is always different per platform

Dainius Jocas11:01:26

@U04V15CAJ if you specify then it is recursive

borkdude11:01:37

this also doesn't return anything for me:

lmgrep  "keys" **/*

borkdude11:01:04

Oh I see:

lmgrep  "keys" "**/*"
I should quote the glob pattern

borkdude11:01:20

yes, that works, perfect

Dainius Jocas11:01:00

@U04V15CAJ yeah, put the GLOB in double quotes ๐Ÿ˜‰

Dainius Jocas11:01:04

@U0P7ZBZCK as of now it is not supported, but there is a Class in Lucene that does just that, so it is possible

๐Ÿ‘ 1
Dainius Jocas12:01:35

@U04V15CAJ for code search I'd suggest to specify the letter tokenizer, because the default analyzer doesn't split text on ., which is a bit unexpected IMO, e.g. lmgrep --tokenizer=letter "select-keys" "**.*"

borkdude12:01:51

yeah. it would be cool if the score was returned as @U0P7ZBZCK suggests and EDN output would also be nice, so you could sort the results (e.g. pipe the results to babashka and then do some processing)

Dainius Jocas12:01:05

@U0P7ZBZCK,@U04V15CAJ, I agree that it would be nice to sort on score, but hint me how would you like the output to look like?

borkdude12:01:20

probably just maps with :file, :line, :column, the line :text (optionally) and :score?

borkdude12:01:51

I would just output the maps on the fly, streaming, not wrapped inside a collection

borkdude12:01:06

maybe one map on each line

Dainius Jocas12:01:07

Got it. So I imagine it will be something like lmgrep --with-score "query" GLOB , i.e. under a flag

tvaughan12:01:10

Assuming compatibility with grep isn't a concern and results are sorted by score: [SCORE]:[FILE_PATH]:[LINE_NUMBER]:[LINE_WITH_A_COLORED_HIGHLIGHT] . I personally don't have much of a preference. I could awk/cut the output easily enough. As @U04V15CAJ suggests, edn output would be super helpful

๐Ÿ‘ 1
borkdude12:01:55

@U0FT43GKV Maybe you can make this more flexible by allowing a --columns argument with a comma separated list of options, which also determines the order

borkdude12:01:57

or even better, a template:

--template "{{score}}:{{file}},{{line}}:{{column}}:{{text}}"

borkdude12:01:29

and you can have {{text}} or {{colored-text}} if you want one of both

borkdude12:01:48

or maybe --no-colors should just be an option

Dainius Jocas12:01:50

Yeah, I was thinking about a template or a pattern as an option ๐Ÿ‘ left it out for the first iteration

borkdude12:01:47

I support something similar in clj-kondo

Dainius Jocas12:01:46

Nice! I'll shamelessly copy it as much as possible ๐Ÿ˜„

Dainius Jocas14:01:55

It was not complicated ๐Ÿ˜„

Dainius Jocas14:01:58

However, the issue is that with the default Lucene I can get either Scoring of highlighting facepalme

Dainius Jocas14:01:46

I have to implement a class that does both ๐Ÿ™‚

Dainius Jocas14:01:46

@U0P7ZBZCK the feedback is welcome ๐Ÿ˜‰

tvaughan15:01:04

๐Ÿ™‚ Scoring is more important to me. And normalization of scores too. From my prior experience with Elasticsearch, I remember that scores across indicies were not comparable. I'm hoping scores across different files are ๐Ÿคž Again, thanks for taking the time to create lmgrep, @U0FT43GKV

Dainius Jocas15:01:33

Yeah, with elasticsearch score are not comparable not only between indices but also between fields within an index ๐Ÿ™‚

Dainius Jocas15:01:22

The scoring with lmgrep is with gotchas. As of now, every line is scored separately. Every line is treated as a document with one field. The temporary index is being created with that one document. Then the query is run against that temporary index.

tvaughan15:01:32

Interesting. Thanks for sharing these details

Dainius Jocas11:01:57

I plan to write a blog post on the details in the coming week

alexmiller17:01:08

It's that time of year again - the https://www.surveymonkey.com/r/clojure2021 is now open! We would love to get your feedback from all Clojure/ClojureScript/ClojureCLR users. Takes < 10 minutes and we release all the data. Please share with your colleagues who might not be seeing it in forums like these.

party-corgi 27
๐ŸŽ‰ 18
โœ… 11
๐Ÿ“œ 1
mynomoto17:01:04

I wonder if babashka should be a dialect option.

๐Ÿ’ฏ 4
alexmiller17:01:05

I will make a note to consider for next year

๐Ÿ‘ 9
dharrigan17:01:13

May I suggest something too, or would you perfer another way of suggesting an addition?

dharrigan17:01:23

Could you add in "Insurance" as an sector/industry for next year.

dharrigan17:01:34

huuuge area ๐Ÿ™‚

โ˜๏ธ 1
alexmiller17:01:25

please add that as an Other response - I look at those every year and anything with high responses I add for the next year

alexmiller17:01:01

Other for that particular question that is

alexmiller17:01:08

I review all of those from prior year

alexmiller17:01:47

Insurance was only mentioned 8 times last year in the other responses

otfrom17:01:44

if only there was some way you could have been ready for the risk and claimed some kind of compensation (sry, sry)

trollface 7
pez18:01:46

Need better tutorials / guides. For me it is rather โ€œNeed more tutorials / guidesโ€.

p-himik05:01:25

Seems like I don't know something but it's hard to find out about it. Q24 lists both "Browsers" and "Chromium". Is there something named "Chromium" that's not a browser? Or was the intention to figure out how many people target Chromium specifically? If it's the latter, then why knowing that is important?

dgb2310:01:01

I also think babashka should be in there! Mabye even clojerl?

alexmiller13:01:09

as mentioned above, added babashka for consideration next year. I don't think anyone is actually using clojerl in anger.

alexmiller13:01:10

@U2FRKM4TW I think Chromium can be used independently as a component? David Nolen requested that, can't remember now why

p-himik15:01:43

@U050B88UR Could you please comment on the above? I'm genuinely interested but can't find any information.