Fork me on GitHub
#datomic
<
2017-10-24
>
Matt Butler14:10:26

Is there any security/injection concerns when using the fulltext feature of datomic?

souenzzo18:10:50

be careful with those characters #{"\t" "\n" "\r" " " "!" "\"" "(" ")" "*" "+" "-" ":" "?" "[" "\\" "]" "^" "{" "}" "~"}

souenzzo18:10:17

They may cause ParseException '*' or '?' not allowed as first character in WildcardQuery com.datomic.lucene.queryParser.QueryParser.getWildcardQuery (QueryParser.java:982)

souenzzo18:10:47

There is some place in docs that says that #{"\t" "\n" "\r" " " "!" "\"" "(" ")" "*" "+" "-" ":" "?" "[" "\\" "]" "^" "{" "}" "~"} could break fulltext searchs? It's on my code.

favila18:10:05

@mbutler The string given to the fulltext function is actually a lucene query syntax: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

favila18:10:48

there's no injection danger, but it is a minilanguage and may not match a user-facing expectation

favila18:10:01

I also don't know a reliable way of escaping characters in that syntax

favila18:10:46

ah, this doc says just prefix with backslash

favila18:10:51

(last section)

Matt Butler18:10:17

Thanks, @souenzzo @favila I knew about the lucene query sytnax, I rely on it to handle searching of emails as the tokenizer seems to split on @ (e.g. I turn " " => "+user +"). Just wanted to check that the worst thing that can happen is a Parse Exception, rather than any security concern. As i understand its still not possible to change the settings of or use a different tokenizer, but do you know if its possible to escape the input so that i can get the full string "" into the index?

favila18:10:26

All I can think is munge it to something that tokenizes the way you want

favila18:10:57

however, what kind of query are you doing? Sounds like exact-match? in which case why use fulltext at all?

Matt Butler18:10:51

I was just using the exact match as the most clear example

favila18:10:43

you could make your datomic query try exact match (normal indexed field), and use that to boost scores

Matt Butler18:10:12

yeah, I considered/was doing that as some point. Gets a bit messy since I allow a variable number of fields to constrain the search, and since I do that its not a big deal that I have to treat email a bit oddly

Matt Butler18:10:36

Was just hoping that there was some easy answer to the tokenizer problem 🙂

Matt Butler18:10:45

Also considered doing as you said and storing a "normalised" version of email but seemed like more tech debt than it was worth. If current implementation proves to poor a UX ill probably move to doing that.

Matt Butler18:10:53

Thanks for the advice btw 🙂

favila18:10:01

don't forget about query rules to abstract some of this. e.g.:

'[[(email-search [?email] ?e ?score)
   [?e :email-attr ?email]
   [(ground 2.0) ?score]]
  [(email-search [?email] ?e ?score)
   [(fulltext $ :email-attr ?v ?score) [[?e ?v _ ?score]]]]]

favila18:10:17

(this is the "score-boosting" approach I was talking about)

favila18:10:38

You can also compare ?v to the original search and infer something

Matt Butler18:10:05

Trying to model the behaviour of the query in this case. If you invoke this rule once, its a logical or right? So in the case the exact match returned it would bind that ?e and "exit early"?

favila18:10:09

It would still try both, but you can aggregate :find (max ?score) ?e to dedup and make exact matches float higher

Matt Butler18:10:30

and not try to do the fulltext.

dominicm18:10:57

I thought lucene was an implementation detail though. It would be nice if fulltext escaping was provided by datomic, so you didn't have to depend on this.

souenzzo19:10:04

awesome @dominicm

(com.datomic.lucene.queryParser.QueryParser/escape "|&&|")
=> "\\|\\&\\&\\|"
Maybe datomic.api could wrap this function (d/fulltext-escape s)

dominicm20:10:36

@souenzzo more or less what I was thinking, yup

dominicm20:10:00

No specification of the engine necessary, but a stable API.

alexisvincent23:10:41

How are folks achieving ordering on cardinality many attributes?