Fork me on GitHub
Matt Butler14:10:26

Is there any security/injection concerns when using the fulltext feature of datomic?


be careful with those characters #{"\t" "\n" "\r" " " "!" "\"" "(" ")" "*" "+" "-" ":" "?" "[" "\\" "]" "^" "{" "}" "~"}


They may cause ParseException '*' or '?' not allowed as first character in WildcardQuery com.datomic.lucene.queryParser.QueryParser.getWildcardQuery (


There is some place in docs that says that #{"\t" "\n" "\r" " " "!" "\"" "(" ")" "*" "+" "-" ":" "?" "[" "\\" "]" "^" "{" "}" "~"} could break fulltext searchs? It's on my code.


@mbutler The string given to the fulltext function is actually a lucene query syntax:


there's no injection danger, but it is a minilanguage and may not match a user-facing expectation


I also don't know a reliable way of escaping characters in that syntax


ah, this doc says just prefix with backslash


(last section)

Matt Butler18:10:17

Thanks, @souenzzo @favila I knew about the lucene query sytnax, I rely on it to handle searching of emails as the tokenizer seems to split on @ (e.g. I turn " " => "+user +"). Just wanted to check that the worst thing that can happen is a Parse Exception, rather than any security concern. As i understand its still not possible to change the settings of or use a different tokenizer, but do you know if its possible to escape the input so that i can get the full string "" into the index?


All I can think is munge it to something that tokenizes the way you want


however, what kind of query are you doing? Sounds like exact-match? in which case why use fulltext at all?

Matt Butler18:10:51

I was just using the exact match as the most clear example


you could make your datomic query try exact match (normal indexed field), and use that to boost scores

Matt Butler18:10:12

yeah, I considered/was doing that as some point. Gets a bit messy since I allow a variable number of fields to constrain the search, and since I do that its not a big deal that I have to treat email a bit oddly

Matt Butler18:10:36

Was just hoping that there was some easy answer to the tokenizer problem 🙂

Matt Butler18:10:45

Also considered doing as you said and storing a "normalised" version of email but seemed like more tech debt than it was worth. If current implementation proves to poor a UX ill probably move to doing that.

Matt Butler18:10:53

Thanks for the advice btw 🙂


don't forget about query rules to abstract some of this. e.g.:

'[[(email-search [?email] ?e ?score)
   [?e :email-attr ?email]
   [(ground 2.0) ?score]]
  [(email-search [?email] ?e ?score)
   [(fulltext $ :email-attr ?v ?score) [[?e ?v _ ?score]]]]]


(this is the "score-boosting" approach I was talking about)


You can also compare ?v to the original search and infer something

Matt Butler18:10:05

Trying to model the behaviour of the query in this case. If you invoke this rule once, its a logical or right? So in the case the exact match returned it would bind that ?e and "exit early"?


It would still try both, but you can aggregate :find (max ?score) ?e to dedup and make exact matches float higher

Matt Butler18:10:30

and not try to do the fulltext.


I thought lucene was an implementation detail though. It would be nice if fulltext escaping was provided by datomic, so you didn't have to depend on this.


awesome @dominicm

(com.datomic.lucene.queryParser.QueryParser/escape "|&&|")
=> "\\|\\&\\&\\|"
Maybe datomic.api could wrap this function (d/fulltext-escape s)


@souenzzo more or less what I was thinking, yup


No specification of the engine necessary, but a stable API.


How are folks achieving ordering on cardinality many attributes?