This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2020-12-09
Channels
- # adventofcode (197)
- # announcements (25)
- # aws (1)
- # babashka (21)
- # beginners (138)
- # calva (21)
- # cider (5)
- # clara (1)
- # clj-kondo (35)
- # clojure (97)
- # clojure-australia (4)
- # clojure-dev (37)
- # clojure-europe (100)
- # clojure-nl (2)
- # clojure-spec (7)
- # clojure-uk (36)
- # clojurescript (11)
- # conjure (15)
- # cursive (20)
- # datomic (12)
- # emacs (10)
- # events (2)
- # fulcro (83)
- # graalvm (14)
- # jobs (1)
- # jobs-discuss (27)
- # kaocha (75)
- # lambdaisland (21)
- # off-topic (27)
- # pedestal (5)
- # reitit (2)
- # reveal (20)
- # rewrite-clj (24)
- # sql (9)
- # tools-deps (37)
- # xtdb (93)
Are there any major drawbacks to primarily using transaction functions to modify data? I'm coming from an attribute-accumulating system instead of document based and it would make my life simpler to emulate that via a :merge
or :patch
function
Hi 🙂 in general the main drawback is in creating small bottlenecks that accumulate, as each invocation will create a small increase in end-to-end latency & throughput vs just doing everything via :crux.tx/put
, also there will be increased churn and network back-and-forth to the doc store due to how the implementation works (this also means the latency increase will be sensitive to which doc store you end up using). If it's of any interest, I'd be happy to assist with analysing the performance for your specific scenario/data.
As for the desire to use merge
or patch
semantics - it's certainly a plausible option to use transaction functions ubiquitously like that, but I think you must be conscious of what it then means to apply such operations retroactively as you need to consider whether it's important in your application for a merge to be able to cascade and update all "future" versions in an entity's timeline. I believe such a retroactive/cascading merge or patch would need to maintain additional metadata inside the document, but it's been a while since I thought about it last and I've not yet seen anyone attempt an implementation. Again, I'd be happy to help break new ground if you're interested to try!
Yeah, I’m definitely down to try it out. I would want to handle retroactive edits correctly, though for now I’ve just ignored it because it seemed like a rabbit hole.
I assume the extra doc store latency comes from having to fetch/modify/insert the patched document which makes sense. Would something like the s3 doc store’s cache help minimize it?
Additionally, the latency increase is only on write and index build time, but not query time correct?
We already have a doc store cache in place that should definitely help for all these concerns, yep, though the process of getting it warmed up via s3 might be less than ideal for your needs, it's hard to say without getting into real-world analysis.
> Additionally, the latency increase is only on write and index build time, but not query time correct?
For the most part this is correct, however certain places within a query may require fetching the source document (which, again is behind a cache) and these especially include eql/project
and any predicates that require decoding large raw values (e.g. large strings or byte-arrays) as indexes only hold small values
What were your previous thoughts about cascading edits? I assume you'd perform the cascade inside the transaction context itself, but I feel like there's a lot of hard choices on things like timestamps or encountering a put
operation instead of a patch
> I assume you'd perform the cascade inside the transaction context itself
Yep definitely. I remember thinking that you would need to keep track of which patch
operations updated each and every attribute, something like maintaining a map of {[patch-vt-time patch-tt-time] attr-name ...}
within every version of the document. Attempting to mix put
operations as well might work but I haven't yet considered it 🙂
Can you tell me a little about the use-case? e.g. are you expecting to routinely have thousands of updates to each entity?
> Attempting to mix put
operations as well might work
I think you can kinda split documents and their updates into "generations" like this, where a put
would terminate a generation and start one anew.
Why would having a map of attribute timestamps be necessary? Wouldn't the update process just be rewriting history from the update point by applying successive merges from the update point then put
ting that updated document in under the same valid time?
Hmm. Thinking about it fresh, maybe patch
just needs to be able to accept an optional valid-time-start and valid-time-end (like put
/`delete`), so the user can be very explicit and there is no book-keeping / generation marker required. In the extreme case you could have a single function updating multiple attributes with different regions of the timeline, like [:crux.tx/fn :multi-patch :my-entity [:foo vt-start vt-end] [:bar vt-end] [:baz]]
The last time I was reflecting on this topic deeply was about a year ago, in a busy few days, before we had the current incarnation of transaction functions available, and the start/end semantics for put
& delete
were a little different...I may well have over-complicated this in my head 🙂
You might also be interested in these datom-like transaction functions I was working on a few months ago: https://gist.github.com/refset/a00be06443bc03ccc84a2874af3cdb8a
I'll see what I can put together regarding patch
, I think I've got at least a reasonable idea of how it could work.
That's a super cool gist as well. Has there been any action around adding datom-style (or even just patch as a tx type) interfaces to crux itself? I'm sure it would be quite the undertaking.
Cool, please do keep us updated!
> Has there been any action around adding datom-style (or even just patch as a tx type) interfaces to crux itself?
Mostly no, but :crux.tx/patch
is a maybe: https://github.com/juxt/crux/issues/462 🙂 (feel free to leave a comment if you like!)
We're not particularly motivated to emulate complete datom-style semantics in the core API, for a variety of reasons:
• it's not clear to us that datoms represent "the best possible model" so we don't want to push users in that direction without deeper consideration
• I think you can build everything needed in userspace easily enough, thanks to transaction functions
• the document-datom impedance mismatch has a non-trivial space & performance cost, and a first-class datom layer would probably be inappropriate as a default choice for many high-throughput use-cases
• as you alluded to, it's not something we would design & build in an afternoon, sadly, and we have higher engineering priorities for the foreseeable future
That said, I'd be up for revisiting the code in that gist soon and turning it into something more consumable (I even wrote a few tests to go with it!).
In the future we would certainly like to provide more opinionated "built-in" layers for modelling on top of Crux. I have been semi-casually researching the possibility of using some combination of malli + Alloy to model & generate an enforceable schema. This might be of interest: https://www.hillelwayne.com/post/formally-modeling-migrations/
I'm trying to follow the crux + confluent cloud blog, but it seems a bit dated. I've got the api key and secret and a configuration file from the Tools & client config
tab. But I'm getting Timed out waiting for a node assignment.
Does this look okay?
(defonce node
(crux/start-node
{:crux/index-store {:kv-store {:crux/module 'crux.lmdb/->kv-store
:db-dir ( "/tmp/lmdb")}
}
:crux/document-store {:crux/module 'crux.kafka/->document-store
:kafka-config {
:kafka-properties-file "/my-project/kafka.properties"}
:doc-topic-opts {:topic-name "crux-docs"}}
:crux/tx-log {:crux/module 'crux.kafka/->tx-log
:kafka-config {
:kafka-properties-file "/my-project/kafka.properties"}
:tx-topic-opts {:topic-name "crux-transaction-log"}
}}))
@U09MR0T5Y ah so the post on http://juxt.pro is old, and our blog section on http://opencrux.com looks like it hasn't deployed properly but here's the asciidoc source with the latest config you need (I checked < 2 weeks ago) 🙂
Fixed, it's available up here, in full technicolor, again now: https://www.opencrux.com/blog/crux-confluent-cloud.html
Awesome! It works, seems my queries are timing out though. Maybe it's still indexing, but htop seems to say it's pretty quiet.
For my app I add a dictionary to the database (literally a dictionary, with words and meanings as entities). It's around 60MB.
You can increase the default query timeout as a query parameter like this: https://github.com/juxt/crux/blob/master/crux-test/test/crux/query_test.clj#L3403-L3406
I am trying to run my api on a 2vCPU 4GB ram gce machine, but I'm guessing that's a bit too small. I could barely fit the ingesting after playing with the heapsize for a while (would run out of heap) and by splitting up the ingesting into smaller chunks and awaiting the transaction.
Is the "60MB" when compressed as a .zip
or something? Or is that raw edn? If that's raw end then I'm a little surprised it exhausted the JVM memory so easily
As for LMDB, do you remember the last thing you ran that might have triggered the segfault? Usually that kind of thing only happens when you fail to run (.close node)
or something. If it happened when you attempted to call (c/db
or (c/q
I'd quite like to figure out why
Double checked, it's even smaller actually, raw json is only 23MB. But that gets made more bulky with edn and dividing up into words -> meanings (120.000 entities/documents? with 4-5 fields)
ah hmm, okay, well in any case Rocks is the right solution for anything larger than my emacs buffer can handle 🙂
the commands I'm using are start-node
db
q
entity
submit-tx
and in my query I use eql/project
Ugh, emacs buffers can be so frustrating. I use cider-eval-last-sexp-and-replace
quite often, and when that gets too big emacs will just freeze and be annoying with the savefile after I force the restart. Usually open up sublime text, just to remove the offending line!
hmm, trying rocksDB now, and although the node starts, and c/sync
returns a timestamp, the query still times out. I'll try with a new tx-topic
and doc-topic
and reingesting..
I deeply sympathise. I was learning about pkill -USR2 emacs
for the first time yesterday!
can you post the query? it might be scanning over everything and generating Cartesian products
yeah I can post the query, but on my local machine with everything in mem it only took a couple ms.
{:find
['(eql/project
?uid
[#:material{:own-material
[:material/id
:material/name
#:material{:image
[:file/id
:file/uploaded-on
:file/sha
:file.sha/url
:file.sha/filename
:file/js-file]}
#:material{:extraction
[:extraction/id
:extraction/source
#:extraction{:sentences
[:sentence/from
:sentence/to
:sentence/source
#:sentence{:words
[:lex/lex
:lex/id
:lex/begin
:lex/length
#:lex{:dict-entries
[:dict-entry/id
#:dict-entry{:meanings
[:dict-entry.meaning/id
:dict-entry.meaning/english-short
:dict-entry.meaning/english-expansive]}
:dict-entry/korean-description
:dict-entry/english-expansive
:dict-entry/type
:dict-entry/hanja
:dict-entry/korean
:dict-entry/english-short
:dict-entry/word-type]}
#:lex{:ranked-meanings
[:ranked-meaning/id
#:ranked-meaning{:meaning
[:dict-entry.meaning/id
:dict-entry.meaning/english-short
:dict-entry.meaning/english-expansive]}
:ranked-meaning/rank]}
#:lex{:morphs
[:morph/id
:morph/lex
:morph/tag
:morph/begin
:morph/length]}]}
:sentence/id]}]}]}])],
:where '[[?uid :material-overview/id]],
:args [{'?uid uid}]}
wow, thanks 🙂 will digest this in a sec. Can you show me the response from (c/status node)
on the gce machine? I want to confirm Rocks is looking healthy
{:crux.doc-log/consumer-state nil, :crux.tx-log/consumer-state {"doc-4-0" {:next-offset 120003}}, :crux.version/version "20.09-1.12.1-beta", :crux.index/index-version 13, :crux.version/revision nil, :crux.kv/estimate-num-keys 3477218, :crux.zk/zk-active? true, :crux.kv/kv-store "crux.rocksdb.RocksKv", :crux.kv/size 138220186}
I'm very excited to finally get a version of my project online and I think crux is a great fit.
(def ^crux.api.ICruxAPI node
(crux/start-node
{:crux.kafka/kafka-config {:bootstrap-servers "pkc-l6ojq.asia-northeast1.gcp.confluent.cloud:9092"
:properties-file "/home/baruchberger/kr/kafka.properties"} ; replace with the path of your properties file
:crux/tx-log {:crux/module 'crux.kafka/->tx-log
:kafka-config :crux.kafka/kafka-config
:tx-topic-opts {:topic-name "tx-4"
:replication-factor 3}}
:crux/document-store {:crux/module 'crux.kafka/->document-store
:kafka-config :crux.kafka/kafka-config
:doc-topic-opts {:topic-name "doc-4" ; choose your document-topic name
:replication-factor 3}}
:crux/index-store {:kv-store {:crux/module 'crux.rocksdb/->kv-store
:db-dir ( "/tmp/rocksdb")}}}))
ah okay, so by default kafka uses an in-memory "view" to pretend that it's a document store
yeah I think so, your gce machine just doesn't have enough memory for this to work without using Rocks here also. The eql/project
makes heavy use of the document store (with an in-memory cache in front of it also)
using local-document-store du -h doc-store
is 71M, however even increasing query timeout to 2 minutes isn't enough. htop and iotop show nearly no load (cpu/ram/disk)
I will try to put together a reproduction case for the lmdb stuff my tomorrow (getting too late here!) thanks again for being so helpful.
> htop and iotop show nearly no load (cpu/ram/disk) That's very surprising. Could you confirm that this simple query works as expected with that setup:
{:find '[?uid]
:where '[[?uid :material-overview/id]]
:limit 1}
And then this one:
{:find '[(eql/project ?uid [*])]
:where '[[?uid :material-overview/id]]
:limit 1}
doing
:where '[[?uid :material-overview/id] [?uid :material/own-material ?exists]]
did give the correct id for ?existsI've been messing about with a new server because the other one was giving me weird timeouts on the ssh. I am in Seoul, using a Seoul GCE instance. But somehow my dedicated server in Germany is a lot faster and more consistent than this Seoul box..
I'm now using a machine with non-shared vCPU and 8GB of RAM. Using in memory my app works, the query in question takes around 20ms to complete.
Using this start-node
config it still times out sadly:
{:crux.kafka/kafka-config {:bootstrap-servers "pkc-l6ojq.asia-northeast1.gcp.confluent.cloud:9092"
:properties-file "/home/baruchberger/kr/kafka.properties"}
:crux/tx-log {:crux/module 'crux.kafka/->tx-log
:kafka-config :crux.kafka/kafka-config
:tx-topic-opts {:topic-name "tx-7"
:replication-factor 3}}
:crux/document-store {:crux/module 'crux.kafka/->document-store
:kafka-config :crux.kafka/kafka-config
:doc-topic-opts {:topic-name "doc-7"
:replication-factor 3}
:local-document-store {:kv-store {:crux/module `rocks/->kv-store
:db-dir ( "/home/baruchberger/kr/doc-store")}}}
:crux/index-store {:kv-store {:crux/module 'crux.rocksdb/->kv-store
:db-dir ( "/tmp/rocksdb")}}}
@U899JBRPF anything I can test? I thought maybe I could just use the in memory db now, since I just want my data persisted, but if remove the :crux/document-store
and :crux/index-store
all seems to work until I restart the process and try to connect again. Am I right to assume that once something is transacted it should be automatically indexed next time I start a node with the same :crux/tx-log
config?
Hi again 🙂
> This first one did work, the second one didn't.
So that's interesting, as it suggests the document store is somehow broken. Does the (c/entity ...)
api work at a least?
> if remove the `:crux/document-store` and `:crux/index-store` all seems to work until I restart the process and try to connect again
Ah, you mustn't deleted the document-store, as the tx-log doesn't contain any actual data, only hashes 🙂. But yes the index-store should be automatically indexed / re-built the next time you start a node with the same tx-log AND document-store config
Did you create these topics yourself, or did you let Crux do it for you (I'm wondering about retention settings)? Also, if it's not too much trouble, please could you check that a RocksDB-backed document-store works as expected with entity
and eql/project
and replay at least (i.e. forget the local-document-document and get rid of Kafka from the document-store part of the config altogether)
Reviewing your Kafka-backed document-store configuration again though I still can't see anything that would obviously explain the issue. Although I did notice your status map shows :crux.doc-log/consumer-state nil
which might be indicative of something but I'm not sure. I'll have a think and get back to you. The next step is probably for me to try to reproduce the setup.
/cc @U050V1N74 - fyi, long thread, feel free to skip the first 50+ messages
I let crux do it. retention settings on confluent.cloud
for tx, 604800000
for doc.
Thanks for the retention settings. Can you confirm roughly how long ago you submitted the transactions to those specific Confluent Cloud topics? It's possible that the low http://retention.ms config for the doc log is at fault here, we're reviewing it now.
I did start my Confluent Cloud account a long time ago, around the time the crux blog about it came out. Recently started back up.
working on the reproduction, hoping that thing happens where I figure out the issue while doing that..
Okay in my repro case c/entity
and (eql/project ?uid [*])
not failing (I did earlier test that and it failed, but not sure what happened). However I do have a clear case of the timeout (tested up-to 5m) on the query when using kafka + RocksDB that finishes very quickly with mem-node.