datascript 2022-08-12 | Slack Archive

Athan10:08:57

Measuring Transactions Throughput with Datomic in-memory database and Datalevin storage engine Hi, I thought you might find this interesting. First https://clojurians.slack.com/archives/C03RZMDSH/p1660170818216469?thread_ts=1660170818216469&cid=C03RZMDSH below to see the experiment and the results. 1. It's important to explain quickly why I think there is such bad performance with Datomic in-memory client and I expect analogous results using https://github.com/tonsky/datascript. It's because of the data structures. To maximize the speed of writing in memory one has to use different memory management and data structures similar to those used in pyarrow, numpy. So, how about building an in-memory Datalog query engine on top of pyarrow ? 2. It was relatively easy to deploy, configure and test transaction throughput of a key-value storage engine (LMDB) of https://github.com/juji-io/datalevin. I would expect Datomic transactor on AWS DynamoDB Local or https://docs.xtdb.com/storage/rocksdb/ to have similar performance.

;; 5.2 sec for 23 cols x 10000 rows
;; 3.2MB books.csv file
;; Elapsed time: 5.2 secs
;; datoms inserted: 229,956
;; transactions throughput: 229,956/5.2 datoms/sec = 44222 datoms/sec

I have met similar problems in the past when I was testing the writing performance of redis KV storage engine. All these KV engines (redis, dynamodb, lmdb) are very good on point queries but they perform really bad when you want to write (import) a big volume of data. You may argue that writing performance is not critical for a transactional (OLTP) DBMS but it becomes super important when you want to import your data from another system/project, or you want to integrate a big volume of data from other sources, or you want to do analytics without adding another storage engine. In fact what we are discussing here is the price you pay for having a flexible universal data model based on EAV/RDF triplets. Which is a similar case when you try to construct a relational, tuple based data model on top of a KV storage engine or object like memory structure (Python/Clojure). The physical layout must be appropriate for such data model and the best candidate I found from my personal research and experiments is to use a columnar layout. Why not adding Datalog query engine on top of a columnar database engine based on LSM tree data structures, such as Clickhouse and Singlestore(MemSQL) ?

👀 1

2022-08-12

Channels