This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-08-10
Channels
- # announcements (9)
- # aws (11)
- # babashka (37)
- # beginners (97)
- # biff (2)
- # calva (73)
- # clj-kondo (17)
- # cljfx (3)
- # clojure (89)
- # clojure-europe (45)
- # clojure-norway (12)
- # clojurescript (17)
- # datahike (8)
- # datomic (13)
- # deps-new (4)
- # figwheel-main (1)
- # graalvm (2)
- # hyperfiddle (8)
- # introduce-yourself (6)
- # leiningen (38)
- # lsp (57)
- # malli (13)
- # nbb (46)
- # off-topic (40)
- # pathom (3)
- # polylith (8)
- # rum (4)
- # shadow-cljs (14)
- # spacemacs (1)
- # sql (11)
- # xtdb (10)
Measuring Transactions Throughput with a Datomic in-memory database I am experimenting with the https://github.com/cognitect-labs/day-of-datomic-cloud/tree/master/datasets/goodbooks-10k in day-of-datomic-cloud repository and I am trying to measure the memory allocation and the time to issue transactions, i.e. find the maximal throughput measured in datoms/sec These are my results running the process on a local machine: • For the 3.2MB books.csv file
time: 2.9 sec for 23 cols x 10000 rows
Memory Allocated: 239MB ((mem difference from Linux system monitor for java process))
datoms inserted: 229,980transactions throughput: 229,980/2.9 datoms/sec = 79303 datoms/sec
But in this measurement time includes reading rows from the CSV file
(time (d/transact conn (with-open [f (io/reader (io/file repo-dir "books.csv"))]
(let [rows (drop 1 (csv/read-csv f))]
(mapv #(row->entity % book-schema) rows)))))
• For the 69MB ratings.csv file
;; 5,976,480 rows x 3 columns
;; 69MB ratings.csv file
;;
; Transform csv rows into 5.9 million rating entities in memory
;; Elapsed time: 11.6 secs
;; Memory Allocated: 3GB (mem difference from Linux system monitor for java process)
;; It is processing asynchronously the data, opens the file, read all rows in memory
(future (time (def rating-entities
(with-open [f (io/reader (io/file repo-dir "ratings.csv"))]
(let [rows (drop 1 (csv/read-csv f))]
(mapv row->rating rows))))))
OK now that we have the transactions set in memory we can measure the throughput
;; Elapsed time: 5m 47sec (347sec)
;; Memory Allocated: 4.1 GB (mem difference from Linux system monitor for java process)
;; datoms inserted: 23,905,976
; transactions throughput: 23,905,976/347 datoms/sec = 68893 datoms/sec
(time (doseq [chunk (partition-all 100000 rating-entities)]
;; if you want to be nice to other users on a shared system
;; (Thread/sleep 2000)
(d/transact conn chunk)))
For this setup and with only writing datoms in memory I reached a limit of about 70,000 datoms/sec
and it took more than 5m for a 69MB file with a 4GB memory footprint !!! I wouldn't dare to use a transactor with Postgres, I expect it to take a really long time and a big database size. And that seems to be a problem when one has to port tables from a relational database with millions of rows and dozens of columns.
Generally speaking it's already tough to port a medium to large production SQL (relational) database into Datomic taking in consideration, authorization/authentication, triggers, functions, constrains and data modeling and it becomes even harder when you also have to think a lot about resources and available time.
Anyway I am curious how to run https://docs.datomic.com/cloud/best.html#pipeline-transactions transactions and see what difference it makes ? Can you share some code on how to use it for this example dataset ?
PS: The machine is a workstation with Intel Xeon E5-1650 v4 @ 3.60GHz × 6 cores
The REPL of the peer is started from inside IntelliJ IDEA and when it finished processing
Total Memory Allocated: 7.3 GB (Linux System monitor for java process)
jvm-opts ["-Dvlaaad.reveal.prefs={:theme,:light} -Xm2g"]
The peer is using datomic.api and the database is created in memory
[datomic.api :as d]
(def db-uri "datomic:")
(d/create-database db-uri)
(d/connect db-uri))
The JVM process is spawned from a REPL that is started from inside IntelliJ IDEA
When it finished processing Total Memory Allocated: 7.3 GB (Linux System monitor for java process)
Are you referring to some other memory setting for the peer, which exactly ? How can I view/set it ?I might be wrong, but I recall reading somewhere that the in-memory database does not have the same performance characteristics as an actual transactor + storage db. So you might want to setup a proper storage db if you want to do benchmarks. (EDIT: here's something: https://docs.datomic.com/on-prem/getting-started/dev-setup.html)
also, did you specify the maximum heap size? If not, I think the jvm sets a default limit of 1/4 of system ram. I'm not sure how much useful information you can gather by just looking at the java memory usage from the systems point of view, because that doesn't really tell you how the heap is utilized (eg. is the jvm constantly running of out memory and thus doing gc all the time or not at all). There's probably other people here who can give better tips regarding this.
Definitely agree with the above. You need better metrics. Also check there aren't other competing processes. If you have nothing else, jcmd is a great tool
The in memory DB definitely gets slower the more data you put into it - I don’t know why.
It doesn’t seem to matter what kind of data too, and queries that shouldn’t be affected start to slow down by adding unrelated datoms
@U03ET6PDHCK the heap size, you can set it passing -Xmx4096m for example
Thanks all for your answers, it seems I have to run a transactor to make some safe conclusions about transactions writing performance and I guess the best viable solution is DynamoDB. But datomic setup for this storage engine is another fruitcake 🙂. If I am successful on configuring and running it I will post results. It's also important to explain quickly why I think there is such bad performance with Datomic in-memory client. It's because of the data structures. To maximize the speed of writing in memory one has to use different memory management and data structures similar to those used in pyarrow, numpy. So, how about building an in-memory Datalog query engine on top of pyarrow ?
DynamoDB Local storage engine works fine in Linux but the configuration for Datomic proves to be a pain in.... and although I managed to make it start without a problem it fails after some time
Launching with Java options -server -Xms4g -Xmx4g -Ddatomic.printConnectionInfo=true
Starting datomic: <DB-NAME> ...
System started datomic: <DB-NAME>
and this is what I got from trying to connect from datomic Java shell
Datomic Java Shell
Type (); for help.
datomic % uri = "datomic:";
<datomic:>
datomic % Peer.createDatabase(uri);
// Error: // Uncaught Exception: bsh.TargetError: Method Invocation Peer.createDatabase : at Line: 2 : in file: <unknown file> : Peer .createDatabase ( uri )
Target exception: clojure.lang.ExceptionInfo: Error communicating with HOST localhost on PORT 8031 {:alt-host nil, :peer-version 2, :password "<redacted>", :username "1cW+SaWdseBnbsJieDkd0NCY0MdBVfrEipe+0GsXH4Y=", :port 8031, :host "localhost", :version "1.0.6397", :timestamp 1660264442167, :encrypt-channel false}
clojure.lang.ExceptionInfo: Error communicating with HOST localhost on PORT 8031 {:alt-host nil, :peer-version 2, :password "<redacted>", :username "1cW+SaWdseBnbsJieDkd0NCY0MdBVfrEipe+0GsXH4Y=", :port 8031, :host "localhost", :version "1.0.6397", :timestamp 1660264442167, :encrypt-channel false}
It was far more easy to deploy, configure and test transaction throughput of a similar key-value storage engine (LMDB) of https://github.com/juji-io/datalevin Datalog DBMS
;; 5.2 sec for 23 cols x 10000 rows
;; 3.2MB books.csv file
;; Elapsed time: 5.2 secs
;; datoms inserted: 229,956
;; transactions throughput: 229,956/5.2 datoms/sec = 44222 datoms/sec
So I would expect Datomic transactor on AWS DynamoDB Local to have similar performance. Which means that one has to https://docs.datomic.com/on-prem/operation/capacity.html#dynamodb and configure accordingly peers/pipeline etc... And sooner or later you realize that this is the price you pay for having the components of a DBMS separated.
I have met similar problems in the past when I was testing the writing performance of redis KV storage engine. All these KV engines (redis, dynamodb, lmdb) are very good on point queries but they perform really bad when you want to write (import) a big volume of data. You may argue that writing performance is not critical for a transactional (OLTP) DBMS but it becomes super important when you want to import your data from another system/project, or you want to integrate a big volume of data from other sources, or you want to do analytics without adding another storage engine.
In fact what we are discussing here is the price you pay for having a flexible universal data model based on EAV/RDF triplets. Which is a similar case when you try to construct a relational, tuple based data model on top of a KV storage engine or object like memory structures (Python/Clojure). The physical layout must be appropriate for such data model and the best candidate I found from my personal research and experiments is to use a columnar layout.
Why not adding support for really columnar database engines, such as Clickhouse or Singlestore(MemSQL), to serve as Datomic storage engines ?