Fork me on GitHub
#architecture
<
2023-04-05
>
1macias104:04:55

Hi Clojurists! I’ve been focusing on data eng for while now… with the modern data stack, let me clear what I mean with an implementation stack, • extract - data extraction - airflow • load - s3 - parquet - store, in this case a https://www.databricks.com/glossary/medallion-architecture with a data modeling like data vault with more structure and clean data • Transform - https://www.getdbt.com - transformation with templated sql • exposing the data with a Data Warehouse like snowflake My questions for the group • what would you imagine a clojure stack would look like for a data stack? • modeling mechanism like https://en.m.wikipedia.org/wiki/Data_vault_modeling seams really close to the universal schema discussed in https://docs.datomic.com/cloud/whatis/data-model.html#universal, but simplest which makes me wonder if datomic is a better approach to have a stage layer in a modern data stack • Is something like datomic or xtdb a posible alternative to a data-warehouse in some cases ? Thanks in advance, I hope to read your thoughts !

Rupert (All Street)11:04:10

We do a lot data engineering at our company (on multi terabyte datasets) and on distributed machines. In other programming languages (particularly strongly typed and non dynamic e.g. Java/C#/Scala etc) the language is too verbose or not expressive enough to do the transformations. Therefore they often delegate data transformation to other languages (e.g. SQL or NoSQL) etc. Other languages like Python/JavaScript are too slow and not multithreaded - so again they delegate away processing to other languages like SQL and libraries like pandas. However, Clojure is a fantastic language for data processing - it's fast, concise, dynamic, functional etc. Even just using the built in sequence functions (`map/filter/reduce` etc) works well. So we do our data processing/transformations etc in Clojure. This means our data stack is very simple: pure Clojure applications kicked off and managed by SystemD.

👀 2
Laurence Chen04:04:02

I think that you can directly use SQL on xtdb, so it probably can serve as the data warehouse at the same time of being operational database. I have some experience with Datomic and SQL data warehouse. The official Datomic website suggests using presto/trino for you to leverage SQL to query Datomic database, but in our use cases, the performance is not good enough. With trial and errors and programming battles, we developed a not-yet-open-source library called Plenish, which can read the Datomic log and translate it to SQL commands. With Plenish, we actually implement a stream system to sync every transaction from Datomic to SQL data warehouse in real time. (Currently, Postgres) I believe: • Modern data analytic stack is great. For analytical use case, SQL, especially jinja enhanced SQL, is a great tool. • For corresponding operational system, you probably have a lot of API integrations, algorithms, write operations, etc. Datomic + Clojure are powerful tools. By the way, if you are interested in plenish, feel free to reach out us. Me, or my boos, Arne Brasseur.

🙌 2
refset17:04:42

Hey @U07SECFA9 - compared with mainstream off the shelf data warehousing tools, I would not recommend XTDB as a cost-effective system today if you have non-trivial volumes of data (>TB) due to both the limitations of the index design (prioritising point lookups over fast scans) and the costs of local SSD storage (the RocksDB index is monolithic). However you'll probably be very interested to follow developments on our upcoming columnar engine that looks a lot more like Snowflake / Apache Iceberg etc. under the hood: https://github.com/xtdb/core2 (TBD on when this is going to be ready for production - happy to chat about that though) This is probably worth looking at also if you've not seen it already: https://techascent.github.io/tech.ml.dataset/walkthrough.html + https://github.com/scicloj/tablecloth +

❤️ 2
refset17:04:12

To answer your question about the hypothetical fitness of XT's features for DV from the https://clojurians.slack.com/archives/CG3AM2F7V/p1681417842524409: I think bitemporality is really common and important (if under-supported) for regular data warehousing users, and the fully dynamic nature of storage and querying is also very common for how those communities work with data. In some ways you can look at XT as trying to bring these "big data" benefits to transactional workloads 🙂 The future of XT is https://en.wikipedia.org/wiki/Hybrid_transactional/analytical_processing

🙌 2