2024-11-25 datalevin | Clojure Slack Archive

datalevin 2024-11-25

Jeremy 2024-11-25T18:15:51.226899Z

Hi again, I'm new to datalog/datomic/etc.. up until now, I've been using datalevin as a kvs, but want to also use it for some time-series data. I have an attribute that is a map of sorted maps (which represents price levels and sizes). can an attribute be a sorted map? and is it possible to query it with datalog? or does datalevin store it as blobs which you can only operate on after reading.

2024-11-26T11:13:34.767309Z

e.g. so you can turn your keys into a single long value, saving lots of storage space.

> Is this different from blob storage? What's the trade-off? If you just want a sequence/vector of values attached to an entity. Also had no idea you can have both kv and datalog in the same db, that's awesome.

Jeremy 2024-11-26T14:45:59.313089Z

> Added two KV query functions to get first n key values in a. range. So it might be helpful when you go the KV route. That was quick lol. awesome.

Jeremy 2024-11-26T15:49:16.246169Z

> Is this different from blob storage? What's the trade-off? If you just want a sequence/vector of values attached to an entity. It saves you from having to deserialize the whole sequence, if you just want the first few items

🤯 1

🤦‍♂️ 1

Huahai 2024-11-26T16:32:52.030359Z

Correct. With these stored in lists, you can use various range query functions on them, and these are very efficient, basically the work horse of this database. So you don't have to deserialize the whole collection. Particularly you don't want to deserialize Clojure immutable collections, these are expensive to construct. In my experience, even Java hash map is too expensive. Only special collection types such as bitmaps, compressed integer arrays, etc. are worthwhile. We used these special collection blobs to build our fulltext search engine.

Huahai 2024-11-26T16:37:34.619459Z

Of course, if extreme performance is not of concern, e.g. not trying to beat Lucence or Postgres, then these consideration is not important, blobs are good enough for application programming.

👍 2

Huahai 2024-11-26T16:39:21.883789Z

So it all depends on what you are trying to do. Datalevin is flexible. Experiment and measure.

💯 1

Huahai 2024-11-25T18:19:51.921909Z

You can normally map whatever data structure into entity-attribute relationship. So your sorted map can be probably considered an entity of its own.

Huahai 2024-11-25T18:21:03.067569Z

I would recommend to follow a standard ER data model when using Datalog.

Jeremy 2024-11-25T18:21:10.339919Z

I see. would i be able to run queries like getting highest key in the map?

Huahai 2024-11-25T18:21:17.941029Z

Don't store blobs.

Huahai 2024-11-25T18:22:03.301609Z

Of course.

Huahai 2024-11-25T18:22:19.158489Z

Datalog is just a more ergonomic SQL.

Huahai 2024-11-25T18:22:49.111469Z

:find (max ?whatever)...

Jeremy 2024-11-25T18:23:06.494139Z

great. thanks.. an entity of mine has a variable price level property, so i'm unsure of how else to store it other than as sorted map in maps

Huahai 2024-11-25T18:24:32.478769Z

Datalog works well with normalized data. The point of triple store is to store data in their smallest form possible, i.e. datoms.

Huahai 2024-11-25T18:25:26.966669Z

Show us what your data looks like, we can show you a schema to store it.

Jeremy 2024-11-25T18:26:41.236529Z

Alright, kindly allow me few mins to type it out

Jeremy 2024-11-25T18:36:46.963879Z

a market snapshot has 10-20 entity snapshots. each entity looks like: entity snapshot: + valid-from (time snapshot was taken) (timestamp) + entity-id (int) + price-data (map) + price-kind-1 (sorted-map) {2.1 30, 3.0 40 ...} + price-kind-2 {900.5 30, ...} + price-kind-3 {...} I can normalize the market by giving each entity market-id attribute. but unsure of how to represent price-data except to store it as is

Huahai 2024-11-25T18:40:25.040909Z

what are the sorted map? map of what to what?

Jeremy 2024-11-25T18:41:14.436869Z

double to double (price to volume)

Huahai 2024-11-25T18:46:29.877229Z

One possible representation:

Huahai 2024-11-25T18:50:18.803609Z

{:market-snapshot/valid-from {...}

:market-snapshot/entity {...} :market-snapshot/price-kind {...} :price-snapshot/price {...} :price-snapshot/volume {...} :price-snapshot/market-snapshot {:db.valueType db.type/ref}

Huahai 2024-11-25T18:53:28.069739Z

You can further separate price-kind from market if you want, as we still have some redundancy there.

Huahai 2024-11-25T19:00:35.990069Z

Or you can turn the reference around, have cardinality many reference to prices instead.

Huahai 2024-11-25T19:01:48.880129Z

maybe that feels more natural to you.

Jeremy 2024-11-25T19:02:19.726939Z

ahh, very interesting.. I think this is the way to go. Only issue is each sorted map has around 20-50 keys, multiplied by 3 price kinds and average of 10 selections and 10k snapshots per market, which is around 16 mil. I'd have to investigate the disk space consumed in practice. if it's too much, I only need to store the top 5 keys per kind, and can archive the whole price-snapshot as blobs.

Jeremy 2024-11-25T19:03:04.016929Z

> Or you can turn the reference around, have cardinality many reference to prices instead. do you mind elaborating on this? I don't quite get it

Huahai 2024-11-25T19:04:19.911139Z

`:market-snapshot/price-snapshot {:db.valueType db.type/ref :db/cardinality :db.cardinality/many :db/isComponent true}`

Huahai 2024-11-25T19:05:35.240639Z

Of course, if your goal is just to store these as time series, storing them as blob is fine, as you are not going to finely slice and dice them in ad-hoc queries.

Jeremy 2024-11-25T19:10:57.742669Z

I'd perform lots of queries regarding the top 5 levels, so your normalize approach is perfect. I also wouldn't want to just discard the remaining price levels, so I'd archive as blobs for safe-keeping. thank you very much @huahaiy

Huahai 2024-11-25T19:11:50.949619Z

You don't even need to use Datalog, Datalevin KV store can store lists

Huahai 2024-11-25T19:12:42.888009Z

so you can have a market-snapshot map as the key, and price-volume tuple as the values.

Huahai 2024-11-25T19:12:54.410769Z

That would be much faster

Huahai 2024-11-25T19:14:38.916709Z

{:entity-id xxx :valid-from xxx ... :price-kind "price-kind-1"} would be the key

Huahai 2024-11-25T19:15:16.422299Z

[price volume] tuple would be the value

Huahai 2024-11-25T19:16:14.343159Z

open-list-dbi for this DBI, that would be the sorted map you want

Huahai 2024-11-25T19:18:25.858119Z

a list DBI basically is a sorted map of sorted map: keys are sorted, values can be a list, also sorted.

Huahai 2024-11-25T19:19:28.003899Z

For time series data (assuming high write rate), I would go with KV store, bypass the expensive machinery of Datalog transaction.

Huahai 2024-11-25T19:21:20.129029Z

I am working in enhancing write throughput with async transactions, that should handle very high write rate.

Jeremy 2024-11-25T19:21:24.670319Z

another interesting approach. Idk how none of this came to mind. I spend days searching for various dbs 😭.

Huahai 2024-11-25T19:22:17.379919Z

we don't have great documentation yet

Jeremy 2024-11-25T19:22:26.876599Z

I don't have need for high write rate (i store in memory for real-time data), but fast and flexible queries are what i'm looking for (for historical data). I'd have to weight both these approaches

Huahai 2024-11-25T19:23:44.234019Z

your data seems to be simple enough for a KV solution, to be honest

Huahai 2024-11-25T19:24:34.156269Z

for the keys, you don't need to use a map blob, use a heterogeneous tuple looks good enough, and it support range query, so you are not missing anything from Datalog store.

Jeremy 2024-11-25T19:26:45.418689Z

waitt.. you mean kv store supports queries? I think i missed that on the documentation

Jeremy 2024-11-25T19:27:40.942519Z

I'd have a look. if so, I guess that's what I need for price data. I'd still store market info in datalog store

Huahai 2024-11-25T19:27:45.782919Z

[3 #inst "2024-11-14:0101" "price-kind-1"] as the key, [3.5 42]... as the values.

👍 1

Huahai 2024-11-25T19:29:13.099439Z

there are so many KV query functions, list-range, list-range-count, list-range-filter, list-range-first, list-range-keep, etc.

💜 1

Huahai 2024-11-25T19:30:45.495809Z

Datalog store is built on top of KV store, so it's by definition slower, particularly, the transaction logic is expensive, because of the complex semantics of Datalog that we support. It's a cost that can be avoided if the use cases do not call for it.

Huahai 2024-11-25T19:33:04.189809Z

I mean you have a very simple data model that do not demand lots of complex arbitrary joins, I would go with a KV solution.

Huahai 2024-11-25T19:33:44.512539Z

The query patterns are pretty predictable, it's time series data.

Jeremy 2024-11-25T19:34:35.127299Z

I see. I've mostly just used kv stores as persistent caches, so i'm currently flooded with new possibilities.

Jeremy 2024-11-25T19:34:56.395179Z

Thanks a lot once again

Huahai 2024-11-25T19:40:44.455289Z

you can use kv store and datalog in the same DB file also, so you can have the time series part in the KV store, and other relationships in the datalog store also.

Huahai 2024-11-25T19:41:21.176169Z

Datalevin is versatile. It's in our slogan 😀

Huahai 2024-11-25T19:42:50.380409Z

e.g. so you can turn your keys into a single long value, saving lots of storage space.

Huahai 2024-11-25T19:44:34.866409Z

key 29239 values [3.5 42] [1.5 10]... where 29239 is an entity id from the Datalog store.

Huahai 2024-11-25T19:46:00.741609Z

so you can expend your ER model arbitrarily on the Datalog side.

Huahai 2024-11-26T04:28:05.166769Z

Added two KV query functions to get first n key values in a. range. So it might be helpful when you go the KV route.

Clojurians Log v2

datalevin 2024-11-25