2024-11-05 datahike | Clojure Slack Archive

datahike

2024-11-05T15:27:29.922459Z

What is the status of :jdbc backends for now? Docs say that's ok, but it is not working. Moreover in code (for example store.clj) no methods for :jdbc, only for :mem and :file ...

alekcz 2024-11-05T15:28:53.256249Z

@sasha_bogdanov_dev I'm using them in production 3M+ datoms

whilo 2024-11-09T19:19:17.917589Z

The S3 backend could give you subsecond latency, but I haven't tried it out inside of an AWS data center yet. As long as you run in a single VM the read caches should also be hot in general, so once your app is up read latencies should not be affected that much by the latency of the store. it depends on your data set and use case though.

👀 1

whilo 2024-11-09T19:21:19.269859Z

If you do not require to transact things in a strict order, use transact! to transact asynchronously. The writer will then automatically batch and the latency of the store will not compound (i.e. you will wait a bit longer than one roundtrip to the store in the ideal case) even if you have many concurrent transactions.

whilo 2024-11-09T19:22:21.498659Z

Alternatively EFS inside of an EC2 instance with the file backend might be faster and cheaper.

whilo 2024-11-09T19:23:06.577969Z

The advantage of S3 is that you can give outside readers read access to it and they can join and query with your datahike database freely. No coordination or additional infrastructure needed.

2024-11-09T19:23:11.066669Z

I will check different approaches, yes

2024-11-09T19:23:17.260539Z

Thank you for suggestions

whilo 2024-11-09T19:23:45.114459Z

Cool. Keep me posted on what works and doesn't work for you.

👌 1

whilo 2024-11-09T19:25:32.283579Z

If dynamo is worth it, I am happy to support it. I just wanted to first cover wider and cheaper ground with S3 and the file backend should also work well in many cases.

2024-11-09T19:32:49.625449Z

About DynamoDB: I trying now it with Datascript IStorage protocol (because it is much simpler), and if it will work not bad, then we can implement it in Datahike. File storage could be nice, but EFS from Lambda is not as robust as expected. Datahike hangs up on attempts to persist data, Datalevin sometimes had "resource not Available" errors, etc...

whilo 2024-11-09T19:33:31.869259Z

I see.

whilo 2024-11-09T19:33:40.586549Z

Why do you need lambdas?

whilo 2024-11-09T19:34:28.642839Z

If you have a persistent distributed database like datomic or datahike you can just scale EC2 instances in front of it if needed.

whilo 2024-11-09T19:35:02.796069Z

In my experience lambdas are great for developers who do not have a good distributed programming model.

whilo 2024-11-09T19:35:46.736699Z

It should be much cheaper to scale up a single EC2 instance first unless you need extreme unmanaged blitz scaling.

whilo 2024-11-09T19:35:55.488649Z

Which might also not work with lambdas.

whilo 2024-11-09T19:38:39.208189Z

I am not claiming I understand your requirements, I am just curious.

2024-11-09T19:43:01.700409Z

I have multiple small applications (will be many if all good), each with very low load, some mostly completely idle. So, it's really the case for serverless setup. And I want one storage for them all (sure different databases/tables/etc...). I tried Datomic cloud. It could be great, but CloudFormation somehow rejected to deploy official templates and I gived up on it))) maybe I wrong, but it was really annoying

whilo 2024-11-09T19:43:52.891129Z

Right, I think somehow they made Datomic difficult to get running in general.

whilo 2024-11-09T19:44:03.343639Z

I haven't used it in a long time to be honest.

2024-11-09T19:44:15.396659Z

And it's mostly hobby-project, so good landscape to try things

whilo 2024-11-09T19:44:24.337519Z

Ok, that makes sense.

whilo 2024-11-09T19:44:43.929989Z

Did you try datahike's S3 backend?

whilo 2024-11-09T19:45:11.488069Z

I kind of made it for the lambda use case. I am just not so convinced about lambdas myself anymore.

whilo 2024-11-09T19:45:45.757339Z

It is easy to deploy multiple small apps on a single EC2 instance and you can even keep REPLs to them open over SSH if you want.

2024-11-09T19:46:01.835489Z

For now I didn't. It was to hard to believe that such storage can be fast enough for real-time applications

whilo 2024-11-09T19:46:42.270809Z

I had latency around 400ms from my laptop here in Canada to next US AWS data center.

whilo 2024-11-09T19:47:02.116909Z

I think it will be lower inside of AWS.

whilo 2024-11-09T19:47:21.296259Z

But I did not have enough time to try things out. I get that lambdas and EFS are not a very good fit.

whilo 2024-11-09T19:47:32.914509Z

(Although that is on Amazon)

2024-11-09T19:48:29.558239Z

> It is easy to deploy multiple small apps on a single EC2 instance and you can even keep REPLs to them open over SSH if you want. Yes, I can, but honestly I am happy with SQS queues between API and Lambdas. So, for now I will stay on Lambdas))

2024-11-09T19:49:44.569399Z

> But I did not have enough time to try things out. I get that lambdas and EFS are not a very good fit. For databases I thing not good, like a file storage it could be nice. And maybe I did something wrong. This landscape is relatively new for me (around a year)

whilo 2024-11-09T19:50:17.067309Z

Ok. It should be easy.

👍 1

whilo 2024-11-09T19:51:40.599499Z

@viesti’s lambda template for datahike is using S3 https://github.com/viesti/clj-lambda-datahike/blob/main/src/clj_lambda_datahike/core.clj

whilo 2024-11-09T19:52:12.209689Z

But I get that it might be too slow for you. Do you need real-time writes or just fast reads?

whilo 2024-11-09T19:57:46.132749Z

It is not native compiled yet though. @viesti mentioned that there are warm start JVM options now that also reduce lambda warmup time. I am not an expert in this.

2024-11-09T19:59:57.679279Z

> But I get that it might be too slow for you. Do you need real-time writes or just fast reads? (edited) Reads speed is critical, writes less

whilo 2024-11-09T20:00:29.706979Z

Can you put a number on it?

whilo 2024-11-09T20:00:35.581179Z

In terms of milliseconds.

2024-11-09T20:01:00.343749Z

I have concerns about that "new" warm start options, but need to try also

2024-11-09T20:01:28.754679Z

150-200ms is definitely the limit

whilo 2024-11-09T20:01:42.487039Z

Also, is your database written a lot? If it is only written sporadically then most reads can be cached and you will only need one roundtrip to S3 on read.

whilo 2024-11-09T20:02:05.355079Z

I see, ok then S3 is maybe too slow.

whilo 2024-11-09T20:02:37.716019Z

EC2 instance with file backend should be fine though.

2024-11-09T20:02:45.495729Z

Need to try anyway, but now I am in another approach

whilo 2024-11-09T20:03:09.623519Z

Cool, thanks for contextualizing 🙂

👍 1

2024-11-09T20:03:28.376059Z

No problem!

viesti 2024-11-09T20:26:25.086799Z

> It is not native compiled yet though. @viesti mentioned that there are warm start JVM options now that also reduce lambda warmup time. I am not an expert in this. > AWS Lambda Java runtime has support for creating a VM snapshot at publish time on Lambda version, which helps in cold starts, but that comes with it's own caveats too (publish takes a couple of minutes, the VM snapshots are cached up to two weeks, rarely used lambdas get evicted from snapstart cache so cold start times might deviate without for example hourly ping) https://docs.aws.amazon.com/lambda/latest/dg/snapstart.html

👀 1

whilo 2024-11-09T07:29:43.066239Z

Maybe this would work https://github.com/passren/DynamoDB-JDBC

whilo 2024-11-09T07:29:52.145909Z

It also wouldn't be super hard to add a konserve-dynamodb backend.

whilo 2024-11-09T07:31:02.660019Z

But it would make sense to know why it is really needed.

alekcz 2024-11-09T07:41:05.334919Z

@whilo @sasha_bogdanov_dev I explored this a while back but I got nervous that DynamoDB is not strongly consistent by default. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html

alekcz 2024-11-09T07:41:30.742839Z

Strongly consistent reads are double the cost of eventually consistent reads.

alekcz 2024-11-09T07:41:56.366699Z

There's probably a work around, but that's why I stayed clear of it.

whilo 2024-11-09T07:42:38.403579Z

Most values we write only exist once, i.e. either you read the right one or there is none.

whilo 2024-11-09T07:43:01.533029Z

Exceptions are the root entries for the db, they are overwritten.

whilo 2024-11-09T07:43:20.296149Z

In principle reading an old value of those would just mean you didn't fetch the latest snapshot.

2024-11-09T07:46:35.345489Z

Hi, thank you for your investigations! Yes, I know about consistency, but that is not clear what they mean saying “writes can be not available imminently to read”. If there is couple milliseconds of latency, then it is ok for my case, if seconds – then no. Need to try and find out.

2024-11-05T15:29:51.902089Z

How so? I don't understand...

alekcz 2024-11-05T15:30:45.802039Z

So you include datahike.jdbc. Sent the link in the chat.

2024-11-05T15:30:58.715939Z

Oh okay thank you

alekcz 2024-11-05T15:31:07.307299Z

Including it allows :jdbc to work

alekcz 2024-11-05T15:31:27.043199Z

I'm currently walking. Will send you an example config a bit later

2024-11-05T15:31:48.359969Z

I think I am okay with further steps, do not worry\

alekcz 2024-11-05T15:31:58.200809Z

Cool beans:muscle:

2024-11-05T15:42:02.854129Z

Oh No. 😀😞

Show: Project-Only All 
  Hide: Clojure Java REPL Tooling Duplicates  (7 frames hidden)

1. Unhandled java.lang.IllegalArgumentException
   No implementation of method: :-connect of protocol:
   #'datahike.connector/PConnector found for class:
   clojure.lang.PersistentHashSet

          core_deftype.clj:  584  clojure.core/-cache-protocol-fn
          core_deftype.clj:  576  clojure.core/-cache-protocol-fn
            connector.cljc:   18  datahike.connector$eval47679$fn__47680$G__47670__47685/invoke
            connector.cljc:  201  datahike.connector$connect/invokeStatic
            connector.cljc:  197  datahike.connector$connect/invoke
                      REPL:   33  datahike-sandbox.core/-main
                      REPL:   11  datahike-sandbox.core/-main
               RestFn.java:  397  clojure.lang.RestFn/invoke
                      REPL:   41  datahike-sandbox.core/eval62264

alekcz 2024-11-05T15:43:11.649719Z

What does you setup up code look like?

2024-11-05T15:44:05.730729Z

Oh wait. Its ok, my fail

2024-11-05T18:18:36.702179Z

It would be nice to support DynamoDB (it have jdbc driver). I had a look on code and I do not think I can create a PR quick. Is it possible to realize?

2024-11-05T18:27:09.645139Z

As I see changes are needed not only in konserve-jdbc , correct?

whilo 2024-11-05T20:17:02.872469Z

I think it should be enough to fix konserve-jdbc to connect to dynamo. If you can provide a PR for that it should make datahike work with dynamo.

👌 1

2024-11-05T23:02:58.645519Z

Looks like no. Need to patch next.jdbc at least in this function: https://github.com/seancorfield/next-jdbc/blob/218cf8263727ce662483fbf26ab08bd9cf22cfad/src/next/jdbc/connection.clj#L140 Problem is that DynamoDB driver (I used CData's one) uses uncommon keys in connection url or spec.

2024-11-05T23:03:43.465149Z

And if we pass complete connection URL then no classname parameter...

whilo 2024-11-05T23:04:42.846249Z

So can you use the URL and just make sure it supports the classname? It seems the spec translation is an additional step to translate it into an URL first, right?

whilo 2024-11-05T23:06:39.860969Z

https://github.com/seancorfield/next-jdbc/blob/218cf8263727ce662483fbf26ab08bd9cf22cfad/src/next/jdbc/connection.clj#L18

whilo 2024-11-05T23:08:16.865779Z

I think this mapping should be a multimethod and not a closed map that cannot be extended from the outside to new SQL types, but it is what it is. Probably easiest to add dynamo there and open a PR.

2024-11-06T00:14:54.597769Z

Yes, that is definitely easiest way.

2024-11-06T10:43:01.251469Z

But another problem here: Could not find a valid license for using CData JDBC Driver for Amazon DynamoDB 2024 on this system. Looks like better to give up on DynamoDB with JDBC. I will take a look on direct integration with taoensso/faraday later.

alekcz 2024-11-05T15:29:59.448899Z

There's a separate library you load.

alekcz 2024-11-05T15:30:00.734929Z

https://github.com/replikativ/datahike-jdbc

Clojurians Log v2

datahike