This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-12-16
Channels
- # adventofcode (24)
- # announcements (3)
- # aws (3)
- # babashka (16)
- # beginners (88)
- # biff (5)
- # calva (27)
- # cider (15)
- # cljs-dev (70)
- # clojure (87)
- # clojure-austin (3)
- # clojure-belgium (6)
- # clojure-europe (59)
- # clojure-nl (1)
- # clojure-norway (14)
- # clojure-uk (3)
- # clojurescript (37)
- # data-science (2)
- # datalevin (40)
- # datomic (1)
- # emacs (23)
- # events (2)
- # graalvm (13)
- # graphql (7)
- # gratitude (1)
- # holy-lambda (193)
- # inf-clojure (15)
- # lsp (27)
- # malli (9)
- # off-topic (20)
- # polylith (6)
- # reitit (29)
- # releases (2)
- # scittle (13)
- # shadow-cljs (51)
- # transit (15)
- # xtdb (29)
Hey everyone! I have a beginner's question. Is there a way to run a singleton service for AWS lambdas? I am thinking about hosting the Datahike transactor for a group of functions that can run their queries locally inside the lambda, but need to coordinate their transactions with the transactor for strong consistency.
I understand your point, but I think this is not really true. Most operations typically just query the database and then it is a very good fit for lambda functions, because they provide horizontal read scaling, a model that is very compatible with Datomic/Datahike's decoupled, scalable readers. Tne point is nonetheless that sometimes you will need to update the database and transact into it and might want to do this from your lambdas. In this case I guess it would be ideal to have a service running that would be reachable from the lambdas and well integrated. I could just deploy such a transactor into AWS, but I was wondering whether there was already a notion for such services in lambda land.
For instance if you just want to query a static Datahike database in lambdas this would make perfect sense I think, because queries need zero coordination and can be executed in milliseconds with minimal reads from the index.
agreed. read can be in parallel so suitable for lambdas. I just can’t think of a way to maintain a singleton for keeping writes in serial. maybe ec2 for writes only? or build a disk/storage format which can reconcile writes i.e. don’t need a singleton writer
@U1C36HC6N Lambda has concurrency configuration, so you could have a transactor lambda with reserved concurrency of 1 and then reader lambdas, with unlimited concurrency > • Reserved concurrency – Reserved concurrency guarantees the maximum number of concurrent instances for the function. When a function has reserved concurrency, no other function can use that concurrency. There is no charge for configuring reserved concurrency for a function. https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
I have actually been thinking about this same thing for DataHike, but haven’t had the time/energy to look into it 😄
I think the thing that set me off was that I didn’t figure out if DataHike could be backed by just S3 & DynamoDB (there was some old trial of persistence layers that used S3 and DynamoDB that I looked into the summer of last year, but those weren’t up to date with latest DataHike at that time, if I recall correctly)
IIRC, DataHike supports SQL database as backing store, so Aurora Serverless v1 could be an option, but that has cold start in the order of ~30s seconds, which is annoying
there’s also Serverless PostgreSQL options with better cold start (and more recent PostgreSQL versions), like https://neon.tech/
but to me, for a Datalog database, throwing all that querying capability of a SQL database out the window and using it only as triple store feels wrong :D
I haven’t wrapped my head around if the querying lambdas would need to build some kind of query index in their memory, or could this query index then reside in the memory of the transactor lambda. With provisioned concurrency, one could keep such a transactor process always running even, although that incurs a cost
Thank you for the contextualization. Yes, using concurrency of one could work. The S3 backend needs to be ported, but we simplified our backend, you only need to implement this protocol https://github.com/replikativ/konserve/blob/main/src/konserve/filestore.clj#L95 and not all methods are needed https://github.com/replikativ/konserve/blob/main/doc/backend.org#backing-store-protocols. So porting the old backend into a reliable backend should be an effort of a few hours max, hopefully. I don't have experience with AWS unfortunately, but I would be down to do a pairing session and make it happen.
My take on caching would be to leave it to AWS and just wrap services with different service qualities and then pick the respective backend for your project. Datahike has native image support now, so lambdas should already be fairly fast to fire up. I would look into holy-lambda more to prepackage Datahike, but maybe it is just good enough to add it as a dependency to a project actually.
@U06QSF3BK What would be a good test case application in your mind?
I would probably opt for S3 first and speculate that many simple applications can cope with its latency. Maybe DynamoDB as an alternative is then a good combination for apps where you are willing to pay for the latency. But I think some experimentation with simple setups would be a good start.
> I would probably opt for S3 first and speculate that many simple applications can cope with its latency. This sounds like a good rationale, I didn't actually have a good grounding to talk about DynamoDB, just that have seen it come up with Datomic 😄 S3 has some interesting properties, like https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/.
Just few weeks ago, the JVM AWS Lambda runtime (java 11 currently) got support where after deployment, it creates a VM-level snapshot of the Lambda process, and on invoke, loads this snapshot, which then avoids the slow cold start of a JVM process (in my trial of a Reitit Ring app, cold start went from ~7 seconds into 500 milliseconds). This is called https://aws.amazon.com/blogs/aws/new-accelerate-your-lambda-functions-with-lambda-snapstart/.
So, native-image support isn't strictly needed for fast cold start, although I think it is good to keep the code so that it is supported.
I'm not familiar with Datahike, but doesn't native-image then though prevent use of clojure.core/eval
, where one could evaluate code to use in a query, for example, for explorative purposes? Not sure if this would be a use-case though, maybe one does explorative queries in some other way, than running against a Lambda-based infra.
> I would be down to do a pairing session and make it happen. I'd be interested, just have have to get better in my time management :D
> What would be a good test case application in your mind? I don't actually have experience in Datalog databases, but probably something that deals with aspects where Datalog is a very good choice? I guess for Lambda + S3, something that fleshes out the compute and the persistence part, but still is not pathological in that sense.
Btw. we don't even use SQL as a triple store, just as a blob store. That is also why I think it is not a good default backend, it is very wasteful.
and that is just the cold start, when lambda has the process running, response times are lower
I don't think we need eval, I think almost all Clojure applications using Datahike can be natively compiled.
Ok, it is very cool to have options for sure. I just want to aim for the most simple setup that is resource efficient, but maybe not fastest.
yup, I think I was going a bit too far, I haven't used Datalog databases, so was wondering how people do explorative queries, but I guess that happens at the repl, not in the deployed app 🙂
Yes. I am a fan of JIT compilers and interactive setups, but native image compilation provides interesting options to scale out like this.
but I guess the interesting thing is that if a transactor lambda with reserved concurrency would fit the singleton transactor requirement
@U0510KXTU I see.
Also, not sure how important this is, but I have not tried AOT compiling Datahike lately on the JVM.
I'm expecting Snapstart to be available for other runtimes when they figure out how to offer stable random numbers, that don't get frozen
However ssl connects are slow first time due to handshake. Snapstart can't fix networking
https://docs.aws.amazon.com/lambda/latest/dg/snapstart-uniqueness.html, there's a scanner that operates on bytecode level to check for patterns that one would want to avoid with Snapstart
I'd think this snapstart would be great for say ML stuff, where you'd load a model in memory, then freeze the process, then thaw it upon first request and do inference
It's excellent for CPU bound tasks. Still figuring out how to use it for ssl calls ie AWS APIs
Deep learning requires a lot of GPU memory, just loading this will always be slow in current stacks.
Ideally I would like to have an API that can also be used asynchronously, e.g. with callbacks for the http requests.
not sure if it's necessary here, but that java api has more support for things like multipart download, efficient syncing of large data, but we don't need that here. Generally I think it tracks new S3 features well, and has pluggable http client support (aws has their "common runtime" which is a optimized C library I think)
https://github.com/FieryCod/holy-lambda/tree/master/examples/bb/native/aws-interop-v2.example
those aws java sdk libs ship with native-image configurations, haven't looked into how much they matter, but they make effort to have the libraries graalvm native-image compatible
Apropo, when developing, you can use for example Minio via docker image, it has good support for the S3 API, so one cam use the AWS Java SDK against Minio. https://min.io/docs/minio/container/index.html Continuing that thougth, a S3 backend for Datahike would allow to use any other object store that implements the S3 API, which I think is interesting.
I think I'd be interested in a pairing session, I just don't know when and might be the slower one that benifits most :D
I have implemented https://github.com/replikativ/konserve-s3 and https://github.com/replikativ/datahike-s3 taking inspiration from @U0510KXTU’s link above. I still need to figure out how to release the two projects with our deployment pipeline, but you can just use the github SHAs in deps.edn for now. @U06QSF3BK I would be down to pair and see how you would wire it up with lambda if you have some time. I don't have enough time myself right now to get into holy-lambda and the AWS stack unfortunately. Latency is as expected higher than with local storage, but if you do not write a lot, caches can stay warm and queries would perform with one roundtrip only (checking whether the DB has changed). Maybe latency is also much better if you access from AWS directly.
Both repos are released now, so you can just depend on datahike-s3 and datahike and develop against that.
I would be in particular interested to have MVP example with one lambda covering the transact function to S3 (which is guaranteed to only run once at a time) and some query in another lambda. If you can help me setup an example project for that I would be very happy.
With this PR we now have close to optimal latency and automatic transaction batching under backpressure https://github.com/replikativ/datahike/pull/618 (I still need to clean it up a bit for it to be merged). This should benefit the S3 backend the most as it has high latency, but can handle high throughput. Any suggestions of how I could build a demo project as simple as possible, ideally as a starter template for holy-lambda?
starter template is a neat idea, I think that if the template is serverless, then it should contain the single writer setup (which I haven't yet gotten around to try out), which is in a significant part, about creating the infra with some tool (I prefer terraform)
also although holy-lambda does great things to make life easier when using native-image, now that https://docs.aws.amazon.com/lambda/latest/dg/snapstart.html is around, the actual making of lambda could be simpler with just using the jvm11 runtime and making a class that implements the lambda entrypoint requires by that runtime so in a template, holy-lambda kind of optional even but, if the demo would be an app with a frontend and say a rest api, then the ring adapter in holy-lambda is really useful
S3 backend also useful outside lambda I think, but should definitely be tried out in a Lambda :)
yes, it is interesting for us, because we cannot easily offer a hosted service right now, but might get contracts and support by offering datahike on lambda (that is just a guess by me) and it is good starting point to then offer the writer as a EC2 instance
i am fine also with snapstart, honestly i am n00b on aws, my mind is mostly on distributed persistent data structures
so i would be done to also take pointers if you are super busy or pair if you have some time at some point
basically, the aws's own jvm11 runtime (suggest vjm11 over jvm8) takes a uberjar with a class that implements com.amazonaws.services.lambda.runtime.RequestStreamHandler interface, found in com.amazonaws/aws-lambda-java-core {:mvn/version "1.2.1"}, which you need to include into the uberjar
make a lambda on aws console, upload the jar, then name the handler class, sounds a bit minimalistic, but that's the start 😄
not sure if it helps, but my mind was focused on hacking on a bit different thing, there's some bits that you could steal from this, if it helps and terraform suits you :) https://github.com/viesti/clj-lambda-sideloader/tree/main/example
been incrementing a slack reminder couple of weeks for a weekend to look into this datahike thing 😄
to get started with datahike it should be enough to copy this snippet and use it in a project with your S3 settings https://github.com/replikativ/datahike-s3#run-datahike-in-your-repl
Hmm, tried it out a bit, d/delete-database
seems to delete the whole bucket, which was a bit unexpected I think 🙂
I’m thinking that could there be a prefix in the store configuration, and then the prefix would be used to “name” a database, so you could then remove all files under a prefix if needed
I think S3 buckets are quite long-lasting things, re-creating a bucket with the same name (if you deleted it accidentally) can take some time, since AWS reserves also DNS name for a bucket
but anyway, managed hello world in lambda yay, can put the code & terraform to github soon
ah, the other thing, deleting a bucket is quite, hmm, heavy operation, one would not want to grant that for a backend (though I failed to limit delete access in my test 😄)
whipped out something quick and dirty https://github.com/viesti/clj-lambda-datahike
how would you carve out the singleton lambda for transact? i think splitting the example into a transact lambda and two different query lambdas would be a good starting point for a template
we could also have same lambda source code, but say an environment variable that toggles the deployed instance to work as transactor or query node
Took a look, I think I forgot to say, that it might be neat to be able to specify a prefix, so you could have multiple databases in a single bucket, something like
{:store {:backend :s3
:bucket "datahike-s3-instance"
:prefix "my-db-1"
:region "us-west-1"}}
tried out with separate writer and reader lambdas, but what happens is a bit interesting
0% bb run write '{"data": [{"name": "Alice", "age": 32}]}'
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
{"result":"ok","status":"ok"}
0% bb run read
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
{"result":[[3,"Alice",32]],"status":"ok"}
0% bb run write '{"data": [{"name": "Bob", "age": 42}]}'
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
{"result":"ok","status":"ok"}
0% bb run read
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
{"result":[[3,"Alice",32]],"status":"ok"}
so what is going on here is that the reader lambda has stale db reference, since it doesn’t show the write that the writer did
is there a way to tell to to datahike to “go refresh caches from the persistent store”
injected this for the query connection before you use it (swap! (:wrapped-atom conn) (fn [db] (update db :writer #(assoc % :streaming? false))))
that forces the connection to refetch from the underlying store every time you access it
that would be here https://github.com/viesti/clj-lambda-datahike/blob/main/src/clj_lambda_datahike/core.clj#L31, before the call to d/q
How long has that option been around? I think I was looking for something like that maybe 1-2 years ago
well, and at that time, s3 backend wasn’t around, which was the thing that I was actually looking for 🙂
well I’ll be damned, I guess it worked!
0% bb run write '{"data": [{"name": "Pedro jr", "age": 15}]}'
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
{"result":"ok","status":"ok"}
0% bb run read
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
{"result":[[6,"Pedro jr",15],[5,"Pablo",55],[4,"Bob",42],[3,"Alice",32]],"status":"ok"}
0% bb run write '{"data": [{"name": "Pedro", "age": 59}]}'
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
{"result":"ok","status":"ok"}
0% bb run read
{
"StatusCode": 200,
"ExecutedVersion": "$LATEST"
}
{"result":[[6,"Pedro jr",15],[5,"Pablo",55],[4,"Bob",42],[3,"Alice",32],[7,"Pedro",59]],"status":"ok"}
both Pedros visible after each readi have a PR for datahike that significantly reduces write latency btw., and could do auto batching in case we run the transact calls async in the lambda https://github.com/replikativ/datahike/pull/618
walking the dog outside, -2 and fingers freezing, still a bit bewildered and glad that I could help, thinking that would need to do some demo with frontend, say todo list :) then also thinking about a perf suite and snapstart setup for eliminating cold starts
to really have a serverless database, even datalog style, for Clojure, is just wicked :)
fortunately freezing stopped here in vancouver already 🙂 my partner in montreal is still freezing though
what you say makes sense. with anything you can help i would be super grateful, as i am thinly stretched atm. also with my AI research (which hopefully i can integrate into Datahike as probabilistic inference)
i also need to do sales again as soon as there is something interesting to sell 🙂 atm. we do not make a lot of revenue with datahike and that slows its development
but i also needed to first get the distributed use case done before i wanted to go out and pitch it
i would write a blog post as soon as we have a project template that people can use to build prototypes and small apps
is it possible to fetch and process multiple requests in lambda that return asynchronously?
lambda is event by event, although there is async invoke to dispatch without waiting but then the event size is quite limited
the server can process multiple requests in parallel and batch them, which gives you better scale on S3
i wonder putting write request to say sqs or another queue supported by lambda and then batching off the queue
i think you want tx responses though, so the client needs to be notified only after tx call
> By default, Lambda polls up to 10 messages in your queue at once and sends that batch to your function. To avoid invoking the function with a small number of records, you can tell the event source to buffer records for up to 5 minutes by configuring a batch window. Before invoking the function, Lambda continues to poll messages from the SQS standard queue until the batch window expires, the invocation payload size quota is reached, or the configured maximum batch size is reached. > https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html it's been some time I did these things, I remember there being some batch size and window size tunable with stuff like Kinesis Firehose, which allowed to put event processing into lambda, when going through Kinesis
these serverless things are kind of, well, you don't configure a single server, but a host off services :)
when inside aws, lambda talks quite fast to nearby aws services, but yeah, some kind of benchmark would be neat
it might be nonetheless reasonable to just run an EC2 instance for transact then, not sure how the prices of the services compare
if you would have enough traffic that lambda is kept running all the time, then ec2 is cheaper, but it gets more complex, since nowadays you can even buy compute capacity for lambda upfront and benifit from discounts the same way as it was for reserved instances of ec2's or databases
for on anf off traffic, this kind of setup with Lambda doing writes, with fast enough cold start, is appealing
will proceed to bed now :D, but with other lambda runtimes you probably mean say GCP Cloud Run, since the other JVM option in AWS would be a custom runtime. I tried GCP Cloud Run when it came out, it probably has advanced since, I think it even has an option to keep the compute that runs the process "warm" without throttling, as opposed to Lambda, where the process runs only when an event is processed, otherwise it is frozen, so you can't do background processing, can only execute while handling an event, though there is upper processing limit in cloud run too I think I"d want to setup snapstart for the aws lambda next, not sure what after that, some kind of write and read benchmark would probably be neat, read side scaling is interesting, but would have to figure out suitable benchmark scenario, does datahike have benchmarks available?
we have https://github.com/replikativ/datahike/blob/main/doc/benchmarking.md, but this probably needs to be adjusted a bit