Regarding the new vector embedding behavior: https://github.com/datalevin/datalevin/blob/master/doc/vector.md This seems to work with either a local or remote API call to an embedding model. ..by autofilling the embedding data as needed. My concern is having this behavior as part of datalevin as opposed to a separate process (docs->chunks->embeddings->add-datalevin). A separate process can fail and be restarted without worrying about the integrity and durability of the db. Would appreciate some understanding of where this is going.
I would still use a separate process as you described for a production document processing workflow. The addition of embedding service in Datalevin is a convenience for those lightweight embedded use cases that do not need sophisticated workflows, e.g. a personal agent use case, etc. As to your concern about integrity and durability of DB, an external process does not degrade it, as the embedding happens before the write commit, if embeddings fails, transaction fails. That's exactly what a lightweight use case expects: if anything fails, fail fast, don't write any broken data.
That said, I can add an option to build secondary index (fulltext and vector) asynchronously. This would be able to handle use cases of higher ingestion throughput requirement while less demand on "read-your-write" for secondary index, for the cases where user do not expect the search is available immediately after the write.
This async secondary indexing option should be useful for a data processing pipeline, where you want retries, back pressure, rate limiting, audit, process isolation, and so on. Sure. I will add this.
Added :indexing-mode :async option for fulltext, vector and embedding.
It supports retires, backoff, and reclaims of work.