Fork me on GitHub
#xtdb
<
2023-04-11
>
John Sullivan00:04:23

hey all, looking for a bit of domain modelling advice. I'm trying to make a little metadata tracker for a large fileshare we have onsite. So all files would be referred to by a sha256 hash sum, and multiple processes (tika extraction, manual data entry, archival records, etc) can assert attributes/metadata about a given file. I'd like to keep all these assertions in xtdb so I can make more sophisticated search/aggregations on top. I feel like, given xtdb's design, it would make the most sense to collect any set of assertions as individual documents (each with an attribute for the file hash, like ::describes-hash). This way there can be multiple parties saying something about a file without making what's already been said invalid. I know xtdb does keep the transaction log of all previously asserted documents, but I'd like to have a way to project multiple perspectives of the same file as information becomes available. Or am I off base, and should I just make updates to documents identified by file hash instead?

tatut04:04:59

there are trade offs to either way… if you have independent small docs that say something about a file hash, then you can update them in smaller pieces. if you have one large doc, then you need to read it and write a new version of the whole doc anytime anything changes

4
tatut04:04:54

but with the one large doc it is easier to see the whole history of that file by looking at the entity history of that doc

tatut04:04:00

I’d say both are valid ways of looking at the problem, which is more convenient will depend on how you use the data

John Sullivan05:04:54

@U11SJ6Q0K thanks for the perspective! I was thinking about it a little more and I think the smaller documents approach gives a little more utility, in that the sources of changes can be more easily distinguished. I'm putting this together as the driver of an archival system, so I think it would be good to keep as much info as possible on how the system changes.

John Sullivan05:04:34

But good point on the document history though, that does give a more logical change-over-time metaphor (**logical? consistent? not sure lol)

rickheere19:04:48

He John, I'm almost exactly making the same thing as you, an archiving system. Good to think about this. I went with the one document version without considering to do it in another way

💯 2
rickheere19:04:16

Is it possible to make specialized nodes? In other words. I expect to get lots of load on writes so I was thinking of having multiple nodes handeling the write part of my application. I'm thinking that I don't need an index of the database if I only want to do writes. For reading from the database I will have other nodes that have an index. The question is. Is op possible to disable indexes for a node? Also, if this question makes no sense and I'm thing about it in the wrong way met me know.

refset20:04:20

Hey @U4XT72NNT if those writes are purely async then yes you can do this with new-submit-client that can only write to the tx-log and doc-store, doesn't store indexes and can't perform regular reads (i.e. it's not really a node at all, but just borrows a subset of relevant APIs) https://docs.xtdb.com/clients/clojure/#_new_submit_client

rickheere20:04:19

Yeah, that sounds exactly like what I need. Thank you!

🙏 2
Martynas Maciulevičius04:04:08

What will you be doing if you'll want to check the current version of the doc? If there is no index then you won't be able to check...? Also do tx fns run on this kind of node?

rickheere06:04:05

That is a good question. I'm not sure yet if it will work but this is the plan. I have an endpoint that receives one or more files with some additional data. I do want to do some additional processing on the data but the load can at times be so high that I have to defer that to later. So, I store the files in some s3 layer. Then I store the additional data in the db somehow linking those together, probably based on a hash of the file. Then I put an event on Kafka that a file was added so I can do some processing later. I will have other nodes with an index that will do the processing but those are not to much in a hurry. I think the processing wil just leave the documents like they are and add more documents all "linked" together based on the hash.

Martynas Maciulevičius06:04:56

Also there is one more problem -- from what I understand when you submit data into a node then other indexing nodes won't receive the update until you sync them. So if you have a client that submits update to a writer node but reads from a full node then if your full nodes have at least some lag you'll get an inconsistency in your system.

rickheere06:04:41

Then i'm in luck because the clients that send the information don't case about what happens to the documents. They only want to know the document is delivered.

2
Martynas Maciulevičius07:04:16

This problem doesn't come from nodes themselves but from your architecture. This is not an error from XTDB. I just wanted to clear this one up.

rickheere07:04:25

I'm very grateful you ask these questions, its a bug project and if I work with the wrong architecture that would be a big mistake.

rickheere07:04:47

I make quick and ugly overview of the architecture. Have a look.

Martynas Maciulevičius07:04:37

Well based on the library of XTDB-Kafka you don't really submit your data directly to Kafka queue. Instead you submit it to a node that has XTDB running and then you wait until it's complete for that node. So your diagram actually doesn't show your decision to split the input node and full node. i.e. I think you will have the consistency problem when you'll do it. You could probably be able to call sync on every read which comes after a write when you interact with two nodes at the same time. But then back-end would have to handle it. So you don't really gain performance from splitting your node for your backend if you have to sync constantly. Well obviously -- you'll get performance for simply submitting. But clients will have to sync all the time. I.e. when client will want to read their own write they'll not be able to do it every time unless you call sync or write through the same node that you read through and wait until it completes the tx.

rickheere08:04:27

Alright, I updated the design. it now shows where the full and write nodes are used. XTDB is hosted on s3 for the documents and on kafka for the log. I also use another s3 bucket to keep the files and I use another kafka topic to buffer "file inserted" events. Then while processing the events I only have to add static information about the file to the database that I can add in another document but "joinable" by the hash. But maybe later we find out I do need information out of xtdb so I'll have to run a full node, but because this process is buffered in kafka we are not in a hurry. There will also be a full node for when users want to query xtdb for searching document but it is not a problem if that system is a bit out of sync. Have a look, much appreciated.

Martynas Maciulevičius11:04:51

I'm not going to look because I'm not paid by you. I have other things to do and I just saw one caveat and pointed it out. I gave you a clue and if you're not sure about your own design then you're out of luck. You ask for too much.

rickheere11:04:41

No hard feelings. I was under the impression I was heling you to understand the design I'm pretty sure about. Thanks anyway.

2
refset12:04:52

+1 thanks for chiming and contributing to the discussion in the first instance @U028ART884X certainly these kinds of designs do require some careful, dedicated thought to get correct! > do tx fns run on this kind of node? nope tx fns only run during indexing on a proper node that performs indexing (i.e. not a new-submit-client)