Fork me on GitHub
#off-topic
<
2020-08-14
>
p-himik13:08:08

A question about a good storage mechanism and indexing. I have 31000 of unique text IDs with length ranging from, say, 3 to 20 characters. For each unordered pair of IDs, there's a float value (most of the values are 0). And while there are almost half a billion such pairs, due to the nature of the values there are only about 12 million unique values. So many pairs have the same value assigned to them. The data is not static - in the future, new IDs will appear, so the 31001st ID will bring in 31000 new pairs and, hence, 31000 values. I need to be able to store it persistently in such a way so that extracting the value for a random pair of IDs is as quick as possible. Seems like in terms of data structures, a trie would be ideal here. But I may be wrong. Do you have any suggestions?

andy.fingerhut13:08:49

What about a Clojure map, where the keys are sets of two text IDs, and the associated values are the numbers? If the map does not contain a particular key, then it has the 'default' common value, which I believe you say is 0.

andy.fingerhut13:08:50

When you say store persistently, you mean in a database? Or you mean persistent as in persistent data structure in memory, like Clojure's persistent data structures?

andy.fingerhut13:08:59

A variation on my first suggestion is to represent a pair of text IDs as a single string. If there is some character that you know will never be in any of text IDs, e.g. a comma character, you can represent a "set of 2 text IDs" by sorting the two text ID strings, then concatenating them together, separated by a comma. For example the pair of text IDs "375" and "A28B7" could be represented uniquely by the string "375,A28B7", assuming "375" sorts less than "A28B7"

andy.fingerhut13:08:59

A variation on my first suggestion is to represent a pair of text IDs as a single string. If there is some character that you know will never be in any of text IDs, e.g. a comma character, you can represent a "set of 2 text IDs" by sorting the two text ID strings, then concatenating them together, separated by a comma. For example the pair of text IDs "375" and "A28B7" could be represented uniquely by the string "375,A28B7", assuming "375" sorts less than "A28B7"

p-himik17:08:27

By "persistently" I mean persisting on disk in any way. Even skipping that, I'm not sure about just using a hash table. Just the pointers to the strings will take up 4GB of memory.

p-himik17:08:11

And combining the IDs together will turn 31000 unique strings into half a billion unique strings. String pool explodes, RAM explodes.

p-himik18:08:24

Do you have any experience with SQLite storing billions of records?

p-himik18:08:13

To the best of my knowledge, even the more "enterprise" RDBMS suffer with that amount of rows. Maybe it's a good area to try NoSQL, but I have zero experience with it.

andy.fingerhut18:08:35

There is no reason to explicitly store the pairs of strings that have the associated "default" value of 0, is there?

p-himik18:08:15

I think that's right. But I cannot guarantee that "most values are 0" will hold forever. The data deals with pedigree. A simple introduction of a common parent can replace all those zeros with some other values.

andy.fingerhut18:08:51

I agree that for 31,000 text IDs, there are "31,000 choose 2" or about half a billion unordered pairs, but then you said: "due to the nature of the values there are only about 12 million unique values." I do not understand what you mean by unique values? Do you mean only 12 million of those half a billion pairs are associated with a non-0 numeric value?

andy.fingerhut18:08:43

Or do you mean that there are only 12 million unique numeric values, but which of the half a billion pairs are associated with each of those 12 million numeric values is pretty arbitrarily 'spread around' the pairs of IDs?

andy.fingerhut18:08:47

If you have only 12 million ID pairs associated with non-0 numeric values, then you only need to explicitly store those 12 million pairs, and their associated numeric values, and there is no need to explicitly store half a billion pairs explicitly anywhere at all.

p-himik18:08:09

All the values are computed by applying some simple arithmetic to integer values. So I end up with values like 0, 1, 0.5, 0.75, 0.125, and so on. Although here I have no idea if this can be changed dramatically by introducing new data.

p-himik18:08:22

So many pairs have the same float value associated with them.

andy.fingerhut18:08:19

Are the numeric values calculated from the contents of the text ID strings themselves? Or from some other data associated with the text IDs and stored / represented elsewhere?

andy.fingerhut19:08:34

One technique to try to reduce storage space is to pick some size of integer, e.g. a 32-bit integer, and create a unique 32-bit integer id associated with each of the current 31,000 text strings, and use pairs of those 32-bit integer ids as keys in a map from the unordered pairs to the numeric values. Then you only need to store each ID string once.

andy.fingerhut19:08:45

The 32-bit integer IDs will need to be stored many times, but as long as they are significantly smaller, on average, than the text ID strings, you've saved a lot of bits.

p-himik19:08:40

The come from some other data, yes. Each ID represents a dog. A dog has parents. It creates huge pedigree graphs, based on which the values (kinship) are calculated. > One technique [...] It's a pretty common approach, I thought about that a bit. That's basically what JVM already does for me for short strings with its string pool. The thing is (and I realize I probably wasn't clear enough initially), I'd like to have a more or less off-the-shelf solution, akin to RDBMS, that would all that thinking for me. My musings on the tries were mainly caused by my attempts to find something that already supports tries.

p-himik19:08:48

I actually did a small experiment with the numbers approach, only in Python. Well, it ate 60GB of RAM when it was halfway there... Just out of interest I will experiment with Java as well.

andy.fingerhut00:08:30

The JVM and Clojure use 'boxed numbers' in many places if you do not somehow force them to be primitives. Boxed numbers take 12 to 16 bytes per number, versus the 32-bit or 64-bit numeric value itself, because of JVM object overhead. It is possible to create data structures that force them to be primitives, but it is unlikely to be the default in Clojure.

thom11:08:35

Sounds like life would be much easier with this all in a graph database. Nodes are dogs, relations can be things like parent/child etc. That way you can very naturally query and update hierarchies of any arbitrary depth.

p-himik12:08:50

@UTF99QP7V But that wasn't the question. I don't need to query or update hierarchies.

thom12:08:19

Sounded like once you got new pedigree info you need to propagate that to descendants?

p-himik12:08:24

I don't know what you mean by "propagate", but even if I do need all that, that wasn't the question. :)

thom12:08:50

You’re maintaining family tree info for dogs and calculating values for purebrededness or mix of breeds, presumably? New info can come in about ancestors and descendants, and so you need to be able to trace the genealogy and update those values. You want to persist this data safely. Graph databases live and breathe for these exact use cases. If you’ve got your own data structures and algorithms in mind that’s great, just seems like more work.

p-himik12:08:54

It all can be (and is) perfectly modeled using a relational database. I don't need to traverse the entire pedigree to update/query some dog because all operations on dogs are done using dog IDs. The question was about data that's not hierarchical at all, it's just a matrix. I don't have any problems with maintaining the pedigree.

thom12:08:21

Okay, apologies, have fun.

Mark Gerard18:08:59

Is there a term for that feeling you have when you write something in clojure, then wonder if it is idiomatic enough? I see it cropping up a lot. I propose clojure beginner syndrome

jamesleonis18:08:08

Often I get that feeling, but usually I'm asking "Is there a function for that?"

borkdude18:08:44

@jamesleonis Maybe https://borkdude.github.io/re-find.web/ can help you with that

seancorfield18:08:53

@cattabanks That affects folks who've been using Clojure for years too! 🙂

phronmophobic18:08:13

it's not just beginners. what's the feeling when you just tried to answer "is this idiomatic?" and immediately see "alexmiller is typing..." 😰

seancorfield18:08:31

Even after a decade of production use, I'll still sometimes post a bit of code to #clojure and ask for opinions/suggestions.

Mark Gerard18:08:53

@seancorfield tbh that is reassuring

jamesleonis18:08:58

I often have the Clojure Cheatsheet open just for this reason 😆

jamesleonis18:08:32

@borkdude Oh my god that's awesome!

jamesleonis18:08:04

(reset! days-since-last-impressed 0)

seancorfield18:08:18

@cattabanks It's rare at this point that someone suggests a function I'm not aware of but they often suggest a combination of functions that I hadn't considered 🙂

walterl19:08:17

In precisely that way, I've wondered if Clojure deliberately shares one of Larry Wall's aims with Perl: the basics are dead simple (you don't need to learn a lot to get started), but the more you learn, the more you can do with it. Just like with natural languages.