Fork me on GitHub
#asami
<
2020-12-03
>
whilo08:12:34

@quoll I have done some quick experiments, but I probably do something wrong. My results are here: https://github.com/whilo/asami/blob/benchmark-comparison-hitchhiker-tree/test/asami/durable/pool_test.cljc#L230

whilo08:12:13

Maybe you can have a look before our meeting later. I realized that the book contains a lot of double words and decided to only insert a set of those, but performance seems similar otherwise as well.

quoll12:12:02

Even if performance seems similar, only inserting a set of the words is not going to be realistic usage. The purpose of this is to get a single, consistent ID value for every string that is passed in. Which means that it needs to search for them as well as store the ones it doesn’t find. Of course, caching will matter a lot for that, and 13k words will easily fit in memory, which will be why the performance looks similar at this size.

quoll13:12:37

I haven’t looked at the loading yet, though reading is being done completely differently. In the HH case you’re iterating over the data, while for Asami you’re doing an individual lookup (i.e. search the tree) for each word. A more direct comparison there will be to use a tree iterator node-seq

quoll13:12:26

That would be:

(time (doall (node-seq (:index bpool) (first-node bpool))))

quoll13:12:03

However, that will only count the words that were stored. Short words aren’t stored!

quoll13:12:33

Strings of 7 characters or fewer are turned into a number. It’s only strings of 8 characters or more that are saved in the tree

quoll13:12:44

(I say “characters”. It’s really “bytes”. But I use UTF-8, and that text is all in the ASCII range, so it’s the same)

quoll13:12:08

One difference I see in loading is in what is being done for each insert. Without knowing the hh code, it appears that you’re loading everything in a batch, then flushing. I presume that the load operation is ordering the data, right?

quoll13:12:03

The Asami code is doing a 2 stage operation for each string: • Does the string exist in the pool? If so, return its ID. • Insert the string, returning the new ID. The result is a seq of IDs that represent the original text. (Called coded in my example).

quoll13:12:55

The second step of my test was to use the pool to map the sequence of numbers in coded back into their original strings, using the index.

quoll13:12:33

That said, I fully expect the HH code to be faster at loading, though not by as much as your earlier test suggests

quoll14:12:59

This machine (apparently fast than yours) encodes the strings into numbers in about 7 seconds (7436ms on my last run. I’m seeing faster and slower) However, that’s only the strings that are longer than 7 bytes, which 7051. You’ll likely be faster again when you restrict yourself to just those. Iterating over that takes 220ms

whilo15:12:42

The data is not presorted before it is inserted into the hitchhiker-tree. Every string is inserted in sequence and the sorting happens internally, partially lazily, because the hitchhiker-tree is fractal.

whilo15:12:12

Ok, with (time (doall (node-seq (:index bpool) (first-node bpool)))) bulk reading takes half the time for me than before. I have not benchmarked random accesses yet. For this it might be sensible to run a more careful benchmark. The hitchhiker-tree has a fairly nice benchmarking suite with reporting to an Excel sheet, albeit only for random insertions and deletes so far.

whilo15:12:20

I am not sure abou the pool internals with the string ids that you are mentioning. Are the ids important for the user of the pool or are they just internal?

quoll16:12:36

The IDs are the point of the pool

quoll16:12:56

The purpose of the pool is to map data to IDs, and to map IDs back to data

quoll14:12:10

You had originally stated 11am last week. When I said I couldn’t do that time you suggested today and that you were going to ask around for who wants to join… but then nothing more was said!

quoll14:12:20

I can manage 11am though. That’s 96 minutes from now

whilo15:12:52

I am here, yes, sorry for not confirming.

whilo15:12:17

(if this works, otherwise we can also go somewhere else)

quoll16:12:32

Hi! I’m really sorry… I was out on an errand and couldn’t respond

quoll16:12:13

I have to apologize. I was in a rush this morning, and didn’t think it through. I have a department meeting right now!

quoll16:12:18

It’s every Thursday at this time

whilo16:12:48

Ok, no problem.

quoll16:12:50

Oh, wait.

quoll16:12:53

We have a new boss

quoll16:12:01

and over the break she changed the time

quoll16:12:06

So I can meet now!

whilo16:12:11

Cool, let me know whether jitsi works for you.

quoll16:12:35

As being discussed offline, this is the program that Naga was running:

sibling(fred, barney).
parent(fred, mary).
sibling(mary, george).
gender(george, male).

parent(B, C) :- sibling(A, B), parent(A, C).
brother(A, B) :- sibling(A, B), gender(B, male).
uncle(A, C) :- parent(A, B), brother(B, C).
sibling(A, B) :- parent(A, P), parent(B, P), A != B.
gender(F, male) :- father(A, F).
parent(A, F) :- father(A, F).