Fork me on GitHub
#asami
<
2021-03-20
>
quoll00:03:51

Asami alpha 6 is out now. Changes are: • Inserting entities that have no :db/id will no longer report their own ID in the :tempids from the transaction • No longer dependent on Cheshire, which means no more dependency on Jackson XML • Fixed reflections in the durable code, with about a 30% speedup

3
Ⓜ️ 3
💪 12
quoll00:03:26

I’m doing training next week, so there won’t be any Asami development until the week after

Craig Brozefsky13:03:35

Just did the same load test with ALpha6 and can confirm that estimate on speedup

👍 3
quoll15:03:46

I know I have work to do to improve load speed, but at the same time there’s a tradeoff between use cases: • Regular updated and modifications to do (current design) • Load once and analyze

quoll15:03:34

The current design is specifically to allow regular updates without great expense, while also trying to keep querying fast

quoll15:03:19

I have another design which is optimized for fast loading. But it can’t update anything in place. Instead, it would manage updates as another pair of graphs (additions and retractions), which get dynamically merged in with with the main graph during queries. Then in the background, those changes would be merged into a single index again. This makes updates possible, but it’s not going to do lots of modifications quickly.

quoll15:03:22

The thing I just implemented is actually a hybrid between the old Mulgara and this write-once design. It mostly follows what Mulgara did, but with a few exceptions.

quoll15:03:52

I knew that Clojure was going to bring some overhead, but I’m still a bit disappointed in the speed. Hoping I can improve it more

Craig Brozefsky20:03:37

Well, fromt what I know of the problem space Cisco is dealing with..

Craig Brozefsky20:03:53

My hunch would be fast load, minimal modification relative to loaded data set...

Craig Brozefsky20:03:19

aka, investigation/analysis vs progressive development of a stable and persistent knowledge set

quoll20:03:24

It depends on who’s using it. Right now my team is modifying things a LOT

Craig Brozefsky20:03:02

enrichment loads?

👍 3
quoll20:03:07

But I agree. I think the real value of these systems is in providing a new view for analysis

Craig Brozefsky20:03:28

or the synthesis of higher order models like "target" etc...

Craig Brozefsky20:03:49

Ok, cause to me, neither of those is modification

Craig Brozefsky20:03:08

but apparently those are modifications in terms of what you were talking about above

Craig Brozefsky20:03:26

I think "update" when I here modification, not addiing some triples

quoll20:03:44

They’ve recently started understanding what they get out of it, and they’re starting to use it for everything. Apparently it’s both easier for them, and the resulting code is faster than what they were doing before (which I’m surprised at, amused by, and grateful to learn)

Craig Brozefsky20:03:14

well, at some point, munging and mapping/zipping around in javascript data objects is just tedious 8^)

quoll20:03:19

They have lots of entities where they want to change state of values, not simply add new ones

Craig Brozefsky20:03:41

yah, so those "synthesized" entities, like Target

Craig Brozefsky20:03:54

or node for the graph...

Craig Brozefsky20:03:57

well, to put this in perspective. I wager my test was way more data than they are expecting to deal with

Craig Brozefsky20:03:12

and they are primarily using the in-mem store

Craig Brozefsky20:03:12

My intuition is that it's really the update speed on the durable storage that is bottleneck

Craig Brozefsky20:03:22

err, I mean, "adding triples"

Craig Brozefsky20:03:58

but I wonder how much durable storage use cases and in-mem use cases overalp

Craig Brozefsky20:03:21

I mean, I think Mario's work is doing snapshots...

Craig Brozefsky20:03:41

durable storage is more about having a working set that is significantly larger than expected available RAM

Craig Brozefsky20:03:11

it's not so much about persistence -- as the current snapshot dump/load mechanism for an in-mem db is sufficient for those use cases

Craig Brozefsky20:03:01

if I can shrink my working set down to something that can fit in like 32g of ram...

Craig Brozefsky20:03:08

current asami is gold

Craig Brozefsky20:03:00

it's when I need a working data set where the indexes might fit in 32g of ram, but my actual data is MUCH larger... that durable stores win.. But maybe I'm not grokking the resource constraints of durable asami...

quoll20:03:22

Well, for now they’re only just starting to hit the durable store, so I think they’re OK. My next step is indexing entities, so they don’t get rebuilt when they’re retrieved. That’s a little tricky, because updates to triples can be updating entities, and I need to find those. But I think I can get most of the low-hanging fruit reasonably well. After that, I can start looking at the triples indexes again.

quoll20:03:47

Fortunately, a lot of it can be built on the same infrastructure that the current indexes use

quoll20:03:02

Also, I’d like to do another tree type, and not just AVL

quoll20:03:24

AVL is perfect for the triples, but it’s not ideal for the data pool (i.e. strings, URIs, etc)

quoll20:03:57

e.g. if you load the data twice, the second time is several times faster. That’s because the data pool already has all the strings and keywords, and they’re cached without having to traverse the tree

quoll20:03:46

Thinking about Hitchhiker trees or skiplists here

quoll20:03:33

Meanwhile, people would like pull syntax, and I need to figure out storage in IndexedDB… it feels like a lot for one person

quoll20:03:55

Oh… and Naga needs some love too 🙂

Craig Brozefsky20:03:33

Seems like pretty classic problem of having a single abstraction interface for vastly different resources

Craig Brozefsky20:03:10

mem vs. write to file...

quoll20:03:25

That’s true

Craig Brozefsky20:03:46

interesting to see the file on disk growing iteravly within a single transaction boundary too

quoll20:03:19

But Naga was always supposed to talk to external graph DBs. I never thought of it being in-memory until you asked me to make it

Craig Brozefsky20:03:22

not what I would have intuited for a mem-mapped file block

Craig Brozefsky20:03:42

it's a marvelous little problem domain

quoll20:03:59

It depends on the index. If it’s the data pool (data.bin and idx.bin) then the data.bin file is just appended to with standard .write operations. When you read from it, it checks if the offset is within a currently mapped region of the file. If so, then you get it. If not, then it will extend the mapping, then you get it

Craig Brozefsky20:03:00

I would guess that in-mem DB with fast snapshot dump/load is still what will support CTR/SX the best

Craig Brozefsky20:03:40

outsider guess of course

quoll20:03:44

I’m thinking so for now

Craig Brozefsky20:03:22

already durable speed is sufficient for those data sets, and for even persistent the entirety of public CTIA I would guess

Craig Brozefsky20:03:35

so it's pretty solidly hitting all your paying use cases 8^)

quoll20:03:41

Right now, the only pain point is the entity rebuilding. So next on the list is indexing those directly

Craig Brozefsky20:03:53

yah, well at some point, you have to recognize entity rebuilding is providing a entitment abstractin on top of a more efficient query engine -- aka, maybe users should be making queries they need directly, instead of snarfling whole entities

Craig Brozefsky20:03:03

entity abstraction o top of...

quoll20:03:25

in memory is easy. The durable version needs me to serialize. I’ve ummed and ahh-ed over using transit, but because I already have most of the serialization I need, I’m going to extend my code a little more. The space overhead is similar, and mine is better in some cases

Craig Brozefsky20:03:30

Maybe consider something like the ES approach

Craig Brozefsky20:03:48

that entity cache... you already are considering it, nvm

quoll20:03:58

heh. Yes. That

quoll20:03:39

It was built inside the engine, as a layer over the DB. I’m talking about shifting it into the DB. It also makes other features possible then

Craig Brozefsky20:03:17

yah, it's where you pay the cost of triple store

Craig Brozefsky20:03:20

vs ES document model

Craig Brozefsky20:03:33

in that I can add/modify/delete from an entity

Craig Brozefsky20:03:38

via manipulating triples

Craig Brozefsky20:03:59

can't do that in ES -- only document level operations, so "entity level"

Craig Brozefsky20:03:10

but gotta have that for naga...

quoll20:03:10

yes. But if I can see the entity those triples are connected to, then I can update the entity as well

Craig Brozefsky20:03:14

I think it's worth the cost

Craig Brozefsky20:03:25

or just invalidate...

Craig Brozefsky20:03:34

don't pay the cost until read

Craig Brozefsky20:03:23

hmmm, subentity ids containing some pointer to the parent entity?

quoll20:03:34

No, but can

quoll20:03:35

Also, it’s only in-memory graphs that use keywords. On-disk graphs use Internal Nodes (serialized as: #a/n "1234")

Craig Brozefsky20:03:36

so :tg-12314-121 ...

quoll20:03:22

I’m better with the upfront cost of writing. That’s because writing always happens in another thread anyway, and it’s usually queries that we want to make fast

Craig Brozefsky20:03:50

so yes, in your current use case, or any use case where your data set is... contrains to fit in mem

Craig Brozefsky20:03:34

So, you mention merged graphs.. tell me more about multigraph

quoll20:03:18

Multigraph isn’t merged. That’s where you have multiple edges between nodes. It gives you a weight

Craig Brozefsky20:03:29

ok, I was thinking of merged DBs...

Craig Brozefsky20:03:38

give me two graphs DBs, make them behave as one...

quoll20:03:44

OK… now THAT is coming soon

quoll20:03:56

I need it so I can do as

Craig Brozefsky20:03:07

that opens up ALOT of use cases to manage write constraints ...

Craig Brozefsky20:03:31

especailly if it's make "n" graphs behave as one on query/update ...

quoll20:03:02

Have an initial graph (either in memory or on disk), and then you do speculative transactions against it. This becomes a pair of graphs. The fixed one, and an in-memory one. Queries against this will be returning concatenations of the results of both

Craig Brozefsky20:03:24

a graphdb variant of WALs

quoll20:03:31

Datomic has this, and people have asked for it.

Craig Brozefsky20:03:36

which is where you end up going to get write speeds approaching RDBMSs

Craig Brozefsky20:03:05

yah, and the power of having a single query doing joins across those graphs..

Craig Brozefsky20:03:47

means easier to build inference/queries/logic/processes across a larger set of knowledge (global threat context vs. my org scope

quoll20:03:48

Yes, you can do everything in memory, and then when you’re done you send a replay off to the index. Queries get done against the pair, until the index has finished it’s transaction, and then you swap over to the index again.

quoll20:03:14

This is actually how Naga operates against Datomic

quoll21:03:27

Datomic’s only notion of “transaction” is grouping write operations. But Naga needs a transaction to be a writes and reads interspersed, with the reads returning data that includes the results of the writes.

quoll21:03:34

I managed it by using an as database for executing against, and accumulating write operations until the end of the rules. At that point, I replay all of the writes into the main DB.

quoll21:03:37

Sorry… that was a mistake. I meant with

quoll21:03:49

as is the operation of going back in time

Craig Brozefsky21:03:50

Stepping back for a moment

Craig Brozefsky21:03:34

I fee like needs to scale and get "coverage" in data sets

Craig Brozefsky21:03:42

has been over-prioritized

Craig Brozefsky21:03:05

at the expense of making the tools that experts and analysts can use to make richer explorations and inferences about data

Craig Brozefsky21:03:37

aka, security is too focused on completeness, breadth, or scaling (mongedb is cloud scale... splunk is cloud scale...)

Craig Brozefsky21:03:57

I feel like there is a need for a way to get a subset of your data, a working set you build from querying all those sources (the mythology of your SEIM being that one source of data is a myth...)

Craig Brozefsky21:03:20

and layer increasingly more sophisticated abstration on top of that

Craig Brozefsky21:03:20

that you can't do with sql, or splunks query language, or ES, or these pipe based "event" query systems

quoll21:03:10

One thing I’d like to do is provide some graph algorithms to all of this. That’s one reason I integrated with Loom. It’s also why I support transitive attributes, and subgraph identification. I feel like we can use some graph algorithms to get some data that we’re not looking for yet

quoll21:03:01

So… I’ve made a start on this. But you’ve already heard lots of other “priorities” that I’m working on at the same time :rolling_on_the_floor_laughing:

quoll21:03:29

However, the other day, Jesse made the suggestion of looking for someone to help me. That would help a lot

Craig Brozefsky21:03:07

So that whole problem of breadth vs richness of data representation

Craig Brozefsky21:03:17

is what I'm thinking about -- it's what drove the SX/CTR archiecture too

quoll21:03:39

Also, I got my first significant external PR the other day. I hope I get to see more 🙂

quoll21:03:05

This requires that the project keep moving, meeting people’s needs, and appearing responsive

Craig Brozefsky21:03:44

we're never get all the data in one place, so let's make a protocol for asking all the different data sources to give us relevant subsets of their data to put into a richer representaion/tool

Craig Brozefsky21:03:56

you should consider how to handle copyright assignment BTW

quoll21:03:52

Oh yeah… that’s a very good point. We ran into that with Mulgara when the previous system was purchased

Craig Brozefsky21:03:50

n-graph combination is gold

Craig Brozefsky21:03:36

even if it's something that requires all mods to go to a single new graph...

Craig Brozefsky21:03:35

I imagine there must be some research on this already

quoll21:03:25

I imagine you’re right 🙂

Craig Brozefsky21:03:53

just thinking that my intuitions could certainly use some correction via a proper survey, vocabulary and theory of distributed graph stores and their interaction with rule engines and the assumptions naga makes around open world/negation/aggreation etc

Craig Brozefsky21:03:24

not even distributed.. layered? composed?

Craig Brozefsky21:03:13

scoping entity ids, namespacing relations, and handling negation/modification in appropriate store...

quoll21:03:46

Question for you… Do you see value in allowing entities with string keys? (i.e. like the ones you tried to load yesterday)

quoll21:03:00

I’ve been looking at it, and I recall now. It’s because Chris had graphs where his displayable edges were strings. I allowed these in as attributes, but filtered them out of entities, since entities were supposed to be constructed with keywords. It let him keep his string edges, and gave us an easy way to tell the difference between the UI things that Chris was doing, and properties for the entities that he was connecting. … but … that’s not happening anymore (I believe). So I can let entities use non-keywords as keys now if I want. That would let your data from yesterday show up as entities

quoll21:03:25

It already loads just fine. It’s just using the entity function to retrieve the nested object

Craig Brozefsky21:03:35

well, strings are better IMO

Craig Brozefsky21:03:45

keywords means we are restricted to EDN really

Craig Brozefsky21:03:04

strings means, it's more generic, makes no assumption about EDN/clojure etc...

Craig Brozefsky21:03:25

so yah, I think it SHOULD support strings as keys

quoll21:03:37

OK. It does

quoll21:03:04

This is my repl right now:

user=> (d/entity d n)
{"layers" {"ip" {"ip.checksum.status" "2", "ip.dst_host" "152.195.33.40", "ip.host" "152.195.33.40", "ip.dsfield" "0x00000000", "ip.version" "4", "ip.len" "40", "ip.src" "192.168.1.122", "ip.addr" "152.195.33.40", "ip.frag_offset" "0", "ip.dsfield_tree" {"ip.dsfield.dscp" "0", "ip.dsfield.ecn" "0"}, "ip.ttl" "64", "ip.checksum" "0x0000bec2", "ip.id" "0x00000000", "ip.proto" "6", "ip.flags_tree" {"ip.flags.rb" "0", "ip.flags.df" "1", "ip.flags.mf" "0"}, "ip.hdr_len" "20", "ip.dst" "152.195.33.40", "ip.src_host" "192.168.1.122", "ip.flags" "0x00000040"}, "eth" {"eth.dst" "22:4e:7f:74:55:8d", "eth.src" "46:eb:d7:d5:2b:c8", "eth.dst_tree" {"eth.dst.oui" "2248319", "eth.addr" "22:4e:7f:74:55:8d", "eth.dst_resolved" "22:4e:7f:74:55:8d", "eth.dst.ig" "0", "eth.ig" "0", "eth.lg" "1", "eth.addr_resolved" "22:4e:7f:74:55:8d", "eth.dst.lg" "1", "eth.addr.oui" "2248319"}, "eth.src_tree" {"eth.addr" "46:eb:d7:d5:2b:c8", "eth.ig" "0", "eth.lg" "1", "eth.src.oui" "4647895", "eth.addr_resolved" "46:eb:d7:d5:2b:c8", "eth.src.lg" "1", "eth.src.ig" "0", "eth.addr.oui" "4647895", "eth.src_resolved" "46:eb:d7:d5:2b:c8"}, "eth.type" "0x00000800"}, "tcp" {"tcp.srcport" "57836", "tcp.seq" "314", "tcp.window_size" "65535", "tcp.dstport" "443", "tcp.urgent_pointer" "0", "tcp.nxtseq" "314", "tcp.ack_raw" "2807365467", "tcp.stream" "42", "tcp.hdr_len" "20", "tcp.seq_raw" "3486482781", "tcp.checksum" "0x00005841", "tcp.port" "443", "tcp.ack" "24941", "Timestamps" {"tcp.time_relative" "0.112280000", "tcp.time_delta" "0.000135000"}, "tcp.window_size_scalefactor" "-1", "tcp.checksum.status" "2", "tcp.flags" "0x00000010", "tcp.window_size_value" "65535", "tcp.len" "0", "tcp.flags_tree" {"tcp.flags.ecn" "0", "tcp.flags.res" "0", "tcp.flags.cwr" "0", "tcp.flags.syn" "0", "tcp.flags.urg" "0", "tcp.flags.fin" "0", "tcp.flags.push" "0", "tcp.flags.str" "·······A····", "tcp.flags.reset" "0", "tcp.flags.ns" "0", "tcp.flags.ack" "1"}, "tcp.analysis" {"tcp.analysis.acks_frame" "2181", "tcp.analysis.ack_rtt" "0.023370000"}}, "frame" {"frame.protocols" "eth:ethertype:ip:tcp", "frame.cap_len" "54", "frame.marked" "0", "frame.offset_shift" "0.000000000", "frame.time_delta_displayed" "0.000068000", "frame.time_relative" "9.977223000", "frame.time_delta" "0.000068000", "frame.time_epoch" "1612222675.179957000", "frame.time" "Feb  1, 2021 18:37:55.179957000 EST", "frame.encap_type" "1", "frame.len" "54", "frame.number" "2273", "frame.ignored" "0"}}}

quoll21:03:12

no. On my localhost

quoll21:03:19

I can make it alpha7 if you want 🙂

quoll21:03:37

Check the thread above

Craig Brozefsky21:03:45

I would say we should not assume keywords anywhere eh?

Craig Brozefsky21:03:59

I need to think more 8^)

quoll21:03:19

I’ve been removing that assumption in general. You’ll note that keywords are no longer required in the “entity” position of a triple

quoll21:03:41

and strings were ALWAYS allowed as attributes

quoll21:03:49

they were just filtered out of entities

quoll22:03:08

It should work on an existing store, so just connect to it

quoll22:03:04

I need to update my project files to do a full release for me. I’m doing a lot manually right now