Fork me on GitHub
#onyx
<
2016-10-08
>
zamaterian09:10:03

did anyone here know that cognitec support, recommends to only run one operational database pr. transacter ?

zamaterian09:10:24

I know its a bit offtopic 🙂

lucasbradstreet09:10:08

Huh. Really. We are actually about to create a database per customer so that is kinda crazy. That's pretty weird because they push being able to query over multiple DBS at a time

zamaterian09:10:29

the problem we ran into was we got a single corrupt database among all the fine databases - this made the transactor fail on indexing job and thereby not starting the hornetq server - so all the databases was unavailable.

zamaterian09:10:02

you could probably query of multiple dbs even if they are located at different storages (eg separate transactors) What the license impact of this is yet to be determined

lucasbradstreet09:10:58

I guess that’s not so bad for us unless increasing the number of DBs increases the corruption risk. I think if we’re spreading the data out, the corruption risk is spread instead.

zamaterian09:10:29

If you spread the database out among several transacter, and one get corrupt - only one database is down then the rest are fine. But if all your databases uses the same transactor and one database get corrupt then all your databases are down.

lucasbradstreet09:10:27

I guess the question to me is if you have 200 DBs, with 1/200th of the data, is the risk of the DB being corrupted 1/200th of the risk of one big DB with all the data.

lucasbradstreet09:10:33

If so, I don’t think it matters so much

zamaterian09:10:46

as I understand your statement correctly the risk between 200 db and and single on the same transactor! the risk is the same if one db small gets corrupted or the big db - the result is your down!

lucasbradstreet09:10:06

Right, but if the overall probability of 200 databases being corrupted is the same as the overall probability of 1 database, then I think you don’t gain much by going to one database instead

lucasbradstreet09:10:18

Since you DB is down either way

lucasbradstreet09:10:53

But say the overall probability of 200 smaller DBs being corrupted is twice the chance of one being corrupted, then maybe the big DB is better

lucasbradstreet09:10:59

My feeling is that every write comes with a chance of corruption, so if the total writes is the same whether it’s 1 DB or 200 DBs, then I don’t see why the recommendation is a good one since the end result is still the same

zamaterian09:10:14

unless your run each of your 200dbs on their own storage and transactor, then your only down for that one db

lucasbradstreet09:10:28

Right, I’m talking strictly about sharing a transactor atm

lucasbradstreet09:10:36

Anyway, that’s good to know.

lucasbradstreet09:10:45

transactors use a license, so using multiple transactors sounds like a no go for us

zamaterian09:10:07

Then your right - that it dosnt really matters for one or multiple db under the same transactor with regards to corrupted data:)

zamaterian09:10:06

in my opinion its a design fault in the transactor that a corrupt db can cause denial to access the rest of the dbs.

lucasbradstreet10:10:40

I doubt they intended for it to be that way

Drew Verlee12:10:05

What would be the reason for pulling data from a column store like Hbase or cassandra into kafka and then into a stream processing framework like onyx rather then just pulling directly from the the column store into the stream processing framework. I have seen the former approach taken in a couple designs and the reason never seemed clear to me.

lucasbradstreet12:10:35

If you want to perform multiple runs over your data it might make sense

Drew Verlee12:10:21

@lucasbradstreet. Hmm. By multiple runs i assume you mean re-reading the log. I could see doing this because you wanted to perform the same computation with a different argument. An example in Onyx might be increasing a windowsize. However, why does reading the data from kafka help in with multiple runs? Or put another way, what prevents making multiple runs over the data in hbase. Additionally this is a concern because were using Hbase and I was curious why there aren’t any Cassendra or Hbase plugins. This could just have to do with no one needing them yet, but i was also considering that it might be because people always pull through kafka for reasons i dont fully understand.

lucasbradstreet12:10:00

Well you could, but maybe doing it in kafka is more efficient in some way

lucasbradstreet12:10:14

I think people typically put a lot of data in Kafka from the get go

lucasbradstreet12:10:33

And it’s different kinds of data, where you might want the full history, but where you are updating in place in your DB

Drew Verlee16:10:21

@lucasbradstreet if i need to keep hold of some state between aggregations. Either as a result of a previous aggregation or just some meta data that the input stream needs to use in order to make decisions about joins. What mechanism do i use and what storage does onyx use. Here is a small example. Say you have two agents feeding data into onyx {key: a, value: 1} {key: b, value 1} now its a rule that key: b and key: a should be grouped together: {key c: [a, b]} Where would i store the rule that key a and key b should be grouped together. I see RocksDb used for this in kafka streams and samza. Onyx uses rocksdb to detect duplicate keys, but i’m curious if I can tap into it using the event map and if this is the correct way to go about it.

lucasbradstreet16:10:46

If I understand you correctly you can onyx/group-by-fn?

Drew Verlee16:10:27

can I update the arguments to the group-by-fn in real time?

lucasbradstreet16:10:10

Possibly, though that part is trickier. You'd have to think about how to do it right

lucasbradstreet16:10:32

Easiest / hackiest way would be to refer to / maintain an atom somewhere

Drew Verlee16:10:39

Does the question indicate i’m thinking about the problem wrong? It seems very common that streaming systems need to make calls out to a database to get more information. As a result people build local cahces next to the streaming process to speed this up. However, if the cache grows stale (due to a long partition or the streaming process dies) then it will cause a strain on the DB as it refreshes. As an improvement its suggested that the in-memory cache is built from a log (kafka) the contains just the updates it needs. Samza uses rocksDb for the in memory cache. As onyx is often coupled with kafka, and already uses rocksDB, i was curious if it might/could share a similar role. Or maybe i’m missing some other way to solve this problem. Another way would be to have the incoming segments just contain all the information they need, but i can see where that would get burdensome to. I suppose that an atom would work just fine as long as it didn’t need to store much information, the the use cases i have we dont need to store much information at all. If the processed crashed we could still call out to kafka to rebuild the atom. So i suppose that would work.

lucasbradstreet16:10:51

If the process crashes it will replay the aggregation to get the state back, so it could be fine. We definitely need a disk based cache like RocksDB though.