Fork me on GitHub
#clojure-uk
<
2016-04-26
>
yogidevbear09:04:23

Anyone here using Cassandra?

otfrom09:04:52

alia and hayt are the libs we use

yogidevbear09:04:14

Seems to be a very in-demand skill to have

yogidevbear09:04:10

And Apache Spark?

mccraigmccraig09:04:27

i use alia+hayt plus our own lib on top of them for higher-level stuff https://github.com/employeerepublic/er-cassandra

yogidevbear09:04:32

Is it quite different from traditional rdbms?

mccraigmccraig09:04:49

yes - data modelling is very different

yogidevbear09:04:44

ls the data modeling similar across other NoSQL dbs?

yogidevbear09:04:13

In relation to other NoSQL dbs

mccraigmccraig09:04:48

i've only really used elasticsearch and hadoop/cascalog greatly - and the modelling is quite different to those

mccraigmccraig09:04:12

you mostly model by the queries you want to do, rather than attempting to discover a suitable natural structure in the data

mccraigmccraig09:04:45

e.g. you might have a table of users, with an id primary key... if you want to be able to retrieve users by their email too then you will need to denormalize to another table, users_by_email or something, with an email primary key

yogidevbear09:04:45

That is quite different

yogidevbear09:04:06

Thanks for the example

mccraigmccraig09:04:52

also you are very limited on sorting and filtering... primary keys in cassandra are divided into two parts - called partition and clustering keys - the partition key is the columns used to determine which partition(s) a record will live on, and can't be used for sorting or filtering (beyond an IN query) while the clustering key columns can be used for sorting and filtering (and maps to the wide-row concept which is kinda sorta hidden beneath the CQL table concept these days)

benedek09:04:45

i enjoyed reading http://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-Persistence-ebook/dp/B0090J3SYW/ref=mt_kindle?_encoding=UTF8&amp;me= quite a few years back as an intro to nosql. it could be quite dated now tho (not aware that there is an updated edition)

mccraigmccraig09:04:10

@yogidevbear: if your requirements don't include one of ["must be nukeproof" "must scale a long long way" "i wanna understand this thing"] then you may well have an easier time with postgresql, if your requirements do include one of those things, then go for it - i've found it relatively straightforward so far, though it took me a little while to get a good feel for different modelling approaches

yogidevbear10:04:16

Cool, thanks again. I'm definitely going to be investing time in getting up and running properly with postgresql as I'm very comfortable working with rdbms, but it's always good to know about alternative options like NoSQL and where/when/how to use them

martintrojer11:04:50

having said that, psql scales pretty ridiculously nowadays.

martintrojer11:04:00

and you have JSON columns, BRIN indices etc

martintrojer11:04:43

also, I avoid Cass* like the plague

yogidevbear11:04:47

Totally unrelated, but there are blue skies, snow and hail going on around my house right now

yogidevbear11:04:31

@martintrojer: What are you reservations around Cassandra?

martintrojer11:04:13

Lots of ops issues, easy to lose data, frequent downtime, don’t really work on a dynamic infrastructure (without lots of blood, sweat and tears)

yogidevbear11:04:49

I've heard loss of data mentioned about a few different NoSQL db options

yogidevbear11:04:09

That's what makes me a little hesitant to use them

martintrojer11:04:27

if you think about using cass, setup a large (i.e. expensive) cluster with lots and lots of redundancy

martintrojer11:04:31

that’s the way to do it.

martintrojer11:04:55

If you’re on AWS, just use Dynamo. Scales with your needs and 0 ops issues (and stop worrying about downtime and/or dataloss)

thomas12:04:39

@yogidevbear: we have had just a little bit of snow… weird

mccraigmccraig12:04:35

@martintrojer: were you doing anything in particular to c* to cause it to bork so ? what size instances were you running it on ?

martintrojer12:04:02

I had some Cass-dudes look at it, they couldn’t find anything wrong.

martintrojer12:04:21

My current looking-back-conclusion is that the cluster was way way too small.

otfrom12:04:25

foo.large seems like a smallish instance

otfrom12:04:10

I found it really needed a box w/16GB even if it didn't use it all

mccraigmccraig12:04:32

what was happening to it ? did nodes fall over, or start losing data or performing badly ?

martintrojer13:04:35

@mccraigmccraig: When EC2 decided to kill some of the nodes, and new one rejoined, the entire cluster went down

martintrojer13:04:19

also, running Cass* on dynamic IPs is a mess, you need a discovery service on the side, and when provisioning update the config file before starting Cass*

mccraigmccraig13:04:00

ah, well i haven't encountered either of those two situations yet, though at some point i will doubtless be encountering the dead node problem...

mccraigmccraig13:04:37

why did you run on dynamic ips tho ? it makes lots of things painful, surely ?

martintrojer14:04:35

@mccraigmccraig: I want to automate everything. No human hands should ever touch the VMs.

martintrojer14:04:51

I want to scale the cluster by just changing a number in the auto scaling group

martintrojer14:04:56

This works perfectly with for instance Elasticsearch (without any service discovery thing)

mccraigmccraig14:04:04

ah, i see - i agree about no-human-hands - though i have taken a differing approach - my config mgmt tool distributes the ips of created instances to config files, so it effectively pre-empts discovery at converge time and ips are static (until an instance dies and needs to be replaced)

otfrom16:04:34

👋 tcoupland

otfrom17:04:02

that parrot is made of party upside_down_parrot