Fork me on GitHub
#ldnclj
<
2015-11-27
>
agile_geek07:11:09

Anyone used Cascalog on a very recent version of Hadoop? I’m using Hadoop 2.7.1 and my Map-Reduce jobs runs fine on a local hadoop instance for that version but not on the cluster. I get a Class Not Found Exception for

cascading.tap.hadoop.io.MultiInputSplit
. See https://www.refheap.com/112124

agile_geek07:11:07

Oh yeah, Good Morning everyone

mccraigmccraig10:11:23

@agile_geek: does your uberjar contain the cascading classes ?

agile_geek10:11:31

@mccraigmccraig: I’ve not specifically included them but not sure if they are a transitive dependency of Cascalog and I’m not including hadoop-client (referenced as :provided to allow compilation). Should I be including them?

mccraigmccraig10:11:56

i'm guessing - i don't have specific experience, but either they will need to be in the uberjar, or you will need to install them on the cluster nodes

mccraigmccraig10:11:34

if they aren't being transitively included, then you could try adding them to your lein project

mccraigmccraig10:11:27

that's assuming that they aren't already installed on your cluster, which it looks like they aren't

mccraigmccraig10:11:49

or perhaps you have some terrible jar-hell problem

mccraigmccraig10:11:19

i think i like the node approach to dependencies better than the jvm approach

Pablo Fernandez10:11:13

What are cascading classes?

mccraigmccraig10:11:55

@pupeno: cascading is the high-level hadoop interface on which cascalog builds

mccraigmccraig10:11:36

it lets you model your map-reduce ops as more familiar joins, aggregations etc

agile_geek11:11:21

@mccraigmccraig: i thought similar myself but reading around the Cascalog docs it is clear about not including any hadoop jars and the examples don’t include anything other than cascalog. I have a feeling it’s something to do with the Hortonworks distribution not having Cascading on it.

mccraigmccraig11:11:43

that would make sense... i have no idea whether it's even possible to run with cascading in the uberjar - i presume there are some gnarly ClassLoader hierarchies inside hadoop, and components like cascading might have to be installed at a higher-level than the app

agile_geek11:11:11

I assumed that too

agile_geek11:11:35

I need to read around how cascading gets installed

agile_geek11:11:56

I think I’ll start by looking at the transitive dependencies for Cascalog and unpacking my uberjar

mccraigmccraig11:11:17

agile_geek: lein deps :tree is your friend simple_smile

agile_geek11:11:01

Along with piping it’s output to a file so I can search.

agile_geek13:11:06

@mccraigmccraig: hmm, that class is in the uberjar…as it’s a transitive dep of Cascalog. It’s an older version (2.5.3 instead of 3.0.2) of it but it’s there. I wonder if version is causing an issue.

mccraigmccraig13:11:48

@agile_geek: so u have a newer version of cascading on hadoop, and an older version in your uberjar ? exclude cascading from your uberjar and pray 🙏

agile_geek14:11:19

@mccraigmccraig: I’ll try it but I’m a bit confused. The stack trace is that this class is missing which suggests it’s not on the cluster OR in my uberjar. In all the examples of Hadoop-Cascading-Cascalog the Cascading jar needed to be jar’ed up and deployed, which it is - I’ve unpacked my uberjar and it’s there. Admittedly, it’s a slightly older version. I’ve tried excluding the older version of cascading and building on a newer one but I get the same error. I’ll try excluding altogether but can’t see how that can work as the class is definitely missing then!

agile_geek14:11:44

As suspected excluding the cascading lib altogether means the job fails to even compile (eval) when the cascalog functions try to resolve any references to cascading. Previously it failed when it hit the cluster.

mccraigmccraig14:11:54

@agile_geek: can you pre-compile your sources, then exclude cascading from the uberjar ? then, if you are lucky and they are api compatible, your .classes will perhaps link to the cascading classes on the hadoop cluster

mccraigmccraig14:11:25

if that fails, then can i suggest spark on mesos 😉

agile_geek14:11:54

That’s what I did. AOT all on uberjar but it fails as soon as I submit to hadoop

agile_geek14:11:50

@mccraigmccraig: Unfortunately it took 5 years for the client to get Hortonworks distro of Hadoop approved! Not sure Spark and Mesos will take less than 10!

mccraigmccraig14:11:15

so you can't run against an EMR cluster instead of the one you are using ?

mccraigmccraig14:11:14

and presumably the hadoop distro you have is deeply frozen and there's no chance of getting anything on to or off of the node classpaths ?

agile_geek14:11:34

You guessed it!

agile_geek14:11:50

This job runs ok locally on same version!

agile_geek14:11:18

I’m going to give up and write it in Java! Ouch!

mccraigmccraig14:11:20

you mean same version of hadoop or same version of same hortonworks distro ?

agile_geek14:11:32

version of Hadoop

mccraigmccraig14:11:26

i've not used it, but it looks interesting as a nice interface to vanilla hadoop

agile_geek14:11:15

Unfortunately the only reason I got to do this bit in Clojure is I said it would be faster but as I’ve lost 2 days to this problem I think I’ve burnt my ‘goodwill’ and I will be forced back to Java.

mccraigmccraig14:11:26

ha, i guess the argument that "jar-hell is not peculiar to clojure and can burn any attempt to use just about anything on a fixed platform" won't melt much ice, huh ?

agile_geek14:11:31

Nope. The ppl I talk to would hear Charlie Brown’s teacher “whah, whah, whah Clojure whah, whah, whah, doesn’t work whah whah…”

mccraigmccraig14:11:29

i shall not complain. this is the mechanism through which large organisations get their lunch eaten by smaller organisations. without it the world would still be dominated by feudal organisations which have been around for thousands of years. oh, wait...

malcolmsparks16:11:11

@mccraigmccraig: So that's why Windows exists? I'd never thought about it that way!

thattommyhall22:11:57

Hello you lovelies

thattommyhall22:11:14

I've not been hanging here much, but should be less busy in 2016

thattommyhall22:11:35

anyone going (or submitting) to http://www.clojured.de/ ?