Fork me on GitHub
#data-science
<
2017-11-10
>
stathissideris15:11:56

@gigasquid Thanks for the cats and dogs example, it looks very interesting

stathissideris15:11:04

I tried running it but I get this:

stathissideris15:11:13

> (train)
...
ExceptionInfo Batch size is not commensurate with epoch size  clojure.core/ex-info (core.clj:4725)
cats-dogs-cortex-redux.core> *e
#error {
 :cause "Batch size is not commensurate with epoch size"
 :data {:epoch-size 21250, :batch-size 32}
 :via
 [{:type clojure.lang.ExceptionInfo
   :message "Batch size is not commensurate with epoch size"
   :data {:epoch-size 21250, :batch-size 32}
   :at [clojure.core$ex_info invokeStatic "core.clj" 4725]}]
 :trace
 [[clojure.core$ex_info invokeStatic "core.clj" 4725]
  [clojure.core$ex_info invoke "core.clj" 4725]
  [cats_dogs_cortex_redux.core$train_ds invokeStatic "core.clj" 256]

stathissideris15:11:59

an batch size of 50 results in a different error

gigasquid16:11:00

checking …

gigasquid16:11:44

the epoch size was originally 4096 in the resnet-retrain example

stathissideris16:11:54

wait, looks like my data folder is empty!

gigasquid16:11:57

but I bumped it to cover all the examples…

stathissideris16:11:57

ignore me please, I’ll redo the first steps and get back to you, not sure what happened with the files 🙂

gigasquid16:11:26

happy to help. If you find any troubles or missing stuff, please let me know

stathissideris16:11:49

@gigasquid made some progress (ran out memory) but calling (train) as is, will always cause an exception because (=not 0 (rem 21250 32)) (the defaults!)

stathissideris16:11:23

checked on line 254

gigasquid16:11:28

can you try with 21248?

gigasquid16:11:18

if that doesn’t work try with the original 4096 and I can revert it back to that

gigasquid16:11:38

I thought I had regened the uberjar and run it after I changed it, but obviously not

gigasquid16:11:02

I know if definitely worked with 4096

stathissideris16:11:36

it’s your code that checks for this btw

gigasquid16:11:14

yes - I took that bit directly from the resnet-retrain example in the cortex project

stathissideris16:11:27

same exception because it’s not an exact division, but I could just define a batch-size of 21250, right?

stathissideris16:11:30

#error {
 :cause "Batch size is not commensurate with epoch size"
 :data {:epoch-size 21250, :batch-size 21248}
 :via
 [{:type clojure.lang.ExceptionInfo
   :message "Batch size is not commensurate with epoch size"
   :data {:epoch-size 21250, :batch-size 21248}
   :at [clojure.core$ex_info invokeStatic "core.clj" 4725]}]
 :trace

gigasquid16:11:00

sorry I wasn’t more clear - I meant changing the epoch size

gigasquid16:11:03

not the batch size

gigasquid16:11:34

(rem (long 21248) (long 32)) ;=> 0

stathissideris16:11:01

yep, should have realized that’s what you meant

stathissideris16:11:12

not sure if it’s significant that I’m running this on the CPU instead of CUDA, but I got this:

stathissideris16:11:26

IllegalArgumentException No implementation of method: :->view-impl of protocol: #'think.datatype.base/PView found for class: clojure.lang.ExceptionInfo  clojure.core/-cache-protocol-fn (core_deftype.clj:583)
cats-dogs-cortex-redux.core> *e
#error {
 :cause "No implementation of method: :->view-impl of protocol: #'think.datatype.base/PView found for class: clojure.lang.ExceptionInfo"
 :via
 [{:type java.lang.RuntimeException
   :message "Error during queued sequence execution:"
   :at [think.parallel.core$queued_sequence$fn__33350 invoke "core.clj" 229]}
  {:type java.lang.IllegalArgumentException
   :message "No implementation of method: :->view-impl of protocol: #'think.datatype.base/PView found for class: clojure.lang.ExceptionInfo"
   :at [clojure.core$_cache_protocol_fn invokeStatic "core_deftype.clj" 583]}]
 :trace
 [[clojure.core$_cache_protocol_fn invokeStatic "core_deftype.clj" 583]
  [clojure.core$_cache_protocol_fn invoke "core_deftype.clj" 575]
  [think.datatype.base$eval13577$fn__13578$G__13568__13587 invoke "base.cljc" 170]
  [think.datatype.base$__GT_view invokeStatic "base.cljc" 180]
  [think.datatype.base$__GT_view invokePrim "base.cljc" -1]
  [think.datatype.base$__GT_view invokeStatic "base.cljc" 182]
  [think.datatype.base$__GT_view invoke "base.cljc" 174]
  [think.datatype.base$make_view invokeStatic "base.cljc" 187]
  [think.datatype.base$make_view invoke "base.cljc" 185]
  [think.datatype.core$make_view invokeStatic "core.clj" 56]
  [think.datatype.core$make_view invoke "core.clj" 54]
  [cortex.compute.cpu.driver$eval33762$fn__33767 invoke "driver.clj" 259]
  [cortex.compute.driver$eval14282$fn__14318$G__14271__14327 invoke "driver.clj" 74]
  [cortex.compute.driver$allocate_device_buffer invokeStatic "driver.clj" 159]
  [cortex.compute.driver$allocate_device_buffer doInvoke "driver.clj" 156]
  [clojure.lang.RestFn invoke "RestFn.java" 464]
  [cortex.tensor$new_tensor invokeStatic "tensor.clj" 456]
  [cortex.tensor$new_tensor doInvoke "tensor.clj" 448]
  [clojure.lang.RestFn invoke "RestFn.java" 410]
  [cats_dogs_cortex_redux.core$src_ds_item__GT_net_input invokeStatic "core.clj" 200]
  [cats_dogs_cortex_redux.core$src_ds_item__GT_net_input invoke "core.clj" 170]
  [clojure.lang.AFn applyToHelper "AFn.java" 154]
  [clojure.lang.AFn applyTo "AFn.java" 144]
  [clojure.core$apply invokeStatic "core.clj" 657]
  [clojure.core$apply invoke "core.clj" 652]
  [think.parallel.core$wrap_thread_bindings$fn__33318 doInvoke "core.clj" 120]
  [clojure.lang.RestFn applyTo "RestFn.java" 137]
  [clojure.core$apply invokeStatic "core.clj" 657]
  [clojure.core$apply invoke "core.clj" 652]
  [think.parallel.core$queued_sequence$process_fn__33336$fn__33337 invoke "core.clj" 215]
  [think.parallel.core$queued_sequence$process_fn__33336 invoke "core.clj" 209]
  [clojure.lang.AFn call "AFn.java" 18]
  [java.util.concurrent.ForkJoinTask$AdaptedCallable exec "ForkJoinTask.java" 1424]
  [java.util.concurrent.ForkJoinTask doExec "ForkJoinTask.java" 289]
  [java.util.concurrent.ForkJoinPool$WorkQueue runTask "ForkJoinPool.java" 1056]
  [java.util.concurrent.ForkJoinPool runWorker "ForkJoinPool.java" 1689]
  [java.util.concurrent.ForkJoinWorkerThread run "ForkJoinWorkerThread.java" 157]]}

gigasquid16:11:43

are you running the uberjar?

stathissideris16:11:55

sorry for the huge paste

gigasquid16:11:12

try uber-jar … I’m retrying it too as we speak

gigasquid16:11:15

ran out of memory

gigasquid16:11:26

oh well, I’ll put back down to 4096 then

gigasquid16:11:00

thanks for helping to catch that

stathissideris16:11:00

there may be something wrong with my setup, but I get some weirdness with uberjar too:

stathissideris16:11:13

➜  cats-dogs-cortex-redux lein uberjar
Warning: specified :main without including it in :aot.
Implicit AOT of :main will be removed in Leiningen 3.0.0.
If you only need AOT for your uberjar, consider adding :aot :all into your
:uberjar profile instead.
Compiling cats-dogs-cortex-redux.core
Nov 10, 2017 6:30:44 PM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded /var/folders/f_/0_rfxz496k520hd35q8v68940000gn/T/jniloader2330751718951129222netlib-native_system-osx-x86_64.jnilib
Reflection warning, cognitect/transit.clj:142:19 - call to static method writer on com.cognitect.transit.TransitFactory can't be resolved (argument types: unknown, java.io.OutputStream, unknown).
Created /Users/sideris/devel/cats-dogs-cortex-redux/target/cats-dogs-cortex-redux-0.1.0-SNAPSHOT.jar
Created /Users/sideris/devel/cats-dogs-cortex-redux/target/cats-dogs-cortex-redux.jar
➜  cats-dogs-cortex-redux java -jar target/cats-dogs-cortex-redux-0.1.0-SNAPSHOT.jar
Exception in thread "main" java.lang.NoClassDefFoundError: clojure/lang/Var
	at cats_dogs_cortex_redux.core.<clinit>(Unknown Source)
Caused by: java.lang.ClassNotFoundException: clojure.lang.Var
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 1 more

gigasquid16:11:50

looks like there is something in the main clojure file that’s not compiling

gigasquid16:11:20

I just pushed a fix commit want to check out a fresh pull?

stathissideris16:11:44

ok, pulled and running again

stathissideris16:11:25

I don’t run out of memory but I get the IllegalArgumentException No implementation of method: :->view-impl of protocol: #'think.datatype.base/PView exception again

stathissideris16:11:51

did you run it with with the CUDA backend?

stathissideris16:11:02

because in my case I see this:

training using batch size of 32
CUDA backend creation failed, reverting to CPU

gigasquid16:11:04

hmmm mine finished

gigasquid16:11:10

I run with CUDA

gigasquid16:11:46

IllegalArgumentException No implementation of method: :->view-impl of protocol: #'think.datatype.base/PView is one I think happens when there are mem problems ..

stathissideris16:11:54

the stacktrace contains this line which makes me think that it may be CPU-specific:

[cortex.compute.cpu.driver$eval33762$fn__33767 invoke "driver.clj" 259]

gigasquid16:11:26

As an experiment you might want to try decreasing the batch size and see if it helps

stathissideris16:11:28

oh ok, I’m brand new to all of this (your blog motivated me to look into it!)

gigasquid16:11:48

sorry I haven’t tried the cpu only - I’ll look into it later on

gigasquid16:11:07

try batch size = 8

gigasquid16:11:17

and see if that helps

gigasquid16:11:00

also you could check out the MNIST example in the cortex project and make sure you can run that ok first

gigasquid16:11:45

If you are serious about doing some big data stuff - you can try getting a AWS P2 compute instance with nvidia

stathissideris16:11:57

trying with 8. Yeah that would be a good sanity check for my setup I guess

gigasquid16:11:57

those are 0.90 an hour

stathissideris16:11:13

that’s the recommended setup in that course you linked

gigasquid16:11:18

Just remember to turn them off!

stathissideris16:11:29

batch size 8 results in the same problem, but I’m setting up a local machine with a decent NVIDIA GPU, so I’ll see if that works better and maybe even try the P2 compute instance at some point

stathissideris16:11:49

in any case, thanks for making all of this a bit more approachable!

gigasquid16:11:11

no problem. the more people getting into this the merrier party-corgi

gigasquid19:11:11

I just tried it without cuda and got the same error when trying to allocate the device buffer for the layers. My guess is that the RESNET50 network is just to big too do as cpu only

gigasquid19:11:07

at least on my poor little laptop

stathissideris19:11:04

Thanks a lot for trying! So we know it’s not just me :) I’ll give it a try on my desktop GPU after I install windows and let you know how it goes