Fork me on GitHub
#data-science
<
2018-06-07
>
rustam.gilaztdinov10:06:25

Hello again, @gigasquid 🙂 Reproduce my steps to running examples on gpu with Ubuntu 16.04 and 1080, and maybe it helps someone First of all, I build OpenCV 3.4 from source Then I try to build shared mxnet from scratch, and it failed with this error

src/io/image_aug_default.cc: In member function 'virtual cv::Mat mxnet::io::DefaultImageAugmenter::Process(const cv::Mat&, std::vector<float>*, mxnet::common::RANDOM_ENGINE*)':
src/io/image_aug_default.cc:318:26: error: 'CV_BGR2HLS' was not declared in this scope
       cvtColor(res, res, CV_BGR2HLS);
                          ^
src/io/image_aug_default.cc:334:26: error: 'CV_HLS2BGR' was not declared in this scope
       cvtColor(res, res, CV_HLS2BGR);
                          ^
src/io/image_io.cc: In function 'void mxnet::io::ImdecodeImpl(int, bool, void*, size_t, mxnet::NDArray*)':
src/io/image_io.cc:175:28: error: 'CV_BGR2RGB' was not declared in this scope
     cv::cvtColor(dst, dst, CV_BGR2RGB);
                            ^
src/io/image_det_aug_default.cc: In member function 'virtual cv::Mat mxnet::io::DefaultImageDetAugmenter::Process(const cv::Mat&, std::vector<float>*, mxnet::common::RANDOM_ENGINE*)':
src/io/image_det_aug_default.cc:550:32: error: 'CV_BGR2HLS' was not declared in this scope
         cv::cvtColor(res, res, CV_BGR2HLS);
                                ^
src/io/image_det_aug_default.cc:561:32: error: 'CV_HLS2BGR' was not declared in this scope
         cv::cvtColor(res, res, CV_HLS2BGR);
Well, this is not related to clojure-mxnet, but maybe anyone knows how to fix this? After that, this issue https://github.com/gigasquid/clojure-mxnet/issues/2 help with install OpenCV, but without my custom configuration, such as Eigen, BLAS, and Intel IPP. It is not so bad, but still, disappointing. Now I can run lein test without errors and can install clojure-mxnet. Time to examples. Of course, I run GAN-example, because GAN is cool 🙂 Running nvidia-smi to watch utilization show me this.

rustam.gilaztdinov10:06:40

2018/06/07 09:22:23.277, GeForce GTX 1080, 17 %, 4 %, 8119 MiB, 400 MiB, 7719 MiB
2018/06/07 09:22:25.278, GeForce GTX 1080, 16 %, 4 %, 8119 MiB, 276 MiB, 7843 MiB
2018/06/07 09:22:27.279, GeForce GTX 1080, 0 %, 0 %, 8119 MiB, 232 MiB, 7887 MiB
2018/06/07 09:22:29.281, GeForce GTX 1080, 0 %, 0 %, 8119 MiB, 232 MiB, 7887 MiB
2018/06/07 09:22:31.282, GeForce GTX 1080, 13 %, 3 %, 8119 MiB, 170 MiB, 7949 MiB
2018/06/07 09:22:33.283, GeForce GTX 1080, 15 %, 3 %, 8119 MiB, 58 MiB, 8061 MiB
2018/06/07 09:22:35.284, GeForce GTX 1080, 6 %, 1 %, 8119 MiB, 377 MiB, 7742 MiB
At the end of this log -- you see I have only about 377MB of free memory and very bad utilization statistics. Then example failed with cudaMalloc failed: out of memory.
iteration =  5 number =  0
Exception in thread "main" org.apache.mxnet.MXNetError: [09:22:34] src/storage/./pooled_storage_manager.h:108: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /tmp/mxnet8710311801860523221/mxnet-scala(dmlc::StackTrace[abi:cxx11]()+0x1bc) [0x7f2361eb5b5c]
[bt] (1) /tmp/mxnet8710311801860523221/mxnet-scala(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f2361eb6d78]
[bt] (2) /tmp/mxnet8710311801860523221/mxnet-scala(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle*)+0x154) [0x7f23648dda44]
[bt] (3) /tmp/mxnet8710311801860523221/mxnet-scala(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x5d) [0x7f23648dfa0d]
[bt] (4) /tmp/mxnet8710311801860523221/mxnet-scala(mxnet::NDArray::CheckAndAlloc() const+0x238) [0x7f236202d318]
[bt] (5) /tmp/mxnet8710311801860523221/mxnet-scala(+0x3169a02) [0x7f236447fa02]
[bt] (6) /tmp/mxnet8710311801860523221/mxnet-scala(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x1be) [0x7f236449eebe]
[bt] (7) /tmp/mxnet8710311801860523221/mxnet-scala(+0x35b060b) [0x7f23648c660b]
[bt] (8) /tmp/mxnet8710311801860523221/mxnet-scala(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f23648c1a45]
[bt] (9) /tmp/mxnet8710311801860523221/mxnet-scala(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f23648d829b]
1080 is not a poor card -- what we can do to fix this? It’s just an MNIST, and we failed with OOM, it’s awful. Can we somehow clean a context, for example?

gigasquid13:06:03

@rustam.gilaztdinov thanks for the update! There is someone else trying to get Ubuntu running so if you wouldn’t mind chiming in with your experience on the issue, it would be helpful https://github.com/gigasquid/clojure-mxnet/issues/2

gigasquid13:06:43

It would also be great if you could open an issue about the mem issues in GAN there. I suspect that it is because of my code handling of the ndarray. I made a custom data iterator for the random noise but I don’t thing I managed the lifecycle of the ndarray’s properly (you need to call destroy on them when you are done). So it might be a simple code change to fix. An issue would be nice to track it to see if the fix is correct.

rustam.gilaztdinov13:06:59

Now I’m running examples Well, most of this is running without a hustle But, in neural-style, I have this error

Exception in thread "main" org.apache.mxnet.MXNetError: [12:01:19] src/executor/graph_executor.cc:329: Check failed: x == default_ctx Input array is in cpu(0) while binding with ctx=gpu(0). All arguments must be in global context (gpu(0)) unless group2ctx is specified for cross-device graph.
I think that’s it -- https://github.com/gigasquid/clojure-mxnet/blob/master/examples/neural-style/src/neural_style/core.clj#L199

gigasquid13:06:40

@rustam.gilaztdinov Yes, that example has some trouble. I put it in the README, but it only runs on cpu (It needs to have the context of the ndarrays not default to cpu everywhere). It also generally needs some help. I spent way too many weekends banging my head against that one and then decided to punt and move on to some other things

gigasquid13:06:00

you can try it against the cpu with lein run. It kinda works, but not all the way. It’s transferring style, but the content is supposed to be able to get recreated with noise and taking info from the higher relu levels like relu4-2, but it doesn’t seem to work there https://github.com/gigasquid/clojure-mxnet/blob/master/examples/neural-style/src/neural_style/model_vgg_19.clj#L54

gigasquid13:06:27

It only works against the first layer content level.

rustam.gilaztdinov13:06:07

That’s the right approach, I understand you. And yes, you did a very great job, many-many thanks to you! I hope that community will support this library. It will be super useful work in one environment, and what’s essential -- clojure environment

🙂 24
gigasquid22:06:37

@rustam.gilaztdinov Someone is working on a memory leak in the Scala NDArray Iterator - that may be were the mem problems for you was in the GAN. We’ll see in the next release which should happen in the next few days https://github.com/apache/incubator-mxnet/issues/10436

🔥 4