clojure 2015-09-16 | Slack Archive

joost-diepenmaat10:09:58

Hey guys. We’re running a ring server (on jetty) and we’re running into a “too many open files” exception. Doing `lsof |grep our-uberjar|wc -l\ shows *16207

joost-diepenmaat10:09:28

162075 open handles on its own jar.

joost-diepenmaat10:09:09

I can’t find if this is a known problem somewhere in our stack or that we messed something up.

joost-diepenmaat10:09:10

This is on 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

joost-diepenmaat10:09:04

with ring-jetty-adapter 1.3.0 / org.eclipse.jetty/jetty-server "7.6.13.v20130916"

joost-diepenmaat10:09:21

Any ideas what may be going on? My first guess is that response streams for bundled resources aren’t being closed properly.

ragge10:09:06

@joost-diepenmaat: have you tried lsof -p {pid} to see which fds it's owning?

joost-diepenmaat10:09:37

hmm

joost-diepenmaat10:09:54

watch -n 1 "lsof -p 8662|grep $our-jar.jar|wc -l"

joost-diepenmaat10:09:01

shows an increasing number

joost-diepenmaat10:09:21

going quickly from ~ 100 to 811 right now

joost-diepenmaat10:09:32

and then at some point it drops again.

joost-diepenmaat10:09:50

1321

joost-diepenmaat10:09:25

doesn’t look like the ridiculously high numbers of beffore

joost-diepenmaat10:09:29

now it’s back to 303

joost-diepenmaat10:09:32

and climbing

joost-diepenmaat10:09:03

lsof |grep $pid|grep $jar shows a lot higher numbers than lsof -p $pid|grep $jar

joost-diepenmaat10:09:54

maybe it’s opening and closing many FDs quickly and you’d get overlapping sets in lsof without the -p

joost-diepenmaat10:09:06

@ragge < 3000 open FDs seems a lot more reasonable than > 160000

ragge10:09:34

but what are the actual fds it's holding onto? lots of regular files? which ones? sockets?

joost-diepenmaat10:09:37

bunch of sockets and lots of open handles on the running uberjar

joost-diepenmaat10:09:20

~ 580 TCP connections, 930 jars, 90 other

ragge10:09:46

and of those, which ones are static, and which ones are changing?

joost-diepenmaat10:09:02

which FDs are changing? let me see..

ragge10:09:55

oh, just saw your first message, ~160k open file handles on it's *own* jar sounds suspicious

ragge10:09:09

wild guess is that it's opening resources on each request using URLConnection

joost-diepenmaat10:09:30

yeah but I don’t acutally get 160K open handles using lsof -p

joost-diepenmaat10:09:41

that’s when I do lsof|grep $pid

joost-diepenmaat10:09:40

the lsof -p option lists a varying number of FDs on the uberjar - between 100 and 2500

joost-diepenmaat10:09:13

and the number seems to be increasing quickly and then drops back to ~~100 after reaching~~ 3000

joost-diepenmaat10:09:16

I’m not sure what that is about. Looks like it’s only closing the FDs when it’s running out of handles.

ragge10:09:53

are you loading resources from the uberjar? like using io/resource?

joost-diepenmaat10:09:56

yep

ragge10:09:12

on each request?

ragge10:09:26

my money is on JarURLConnection or whetever it's called then

joost-diepenmaat10:09:41

most of the static assets in the site - stylesheets, js, css, are served using ring-resource middleware

joost-diepenmaat10:09:39

we do serve all the “large” assets from a separate server, but the basic styling assets and frontend code is served from the jar

joost-diepenmaat10:09:38

Right now, I’m not sure if there’s a large problem; we’ve been running like this for over a year and this is the first time I’ve seen this exception.

joost-diepenmaat10:09:13

Occurred about an hour ago, and it’s not been back since.

joost-diepenmaat11:09:39

I’m assuming the 160K open FDs is an artifact of running lsof without the -p option with a process that opens and closes FDs continuusly

ragge11:09:13

i think lsof -p will exclude child proceeses too, could also be the reason you're seeing different numbers

joost-diepenmaat11:09:17

hmm

joost-diepenmaat11:09:31

ragge11:09:00

it's easy to leak (not close in a timely manner) InputStreams using UrlConnection

ragge11:09:11

they will eventually be closed anyway

ragge11:09:49

but sometimes, during some circumstances (lot's of requests for instance) you could leak more than you have fds

ragge11:09:18

so I can see how it work fine, then happen suddenly, then work again

ragge11:09:45

i fixed a bug in boot recently that had a similar issue: https://github.com/boot-clj/boot/pull/228

ragge11:09:56

strace was very helpful there

ragge11:09:02

you could try that too

joost-diepenmaat11:09:22

I think I’ll give strace a shot.

ragge11:09:37

I used something like sudo strace -e open,close -p {pid} -o trace.log

joost-diepenmaat11:09:45

thanks @ragge! I’ll take a closer look

timvisher11:09:17

is there something more compact than this to get the values for a set of keys from a map? either in core or in another library?

user=> (vals (select-keys {:a :b :c :d :e :f :g :h} [:a :c :g]))
(:h :d :b)

ragge11:09:45

using map and get, not that much more compact, but avoids building intermediate map just to get vals:

boot.user=> (def m {:a :b :c :d :e :f :g :h})
#'boot.user/m
boot.user=> m
{:a :b, :c :d, :e :f, :g :h}
boot.user=> (map (partial get m) [:a :c :g])
(:b :d :h)

ragge11:09:05

@timvisher: ^^

ordnungswidrig11:09:37

@timvisher: juxt is your friend user> ((juxt :a :c :e) m) [:b :d :f]`

asolovyov12:09:44

anybody with experience with riemann here? I'm trying to receive events through graphite transport and get errors like that - WARN [2015-09-16 14:56:35,174] nioEventLoopGroup-3-1 - riemann.transport.graphite - Graphite server received malformed message (java.lang.ClassCastException: io.netty.channel.socket.DatagramPacket cannot be cast to java.lang.CharSequence): #<DatagramPacket DatagramPacket(/127.0.0.1:65318 => /0:0:0:0:0:0:0:0:2003, SimpleLeakAwareByteBuf(UnpooledUnsafeDirectByteBuf(ridx: 0, widx: 105, cap: 2048)))>

asolovyov12:09:08

I'm not even sure where exactly this error happens, let alone how do I fix that...

asolovyov12:09:23

can I maybe extract full traceback from riemann?

ragge12:09:13

@timvisher @ordnungswidrig nice with juxt, can simplify my example to also just (map m [:a :c :g]), works where keys are not functions too

pbostrom13:09:49

@joost-diepenmaat: not much to add, but we are seeing the same problem occassionally, I'd be curious to know if you find anything

timvisher13:09:04

@ordnungswidrig: man! that’s awesome

profil13:09:57

Is there any transducer sort?

rauh14:09:31

@profil: There is not. In general if a transducer would have to wait until the sequence ends to actually output anything, then there is no transducer for it since it makes limited sense.

profil14:09:59

Yeah thats what I thought..

profil14:09:29

I am transforming a big list with transduce using drop and map second, then using conj as the reducing function. After this I want it sorted then calculate the mean and some quantiles, it feels kinda ugly not doing it in a sequence

benedek14:09:29

@crankyadmin: that is clj-refactor telling you you have a broken namespace

benedek14:09:04

(setq cljr--debug-mode t) should give you more info

joshg15:09:20

Is there a Clojure library that supports XML element namespaces? This is a requirement for interfacing with an external service and data.xml does not support this feature (although there’s a seemingly dead proposal to fix this: http://dev.clojure.org/display/DXML/Namespaced+XML).

Alex Miller (Clojure team)16:09:14

@profil: you should benchmark using eduction instead of transduce in this scenario - (sort (eduction (drop 1) (map second) coll))

Alex Miller (Clojure team)16:09:26

it might be faster

Alex Miller (Clojure team)16:09:26

there are several competing effects going on here so it's hard for me to predict which is better

Alex Miller (Clojure team)16:09:05

eduction is going to effectively build an iterator and sort will build a chunked seq over that iterator

Alex Miller (Clojure team)16:09:58

if you're doing transduce with conj you should really just do into (which automatically uses transients) - (sort (into [] (comp (drop 1) (map second)) coll))

Alex Miller (Clojure team)16:09:04

that will conj into a vector with transients, then use a seq that traverses the vector directly (which is not chunked, but vector seqs are using the underlying data structure, so generally very fast)

Alex Miller (Clojure team)16:09:22

if you do benchmark those, I'd be curious to know the input size and timing results :)

Alex Miller (Clojure team)16:09:18

kind of an interesting reducing function would be one that inserted results into a sorted collection as they arrived

Alex Miller (Clojure team)16:09:13

for example (into (sorted-set) (comp (drop 1) (map second)) coll)) - if data has no dupes

Alex Miller (Clojure team)16:09:34

then you traverse the input exactly once, extract the right data, and drop it directly into a sorted output collection

profil16:09:27

Ahh, cool. The data can contain duplicates, but thats not very likely. I need to get this assignment done, I will try to bench it in the weekend 😄 The data sets is a simple neural network consisting of 1000 neurons, ~~500 patterns, and this will be sampled for~~ 50000 time steps, so its not huge, but it still takes a few hours

profil16:09:47

@alexmiller: How would I bench it? using time or something more sophisticated?

Alex Miller (Clojure team)16:09:01

well if it takes hours, then time should be sufficient :)

estsauver17:09:44

If it takes hours, you could probably use a physical stopwatch