Fork me on GitHub
#clojure
<
2015-10-14
>
nicholasf02:10:24

I have to parse a 2G xml file. Large but not huge by big data standards

nicholasf02:10:08

I’m parsing it using clojure.data.xml, which I thought would be lazy, but it’s chewing up huge amounts of memory

nicholasf02:10:28

starting the jvm up with 12G means I end up using all of it on the parse process and it doesnt return

nicholasf02:10:39

what’s the best way to parse an xml file of this size?

ghadi02:10:56

First of all, I'm so sorry for your plight 😉

ghadi03:10:08

I can't speak to the Clojure xml libs, but there are certainly streaming apis in Java that are more than performant and memory friendly.

ghadi03:10:13

Regarding data.xml, are you using the lazy marked functions? http://clojure.github.io/data.xml/

nicholasf03:10:00

how do I tell if something is marked as lazy?

nicholasf03:10:26

Parses the source, which can be an
InputStream or Reader, and returns a lazy tree of Element records

nicholasf03:10:38

should be ok. And, btw, thanks for your help here

nicholasf03:10:43

here is the code loading the file

nicholasf03:10:59

(defn load-feed
    "Loads the given filename"
    [filename]
    (-> filename
        
        
        clojure.data.xml/parse))

ghadi03:10:39

thanks, that helps. What are you doing to cause the memory symptom? Are you printing to repl (that won't be lazy)

nicholasf03:10:50

ha, yeh, I am actually

nicholasf03:10:02

Im sitting there in lein repl and running the function

nicholasf03:10:09

sorry, totally me being stupid

ghadi03:10:14

Not at all, it happens.

ghadi03:10:27

Try to def the result and watch mem consumption

nicholasf03:10:29

thanks, now I can just test sending it downstream into firebase

nicholasf03:10:51

I am turing it into a zipper

nicholasf03:10:04

and then parsing it into a representation of lists and sets

nicholasf03:10:11

but we’ll see if that’s a problem

ghadi03:10:08

Let us know. You should be good: (-> "<foo><bar baz=\"42\" /></foo>" x/parse-str :content class) gives back clojure.lang.LazySeq

ghadi03:10:37

I admit I'm not super familiar with the peculiarities of the library. (But there's always Java, yippee!)

nicholasf03:10:16

yeh, I dont want to rewrite this

nicholasf03:10:32

Im just building my clojure service now and will see if it can stream the data

nicholasf03:10:32

hrm, testing loading my large XML file in an exploded war is leading to java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "accessClassInPackage.com.sun.xml.internal.stream”)

nicholasf03:10:55

I have the xml file in my /resources directory. When I build it into a war it ends up in WEB-INF/classes

nicholasf03:10:15

shouldnt I be able to load it using ?

nicholasf04:10:35

hi, Im still trying to identify my bottleneck here. I am loading a 2 gig xml file, building a zipper on it, then attempting to take the first 5 elements from it and println a token

nicholasf04:10:51

just to represent that I am not loading the full 2 gigs into memory anywhere

nicholasf04:10:57

but somehow I am

nowprovision06:10:47

Reading logging and debugging from https://github.com/Day8/re-frame#logging-and-debugging and Talking To A Server section. With regards to replay, if I log anything sent to dispatch such as [:initalize], which is handled by a http call to setup initial db state, then dispatches a [:populate-db], if I replay I'm going to see two [:populate-db] one caused by log replay and one caused by http call of initialize.

nowprovision06:10:42

nicholasf, I dont think clojure.data.xml/parse is SAX style

nicholasf06:10:43

yeh, I just switched to golang and did what I wanted to do in 5 minutes after not thinking about it for 6 months

nicholasf06:10:49

me just sucking at clojure I guess

nowprovision06:10:17

the documentation for parse is confusing it does mention something about SAX, but i guess XML gets less attention nowadays given the uncoolness

nicholasf06:10:26

I spent days understanding how to work with large xml in clojure

nicholasf06:10:41

I learnt a lot about clojure but reached zero results, really

nicholasf06:10:45

can parse smaller files

nicholasf07:10:21

Im pretty sure if Id just broken out a sax parser and used java interop I would have been fine

escherize08:10:10

The transit github made it to #1 on HN right now! Is there a reason why its topical today?

danielcompton08:10:17

Is there any way to get a projects version while running code inside project.clj?

danielcompton08:10:47

I know about the System/getProperty trick, but that doesn’t seem to be set at the time I’m running code inside an :alias in project.clj, I think that only applies when you’re evaluating code in the context of your app.

korny09:10:12

@nicholasf: dealing with laziness can be tricky - but I still far prefer to use a lazy language to parse xml than sax! It really depends on the XML though. Wikipedia is an overly-simple example, but it was handy as it’s huge and easily avaialable.

sander09:10:19

@danielcompton: what about something like (-> "project.clj" slurp read-string)?

nicholasf09:10:42

@korny: I think a lot of the documentation was misleading

danielcompton09:10:45

@sander I was trying to avoid that route

nicholasf09:10:54

unfortunately Im back working in golang again

sander09:10:07

@danielcompton: why? the edn structure of project.clj is well-defined

danielcompton09:10:19

it’s not really edn

korny09:10:26

@nicholasf: True - or just very terse. I got far more value browsing the code than the docs. And it’s a pain as it’s split over several libraries.

thheller09:10:21

@cfleming: no worries, at this point I'm fighting more with nrepl than cursive. 😉

cfleming09:10:35

@thheller: Yeah, for CLJS I think that’s pretty common!

cfleming09:10:52

I’m planning to provide direct CLJS REPL integration with no nREPL in sight.

thheller09:10:51

oh nice, that is what I was thinking as nrepl is pretty JVM specific with all the bindings and stuff

thheller09:10:59

which CLJS just doesn't need

thheller09:10:11

but I still want "Run test ... in REPL" for CLJS simple_smile

cfleming09:10:20

You’ll get it simple_smile

cfleming09:10:27

Along with a test runner.

cfleming09:10:49

And a debugger, at some point.

cfleming09:10:09

No promises on that one, though simple_smile

cfleming09:10:37

Your custom REPL is going to be cljs.repl compatible, right?

thheller09:10:52

would it be possible to get a webview in intellij? that would be sweet too

thheller09:10:18

cfleming kind of yes and no

cfleming09:10:21

So it’s possible, but it’s complicated and involves a commercial component. But I’m planning to try it out.

thheller09:10:32

the cljs.repl is modeled after the clojure one (ie. streaming)

cfleming09:10:35

It embeds Chromium and presents it in a Swing view.

thheller09:10:05

I don't think that is the best approach and even things like piggieback have to work arround tricks to get it going

thheller09:10:34

yeah figured it was tough to get a native component into swing

cfleming09:10:12

Ok. It’s a long time since I looked at this, I’ll try to look soon and refresh my memory.

thheller09:10:19

you can have them side by side, probably to much work to deal with devtools and other stuff

cfleming09:10:35

I’d like to sort out the CLJS REPL soon since it’s a pain point for a lot of people.

cfleming09:10:01

There’s an OSS one but the price on that is pretty reasonable and people speak highly of it.

thheller09:10:08

interesting

cfleming09:10:27

You could do some really nice things with that, though.

cfleming09:10:01

In general it’s much easier to make a zero-pain option when it’s all integrated.

thheller09:10:36

yeah, would definitely be cool

thheller09:10:59

very curious what you come up with for CLJS

thheller09:10:09

I was just playing around with my REPL implementation which is now at figwheel level and working quite nicely

cfleming09:10:34

Nice, is the code open?

cfleming09:10:39

I’d like to take a look.

thheller09:10:58

yes, just a sec

cfleming09:10:29

I’d be interested if you have ideas about what an interface to a CLJS REPL should look like. Ideally I’d provide an interface that a REPL implementation would provide an implementation of, so that Cursive could fire up different impls.

cfleming09:10:56

Although I’m not sure how much I need to make it generic, since there’s probably not a lot more custom REPLs out there.

cfleming09:10:30

That said, there’s already Figwheel, cljs.repl and yours so there’s clearly scope for different takes on tooling.

thheller09:10:20

I split the repl/figwheel'ish parts into its own project

thheller09:10:33

since they bring in some deps I didn't want in shadow-build

cfleming09:10:45

Nice, thanks.

cfleming09:10:54

I’ll take a look, it sounds very interesting.

thheller09:10:07

run a clojure.main repl

thheller09:10:25

(require 'build) and (build/browser-dev)

thheller09:10:35

that will run the browser repl

cfleming09:10:57

Cool, thanks

thheller09:10:09

open test-project/public/index.html in a browser to connect the repl

thheller09:10:41

actual implementation bits are still very much in flux

cfleming09:10:17

Ok. It’ll probably be a while before I can look at it seriously, definitely post-conj.

cfleming09:10:29

But like I say, it’s something I want to sort out soon.

thheller09:10:52

the interface to the compiler is pretty much final though

thheller09:10:01

none of it deals with actually running stuff

thheller09:10:16

you just feed it strings of cljs code

thheller09:10:28

it gives you back the required actions to run in the client

thheller09:10:39

that are the compile bits

thheller09:10:01

I outlined how that works in my email

thheller09:10:07

that is unchanged

cfleming09:10:08

Yes, I remember

cfleming09:10:16

So you get back a JS snippet?

thheller09:10:31

js snippet or information yes

thheller09:10:45

eg (require ...) tells you which files need to be loaded

thheller09:10:59

but doesn't have any actual js to eval

cfleming09:10:19

That sounds reasonable.

cfleming09:10:35

Ok, I have to get to bed, but I’ll copy all this and look at it when I get a chance.

cfleming09:10:13

Thanks for the info!

thheller09:10:23

those are the client repl bits

thheller09:10:36

they will look pretty different for the node repl

thheller09:10:01

@cfleming cool cool, sleep well

cfleming09:10:15

Thanks, seeya

Pablo Fernandez12:10:42

Any ideas in compojure-api, why does PUT* ignore :body while POST* reads it? Both seem to read :body-params though.

sidrero12:10:00

I am new to clojure, and I got a question I have not being able to find a definitive answer on google

sidrero12:10:00

I work with a big java desktop application, where I have managed to embedd jython to provide a scripting framework

sidrero12:10:47

it works well, but I hate java scientific libraries, jython cannot use numpy and Co. I got interested in Inchanter instead.

sidrero12:10:07

Now I am wondering if I can also embedd clojure into the java app in the same way as I did for jython.

sidrero12:10:15

What I do with jython, is on java, I create an interpreter, set paths to different modules I need dynamically

sidrero12:10:35

then I put a 2D matrix (a list of lists) into the interpreter

sidrero12:10:16

and then I load the script I want and execute a function that will always be in every script by convention, and which takes that 2D matrix I put into the interpreter

sidrero12:10:33

the function returns another 2D matrix, which then I can recover from java side and cast to java types

sidrero12:10:21

and then some other small things like redirecting jython stdout and stderr to java so that I can print nicely on the java app, etc

sidrero12:10:51

Would it be possible with clojure? Somehow I got the feeling it is not designed for this purpose.

Pablo Fernandez12:10:45

@sidrero: I don’t have the final answer on this because I’m not very familiar with the environment, but my understanding is that since Clojure compiles to JVM code, mixing Clojure and Java is much simpler than what you had to do with Jython.

Pablo Fernandez12:10:13

@sidrero: I don’t know how you make a Clojure / Java hybrid project, but Clojure can use Java libraries and viceversa.

sidrero12:10:44

makes sense

sidrero12:10:20

I guess the only problem is that for jython they provide very clear examples on how to do this, while for clojure the documentation is not there, or at least I have not been able to find it

Pablo Fernandez12:10:36

@sidrero: when I was working at Google, as a 20% project I wanted to figure out whether we could use Clojure to build apps using Google’s infrastructure. I don’t remember building a mixed project, but long story short, I was done within a day, having built sample apps that touched on the various internal databases, deployed them to our staging infrastructure and so on.

joelkuiper12:10:49

the Clojure->Java interop is pretty straight forward and I’ve made many mixed projects (requires some fiddling with the build process though). The other way around (calling Clojure from Java) I’ve not much experience with, but should be possible. Maybe this helps? https://stackoverflow.com/questions/2181774/calling-clojure-from-java

Pablo Fernandez12:10:17

sidrero: how do you build your java apps? what build tool do you use?

joelkuiper12:10:31

also for scientific computing we pretty much decided to use Python and call it from Clojure using IPC (over ZeroMQ usually)

Pablo Fernandez12:10:45

joelkuiper: I think more than the actual code, he needs the build tool chain.

joelkuiper12:10:54

ah I see, sorry!

sidrero12:10:33

I have to do an eclipse plugin, because the java app is an eclipse app

Pablo Fernandez12:10:49

sidrero: I think this might help, if you are using Maven: http://alexott.net/en/clojure/ClojureMaven.html

Pablo Fernandez12:10:03

sidrero: and this might help if you wan to have java in a lein (clojure) project: https://github.com/technomancy/leiningen/blob/master/doc/MIXED_PROJECTS.md

sidrero12:10:09

but for the scripting, the idea is not to compile the script as in the stackoverflow example, but load the script and interpret in on the fly

sidrero12:10:21

so that hte user can do changes to the script and see the changes immediately

joelkuiper12:10:23

eval should do

Pablo Fernandez12:10:49

sidrero: in that case, yeah, just add the clojure jar to your project and call its eval from java.

Pablo Fernandez12:10:03

As joelkuiper said.

sidrero12:10:09

sounds promising

sidrero12:10:51

does clojure have a mechanism to load a whole script and then pass it to eval? or I should read the script with java, build a string, then pass to clojure eval

sidrero13:10:10

I mean, in jython the only thing I have to do is to import the script

sidrero13:10:19

no need to run it manually

Pablo Fernandez13:10:25

sidrero: it’s definitely possible and I would bet the results are much satisfying than Jython (not due to Lisp being great, but due to Clojure targeting Java).

sidrero13:10:46

I am willing to try

ul13:10:53

IFn eval = Clojure.var("clojure.core", "load-file");
eval.invoke("my-script.clj");

ul13:10:04

smth like that should help

sidrero13:10:07

that looks great

Pablo Fernandez13:10:12

Anyway, I have to go now.

sidrero13:10:18

thanks pupeno

joelkuiper13:10:30

just thinking out loud here (no idea what your project actually is …) but if it’s some sort of scientific project you can also try and start the iPython notebook as a subprocess and call that over WebSockets 😛 as Python science libraries are still the best out there imho

sidrero13:10:03

no, you need to have a server with python etc, and I don't have that

sidrero13:10:19

well, yes, but the story is complicated ... not a real option anyway

sidrero13:10:41

if I do what ul suggests

sidrero13:10:48

which is more or less what I want

sidrero13:10:17

except that I need to pass a 2D array as argument to the clojure function I want to run

sidrero13:10:36

how do I pass this argument, and how do I cast java types to clojure types?

ul13:10:56

cast them in clojure code

ul13:10:47

Incanter should have appropriate funcs

sidrero13:10:17

and if I would like to hide this complexity from my users? they should see only clojure types in and out

sidrero13:10:45

can I cast a java Object[][] to a core.matrix?

joelkuiper13:10:01

probably, or otherwise with a straight forward loop

joelkuiper13:10:14

wouldn’t starting a Clojure REPL be an option?

mikera13:10:16

core.matrix works just fine on Object[][]. And on nested Clojure vectors like [[1 2] [3 4]]

mikera13:10:42

All the different types are handled by protocols, so the translation is pretty seamless

sidrero13:10:54

sounds good

mikera13:10:25

But you can convert if you want, something like (coerce :persistent-vector (object-array [1 2 3])) => [1 2 3]

joelkuiper13:10:51

@mikera: I have a project somewhere on the todo to write an APL interpreter using core.matrix (including all the funky symbols) so I’ll take this moment as an opportunity to say thank you for the excellent library!

sidrero13:10:52

this scripturian thing looks really cool

mikera13:10:54

And the reverse works, e.g. (coerce :object-array [1 2 3]) => an Object[] array containing the longs 1, 2, 3

mikera13:10:16

Haha no probs. APL was a big inspiration for me!

mikera13:10:06

Should be a fun project to do Instaparse + core.matrix => APL on the JVM

joelkuiper13:10:14

my thoughts exactly!

sidrero13:10:43

mikera: regarding matrix, what if I have an Object[][] where sometimes I have numbers, sometimes strings and sometimes dates, would matrix cope with that?

mikera13:10:48

Yup. The following round-trip works just fine: (coerce :persistent-vector (coerce :object-array [[:a "Hello"] [1 nil]]))

mikera13:10:40

Also check out the "dataset" functionality, which allows named columns etc. Useful if you are importing DB tables / CSV files etc

sidrero13:10:07

that's great!

sidrero14:10:36

thanks for all the suggestions by the way, it is very cool to see such a lively community ... as comparison jython chat is mostly dead and it is difficult to get any help

jstew15:10:40

Question about components: Let's say that I have a bunch of components that access a database, I have the connection set up as a component in the system. That's all fine, but I have all of my database access code as a function in some namespace. I can require this wherever I want, but the purpose of components seems to be to keep things like that from happening (having my database functions wrapped around all my other code like tentacles). Thoughts on how I might clean that up?

jstew15:10:04

Also, kind of related, let's say I have a component with a bunch of workers. These workers get assoc on to the component and may need restarting, or to be removed. Is it a bad idea to change what's in a component after it's assoc'ed on to the system?

swizzard15:10:43

@jstew: (:require [my-db-ns :refer [my-get-fn my-upload-fn]]) is too messy?

jstew16:10:44

swizzard: Not really. I've been spoiled by my last project. Everything was pretty self contained, passing data back and forth between core.async channels, and not using too many things from other namespaces.

ghadi17:10:16

Let it be known: untyped PHMs are now known as hashbags

hlship18:10:02

@jstew There's a lot of options there; for example you could have a component that owns the database connection. It may even store the connection in an Atom (as part of, say, a failover strategy). So inside that namespace are functions that access the database, expecting (by convention, as the first parameter) the component ... from which is extracted the active connection. If this really irritates you, you could define a kind of derived component, that is just a map of partial functions. But still, if you were using Java and a DI container, you'd be invoking dbComponent.writeToDatabase(myRecord) and here it's just (write-to-database db-component my-record).

hlship18:10:54

map of partial functions: i.e., {:write-to-database (partial write-to-database db-component)} ... seems like a lot of work.

hlship18:10:53

@jstew: it is valid for components to encapsulate some local state, typically stored in Atoms (or some other mutable ref type). I've used that for things like caches, tracking failover state, even some orderly shutdown code. There is a debate about whether that works best as one-global-atom-to-rule-them-all, or distributed state in Atoms inside components. I favor the latter in my experience, but there's something compelling about being able to see all application state in one place, as a kind of aid to debugging.

jstew18:10:57

@hlship: Thanks! That's exactly the type of input I was hoping for. I'm going to try going with local state encapsulated by components.

bozhidar19:10:26

let’s give some love to the idea of proper deprecation mechanism in Clojure

bozhidar19:10:50

^^ upvote this ticket to influence the future for the better simple_smile

ghadi19:10:40

that was the first Clojure ticket I ever worked on :tear:

Nicolas Boskovic20:10:15

I have a small question regarding clj-http. How can I emulate a curl post that does this? curl -X POST 'https://website.com/endpoint' -d 'request_body={ ... }'

kenny20:10:52

I have a question for those using Pedestal: Is there a way to prevent a route handler from being called? For example, I am writing an authentication interceptor that will assoc :status 401 to the context if the user is not authenticated but the problem is that the handler still gets called. To reduce load on the db/server, I do not want the handler to be called unless the user is authenticated.

ghadi20:10:47

nnbosko: should be a lot of examples on the clj-http readme

jstew21:10:14

@nnbosko: clj-http.client/post "" {:query-params {...}}

Nicolas Boskovic21:10:39

Nothing with that case in particular @ghadi

Nicolas Boskovic21:10:54

and query-params doesn't do it either @jstew

jstew21:10:01

@nnbosko: Obviously you'll have to change the accept headers if it's json.

Nicolas Boskovic21:10:21

It's expecting -d 'request_body={ ... }' which is where the problem is

Nicolas Boskovic21:10:15

not -d '{"request_body" : {...}}' but -d 'request_body={"data1" : "val1" ... }'

pbostrom21:10:46

@nnbosko: I'm assuming you tried (http/post "" {:body "request_body={\"data1\" : \"val1\"})

ghadi21:10:55

what pbostrom said

jstew21:10:05

If you really have to go low level you can use clj-http/request and set the body yourself.

Nicolas Boskovic21:10:41

hmm I'll try clj-http/request

jstew21:10:31

Either way you're setting the body the same way as the 2 other examples though, so I would expect the same results.

jstew21:10:21

As long as it's not https, you could use something like charles proxy to see what the difference is between the curl post and clj-http. That might give you more insight into what's going on with your request body.

gusbicalho21:10:36

@nnbosko: curl sets the content-type to application/x-www-form-urlencoded automatically, clj-http doesn't

ericnormand22:10:11

does anyone know of a good way to set up a local maven proxy?

ericnormand22:10:30

I'm building docker containers and each one downloads the same jars over again

ericnormand22:10:35

each time it's built!!!

ericnormand22:10:59

maybe it's as simple as mounting the .m2 directory (hoping)

lfn322:10:45

We usually build our jars on our machines and have docker build using the uberjar that’s generated from that for local dev.

ericnormand23:10:51

that's interesting

ericnormand23:10:55

do you have a build server do the same?

ericnormand23:10:25

and that probably keeps the container leaner

nberger23:10:46

@ericnormand: we mount the local .m2 directory