This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-08-01
Channels
- # aleph (3)
- # announcements (10)
- # babashka (6)
- # bangalore-clj (4)
- # beginners (91)
- # biff (7)
- # cider (25)
- # cljs-dev (1)
- # clojure (109)
- # clojure-europe (9)
- # clojure-norway (5)
- # clojure-uk (1)
- # clojurescript (22)
- # cursive (22)
- # data-science (1)
- # datalevin (5)
- # datomic (7)
- # emacs (7)
- # etaoin (1)
- # events (3)
- # graphql (12)
- # hyperfiddle (1)
- # inf-clojure (1)
- # lsp (69)
- # luminus (1)
- # meander (21)
- # nbb (4)
- # off-topic (27)
- # other-languages (12)
- # rdf (58)
- # releases (3)
- # remote-jobs (2)
- # rum (12)
- # shadow-cljs (4)
- # sql (3)
- # xtdb (1)
Anyone have much experience with working with spatial data in Clojure/Java? I'm currently using Postgis to do a lot of spatial manipulation (intersections of thousands of large shapes) and was wondering if there would be any advantage to moving it into Clojure code. I'm mostly concerned with performance. Postgis uses C libraries, so I'm guessing it'll be faster than anything in Java land. However, I think (I could be wrong though) Postgres (and maybe Postgis?) have poor parallelization, soooo maybe there's room for optimization? Anyway, would love to hear about someone's experience if they've done a similar switch before.
I don't have a library to offer you but I'll offer a basic idea what you may need to consider. Intersection of shapes seems like a very parallelizable problem but if you would want to download the data then you would do network (assuming Postgres is the DB) and that's not too good. If you would move your data into Java land then you have to still save it. So will it be HDD again or RAM? If it's RAM it will be lightning fast and you will use your whole CPU. But you know, you can't go into a store and buy a terabyte of RAM. I'd first consider if you want to optimize by moving your data into RAM. Then I would also think whether you do disk IO when your Postgres loads those shapes from disk. Maybe you didn't create some kind of spatial/gis index and that's the actual problem? (I haven't tried postgis but well.. it's a DB and these are things that DBs should do) It's also possible that your database's HDD is too slow. And this is why it could look like the the DB's CPU is underutilized. Maybe it's not how they parallelized the CPU but that they stored something into disk and there is no other way. One solution would be RAID1 (mirror disks for read speed) and another one would be to add more DBs that use different physical disks.
There are several wrappers for Java Topology Suite but most of them are incomplete and out of date. You're better off using JTS directly. PostGIS is excellent and very fully featured, but it depends on your exact workload. Modern versions of Postgres will parallelise queries sensibly, and I believe PostGIS started marking its stuff parallel safe several years ago. If you're on recent versions you should be getting good performance.
The script would run on the server with the database. No network download necessary. Having everything run in RAM wouldn't be a problem and would be better than disk IO, since the hard drive isn't SSD. I have indexes in all the right places and have even gotten a script that took ~6 hours to run in 16 minutes using some optimizations covered in this talk https://www.youtube.com/watch?v=uCSYp_m8A9o
@UTF99QP7V yeah I found https://github.com/Factual/geo which looked decent. Also a couple of more listed here: https://scicloj.github.io/docs/resources/libs/#geospatial-processing Yeah I guess it should be running in parallel. Maybe I just need to tune the database config params a bit more.
You also have the option (if appropriate to your problem) of simplifying geometries a bit (ST_Simplify) which can often speed things up.
You really are better off using JTS directly from Clojure, the wrappers aren't great and if you really need to tune for performance you might start caring about type hinting things aggressively or using mutable state more. You can also control prepared geometries better.
> The script would run on the server with the database
If it's not IN
the database's process it's still network. Even if it's in the same machine. It's faster than through wire but it's still a hop. But I think your bottleneck is HDD.
@UTF99QP7V ok thanks for the tip, I'll go with that then. Not excited to wrangle java
@U028ART884X yeah that's fair
Wrangling Java’s a big part of what Clojure’s for. Better to embrace it and get interesting stuff done than worry about purity, I think. :)
Yes — happy to discuss geo
etc. It’s correct that geo
is more useful if you’re trying to switch between JTS or other libraries. In terms of out of date, if you don’t mind using something via a git dep, you might consider https://github.com/willcohen/aurelius/blob/master/src/aurelius/jts.clj. (It’s taken a pause as I’ve working on getting https://github.com/OSGeo/PROJ wrapped natively on the JVM, which has been hampering my ability to shift more fully from PostGIS to JVM land.) That said, aurelius
and ovid
(https://github.com/willcohen/ovid) which it builds upon creates a Featurelike
protocol which should substantially reduce some of your JTS friction and allow you to switch between geometries, features (where the geometry and attributes stay together), and pure attributes alike. You’ll also see in the jts
ns that it includes JTS’s preparable geometries, so that things like intersection checks can happen in a more optimized way. The protocol does hint as efficiently as possible, so this may reduce your need for full postgis indices.
To echo above, just interacting with postgis is your easy bet, my incomplete efforts here are not quite ready to go. I keep thinking it’s a small problem to solve, and then I get one step further down the processing chain and hit a bump. I’m pretty sure, though, that solving PROJ is the last primary component. My roadmap from there will be to get the new PROJ wrapper integrated into geo
, at which point I can then finish the clojurescript port of geo, at which point I can get a full initial alpha release of ovid
and aurelius
done.
That said, if you want to run some of this stuff through its paces, I’m MORE than open to feedback!
Finally — just note that postgis uses GEOS, which as you note is a C library, but it’s actually just a downstream port of JTS, so I don’t think you should be too concerned about its performance relative to JTS. Broadly speaking, I see no reason why panama/jna PROJ + JTS should ultimately be worse than smartly called GEOS, and it’s certainly less interop if the PROJ wrapper is performant
Oh wow thanks for weighing in. Ok I'll keep all of these things in mind. We're working with a lot of shape data for my startup and I'd like to eventually move most of our processing from JavaScript + Postgis to primarily Clojure if I can (because I like working with it). It'll probably be a gradual thing since I'm still just experimenting, I'll probably consult you again at some point if you don't mind?
of course! the projects i’m working on definitely aren’t dead — I’ve just ducked underwater for a while to address some underlying stuff before pushing them forward. i’ve just been getting by with them in WIP for my own internal purposes and didn’t want to start with a half-baked API that i knew was likely to change!
also — as you can see, the ONE place where i see using a clj solution as strictly better than just JTS is in the jvm/js parity. the thing that’s got me hung up right now with projections is making sure that my clojure wrapper works equally on the JVM and on JS — I’ve mostly got PROJ compiling to wasm, which would mean that spatial transformations would be the same in clojure as clojurescript. THAT would then mean that the story for spatial stuff on clojure will suddenly all work in parallel on clojurescript as well, which would be nice. i thought it’d be like a 2 month process to port but it’s been more like half a year and will probably take a few more months
I led a team whose major effort was satellite imagery ingestion(L8/S2) for the ag industry. We processed both raster and vector data(hundreds of TB). For its spatial processing component, we used GeoTrellis(https://geotrellis.io/). I'm not too sure what interop looks like between Clojure<->Scala even though they both sit on the same jvm, but perhaps it's simple enough that GeoTrellis might be a benefit.
@U4DQ68204 oh very cool, I'll take a look. Thank you!
i’ve always been totally intrigued by geotrellis but never had the pleasure of making use of it!
@U2B2YS5V0 their business support is top notch. we had multiple in-person trainings circa 2017-2018 with their staff and they always went above and beyond even down to basic language stuff. It helped jumpstart the team in a way that probably saved 6mo work. Happy to share more in DM. It's probably too off-topic even for off-topic 🙂