This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-12-11
Channels
- # adventofcode (33)
- # babashka (1)
- # beginners (11)
- # biff (3)
- # calva (2)
- # cider (24)
- # clj-kondo (9)
- # cljfx (5)
- # clojure (39)
- # clojure-austin (2)
- # clojure-europe (11)
- # clojure-nl (1)
- # clojure-norway (22)
- # clojure-uk (10)
- # community-development (18)
- # data-science (24)
- # datahike (3)
- # events (3)
- # hyperfiddle (11)
- # lsp (22)
- # malli (3)
- # matrix (1)
- # off-topic (24)
- # other-languages (3)
- # overtone (7)
- # pathom (5)
- # reitit (2)
- # shadow-cljs (34)
- # sql (20)
- # squint (13)
Good morning dear folks. I wonder if there are people who do heavy geospatial data processing here. For example, we do things like simulating 100,000 years of wildland fire in the all over the US; as you can imagine, that entails a geospatial Big Data situation. Still scientific computing, but somewhat different from the usual data science / ML data engineering workflows I believe. I wonder if other teams deal with similar situations and want to talk about it.
Hi @U06GS6P1N! I've done some geography in the past and would be curious to hear about your current needs. Would you like to chat about it? I'm available for a call most days after Wednesday afternoon. BTW there are some existing https://scicloj.github.io/docs/resources/libs/#geospatial-processing, mostly by @U2B2YS5V0, who may be more available on the https://scicloj.github.io/docs/community/chat/.
Thanks @U066L8B18 , it would be my pleasure. Could you describe your availability with time zones? 🙂
Thanks, I'm at https://time.is/UTC+2 time zone. Most hours are flexible for a call 🙏
hi @U06GS6P1N! this is my background (geography phd, worked for geospatial companies my first four years in industry), and while I mostly did more numerics & ML stuff for remote sensing in pre-Clojure days, I’ve kept up w/some research groups, including this one here in Boulder that’s part of Earth Lab at CU: https://earthlab.colorado.edu/our-work/extremes-natural-hazards/fire-regimes-and-how-they-are-changing — I’d be interested to chat or join a larger discussion w/@U066L8B18 and/or others. I’m in MT time zone w/some flexibility. Have been working on getting my scheduling workflow under control so I can just pass calendly links around. 😛 Hope to be there soon!
@U06GLTD17 ha, amazing that we've worked on such similar subjects, what are the odds?
I also have a background in geography and do (some) spatial analysis for my day job. If you're interested in potentially "scaling up" rather than "scaling out" you might want to have a look at DuckDB's https://duckdb.org/docs/extensions/spatial.html. It probably won't help with the most-compute intensive parts of the simulation but may be useful for some aspects of the workflow.
Thanks @UFTRLDZEW! I'm already using PostGIS a lot for my geospatial data processing, not always very happily because I find it hard to make fast. Do you have reasons to believe that DuckDB would be naturally faster for this sort of workload?
It's specifically designed for analytical workloads with a column-oriented architecture, whereas PostGIS is still built atop a database designed for row-oriented transactional workloads.
I wonder if that makes much of a difference in this case...
It will depend on the particulars of your architecture and workload, but the https://duckdb.org/2022/09/30/postgres-scanner may make it easy to test whether DuckDB results in a speedup without needing to invest significant resources in architectural changes.
The primary missing piece for me the last few years has been the lack of any good projection libraries beyond proj4j, which is stuck at proj version 4. What was going to be a small project that’d take a few months has grown into an ever-shifting multiyear kind of crazy thing combining dtype-next, emscripten, graalwasm, and cherry-cljs. I’ve finally got a build of PROJ 9 working natively with FFI and clj, which kind of works falling back to a wasm build via graalwasm (with a few remaining memory handling bugs), and my last step is to finish a version of it that works on cherry-cljs. New job has given me a little less time to get it out the door but I’m perenially hopeful!
For me without a good projection library that can handle ESRI shapefile definitions etc, it’s always going to be hard to get this working on the JVM to solve the use cases that QGIS + PostGIS can solve. In the meantime, PostGIS gets me 80% of the way there if I accept doing it all via postgres!
I wonder if a library like clong
or #coffi would be a viable option for building a modern Clojure projection library atop a more recent version of proj
.
Also, DuckDB supports https://duckdb.org/2023/04/28/spatial.html, so that's another potential in-process option with a relatively small dependency footprint.
I think dtype next is the way to go and is working well for performant ffi. The part that’s been a pain for me has been ensuring that even when a native built lib isn’t available that there’s something else that users can fall back to
Graalwasm is now on maven central so that works, but there’s been a little wackyness dealing with thread handling on a graaljs context
re: projection libraries, I’m actually not up to date here at all. I have some ancient knowledge re: the refractions research libs — they were the first devs of postgis, and from 2008-2013 I was aware of several open source projects that used their geotools java lib, landing there after trial and error re: coordinate system conversions. i.e., it was the one library that they could use in open source land (at the time) and end up w/accuracy comparable to going through esri tools. They built uDig basically as a GUI consumer of geotools, postgis, and the various OGC mapping service specs. As always, any shapefile support back then (and probably still elsewhere?) was/is spotty, due to the fact that esri has a bunch of extra special proprietary stuff w/different conventions than open. Caveat: that I’m not fully caught up on the state of things after the back-colonization from arcpy to the python data science ecosystem and geopandas etc 🙂 Re: big data solutions, there’s some geospatial stuff in Trino, which also provides a path for eg integrating datomic layers via analytics metaschema w/geospatial query capabilities. Or there’s the Athena presto/trino fork available as an AWS service. But I guess the question is what specific things do you need to do for the simulation, etc? How much of this is data integration and conversion b/t a bunch of different coordinate systems and resolutions of data to align and re-sample things into a common representation for modeling? how much is query re: e.g. standard geometry/polygon topology stuff? how much is more like numerical simulation or cellular automata-ish rules on rasters etc.?
If you’re disappointed by PostGIS performance then things like BigQuery, Athena or Redshift can do vast amounts of GIS processing quite quickly. But if you’re doing detailed simulations that might not be the best fit - it’s possible you want something like Java Topology Suite with Clojure to orchestrate that on a massive multicore instance on AWS.
Did you ever use torch cuda and have a cuda error at the repl? I want to debug the issue but pytorch cuda apis all check for error. I want to reset my context so I can keep debugging.
I'm looking for a durable (storage-backed) data structure that is a time-indexed log, 20 transactions per second, the key is a time bucket and the value is a record. The log must support efficient lookup by time-bucket as well as O(1) relative scan forward/backward for playback. File system storage is fine as is rotating files every day if needed. Suggestions?
Sharper statement of requirements: * timeseries records * random access * sequential access * realtime tailing * 20 appends per second * simple storage * java client
Since no one else has answered. I think it would be difficult to find a db that couldn't handle this even on cheap cloud instances. There are also dozens of logging solutions that will do all of the above. If you already familiar with a particular db, that should be fine. If you're running in the cloud, then your provider might already have a solution that does 90% of what you want out of the box (eg. AWS Cloudwatch). Some differentiating factors might be: • what kinds of queries you're interested in • structured logging (ie. data) or unstructured logging (ie. text) • durability (ie. if an instance crashes, how much loss is acceptable, a few seconds, minutes, hours?) • the size of your messages (<1k, <10kb, <1mb, etc)
I agree, most SQL (e.g. SQLite, Postgres) or NoSQL (e.g. Mongo) could handle thee requirements.