data-science

rgm 2025-02-03T22:59:48.671159Z

Hi everybody, I hope this is the right place for a tablecloth/tech.ml.dataset question. The attached image is fairly typical of the data I receive from structural and building envelope engineers. Are there any "higher-order" ways to extract data out of this with tablecloth or similar? What I'm looking for is help recognizing that the shape of these tables is such that: • every cell is effectively seen as a map, ie. one row in a CSV table = 1 cell in these tables, with a huge amount of data duplication for cells in the same table • each table gets unrolled into a vector of maps in some data-driven way, like (extract "B1" "H40" opts) ;; => [{:material "Steel" :orientation "Horizontal" :insulation "Exterior" :spacing [406 610] :insul-thickness 203 :nominal-R 5.90 :tape true :effective-R 5.24} ,,,] I've tended to homebrew various ways to do this with docjure and tech.ml.dataset. I'm just wondering if anyone knew of any good ideas around this kind of unrolling. It seems to be a favourite format of professional engineers because it matches closely to how engineers and specifiers look things up, but a big rectangle is better for what I need to do with it. Even just a name of what this kind of format is would help. It's isomorphic to a big CSV rectangle, so I'm just guessing I'm not the first person to notice and write something clever to cope with it.

Gent Krasniqi 2025-02-04T11:44:56.125909Z

I think the easiest way would be to prepare the data in Excel first, meaning prep each into a proper individual table. Powerquery (M language) is appropriate for this sort of task (whatever its called in current versions of Excel).

rgm 2025-02-04T16:39:11.877699Z

yeah, that's a good suggestion, thanks. Hadn't thought of Powerquery. I suspect the best thing is to just assume 8 copy-pastes to pull these apart into worksheets, then TMD/tablecloth can go through it like a buzzsaw

2025-02-05T22:33:19.563119Z

if you can detect the variability in the "constant-ish" as a first pass (maybe with an llm), you can calibrate coordinates such that the remainder is constant with the code you already have that reads the spreadsheet

rgm 2025-02-03T23:12:04.327009Z

(and of course just to keep it interesting, I've found building engineers to be big fans of merging cells, which I should probably just deal with as a pre-processing pass)

🙃 1
2025-02-04T01:32:13.098399Z

If the data you receive is in spreadsheet format, perhaps you could read the spreadsheet directly first?

rgm 2025-02-04T01:33:57.055209Z

sorry, I could have been clearer ... yes, it's a spreadsheet, but formatted more for distribution as a PDF than as easily consumed rectangular data

rgm 2025-02-04T01:35:24.712759Z

I can generate a bunch of (docjure/select-columns ,,,) calls but I'm mainly wondering if this kind of transformation has a common name.

2025-02-04T01:42:06.032969Z

others can weigh in as well, but if the format is constant, one approach might be to read it in as a spreadsheet. From there it can be converted ( e.g. tablecloth, etc) for analysis

Harold 2025-02-04T01:42:16.885639Z

I came across this paper a while ago: https://hannes.muehleisen.org/publications/ssdbm2017-muehleisen-csvs.pdf I think an implementation of those ideas that produced sequences of maps (sort of as you describe) would be a potentially valuable contribution to the community. For one, those sequences of maps could be fed to TMD, but doing it in Clojure data strikes me as the right level to enable more varied uses (than hard-wiring that kind of functionality into TMD).

🙏 1
rgm 2025-02-04T01:56:34.440139Z

thanks, both! the basic issue I seem to run into over and over is that the format is "constant-ish" ... doesn't seem to matter how often I ask anyone to not change the shape, it always seems to shift around just enough to break my spreadsheet import code's expectations, and it's been hard to specify things just loosely enough to dodge this. Scanning that paper and I'm now musing about a function that producing 'hypotheses' as data given a TMD dataset, eg. one hypothesis is that this dataset contains 8 "tables", here are their corners, here are the attributes we identified. Maybe it's something like (form-hypotheses ds). Then we feed confirmed hypothesis data into a second fn (ext ract-rows ds hypothesis) that spits out the seq of same-keyed maps. if the hypotheses are even 70-80% close, I can just tweak them using my eyes after hypothesizing and before extraction.

Harold 2025-02-04T02:27:34.302179Z

glad to have shared a helpful idea - yours is a good one too - the extra structure present after feeding it to TMD could be helpful in identifying to what extent it parsed as expected.