clojure-europe 2022-12-15

Hey guys. I have a 150MB SQL dump from some random database. I have no idea how the tables are set up or even what’s really in this data. How do I get from hugedump.sql to an overview of what’s in this data? My colleague has been working on linking two datasets for several months in 2021, but she never bothered to check if she could export the data from the tool she was using. Now I have this dump, but most of it just says _binary and then lists a bunch of binary data.

javahippie09:12:45

So it’s a dump as SQL DDL and Insert statements, but most of the data is serialized binaries?

simongray09:12:07

I think so.

simongray09:12:19

I think I need to visualise the schemas and get some specific data out (if it’s in there).

simongray09:12:41

Looks like this

javahippie09:12:05

Ah, so no DDL and the inserts depend on column order?

simongray09:12:03

not sure if I know enough SQL to answer that. It has a bunch of CREATE TABLE statements in the beginning.

simongray09:12:25

javahippie09:12:38

In this case, I’d just try and run the whole script against a dockerized, uninitialized MySQL Database and see what happens

💯 1

javahippie09:12:37

Seeing a Flyway reference in there is worrying, it’s like migratus in the Clojure world and hints that there might be additional DDL files in a programs source code somewhere

simongray09:12:50

Uh oh

javahippie09:12:13

Ah, nevermind, if it’s a dump it shouldn’t matter

simongray09:12:18

ok, phew

javahippie09:12:21

sorry 😄

simongray09:12:02

Do you think it needs to be MySQL or can I perhaps use SQLite?

javahippie09:12:07

I’d use MySQL to be sure that the collation et al work, especially the binary format might not work in another DB. The bigger question might be, what the binary stores represent. If they are serialized Java Objects and you don’t have the source code anymore it might get tricky

🤯 1

mccraigmccraig09:12:19

it's referencing InnoDB in the DDL - so it's MySQL

🙏 1

simongray09:12:04

I’ve been trying to get her to provide me with the export for 2 years… she finally delivers and it can’t be used. Now all I have is this dump 😛 will be interesting to see whether her work is completely wasted.

javahippie09:12:48

After a quick search, there seem to be tools that can restore deserialized java objects even if the class files are missing. Archeology time!

😅 1

mccraigmccraig09:12:23

tried a couple of the leading bytes from the binaries against - https://en.wikipedia.org/wiki/List_of_file_signatures - no matches... which hopefully rules out serialized java objects (which start with AC ED)

simongray09:12:39

thanks for the detective work, guys

simongray09:12:26

I’m in. Now I just need to figure out where the magic happens.

Ben Sless09:12:11

Maybe db-xray will help explore this from repl, too

reefersleep10:12:46

This is the stuff that they don’t tell you about when they’re trying to recruit you with fancy tutorials for using pure functions and making beautiful GUIs in few keystrokes.

😂 3

emilaasa10:12:21

This is more thrilling than most netflix productions!

😁 1

emilaasa10:12:52

For some reason I always want to find out what’s in a binary data representation if I see one 🙂

simongray10:12:11

So I managed to poke around the data for a bit and as feared all of the IDs I was looking for were NULL. So my colleagues have altogether spent probably a year wasting their time...

reefersleep10:12:37

They can’t re-export in a different manner?

simongray10:12:27

No this is the raw data. The data simply isn't there. If they had exported a sample of the stuff in 2020 like I asked, this wouldn't have been an issue...

reefersleep10:12:28

Or, reshape the data in a manner that ensures that you get what you want upon export.

reefersleep10:12:48

Welp, that’s terrible.

emilaasa10:12:57

Oops

mccraigmccraig10:12:26

eek - how do you work for a year without ever checking that you are recording what needs to be recorded ?

simongray10:12:04

By ignoring the guy asking for a sample multiple times a year for 3 years.

simongray10:12:31

Unfortunately. I had a feeling this would happen.

emilaasa10:12:52

Maybe there are some relations? Have you tried looking at an ERD of the db?

emilaasa10:12:33

Won’t help if all the fields are null of course 🙂

simongray10:12:26

the work they have done is that of manually linking two wordnets, but the software they used uses its own binary IDS to link the wordnet entries. There is a table that translates this makeshift binary ID to one we know for one wordnet dataset (the English one) but all of the IDS for the Danish data are NULL. We would have discovered this early if they had produced a sample, but this sample never materialised for 3 years...

rip

Indeed

👀

Does the original data with that software still exist before export? Are the ids there correct or were they corrupted from the get go?

orestis11:12:03

BTW I'd use a GUI to explore this. I personally use TablePlus on Mac - sometimes a GUI helps you see data you'd otherwise miss.

emilaasa12:12:07

DBeaver for lnx

simongray13:12:44

@UK0810AQ2 There was never any data corruption, AFAIK, just a lack of due diligence.

😟 1

orestis14:12:43

Better ask Jørgen.

😁 1

orestis14:12:41

What a sucky situation @U4P4NREBY - hope my joke works for you.

simongray16:12:03

I chuckled 😉

lemontea16:12:07

that’s a rather classic case study… my condolences to the data loss

simongray16:12:33

Thank you @U6L5FFL95. 🥲

mccraigmccraig09:12:28

well this job turned out to be bigger than i guesstimated 🙈

simongray09:12:30

You mean my SQL woes?

mccraigmccraig09:12:47

no, i've been refactoring a library which has accreted function over 7 years, to give it a nice clean interface...

simongray09:12:12

doing good work

mccraigmccraig09:12:48

hopefully doing good work - it's open-sourced, and a distillation of all the learnings we've had working with a streams+promises system

robert-stuttaford10:12:04

remember the rule of spring cleaning - empty the drawer on to the kitchen floor and put things back in that you want to keep, rather than trying to remove things you don't like from the drawer i.e. in code: make a new namespace, copy stuff over bit by bit, rather than trying to cut stuff away from existing code it doesn't always work out but i find it's a better way to start. it also has the benefit that you can do multiple iterations and compare

👀 1

mccraigmccraig10:12:10

i did exactly that on the main namespace @U0509NKGK , then ditched the old namespace once done

otfrom10:12:50

@U0509NKGK that is very good advice

👍 2

robert-stuttaford10:12:13

thank you 😊

jasonbell11:12:39

Morning

agile_geek11:12:06

Bore da