Fork me on GitHub
#data-science
<
2022-12-15
>
jumar09:12:00

How do people here document data? We have a bunch of devs working on a system producing analytics events and sending them to a destination where the events are processed further and they consumed via AWS Athena interface. Every event is a separate JSON document but the content can differ a lot. And over time, the structure evolves. At the same time, the data analysts using (primarily) the Athena SQL interface, need to know what kind of data properties they can expect and use for their queries. We are just starting with this effort but it has already been a challenge to communicate what properties are available and when they were introduced. So we need a way to communicate this "data schema" somehow. I don't have any great ideas so far - mostly just write it down somewhere like a wiki page or an Excel sheet. Do you know any good tools to capture this information? Ideally free / open source. Any specific suggestions on how to design this documentation process?

jeroenvandijk11:12:23

In Athena you can convert data from json to something with a schema e.g. Parquet. This gives queries a lot of performance benefit as the data gets smaller and parsing is faster. It reduces the bills and you get faster results. This is if you access the data more than a few times of course. So adding a (parquet) schema would give you these benefits. And introducing this schema might be the start of a formal documentation process?

jumar08:12:36

Bear with me since I don't have much knowledge of Parquet and I've never used it (or a similar columnar format). I agree that Parquet could give us some performance benefits but I'm not sure how exactly it would help with the documentation part. Currently, we have defined tables with pretty generic structure because the events format evolves

server-time bigint
event-type string
event-time bigint
event-properties string
user-id string
user-properties string
The content of these event/user properties is of course what is often interesting but it's also the part that will change and also depends on the system in question (we have two different apps generating these events at the moment). On the top of these tables, we have defined views to make it more user-friendly, e.g.
user struct<id:string,account:struct<org_name:string,email_domain:string, ...>

event struct<type:string,time:timestamp>
...
The view is defined with the help of json_extract_scalar function. But even that is far from complete and lacks metadata such as description, when the field was introduced, etc. I guess what I'm after is really something a bit more generic or just a description of a process that other people use to satisfy similar requirements, that is: 1. Complete = To have a (as much as possible) complete list of all the fields in various events 2. Easily accessible = various people, including devs, data analysis and product managers can easily discover and explore the data. Not all of them will be familliar with Athena (or even have access to it) 3. Metadata support - again, things like description, created date, whether it's deprecated, links to relevant resources, etc. 4. Maintainability - it must be relatively easy to keep it in sync with the real state of things, notably when developers change the logic or introduce new events and properties

jumar08:12:56

I did a quick google search but haven't found much. The pretty much only relevant link was this: https://www.datalogz.io/post/data-documentation-tools That is a propriatary tool that doesn't seem to have a free account at all. But they at least enumerate common options: Word, Google docs, http://Dbdocs.io, ApexSQL, Datalogz itself • of those, http://Dbdocs.io is perhaps interesting but I haven't had time to look at it yet. As one event destination we use https://amplitude.com/, at the moment. They have something similar available at https://data.amplitude.com/ (screenshots show one event and a property). It's sort of nice and has additional features such as showing in what events the property is used, but it's again proprietary and maybe not appropriate for all the purposes and stakeholders. I'm also not sure if we keep it using in the long run.

jeroenvandijk09:12:27

Haven't used it myself, but i remember confluent's schema registry. Maybe it has some useful insights as well https://docs.confluent.io/platform/current/schema-registry/index.html#schemas-subjects-and-topics

jumar10:12:49

I found https://atlan.com/p/data-catalogs-are-dead/ which sounds really nice except that it's a commercial product.

aaelony03:12:20

Most databases have a table named information_schema.columns or something similar. Create a new table with your documentation of columns that is joinable with information_schema.columns, that way your documentation is queryable and near your data as well