data-science 2023-11-03 | Slack Archive

vonadz10:11:45

What methods / systems do people use to log steps / transformations in data pipelines? I'm currently pulling spreadsheets from online sources, parsing them using tech.v3.dataset, transforming, and loading into a DB. I'm handling this using scripts that are executed on a monthly basis. I've mostly just been using println statements for different stages, but would love to hear about any more robust solutions people have implemented.

respatialized15:11:47

https://github.com/BrunoBonacci/mulog is excellent for this, the ability to add arbitrary structured data to any logging operation makes logging metrics directly from the call site super simple. I use u/trace for every important operation - the :capture-fn capability helps easily (and asynchronously) grab metrics from the output of that operation. I've gotten pretty far just dumping the edn logs to local disk and reading them back in (often with tablecloth) for any analysis I might need to do. Having a ton of empirical data on performance really helps when thinking about a potential change to a data pipeline.

👍 1

respatialized15:11:56

If you already have a logging stack set up you can also convert to JSON and easily dump the same data to your existing monitoring system.

vonadz15:11:45

thanks! I'll check it out.

2023-11-03

Channels