Fork me on GitHub
#data-science
<
2023-10-21
>
vonadz14:10:37

Hey, had some general data engineering questions, would love to hear any opinions from people who've set up / manage data pipelines. My company does a lot of work where we get data from a number of different APIs, csvs, excel docs, and web scraping. A lot of the data is incomplete or in partial formatting (like a column that's mostly numeric, but then might have periods for null values or something), so it usually requires a decent amount of wrangling. After we've wrangled and transformed it into something useful, we enrich it manually or through other sources. We then publish the enriched data and write content for it on our websites. Most of the data sources are updated on either a monthly or yearly basis and so need to have tasks scheduled to update it. So far the way I've handled it is by basically writing scripts (I used javascript/typescript, but I don't mind using Python or Clojure or anything else) for each of these data sources and executing them manually whenever an update is necessary. The biggest issue with this is that it's a documentation nightmare. I'd love to have it be fairly easy to see where data has come from and where it's used on our websites, so other people can easily take over the process. Generally it's not a huge amount of data (sub 100GB per "site", stored in 1 postgres DB)/ As of right now, I've looked at setting up Airflow (https://github.com/apache/airflow) to orchestrate things, dbt to handle transformations (https://github.com/dbt-labs/dbt-core). It seems like I'm still missing a good solution for the Extraction part of ETL. Anyone have any recommendations for tools or processes or anything? I'd personally love to build something using Clojure, but I think this is definitely a solved problem and there are opensource tools out there that far exceed whatever I'd be able to put together by myself.

bocaj16:10:52

It sounds like you might like a visual, like a graph or boxes and lines to document the flow of data?

vonadz17:10:10

Priority is definitely functionality that would make it easier for a developer to take over and manage (like auto-generated docs, visuals, etc). My end goal would be that non-technical people could also use the system to be able to use the data in the content they write, but that's v2.

bocaj18:10:45

In Clojure , I like the techascent stack with tablecloth. This takes care of extract, and is succinct. Plus you can do any work needed in Clojure. I usually skip loading to a typed model or database and query source files stored in an object store directly using tablecloth. This is efficient for millions of rows. For documentation I would look at notebook style (clay library ) https://github.com/scicloj You could leverage malli as well, for providing documented apis. I also like meta base but haven’t used it for a while https://github.com/metabase/metabase

vonadz10:10:54

Thanks @U068BQFJ9! I'll look into using these. I like the idea of using notebooks for documenting things.

Chris Herron13:10:56

@U03QQS7341W At Crossbeam, we have been moving our home-rolled Data Pipeline to https://temporal.io/ (which has a nice Clojure SDK btw). I'm working towards having ELT capabilities within our product stack. We are already using dbt for our analytics stack. For extract, we have a variety of things in use - https://www.stitchdata.com and https://www.singer.io/ Taps, plus various home-rolled things. We are looking at https://airbyte.com/ as a candidate for a more generalized approach to managing data sources. Since you are interested in AirFlow, that's probably already on your radar, but wanted to flag it in case you weren't aware.

vonadz16:10:12

@U05EF0987B7 so are you setting up the ETL in Temporal as well? What does that look like? Does it play nice with dbt? That's another library I'm thinking of using for some transformations. Am I correct in saying that Temporal is an alternative to Airflow? Yeah I'm aware of AirByte, even set up a local instance to mess around with it. Looks pretty cool.

Chris Herron16:10:44

@U03QQS7341W Haven't gotten to that stage yet but that's the hope. I imaging triggering dbt as a concluding step in a data processing workflow.

Chris Herron16:10:26

We looked at Airflow, Airbyte, Overseer and Temporal. Not quite 1:1 equivalents but it was a learning process. We settled on Temporal as a nice generic 'distributed cron' with DAG workflows. The Clojure-friendliness was a bonus.

Chris Herron16:10:47

Slightly circular but this is interesting: Airbyte uses Temporal and dbt: https://airbyte.com/blog/scale-workflow-orchestration-with-temporal

😆 1
vonadz13:10:16

@U05EF0987B7 thanks for all the info! I'll be spending today comparing Temporal and Airflow. It's nice that Temporal has a clojure SDK.

genmeblog19:10:57

Integration and derivatives: today I added fastmath.calculus namespace to the lastest SNAPSHOT of fastmath (`2.2.2-SNAPSHOT`) which includes various methods of integration algorithms from Apache Commons Math for R->R functions. Also implemented three new algorithms, two for multivariate functions: VEGAS+ and h-Cubature (Genz-Malik) and one for 1d: Gauss-Kronrod (which is the base algorithm for R, Julia, Python (Scipy)) in h-adaptive version. For derivatives, finite differences method is used. There three functions: derivative, gradient and hessian. Links for algorithms in a thread. https://generateme.github.io/fastmath/fastmath.calculus.html

🎉 5
Sam Ritchie21:10:17

The numerical differentiation follows a Sussman paper and uses Richardson extrapolation to get the numerical derivatives to converge very quickly. It’s a nice read if you’re interested in this sort of algorithm https://github.com/mentat-collective/emmy/blob/v0.31.0/src/emmy/numerical/derivative.cljc

genmeblog22:10:06

Great to know! I'll take a look at this soon. Probably some duplications are between emmy and fastmath.

genmeblog12:10:55

I really like how you write the code! I definitely need to change my habits...

genmeblog18:10:35

btw. regarding Richardson, I've implemented several orders of accuracy which is explicit unrolled version of Richardson method (though, you can't say what error you will get for arbitral accuracy and x)

👍 1