pathom 2024-04-29 | Slack Archive

kpassapk19:04:09

I'm playing around with Pathom v3. I would like to combine it with https://github.com/scicloj/tablecloth, so that resolvers return https://scicloj.github.io/tablecloth/#column-api. I wold like users to be able to specify a list of columns in a final dataset, and work backwards to data sources (could be CSV files or API calls) that return a column in a raw format. Perhaps apply cleanup as part of the process, so that for example a phone number is translated to standard MSISDN format. I have a pretty basic question: in this scenario, most of my data sources will provide the same map key. For example, I may have two source CSV files that have a phone column. In my quick tests, if I have two resolvers providing the same key, Pathom silently ignores one of them. Is there a way to get Pathom to fail instead, telling me "you have two resolvers providing :phone, I'm not sure which one I should choose."?

wilkerlucio19:04:27

hello, in Pathom the attributes are the main building block of the data model, so when you provide the same attribute via different resolvers it considers those alternative paths for the same thing. the simplest recommendation here is to give a different namespace for each of your data sets, so you can avoid ambiguity on their names. this is also important to be able to later extend the API and define computed properties on these specific data

👍 1

kpassapk20:04:06

Here is the https://github.com/kpassapk/pathom3-toys/blob/main/src/pathom3/datasets/toy.clj. Data cleaning code is often a real pain to write. I'm interested in Pathom because resolvers could be potentially be written by AI, as they have a simple strcture. Ideally I could provide a dataset, then extract a few fields and have a GPT write reosolvers that would give me a cleaned-up version of that dataset with known column names. (e.g. "phone" instead of "telefono") Following the idea of "every dataset is a namespace", each source dataset and its recipe for cleaning lives in a separate namespace, and the final step is to join them together (also a resolver). I'm left with three indices, and a resolver that joins up columns. I ran into something with EQL. In https://github.com/kpassapk/pathom3-toys/blob/main/src/pathom3/datasets/users1.clj#L63, this didn't work

(p.eql/process index {::raw dataset} [::clean-name])

but the closest equivalent smart map did

(-> (psm/smart-map index {::raw dataset})
    (psm/sm-touch! [::clean-name]))

The EQL processor fails with

class tech.v3.dataset.impl.column.Column cannot be cast to class clojure.lang.IPersistentVector

The smart map produces the results I want, so I went with that.

2024-04-29

Channels