data-science

lvh 2024-09-19T09:26:40.588989Z

I am trying to analyze the costs of running some infrastructure (a Panther instance). This is a combination of a fixed license fee, an AWS amount, and a Snowflake amount. I'm trying to predict the cost based on e.g. the number of bytes ingested. ๐Ÿงต

lvh 2024-09-19T09:27:25.354939Z

I have the following tables:

=> aws-costs.nippy [3093 4]:

|      :date | :aws-account-id |       :codename |       :cost |
|------------|-----------------|-----------------|------------:|
| 2024-06-19 |    149026138499 |             e40 | 18.91663618 |
| 2024-06-19 |    156035833056 |      dougefresh | 12.85524032 |
| 2024-06-19 |    167118260314 |     buddhabrand | 24.13205988 |
| 2024-06-19 |    211125573087 |          jdilla | 17.24708290 |
| 2024-06-19 |    365922198879 |      wutangclan | 20.00561634 |
...
(-> @snowflake-metrics
    (expand-snowflake-warehouse-credit-usage-metrics)
    (tc/rename-columns {:cost :snowflake-cost}))

=>


| :credits-usage |      :date |  :codename | :snowflake-cost |
|---------------:|------------|------------|----------------:|
|     0.00009250 | 2024-09-17 | vanillaice |      0.00018500 |
|     0.00007083 | 2024-09-16 | vanillaice |      0.00014167 |
|     0.00006889 | 2024-09-09 | vanillaice |      0.00013778 |
|     0.00006750 | 2024-09-08 | vanillaice |      0.00013500 |
|            ... |        ... |        ... |             ... |
|     0.18011333 | 2024-07-12 |         az |      0.36022666 |
|     0.16381972 | 2024-07-11 |         az |      0.32763944 |
|     0.15400111 | 2024-07-06 |         az |      0.30800222 |
|     0.15395972 | 2024-07-05 |         az |      0.30791944 |
...
(-> @panther-metrics
        (expand-panther-bytes-processed-metrics))

|      :date |       :codename | :bytes-processed |
|------------|-----------------|-----------------:|
| 2024-07-19 |      vanillaice |      26593521516 |
| 2024-06-19 |      vanillaice |      22760548397 |
| 2024-08-18 |      vanillaice |           184057 |
| 2024-07-19 |          xzibit |         54339544 |
| 2024-08-18 |          xzibit |         67950856 |
| 2024-07-19 |          xzibit |       3240448894 |
|        ... |             ... |              ... |
| 2024-06-19 |          biggie |                0 |
| 2024-08-18 |          biggie |          3596260 |
| 2024-07-19 |          biggie |                0 |
| 2024-06-19 |          biggie |                0 |
...
I'm combining the data as follows:
clojure
(def all-data
  (->
   (tc/bind
    (-> @aws-costs
        (tc/rename-columns {:cost :aws-cost}))
    (-> @snowflake-metrics
        (expand-snowflake-warehouse-credit-usage-metrics)
        (tc/rename-columns {:cost :snowflake-cost}))
    (-> @panther-metrics
        (expand-panther-bytes-processed-metrics)))
   (tc/unique-by [:date :codename] {:strategy (partial reduce +)})))
I'd love for someone to check me on this. I think what this does is merge the records with the same date and codename, and when there are matching fields, they get summed. (That's the correct behavior: e.g. there are bytes processed records for different tables but I'm interested in the total number of bytes processed.) I tried to do this:
clojure
  (def simple-aws-model
    (ml/pipeline
     (mm/set-inference-target :aws-cost)
     (mm/drop-columns [:aws-account-id :snowflake-cost :credits-usage])
     {:metamorph/id :model}
     (mm/model {:model-type :smile.regression/ordinary-least-square})))

  (def aws-ctx
    (simple-aws-model
     {:metamorph/data all-data
      :metamorph/mode :fit}))
But it didn't work because I'm on an M3 laptop:
2. Unhandled java.lang.NoClassDefFoundError
   Could not initialize class org.bytedeco.mkl.global.mkl_rt

1. Caused by java.lang.ExceptionInInitializerError
   Exception java.lang.UnsatisfiedLinkError: Platform "macosx-arm64" not
   supported by class org.bytedeco.mkl.global.mkl_rt [in thread
   "nREPL-session-683557d4-ccf9-4a5f-81da-f16107d6fbc4"]
This seems strange because presumably I'm not the only person using a recent Macbook? Here are my deps, should I be using git versions?
scicloj/scicloj.ml {:mvn/version "0.3"}
scicloj/tablecloth {:mvn/version "7.029.2"}

๐Ÿ™‚ 1
lvh 2024-09-19T09:27:35.582369Z

I would like to try to do least squares manually since that's not very complex. Can you help me predict :aws-cost from :bytes-processed using tablecloth to create columns like :delta-aws-cost-sq et cetera? I tried writing this:

(let [avgs (->>
              (for [col [:aws-cost :snowflake-cost :bytes-processed]]
                [col (tc/mean all-data [col])])
              (into {}))]
    (-> all-data
        (tc/add-columns
         {:delta-aws-cost #(dfn/- (:aws-cost %) (avgs :aws-cost))})))
The averages work but the added columns don't; I got the following error because I don't really know how to use tablecloth:
1. Unhandled java.lang.ClassCastException
   class clojure.lang.MapEntry cannot be cast to class java.lang.Number
   (clojure.lang.MapEntry is in unnamed module of loader 'app'; java.lang.Number
   is in module java.base of loader 'bootstrap')
Any help would be much appreciated.

lvh 2024-09-19T09:35:04.771599Z

Hm, maybe it's because avgs is a dataset and not a number?

lvh 2024-09-19T09:58:57.606739Z

Yeah, I have no idea if this is idiomatic but I was able to make some progress by using the columns ns for tablecloth, and:

lvh 2024-09-19T09:59:04.288439Z

(def avgs
    (->>
     (for [col [:aws-cost :snowflake-cost :bytes-processed]]
       [col (-> all-data (tc/mean col) (get-in ["summary" 0]))])
     (into {})))
to just get numbers

Daniel Slutsky 2024-09-19T10:18:37.664029Z

Hi, interesting, I'll look a little later. ----------- Regarding the mkl dependency: Indeed has the mkl dependency as it relies on https://haifengl.github.io/. One alternative stack of dependencies, that we are working on these days, is https://scicloj.github.io/noj/. It is alpha stage, but the relevant parts for your analysis are actually stable and tested. Noj includes a few of the underlying libraries of , but it does not offer any namespaces of its own, so we use the underlying libraries such as more directly. Instead of Smile, you can use the linear regression (ordinary-least-squares) from Fastmath, as demonstrated here: https://scicloj.github.io/noj/noj_book.interactions_ols.html

ryrobes 2024-09-19T19:17:31.739769Z

going to be a very unpopular answer :) - but you could just sqlize the tables, join them, calculate the bytes->cost value and then do a simple forecast on that once you have a hard unified cost calc. But I could be misunderstanding the func reqs. I put your data examples in small REPLs and joined them as an example in Rabbit. created a parameter of the cost basis of bytes and added it to a subsequent query, which then could be used to forecast over days - for bytes or dollars (didnt do this step in the screenshot - but w date+value you could do a linear-regression with core.matrix or whatever in a subsequent downstream REPL block - sqlize that and then draw timeseries or whatever you need to visualize it). Edit: many would argue this is overkill when there is likely a nice and clean tablecloth + scicloj threading solution. But our minds work in different ways, I sometimes I want to "see" all the steps, esp if we are mutating a data pipeline, even if that... you know, requires more "stepping". ;)

Daniel Slutsky 2024-09-19T21:41:10.940199Z

Those workflows demonstrated by @ryan.robitaille are always inspiring to me.

Daniel Slutsky 2024-09-19T21:44:05.293509Z

Coming back to the Tablecloth processing, it looks right to me, but I'd always recommend verifying that it actually works correctly with a small toy example that is similar to your dataset in terms of the datatypes (e.g., the type of the date column), the presence of missing data, etc. BTW, instead of (partial reduce +) you may use tcc/sum, where tcc is the usual alias to tablecloth.column.api.

Daniel Slutsky 2024-09-19T21:52:07.942959Z

With Noj rather than scicloj.ml, the pipeline will be:

(mm/pipeline
   (mm/lift dsmod/set-inference-target :aws-cost)
   (mm/lift tc/drop-columns [:aws-account-id :snowflake-cost :credits-usage])
   {:metamorph/id :model}
   (ml/model {:model-type :fastmath/ols}))
Here, we use the aliases:
(require '[tablecloth.api :as tc]
         '[tablecloth.column.api :as tcc]
         '[scicloj.metamorph.core :as mm]
         '[scicloj.metamorph.ml :as ml]
         '[tech.v3.dataset.modelling :as dsmod])
The first two operations does not have to be part of the pipeline, unless you wish to vary them across multiple pipelines you are comparing. So, it is recommended to do them in the preprocessing stage.

Daniel Slutsky 2024-09-19T21:53:51.638619Z

If you wish, it'd be nice to create a small toy namespace with fake data. This way we may explore it together here.

Daniel Slutsky 2024-09-19T21:59:18.874039Z

The code with the avgs looks nice to me. Here is an alternative way to do it, using the above tcc (https://scicloj.github.io/tablecloth/#column-api) and the fact that a dataset is actually a map:

(def toy-data
  (tc/dataset {:x (range 9)
               :y [1 1 1 1 0 -1 -1 -1 -1]}))

(-> toy-data
    (update-vals tcc/mean))

;; => {:x 4.0, :y 0.0}

lvh 2024-09-20T14:13:56.075779Z

Whoa, thanks @daslu! I like the new averages code. However, when I run your code I get:

Failed to find model :fastmath/ols. Is a require missing?
with
org.scicloj/noj {:git/url ""
                   :git/tag "2-alpha6"
                   :git/sha "c7a7240"}
(and also with current main)

lvh 2024-09-20T14:15:57.162709Z

ah sorry forgot to require (require '[scicloj.metamorph.ml.regression])

lvh 2024-09-20T14:26:18.094769Z

cool, that made it work, though:

`
 :cause "Near rank-deficient model matrix"
 :data {:data #object["[D" 0x44de22f3 "[D@44de22f3"]}
 
which is super surprising, but now I have something to ivnestigate ๐Ÿ™‚

Daniel Slutsky 2024-09-20T14:45:24.912909Z

> oh sorry forgot to require (require '[scicloj.metamorph.ml.regression]) > Ahh right. Thanks

Daniel Slutsky 2024-09-20T14:48:09.903859Z

Regarding the error, let us maybe check when each of the relevant columns is actually zero, and whether we have any rows where a few of them are nonzero. Something seems degenerate. Maybe a toy example of toy data will teach us something.

Daniel Slutsky 2024-09-19T22:17:09.952069Z

Following the discussion at the https://clojurians.slack.com/archives/C0BQDEJ8M/p1726773451739769?thread_ts=1726738000.588989&cid=C0BQDEJ8M -- @ryan.robitaille, it'd be great to discuss how the underlying libraries of https://scicloj.github.io/noj/ (Tablecloth, http://metamorph.ml, Fastmath, Hanamicloth, etc.) can integrate into Rvbbit workflows. If we could just take a namespace such as https://scicloj.github.io/noj/noj_book.automl.html or this https://scicloj.github.io/hanamicloth/hanamicloth_book.plotlycloth_walkthrough.html and transform it into one of your dataflows, I believe it would create interesting opportunities. Some of us still like to edit our code as namespaces in the editor, and we creating lots of tutorials, docs, and analyses these days. Possibly, they can all be converted into inspiring Rvbbit examples.

๐Ÿ‘€ 1
ryrobes 2024-09-19T22:30:27.134249Z

For sure, let me take a look and try to adapt your examples and see what works. Rabbit supports external editing, but the paradigm is more focused on each REPL block being it's own CLJ file currently - but still I think there are some opportunities there. Even if all the code was in the same namespace block, Rabbit also introspects the namespaces - so defs and atoms can be dragged out as their own viewable data structures on the canvas as a child dep - so either way, could be some interesting interplay. To be continued! ๐Ÿ™‚

Daniel Slutsky 2024-09-19T22:52:03.543949Z

Very nice ๐Ÿ™