I am trying to analyze the costs of running some infrastructure (a Panther instance). This is a combination of a fixed license fee, an AWS amount, and a Snowflake amount. I'm trying to predict the cost based on e.g. the number of bytes ingested. ๐งต
I have the following tables:
=> aws-costs.nippy [3093 4]:
| :date | :aws-account-id | :codename | :cost |
|------------|-----------------|-----------------|------------:|
| 2024-06-19 | 149026138499 | e40 | 18.91663618 |
| 2024-06-19 | 156035833056 | dougefresh | 12.85524032 |
| 2024-06-19 | 167118260314 | buddhabrand | 24.13205988 |
| 2024-06-19 | 211125573087 | jdilla | 17.24708290 |
| 2024-06-19 | 365922198879 | wutangclan | 20.00561634 |
...
(-> @snowflake-metrics
(expand-snowflake-warehouse-credit-usage-metrics)
(tc/rename-columns {:cost :snowflake-cost}))
=>
| :credits-usage | :date | :codename | :snowflake-cost |
|---------------:|------------|------------|----------------:|
| 0.00009250 | 2024-09-17 | vanillaice | 0.00018500 |
| 0.00007083 | 2024-09-16 | vanillaice | 0.00014167 |
| 0.00006889 | 2024-09-09 | vanillaice | 0.00013778 |
| 0.00006750 | 2024-09-08 | vanillaice | 0.00013500 |
| ... | ... | ... | ... |
| 0.18011333 | 2024-07-12 | az | 0.36022666 |
| 0.16381972 | 2024-07-11 | az | 0.32763944 |
| 0.15400111 | 2024-07-06 | az | 0.30800222 |
| 0.15395972 | 2024-07-05 | az | 0.30791944 |
...
(-> @panther-metrics
(expand-panther-bytes-processed-metrics))
| :date | :codename | :bytes-processed |
|------------|-----------------|-----------------:|
| 2024-07-19 | vanillaice | 26593521516 |
| 2024-06-19 | vanillaice | 22760548397 |
| 2024-08-18 | vanillaice | 184057 |
| 2024-07-19 | xzibit | 54339544 |
| 2024-08-18 | xzibit | 67950856 |
| 2024-07-19 | xzibit | 3240448894 |
| ... | ... | ... |
| 2024-06-19 | biggie | 0 |
| 2024-08-18 | biggie | 3596260 |
| 2024-07-19 | biggie | 0 |
| 2024-06-19 | biggie | 0 |
...
I'm combining the data as follows:
clojure
(def all-data
(->
(tc/bind
(-> @aws-costs
(tc/rename-columns {:cost :aws-cost}))
(-> @snowflake-metrics
(expand-snowflake-warehouse-credit-usage-metrics)
(tc/rename-columns {:cost :snowflake-cost}))
(-> @panther-metrics
(expand-panther-bytes-processed-metrics)))
(tc/unique-by [:date :codename] {:strategy (partial reduce +)})))
I'd love for someone to check me on this. I think what this does is merge the records with the same date and codename, and when there are matching fields, they get summed. (That's the correct behavior: e.g. there are bytes processed records for different tables but I'm interested in the total number of bytes processed.)
I tried to do this:
clojure
(def simple-aws-model
(ml/pipeline
(mm/set-inference-target :aws-cost)
(mm/drop-columns [:aws-account-id :snowflake-cost :credits-usage])
{:metamorph/id :model}
(mm/model {:model-type :smile.regression/ordinary-least-square})))
(def aws-ctx
(simple-aws-model
{:metamorph/data all-data
:metamorph/mode :fit}))
But it didn't work because I'm on an M3 laptop:
2. Unhandled java.lang.NoClassDefFoundError
Could not initialize class org.bytedeco.mkl.global.mkl_rt
1. Caused by java.lang.ExceptionInInitializerError
Exception java.lang.UnsatisfiedLinkError: Platform "macosx-arm64" not
supported by class org.bytedeco.mkl.global.mkl_rt [in thread
"nREPL-session-683557d4-ccf9-4a5f-81da-f16107d6fbc4"]
This seems strange because presumably I'm not the only person using a recent Macbook? Here are my deps, should I be using git versions?
scicloj/scicloj.ml {:mvn/version "0.3"}
scicloj/tablecloth {:mvn/version "7.029.2"}I would like to try to do least squares manually since that's not very complex. Can you help me predict :aws-cost from :bytes-processed using tablecloth to create columns like :delta-aws-cost-sq et cetera? I tried writing this:
(let [avgs (->>
(for [col [:aws-cost :snowflake-cost :bytes-processed]]
[col (tc/mean all-data [col])])
(into {}))]
(-> all-data
(tc/add-columns
{:delta-aws-cost #(dfn/- (:aws-cost %) (avgs :aws-cost))})))
The averages work but the added columns don't; I got the following error because I don't really know how to use tablecloth:
1. Unhandled java.lang.ClassCastException
class clojure.lang.MapEntry cannot be cast to class java.lang.Number
(clojure.lang.MapEntry is in unnamed module of loader 'app'; java.lang.Number
is in module java.base of loader 'bootstrap')
Any help would be much appreciated.Hm, maybe it's because avgs is a dataset and not a number?
Yeah, I have no idea if this is idiomatic but I was able to make some progress by using the columns ns for tablecloth, and:
(def avgs
(->>
(for [col [:aws-cost :snowflake-cost :bytes-processed]]
[col (-> all-data (tc/mean col) (get-in ["summary" 0]))])
(into {})))
to just get numbersHi, interesting, I'll look a little later.
-----------
Regarding the mkl dependency:
Indeed has the mkl dependency as it relies on https://haifengl.github.io/.
One alternative stack of dependencies, that we are working on these days, is https://scicloj.github.io/noj/. It is alpha stage, but the relevant parts for your analysis are actually stable and tested.
Noj includes a few of the underlying libraries of , but it does not offer any namespaces of its own, so we use the underlying libraries such as more directly.
Instead of Smile, you can use the linear regression (ordinary-least-squares) from Fastmath, as demonstrated here:
https://scicloj.github.io/noj/noj_book.interactions_ols.html
going to be a very unpopular answer :) - but you could just sqlize the tables, join them, calculate the bytes->cost value and then do a simple forecast on that once you have a hard unified cost calc. But I could be misunderstanding the func reqs. I put your data examples in small REPLs and joined them as an example in Rabbit. created a parameter of the cost basis of bytes and added it to a subsequent query, which then could be used to forecast over days - for bytes or dollars (didnt do this step in the screenshot - but w date+value you could do a linear-regression with core.matrix or whatever in a subsequent downstream REPL block - sqlize that and then draw timeseries or whatever you need to visualize it). Edit: many would argue this is overkill when there is likely a nice and clean tablecloth + scicloj threading solution. But our minds work in different ways, I sometimes I want to "see" all the steps, esp if we are mutating a data pipeline, even if that... you know, requires more "stepping". ;)
Those workflows demonstrated by @ryan.robitaille are always inspiring to me.
Coming back to the Tablecloth processing, it looks right to me, but I'd always recommend verifying that it actually works correctly with a small toy example that is similar to your dataset in terms of the datatypes (e.g., the type of the date column), the presence of missing data, etc.
BTW, instead of (partial reduce +) you may use tcc/sum, where tcc is the usual alias to tablecloth.column.api.
With Noj rather than scicloj.ml, the pipeline will be:
(mm/pipeline
(mm/lift dsmod/set-inference-target :aws-cost)
(mm/lift tc/drop-columns [:aws-account-id :snowflake-cost :credits-usage])
{:metamorph/id :model}
(ml/model {:model-type :fastmath/ols}))
Here, we use the aliases:
(require '[tablecloth.api :as tc]
'[tablecloth.column.api :as tcc]
'[scicloj.metamorph.core :as mm]
'[scicloj.metamorph.ml :as ml]
'[tech.v3.dataset.modelling :as dsmod])
The first two operations does not have to be part of the pipeline, unless you wish to vary them across multiple pipelines you are comparing. So, it is recommended to do them in the preprocessing stage.If you wish, it'd be nice to create a small toy namespace with fake data. This way we may explore it together here.
The code with the avgs looks nice to me. Here is an alternative way to do it, using the above tcc (https://scicloj.github.io/tablecloth/#column-api) and the fact that a dataset is actually a map:
(def toy-data
(tc/dataset {:x (range 9)
:y [1 1 1 1 0 -1 -1 -1 -1]}))
(-> toy-data
(update-vals tcc/mean))
;; => {:x 4.0, :y 0.0}
Whoa, thanks @daslu! I like the new averages code. However, when I run your code I get:
Failed to find model :fastmath/ols. Is a require missing?
with
org.scicloj/noj {:git/url ""
:git/tag "2-alpha6"
:git/sha "c7a7240"}
(and also with current main)ah sorry forgot to require (require '[scicloj.metamorph.ml.regression])
cool, that made it work, though:
`
:cause "Near rank-deficient model matrix"
:data {:data #object["[D" 0x44de22f3 "[D@44de22f3"]}
which is super surprising, but now I have something to ivnestigate ๐> oh sorry forgot to require (require '[scicloj.metamorph.ml.regression]) > Ahh right. Thanks
Regarding the error, let us maybe check when each of the relevant columns is actually zero, and whether we have any rows where a few of them are nonzero. Something seems degenerate. Maybe a toy example of toy data will teach us something.
Following the discussion at the https://clojurians.slack.com/archives/C0BQDEJ8M/p1726773451739769?thread_ts=1726738000.588989&cid=C0BQDEJ8M -- @ryan.robitaille, it'd be great to discuss how the underlying libraries of https://scicloj.github.io/noj/ (Tablecloth, http://metamorph.ml, Fastmath, Hanamicloth, etc.) can integrate into Rvbbit workflows. If we could just take a namespace such as https://scicloj.github.io/noj/noj_book.automl.html or this https://scicloj.github.io/hanamicloth/hanamicloth_book.plotlycloth_walkthrough.html and transform it into one of your dataflows, I believe it would create interesting opportunities. Some of us still like to edit our code as namespaces in the editor, and we creating lots of tutorials, docs, and analyses these days. Possibly, they can all be converted into inspiring Rvbbit examples.
For sure, let me take a look and try to adapt your examples and see what works. Rabbit supports external editing, but the paradigm is more focused on each REPL block being it's own CLJ file currently - but still I think there are some opportunities there. Even if all the code was in the same namespace block, Rabbit also introspects the namespaces - so defs and atoms can be dragged out as their own viewable data structures on the canvas as a child dep - so either way, could be some interesting interplay. To be continued! ๐
Very nice ๐