Fork me on GitHub
#data-science
<
2022-05-19
>
maxt19:05:27

tablecloth (and tech.ml.dataset) has a surprising behavior (to me) on join. Why would I want the right.id? Isn't it guaranteed to be equal to id?

(def ds1 (tc/dataset {:id [1 2] :name [:foo :bar]}))
  (def ds2 (tc/dataset {:id [2 3] :value [100 200]}))
  (tc/full-join ds1 ds2 :id)
full-join [3 4]:

| :id | :name | :right.id | :value |
|----:|-------|----------:|-------:|
|   2 |  :bar |         2 |    100 |
|   1 |  :foo |           |        |
|     |       |         3 |    200 |

genmeblog19:05:11

full-join is left-join and right-join combined together.

genmeblog19:05:52

There is no id=1 in the ds2 and no id=3 in the ds1. This produces missing values for each case.

genmeblog19:05:42

I believe it works exactly the same as in SQL.

maxt20:05:32

Right, yes, it does make sense in full-join. I originally found the problem with left-join, where it's not as clear

left-outer-join [2 4]:

| :id | :name | :right.id | :value |
|----:|-------|----------:|-------:|
|   2 |  :bar |         2 |    100 |
|   1 |  :foo |           |        |

maxt20:05:01

But yeah even there I guess that keeping it allow me to see which values where missing

maxt20:05:04

Inner-join does not keep it.

inner-join [1 3]:

| :id | :name | :value |
|----:|-------|-------:|
|   2 |  :bar |    100 |

genmeblog11:05:44

yes, this works (again) the same as in SQL. inner-join is intersection of two sets, outer (`left-join` and right-join) keeps all rows from left/right ds and appends corresponding rows from the other ds (producing missing values if there is no such).

☝️ 1