Fork me on GitHub
#data-science
<
2022-05-20
>
Nom Nom Mousse09:05:45

Can you show me an example of subsetting with ds/tc?

;; setup
(def d '({:sample "H3K27me3_IMR90_Cell_Line_0",
  :file "EpigenomeAtlas/Current-Release/experiment-sample/Histone_H3K27me3/IMR90_Cell_Line/UCSD.IMR90.H3K27me3.LL223.bed.gz"}
 {:sample "H3K27me3_IMR90_Cell_Line_1",
  :file "EpigenomeAtlas/Current-Release/experiment-sample/Histone_H3K27me3/IMR90_Cell_Line/UCSD.IMR90.H3K27me3.SK05.bed.gz"}
 {:sample "H3K27me3_ES-WA7_Cell_Line_0",
  :file "EpigenomeAtlas/Current-Release/experiment-sample/Histone_H3K27me3/ES-WA7_Cell_Line/BI.ES-WA7.H3K27me3.Solexa-12609.bed.gz"}))
(def ss (ds/->dataset d))
I'd like to get all values from the file column that matches sample "H3K27me3_IMR90_Cell_Line_1".

Nom Nom Mousse09:05:18

In pandas I'd do

ss[ss["sample"] == "H3K27me3_IMR90_Cell_Line_1"].file.iloc[0]

Nom Nom Mousse09:05:07

There is a select-rows, but it uses indexes:

(ds/select-rows ss (map #(= "H3K27me3_IMR90_Cell_Line_1" %) (ss "sample")))
;; bool not a number

Nom Nom Mousse09:05:41

(def c
  (-> (ds/filter-column ss "sample" #(= "H3K27me3_IMR90_Cell_Line_0" %))
      (ds/select-columns ["file"])))
(-> c vals first first)
;; "EpigenomeAtlas/Current-Release/experiment-sample/Histone_H3K27me3/IMR90_Cell_Line/UCSD.IMR90.H3K27me3.LL223.bed.gz"
Doesn't feel like the most pity solution.

genmeblog09:05:18

TC accepts any predicate.

genmeblog09:05:06

(tc/select-rows ss (comp #{"..."} :sample))

👍 1
Nom Nom Mousse10:05:08

(tc/select-rows ss (comp #{"H3K27me3_IMR90_Cell_Line_0"} #(get % "sample"))) worked. It would have been easier if the columns were keywords.

genmeblog11:05:33

Right. Above can be rewritten as:

(tc/select-rows ss #(= "H3K27me3_IMR90_Cell_Line_0" (% "sample")))

🙏 1
metasoarous15:05:06

Good news nerds! GitHub finally accepts LaTeX in their Markdown! https://github.blog/changelog/2022-05-19-render-mathematical-expressions-in-markdown/

2
🙏 1
🥳 2
👍 1