Fork me on GitHub
#data-science
<
2022-04-26
>
Nom Nom Mousse12:04:05

Does anyone here use a workflow management system like https://snakemake.readthedocs.io/en/stable/ or https://www.nextflow.io/? These are mostly used in bioinformatics AFAIK, but would be suitable for all branches of science and writing any kind of advanced data science workflow.

Nom Nom Mousse12:04:54

I'm asking because I'm writing something similar in Clojure and have an alpha ready which I'll make public in the not too distant future. I'm open for co-authors.

metasoarous23:05:08

I'd be interested in talking with you more about this.

Nom Nom Mousse13:05:23

Feel free to ping me or ask questions :)

Nom Nom Mousse13:05:50

I'll have docs and the software ready in a month or so, might be easier to discuss after making it public 🙂

Nom Nom Mousse12:04:50

Anyone used Clojure for genomics? Anyone implementing something like GenomicRanges? I'm considering it.

genmeblog13:04:06

@jsa-aerial works in bioinformatics afaik

😄 1
Nom Nom Mousse13:04:19

genme sounds like the name of a biotech startup XD

genmeblog13:04:44

it's related to generative art

😎 1
otfrom13:04:12

@endrebak how big is the data you are working on? Is it something that could be done on a single larg-ish box?

otfrom13:04:04

why not just code and some as-> or -> with tablecloth or tech.ml.dataset?

Nom Nom Mousse13:04:14

They do not purport to offer the same functionality as Make 🙂 Those are more a Clojure version of the Python Science Stack as far as I know 🙂

jsa-aerial14:04:25

@endrebak We don't use either of those. For pipelines we use an extensible streaming server that uses services connected into DAGs to form jobs. This all for various forms of high throughput sequencing (HTS). Stuff like RNA-Seq, Tn-Seq, Term-Seq, PETRI-Seq, http://et.al. This was homegrown. Was also used by labs at Tufts and Northeastern. Biologists like it because the input is simple spreadsheet generated csv files that describe the reads and experiment design. It could use a major rewrite - in particular the program graph analysis, expansion and DAG instantiation which was thrown together for expediency. Should be redone with Specter.

👍 1
Nom Nom Mousse16:04:37

Is it open source somewhere?

jsa-aerial17:04:17

https://github.com/jsa-aerial/aerobio There is a fair amount of internal documentation that I should put out there as well. There were a couple of times when one of the PIs asked about packaging it up and publishing a paper. The problem is (like most of this stuff) all the large number of diverse dependencies. Not just libs but entire tool chains (eg, samtools).

jsa-aerial17:04:49

Like I mentioned, it works nice for what it is. Biologists put their read structures and experiment setups, along with various comparison requirements, together in Excel, then push them to a canonical place on our servers and then 'push the button'. A few hours to a few days later they get an email describing the results and the canonical locations for the output.

Nom Nom Mousse19:04:24

Cool. I'll look at it for inspiration 😄

jsa-aerial14:04:39

There are loads of these sort of things out there and loads of things like Snakemake and Nextflow. They never seem to catch on in general because of the great diversity and specificity of individual lab experiments and workflows. Companies come and go in this space.

Nom Nom Mousse16:04:41

Yes, I wanted one for me specifically. If others like it, I'm happy, but I'm making it for myself. I actually have the same idea: the input is a sample sheet.

Nom Nom Mousse16:04:20

But Nextflow and Snakemake did take off though 😄

jsa-aerial17:04:29

Nobody I know in many labs across the country uses them. Anecdotal, but I still would not say 'take off' here. Same with Galaxy - lots of people 'try it' and some even (kind of sort of) use it. But many abandon it due to high impedance mismatch. We used it originally, but it was way too cumbersome. Also, at an 'odd abstraction' level. Too low level for typical biologist, but too high level for easy use across use cases.

Nom Nom Mousse19:04:47

I'm just looking at the cites here. I used Snakemake though, but wanted something different

Nom Nom Mousse05:04:29

But Snakemake and Nextflow are different in that they are DSLs and execution environments that can be used to write any workflow. They do not bundle any software.

jsa-aerial14:04:40

Once we have the base data (BAMs, DGE and fitness matrices, etc) using a mix of R, tmd, tc, and Neanderthal are used for various other post processing pipeline analysis.