This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-04-26
Channels
- # babashka (7)
- # beginners (85)
- # calva (39)
- # cider (3)
- # clara (1)
- # clj-kondo (10)
- # clojure (194)
- # clojure-europe (36)
- # clojure-madison (2)
- # clojure-nl (13)
- # clojure-spec (11)
- # clojure-uk (2)
- # clojurescript (17)
- # community-development (5)
- # component (9)
- # conjure (4)
- # core-async (3)
- # cursive (32)
- # data-science (26)
- # datomic (31)
- # graalvm (22)
- # holy-lambda (31)
- # honeysql (7)
- # introduce-yourself (1)
- # jobs (9)
- # jobs-rus (1)
- # lsp (3)
- # malli (9)
- # off-topic (54)
- # pathom (27)
- # pedestal (6)
- # portal (1)
- # re-frame (4)
- # releases (1)
- # remote-jobs (1)
- # sci (3)
- # shadow-cljs (4)
- # spacemacs (13)
- # vim (14)
- # xtdb (3)
Does anyone here use a workflow management system like https://snakemake.readthedocs.io/en/stable/ or https://www.nextflow.io/? These are mostly used in bioinformatics AFAIK, but would be suitable for all branches of science and writing any kind of advanced data science workflow.
I'm asking because I'm writing something similar in Clojure and have an alpha ready which I'll make public in the not too distant future. I'm open for co-authors.
I'd be interested in talking with you more about this.
Feel free to ping me or ask questions :)
I'll have docs and the software ready in a month or so, might be easier to discuss after making it public 🙂
Anyone used Clojure for genomics? Anyone implementing something like GenomicRanges? I'm considering it.
genme sounds like the name of a biotech startup XD
@endrebak how big is the data you are working on? Is it something that could be done on a single larg-ish box?
They do not purport to offer the same functionality as Make 🙂 Those are more a Clojure version of the Python Science Stack as far as I know 🙂
@endrebak We don't use either of those. For pipelines we use an extensible streaming server that uses services connected into DAGs to form jobs. This all for various forms of high throughput sequencing (HTS). Stuff like RNA-Seq, Tn-Seq, Term-Seq, PETRI-Seq, http://et.al. This was homegrown. Was also used by labs at Tufts and Northeastern. Biologists like it because the input is simple spreadsheet generated csv files that describe the reads and experiment design. It could use a major rewrite - in particular the program graph analysis, expansion and DAG instantiation which was thrown together for expediency. Should be redone with Specter.
Is it open source somewhere?
https://github.com/jsa-aerial/aerobio There is a fair amount of internal documentation that I should put out there as well. There were a couple of times when one of the PIs asked about packaging it up and publishing a paper. The problem is (like most of this stuff) all the large number of diverse dependencies. Not just libs but entire tool chains (eg, samtools).
Like I mentioned, it works nice for what it is. Biologists put their read structures and experiment setups, along with various comparison requirements, together in Excel, then push them to a canonical place on our servers and then 'push the button'. A few hours to a few days later they get an email describing the results and the canonical locations for the output.
Cool. I'll look at it for inspiration 😄
There are loads of these sort of things out there and loads of things like Snakemake and Nextflow. They never seem to catch on in general because of the great diversity and specificity of individual lab experiments and workflows. Companies come and go in this space.
Yes, I wanted one for me specifically. If others like it, I'm happy, but I'm making it for myself. I actually have the same idea: the input is a sample sheet.
But Nextflow and Snakemake did take off though 😄
Nobody I know in many labs across the country uses them. Anecdotal, but I still would not say 'take off' here. Same with Galaxy - lots of people 'try it' and some even (kind of sort of) use it. But many abandon it due to high impedance mismatch. We used it originally, but it was way too cumbersome. Also, at an 'odd abstraction' level. Too low level for typical biologist, but too high level for easy use across use cases.
I'm just looking at the cites here. I used Snakemake though, but wanted something different
But Snakemake and Nextflow are different in that they are DSLs and execution environments that can be used to write any workflow. They do not bundle any software.
Once we have the base data (BAMs, DGE and fitness matrices, etc) using a mix of R, tmd, tc, and Neanderthal are used for various other post processing pipeline analysis.