2025-01-19 data-science | Clojure Slack Archive

data-science 2025-01-19

2025-01-19T16:11:54.662659Z

FWIW, Tidyverse is fine, but not necessary for R.

👍 1

Stephan Renatus 2025-01-19T19:29:23.098369Z

Help me out here, please, from my dated understanding of the R ecosystem, that’s there tidyverse came up and lives, isn’t it?

2025-01-19T19:55:13.433569Z

Could you rephrase the question? The Tidyverse, written by the amazing and prolific Hadley Wickham, is a set of R packages that enhances the functionality of base R for data processing. It includes ggplot2, lubridate, dplyr and quite a few other packages. IMHO, ggplot2 (which actually pre-dates the notion of a tidyverse) is the most useful, and offers in depth composable plotting capabilities. The original concept of a data frame comes from base R, but some users found it confusing to use and Hadley wrote several new packages oriented around data analysis for tables (tibbles) of data with an easier (i.e. more verbose) syntax which has become popular and part of the "tidyverse". I myself prefer another library named data.table that is more similar in syntax to base R and also more performant, those less verbose. Like Clojure, the syntax of data.table has a learning curve but your code will be the same regardless of package versions. Tidyverse functions, on the other hand, are great but often less performant and may require syntax tweaks across versions over time

Stephan Renatus 2025-01-19T19:56:57.952249Z

ah. now I understand your message. I read it as “it’s not necessary for R” as if tidyverse was available (as-is) in another language. but you mean it’s “not necessary for R”, as in, you can do without it, in R. thanks for clarifying!

👍 1

2025-01-19T20:01:30.346209Z

everyone is different and when some people refer to R itself they mean R + Tidyverse. I rarely use Tidyverse unless I am adding to someone else's code that already uses it. In my view, data.table is what I prefer. Ymmv. https://github.com/Rdatatable/data.table

👍 1

Daniel Slutsky 2025-01-19T22:22:41.244449Z

Thanks for this discussion. Regarding the lecture miniseries we are organizing, do you find any topics or packages in the R ecosystem (tidy or not) that you think would inspire us to build Clojure equivalents?

2025-01-20T02:08:26.574229Z

I do. The data.table paradigm is quite powerful: # FROM[WHERE, SELECT, GROUP BY] # DT [i, j, by] is worth emulating

2025-01-20T02:11:54.312339Z

could become: (from where select groupby)

2025-01-20T02:13:07.525609Z

where the input is a datatable and so is the output, a different datatable

2025-01-20T02:16:23.395149Z

neatRanges might be interesting to implement as it has useful methods commonly needed when working with date ranges https://github.com/arg0naut91/neatRanges

Daniel Slutsky 2025-01-20T07:49:18.371979Z

Nice. Would you be interested in discussing data.table in a meetup?

2025-01-20T15:01:42.471089Z

Sure. Depending on when it is

Daniel Slutsky 2025-01-20T16:50:38.744909Z

Wonderful. The best time will be around the current hour, but on a Friday a few weeks away. That is the main time we are assigning to the R4Clj meetings. But if this does not work, we can always set up another hour for a special meeting with your presentation.

2025-01-20T17:27:00.884839Z

Okay, lmk when you have a hard calendar date.

Daniel Slutsky 2025-01-20T17:30:48.412209Z

Thanks. Maybe Feb 14 or 28? Or later? Would you like to give a long or short talk?

Daniel Slutsky 2025-01-23T14:17:59.611969Z

Nice, many thanks. Yes, let us leave the date open for now and talk again a little later, as the series of meetups continues.

2025-01-23T15:08:46.972059Z

There are also several excellent Matt Dowle talks... e.g. I like this one but there are more recent ones too https://m.youtube.com/watch?v=qLrdYhizEMg

👍 1

2025-01-22T23:29:43.502219Z

Here are a few examples of using data.table. I used the same datasets used in the 100-walkthrough for tech.ml.dataset:

r 
options(width = 300)
library(data.table)
## install.packages("remotes")
## remotes::install_github("HenrikBengtsson/R.utils")
## Let's try to mirror the analysis for tmd: 
d <- data.table::fread("")
head(d)
str(d)
mycols <- c("SalePrice", "1stFlrSF", "2ndFlrSF")
d[1:5, mycols, with=FALSE ]


## remotes::install_github("ycphs/openxlsx")
library(openxlsx)
d_xls <- as.data.table(openxlsx::read.xlsx(""))
class(d_xls)

str(d_xls)

data.table::setnames(d_xls, old= c("Date"), new= c("Date_string"))
d_xls[, date := as.Date(Date_string)]
str(d_xls)


d[, c("Id", "OverallQual", "SalePrice")]


d_stocks <- data.table::fread("")

## MSFT price moments
d_stocks[symbol == 'MSFT', .(N = .N, min_price = min(price), mean_price = mean(price), median_price = median(price),  max_price = max(price))]
d_stocks[symbol == 'MSFT', .(N = .N, min_price = min(price), mean_price = mean(price), median_price = median(price),  max_price = max(price)), by = "symbol"]
d_stocks[symbol == 'MSFT', .(N = .N, min_price = min(price), mean_price = mean(price), median_price = median(price),  max_price = max(price)), by = c("symbol", "date")]

d_stocks[symbol == 'MSFT'][, year := year(as.Date(date, c('%b %d %Y') ))][, .(N = .N, min_price = min(price), mean_price = mean(price), median_price = median(price),  max_price = max(price)), by = c("symbol", "year")]
d_stocks[, year := year(as.Date(date, c('%b %d %Y') ))][, .(N = .N, min_price = min(price), mean_price = mean(price), median_price = median(price),  max_price = max(price)), by = c("symbol", "year")]

data.table::fwrite(d_stocks, file = "the-stocks.tsv.gz", sep="\t", compress = "gzip")


## Same moments for all symbols

d_stocks[, .(N = .N, min_price = min(price), mean_price = mean(price), median_price = median(price),  max_price = max(price)), by = "symbol"]

## Sorting
d_stocks[, .(N = .N, min_price = min(price), mean_price = mean(price), median_price = median(price),  max_price = max(price)), by = "symbol"][order(-mean_price)]

2025-01-22T23:31:07.384679Z

Hope that is helpful

2025-01-22T00:20:05.274239Z

I can give a short explanation of how to use it. it is a great package if you use R. Lately, I am using Python though because employers demand it. Your original question was "Would you be interested in discussing data.table in a meetup". Feb 14 or 28 are okay, but in my time zone these are work hours so I won't be sure until the dates get closer. Evenings or weekends have less risk of being preempted. I would also need to get set up on whatever platforms you are using that may or may not work on my pop!os system

Clojurians Log v2

data-science 2025-01-19