Fork me on GitHub
#data-science
<
2020-12-27
>
David Pham23:12:02

Anyone know if in geni (Spark) the data frames are typed or untyped?

Anthony Khong05:12:36

Hi, David, author of Geni here. It uses Spark Datasets (see https://www.baeldung.com/java-spark-dataframe-dataset-rdd for a discussion). So it’s a typed view of DataFrames. However, the type information only comes in when you load the schema, so that you’ll get the type errors in run time.

David Pham08:12:40

Does it have an impact when you use datasets? Do you feel the burden of types in comparison to handling a collection of open Clojure maps?

David Pham08:12:55

Thanks for your answer and the library!

Anthony Khong08:12:58

> Do you feel the burden of types in comparison to handling a collection of open Clojure maps? Not really, to me, it still feels like a dynamic language (or library in this case), because it all happens during runtime. But, just like Clojure, it’s strongly typed, so that you get type errors during run time. Also, I wouldn’t compare it to handling Clojure maps. Geni is for a different use case.. If your data is small enough, using collection of maps is probably better, because the reader of your code doesn’t have to learn Spark. But once you’re dealing with millions or billions of rows, you’d want to use Spark or similar libraries.

David Pham22:12:45

Thanks a lot for your explanations!

👍 1