Fork me on GitHub
#beginners
<
2024-01-21
>
James Amberger00:01:26

regarding deps.edn, what if I want an alias that will run more than one -main fn?

hiredman00:01:04

Make a new function that runs each main and use that as your main

James Amberger00:01:51

Make a new function somewhere in my path. Okay

James Amberger00:01:29

Also, when using -X , is it right to say you’re basically limited to fns whose first parameter is a map (or 0-arity)?

hiredman00:01:48

You need to watch out because some practices in main functions that improve the experience at the command line make it hard to compose them (calling shutdown agents at the end)

Aaron Cooley03:01:27

I’m writing an app to parse .csv files of my financial (bank, credit card) statements to track my spending. I ran into a snag that I haven’t figured out how to work around. I use tablecloth/dataset to load the .csv files into my app, then I standardize the column headers based on a file-by-file mapping, e.g. the headers “Date” in one file and “Transaction Date” in another both map to :transaction-date in the final combined table containing all my financial transactions across all my accounts. Here’s where it gets tricky: all but one of the docs is encoded as UTF-8, the last one is UTF-8 with BOM (byte order marker). When I go to match my tables to headers, Clojure won’t recognize the string “Date” from the UTF-8 BOM encoded file as “Date”. TL;DR is there a way to force text strings read from files with different encodings into a single encoding schema, e.g. UTF-8?

hiredman04:01:55

Clojure strings are java strings and java strings are all utf-16

hiredman04:01:27

(they behave as utf-16 or ucs2 or whatever it is called, although newer jvms will sometimes actually store the data as some more efficient encoding)

hiredman04:01:42

So if you have the string "Date" and the string "Date" and they are not comparing equal, it is not due to encoding, it may be a 0 width character, or some other Unicode weirdness

Aaron Cooley04:01:23

Interesting. The reason I assumed it was the encoding was because when I opened the .csv in a text editor and re-saved it with UTF-8, it fixed the problem. But yes, you’re exactly right. If I bind the column name to column-name, I get this insanity: column-name => “Date” (= column-name “Date”) => false

Aaron Cooley04:01:26

Furthermore, Date is the first column in the .csv, so if the byte order marker, which I understand is at the beginning of the file, is somehow making it stealthily into my string, that’s the field you’d expect to find it.

Aaron Cooley05:01:39

Update: fixed it by creating a function to remove bom and then applied it to all column names using tc/column-rename

(defn remove-bom [s]
  (let [bom "\uFEFF"]
    (if (.startsWith s bom)
      (.substring s (count bom))
       s)))

Aaron Cooley05:01:15

Now my only question is: • Was this user error, or should I create an issue? • If the latter, should I create a tablecloth issue, a tech.v3.dataset issue, or a Clojure issue?

Marcelo Feodrippe05:01:08

seems likes an "operation error", where data is not provided as requested (utf-8). If the product + interface contract says "should accept utf-8 and utf-bom", then could be a issue. It's like send a byte format, instanted text format, you need to know what to decode, in my opinion

Marcelo Feodrippe06:01:43

Things could be better when solved on root problem

phill12:01:40

The BOM is quite an irritant, an extra UTF codepoint at the beginning of the file to inform you that the rest of the file is in UTF-8 or whatever. So, indeed, the first word you read from the file (with a Java Reader) is corrupt. Of course the BOM is a great idea if you really truly need help figuring out whether a file is UTF-8 or UTF-16... The Apache Commons has a solution at https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/BOMInputStream.html ... basically an input stream that notices the BOM! Why this isn't built into Java is very curious.

James Amberger16:01:25

You may also be interested in https://github.com/cantino/reckon which I’ve been using happily for several years now

Bailey Kocin22:01:34

I see that when I fetch the first item from a lazy sequence it does 32 items at a time. Is there any way to only realize N items at a time? I would like for a user to be able to pass me a lazy sequence of operations (with side effects) that I evaluate 1 at a time because the functions have side effects and if I am running N at once I need total control of that. My first thought was to wrap the function with side effects in a delay so the chunking would not run them when I grabbed each item but I do not want the user to have to do that.... Any ideas? Example

(my-api-func (map #(do-side-effects %) coll) process-batch-size)

Alex Miller (Clojure team)22:01:26

Clojure makes no guarantees about when or how many items in a lazy sequence are evaluated. If you want full control over that, you should not use lazy sequences.

Bailey Kocin22:01:54

That makes sense but then I still am doing what I recommended with the delay and just using mapv a loop or a transducer for example. I guess the question I really need to ask is this good design for a function or would having the user just pass options in and calling the function myself in the non-lazy forms above something I should be doing.

Alex Miller (Clojure team)22:01:53

loop is a good way to control the process. what are you putting in the delay? could the user just pass you a function that you evaluate when you want to?

Bailey Kocin22:01:44

I am at a boundary between Clojure and Java code with an SDK for a library. Essentially the Clojure library creates a promise, and I am grabbing the Java promise handle to use a function that the Java SDK has exclusively to wait on a group of promises. When the Clojure library creates the promise it also starts something so I need to both create the promise, get the Java ref handle and then wait on it to complete but I want to do this in batches. Before I was invoking the function with that map call so when it realized the chunks it was invoking it for N chunks (but I am doing this to avoid starting too many at once) and I need to do it in groups at a time so yea the granular control of a loop with a collection of arguments might be best

👍 1
Bailey Kocin22:01:19

Thanks for helping me rubber duck!

duckie 1
Marcelo Feodrippe00:01:34

I'm a beginner too in clojure, but what about try to use channels for it? Checkout org.clojure/core.async, check chan (https://clojuredocs.org/clojure.core.async/chan) You could create a quee to process first N itens by put them to execute. pmap also can do process in parallel against "n itens". Take each N items, process to channel/channels and maybe do the trick. Yet, I'm a beginner, just is a brainstorm about this topic.

Nim Sadeh02:01:19

Yea, like Marcelo said this seems channel-shaped. I have written several data engineering scripts that have a common structure: 1. Generate a queue of work to be done 2. Hand off the work to worker "threads", keeping track of some global data that includes operational data (connections to a common resource, done-ness metadata) 3. Controling the parallelism of the system and the total usage of some resource I have found channels to be exactly the mechanism to do that with

adi06:01:41

Generally-speaking, https://www.evalapply.org/posts/n-ways-to-fizzbuzz-in-clojure/index.html#le-fizzbuzz-classique-est-mort-%C3%A0-clojure.-d%C3%A9sol%C3%A9, given that lazy constructs intentionally decouple time-of-definition and time-of-use, whereas effects couple both. That said, you could wrap the effect-fn in a lambda, over each collection item without triggering side effects (viz. thunking), like so:

(map #((partial do-side-effects) %) coll)
Then my-api-func would expect a collection of lambdas that it would execute sequentially, to cause the effects. It would be a "dumb" processor (doesn't know or care what it is told to evaluate), only supplying process control (e.g. batch size). This is the poor man's streaming architecture. The comments up-thread hint at alternate, likely better, design choices!

adi07:01:46

When the Clojure library creates the promise it also starts something so I need to both create the promise, get the Java ref handle and then wait on it to complete but I want to do this in batches.I am squinting at this and it looks awfully clojure-ants-demo-shaped to me :) It sounds like effects are async and uncoordinated within each batch (each effect is an "ant"). And the next batch can begin as soon as the ongoing batch completes (the "world" / "board" is repopulated with new ants as soon as all the ants are "done" in the current batch). Might be fun to play with agents: https://github.com/juliangamble/clojure-ants-simulation/blob/master/src/ants.clj#L316

Bailey Kocin16:01:22

All of this advice is great! It is a bit weird because I am writing code definitions for a service and the service fufills promises but it can only start so many at once so I had to batch them up. I was waiting on the batches with something like promise/all then doing the next batch. I could only use functions the service had so I did not block the main loop so adding a channel to consume things would have been tough since the code was tied to the definition of the service and they way it functioned. I figured it out by just passing in the function object and the arguments then invoking them manually to not go over the gRPC limit the service had. Then using a promise primitive I can await while not blocking.

metal 1