beginners 2020-07-19 | Slack Archive

hi, I have a big csv, is it possible to read lazily the lines in inverse order? from tail to head

You can reverse the file with tac file.csv > rev_file.csv and then read rev_file.csv normally. Definitely an option if the file isn't too large.

👏 3

noisesmith14:07:06

surely someone has made a "from end buffered line reader" on top of mmap

noisesmith14:07:53

it's inefficient (requires consuming the whole thing and working backward) but doable

noisesmith16:07:29

actually, mmap might be the one way to do this that doesn't use heap space inefficiently (if I'm remembering the API correctly)

noisesmith16:07:17

#Also sent to the channel

you can use the memory mapped API to do this without putting the whole contents in heap, it even lets you skip to the end and work toward the front, without consuming what's in between https://howtodoinjava.com/java7/nio/memory-mapped-files-mappedbytebuffer/

Drew Verlee16:07:44

I'm curious why that would be useful. @vachichng

Alex Miller (Clojure team)16:07:28

I assume you have something accumulated in time order and you want to process from new to old

noisesmith16:07:17

replied to a thread:it's inefficient (requires consuming the whole thing and working backward) but doable

noisesmith16:07:45

next question is if anybody hooked that up to read lines in backward order yet

noisesmith16:07:21

clever usage of tac and ProcessBuilder with output to an iostream instead of file might allow similar via the OS(?)

Ludwig16:07:52

@drewverlee exactly as Alex said, I have a big cvs of timestamped data that I have to process it windowed by timeframe and the original source has it ordered by recent to old, but the window function needs old to recent

phronmophobic16:07:15

maybe ReversedLinesFileReader would work, http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/ReversedLinesFileReader.html

phronmophobic16:07:25

> Reads lines in a file reversely (similar to a BufferedReader, but starting at the last line). Useful for e.g. searching in log files.

noisesmith16:07:40

nice, I'd hoped someone made that

Drew Verlee16:07:30

But if you have to process the whole thing then the out put will be the same no? The window closing logic should handle any order of time stamped events.

noisesmith16:07:01

many reducing processes aren't associative, ordering can matter a lot

Drew Verlee16:07:38

Yes, but I think, for instance, onyxs semantics express the window closing trigger this way.

Ludwig16:07:58

@drewverlee no, ordering matters because the window function needs to know the oldest timestamp to create sequence

Ludwig16:07:53

@drewverlee the windows functions looks like, every element inside a initial time and a hour after

Drew Verlee16:07:27

starting a thread so we dont lock up beginners. Is the csv ordered by timestamp?

Drew Verlee16:07:03

> no, ordering matters because the window function needs to know the oldest timestamp to create sequence What does "create a sequence" mean?

Ludwig16:07:43

a sequence of grouped data created by the window function

Ludwig16:07:03

yeah, it is ordered by new to old

Ludwig16:07:21

but, I have to process it from old to new

Drew Verlee16:07:47

is there a side effect as part of that processing?

Drew Verlee16:07:05

that those need to be ordered?

Ludwig16:07:28

yes, I have to send to a kafka topic to do further processing

Drew Verlee16:07:10

Gotcha. 1. its weird to have ordered the data this way if its not how its used. It means you always pay a performance penalty. This is the main issue, that the data is ordered in such a way that readers pay for it. 2. Out of order data process is unavoiable and has to be accounted for. 3. streaming frameworks with windowing semantics can read out of order data. Say we had data from 3 to 5. so its basically ordered in your csv like 5:01 4:01 3:01. It would read 5:01 first, then 4:01 and the window trigger would close on 5-6 because we saw 4:01 and release that data to kafka (or what ever). The thing reading kafka can and should also account for out of order time stamps (because this is unavoidable). E.g lets say you do reverse the csv and send 3-4 first, what if that network call fails (and has to retry) but 4-5 succeeds and so kafka gets 4-5, 2-3, 5-6 regardless.

Drew Verlee16:07:19

If thats more or less all well understood then the solutions suggested by others about reverse reading a csv are likely good ones. i don't have any particular insight there 🙂

Ludwig19:07:49

@drewverlee thanks, I'm not familiar with kafka, will have a look on that, I was doing the transformations with transducers to feed the kafka topic, because it turns out that the cvs is not ordered properly, so a naive reverse of the cvs is still not ordered as I need it.

tjb20:07:21

hello again everyone! i have a question: is there a preferred structure in a repo for having both a client (cljs) and api (clj)?

🎉 3

tjb21:07:33

i currently used lein new shadow-cljs <app-name> +reagent but curious if there is another way to generate both the client template and a plain lein project? maybe i should just lein new my-app` and then lein new shadow-cljs inside of that app?

nick23:07:12

luminus template has really good defaults and the overall structure(clj/cljs/cljc). lein new luminus new-app +cljs +shadow-cljs

tjb23:07:47

@U0113AVHL2W -- thank you so much! ill take a peek

nick12:07:04

haha sorry mate. I recommended you giving luminus a try and looks like you experienced the exactly the same issue I was fighting in the recent Luminus template 😄 Infinite build completed/compiling Referenced issue https://github.com/luminus-framework/luminus/issues/270 You may need to upgrade your shadow-cljs to 2.10.16 or use ~ month old luminus template as a workaround

tjb15:07:06

@U0113AVHL2W thank you a ton! i got it resolved when you responded to me in #shadow-cljs i appreciate your support and guidance on this!

👍 3

2020-07-19

Channels