Fork me on GitHub
#beginners
<
2020-07-19
>
vachichng12:07:40

hi, I have a big csv, is it possible to read lazily the lines in inverse order? from tail to head

jaihindhreddy13:07:10

You can reverse the file with tac file.csv > rev_file.csv and then read rev_file.csv normally. Definitely an option if the file isn't too large.

noisesmith14:07:06

surely someone has made a "from end buffered line reader" on top of mmap

noisesmith14:07:53

it's inefficient (requires consuming the whole thing and working backward) but doable

noisesmith16:07:29

actually, mmap might be the one way to do this that doesn't use heap space inefficiently (if I'm remembering the API correctly)

noisesmith16:07:17

you can use the memory mapped API to do this without putting the whole contents in heap, it even lets you skip to the end and work toward the front, without consuming what's in between https://howtodoinjava.com/java7/nio/memory-mapped-files-mappedbytebuffer/

drewverlee16:07:44

I'm curious why that would be useful. @vachichng

alexmiller16:07:28

I assume you have something accumulated in time order and you want to process from new to old

noisesmith16:07:45

next question is if anybody hooked that up to read lines in backward order yet

noisesmith16:07:21

clever usage of tac and ProcessBuilder with output to an iostream instead of file might allow similar via the OS(?)

vachichng16:07:52

@drewverlee exactly as Alex said, I have a big cvs of timestamped data that I have to process it windowed by timeframe and the original source has it ordered by recent to old, but the window function needs old to recent

smith.adriane16:07:25

> Reads lines in a file reversely (similar to a BufferedReader, but starting at the last line). Useful for e.g. searching in log files.

noisesmith16:07:40

nice, I'd hoped someone made that

drewverlee16:07:30

But if you have to process the whole thing then the out put will be the same no? The window closing logic should handle any order of time stamped events.

noisesmith16:07:01

many reducing processes aren't associative, ordering can matter a lot

drewverlee16:07:38

Yes, but I think, for instance, onyxs semantics express the window closing trigger this way.

vachichng16:07:58

@drewverlee no, ordering matters because the window function needs to know the oldest timestamp to create sequence

vachichng16:07:53

@drewverlee the windows functions looks like, every element inside a initial time and a hour after

vachichng16:07:53

@drewverlee the windows functions looks like, every element inside a initial time and a hour after

drewverlee16:07:27

starting a thread so we dont lock up beginners. Is the csv ordered by timestamp?

drewverlee16:07:03

> no, ordering matters because the window function needs to know the oldest timestamp to create sequence What does "create a sequence" mean?

vachichng16:07:43

a sequence of grouped data created by the window function

vachichng16:07:03

yeah, it is ordered by new to old

vachichng16:07:21

but, I have to process it from old to new

drewverlee16:07:47

is there a side effect as part of that processing?

drewverlee16:07:05

that those need to be ordered?

vachichng16:07:28

yes, I have to send to a kafka topic to do further processing

drewverlee16:07:10

Gotcha. 1. its weird to have ordered the data this way if its not how its used. It means you always pay a performance penalty. This is the main issue, that the data is ordered in such a way that readers pay for it. 2. Out of order data process is unavoiable and has to be accounted for. 3. streaming frameworks with windowing semantics can read out of order data. Say we had data from 3 to 5. so its basically ordered in your csv like 5:01 4:01 3:01. It would read 5:01 first, then 4:01 and the window trigger would close on 5-6 because we saw 4:01 and release that data to kafka (or what ever). The thing reading kafka can and should also account for out of order time stamps (because this is unavoidable). E.g lets say you do reverse the csv and send 3-4 first, what if that network call fails (and has to retry) but 4-5 succeeds and so kafka gets 4-5, 2-3, 5-6 regardless.

drewverlee16:07:19

If thats more or less all well understood then the solutions suggested by others about reverse reading a csv are likely good ones. i don't have any particular insight there 🙂

vachichng19:07:49

@drewverlee thanks, I'm not familiar with kafka, will have a look on that, I was doing the transformations with transducers to feed the kafka topic, because it turns out that the cvs is not ordered properly, so a naive reverse of the cvs is still not ordered as I need it.

tjb20:07:21

hello again everyone! i have a question: is there a preferred structure in a repo for having both a client (cljs) and api (clj)?

tjb21:07:33

i currently used lein new shadow-cljs <app-name> +reagent but curious if there is another way to generate both the client template and a plain lein project? maybe i should just lein new my-app` and then lein new shadow-cljs inside of that app?

nfedyashev23:07:12

luminus template has really good defaults and the overall structure(clj/cljs/cljc). lein new luminus new-app +cljs +shadow-cljs

tjb23:07:47

@ -- thank you so much! ill take a peek

nfedyashev12:07:04

haha sorry mate. I recommended you giving luminus a try and looks like you experienced the exactly the same issue I was fighting in the recent Luminus template 😄 Infinite build completed/compiling Referenced issue https://github.com/luminus-framework/luminus/issues/270 You may need to upgrade your shadow-cljs to 2.10.16 or use ~ month old luminus template as a workaround

tjb15:07:06

@ thank you a ton! i got it resolved when you responded to me in #shadow-cljs i appreciate your support and guidance on this!