Fork me on GitHub
#joker
<
2020-12-26
>
mobileink13:12:09

Hi. I’m using file-seq to get all the files whose names match a few simple patterns (e.g. “*.ml”). Unfortunately the tree includes 64K items, So I need a version of file-seq that supports a filter function, so I can tell it to ignore files whose names do not match a list of patterns. Since file-seq calls go’s

func Walk(root string, walkFn WalkFunc) error
that should be relatively easy, even for somebody who doesn’t know much go. My question is whether there is a way to do that by providing a go file in my app code so I don’t have to mess with the joker source code. Well, I guess the first question is whether this would speed things up significantly.

👍 3
jcburley19:12:58

Sounds like it might speed things up at least somewhat, but it’s hard to say, depending on the implementation. A general Joker implementation would be to (e.g.) allow for an optional filter-fn arg to file-seq that is called for (and with) each path and returns a Boolean result indicating whether the path (and its info) is added to the resulting Vector. But that’d probably be about as expensive (CPU-wise, though not memory-wise) as just building the map and adding that to the resulting Vector and letting the caller decide whether to use it. A faster implementation would be custom Go code that your own file-seq variant would call. It’d do whatever matching it wanted (including against info, which would not be so easily and efficiently passed to the filter-fn proposed above). By not calling into Joker code, that overhead would be saved for each path returned by filepath.Walk. If you’re willing to consider rebuilding Joker for this purpose, you might find it easier than you think to add your own custom file-seq to filepath.joke and filepath/filepath_native.joke in Joker’s std/ subdirectory. See DEVELOPER.md and other docs on the internals, such as how the Standard-library stuff works, and let us know if you’d like to try this approach. (BTW it’s nice to see Joker getting such “heavy” use!!)

jcburley19:12:39

Well, I got curious, and decided to do a little testing, and the custom-Go-code approach does seem to save a little time:

craig@pony:~/go$ xtime ./src/github.com/candid82/joker/joker -e '(doseq [f (joker.filepath/my-file-seq ".")] (println (:name f)))' | wc -l
u=1.14 s=0.29 r=0.95 cpu=150% kBresavg=0 kBresmax=44736 kBundata=0 kBunstack=0 kBtext=0 Bpagsiz=4096 kBavgtot=0
  fsin=0 fsout=0 sockrcv=0 socksnt=0 pfmaj=0 pfmin=11339 vol=178 invol=2806 signals=320 swaps=0
  rc=0 ./src/github.com/candid82/joker/joker -e (doseq [f (joker.filepath/my-file-seq ".")] (println (:name f)))
   18290
craig@pony:~/go$ xtime joker -e '(doseq [f (joker.filepath/file-seq ".")] (or (joker.string/ends-with? (:name f) ".go") (println (:name f))))' | wc -l
u=1.29 s=0.32 r=1.15 cpu=140% kBresavg=0 kBresmax=55424 kBundata=0 kBunstack=0 kBtext=0 Bpagsiz=4096 kBavgtot=0
  fsin=0 fsout=0 sockrcv=0 socksnt=0 pfmaj=2051 pfmin=11983 vol=611 invol=3866 signals=311 swaps=0
  rc=0 joker -e (doseq [f (joker.filepath/file-seq ".")] (or (joker.string/ends-with? (:name f) ".go") (println (:name f))))
   18290
craig@pony:~/go$ xtime find . \! -name "*.go" | wc -l
u=0.06 s=0.23 r=0.30 cpu=99% kBresavg=0 kBresmax=1008 kBundata=0 kBunstack=0 kBtext=0 Bpagsiz=4096 kBavgtot=0
  fsin=0 fsout=0 sockrcv=0 socksnt=0 pfmaj=9 pfmin=377 vol=0 invol=120 signals=0 swaps=0
  rc=0 find . ! -name *.go
   18290
craig@pony:~/go$

jcburley19:12:09

So the custom approach takes about 1.14s, while the vanilla Joker approach takes 1.29s, versus fine at 0.06s.

jcburley19:12:31

Here are the two pertinent diffs (a third appears after building via ./run.sh due to autogeneration of Go code):

diff --git a/std/filepath.joke b/std/filepath.joke
index d6a860d1..4f3f9995 100644
--- a/std/filepath.joke
+++ b/std/filepath.joke
@@ -9,6 +9,13 @@
   :go "fileSeq(root)"}
   [^String root])
 
+(defn my-file-seq
+  "Returns a seq of maps with info about files or directories under root, exception
+  for those with names ending in '*.go'."
+  {:added "1.0"
+  :go "myFileSeq(root)"}
+  [^String root])
+
 (defn ^String abs
   "Returns an absolute representation of path. If the path is not absolute it will be
   joined with the current working directory to turn it into an absolute path.
diff --git a/std/filepath/filepath_native.go b/std/filepath/filepath_native.go
index 0af7f449..416b75b8 100644
--- a/std/filepath/filepath_native.go
+++ b/std/filepath/filepath_native.go
@@ -3,6 +3,7 @@ package filepath
 import (
        "os"
        "path/filepath"
+       "strings"
 
        . ""
 )
@@ -17,3 +18,17 @@ func fileSeq(root string) *Vector {
        })
        return res
 }
+
+func myFileSeq(root string) *Vector {
+       res := EmptyVector()
+       filepath.Walk(root, func(path string, info os.FileInfo, err error) error {
+               PanicOnErr(err)
+               if strings.HasSuffix(path, ".go") {
+                       return nil
+               }
+               m := FileInfoMap(path, info)
+               res = res.Conjoin(m)
+               return nil
+       })
+       return res
+}

jcburley19:12:58

Sounds like it might speed things up at least somewhat, but it’s hard to say, depending on the implementation. A general Joker implementation would be to (e.g.) allow for an optional filter-fn arg to file-seq that is called for (and with) each path and returns a Boolean result indicating whether the path (and its info) is added to the resulting Vector. But that’d probably be about as expensive (CPU-wise, though not memory-wise) as just building the map and adding that to the resulting Vector and letting the caller decide whether to use it. A faster implementation would be custom Go code that your own file-seq variant would call. It’d do whatever matching it wanted (including against info, which would not be so easily and efficiently passed to the filter-fn proposed above). By not calling into Joker code, that overhead would be saved for each path returned by filepath.Walk. If you’re willing to consider rebuilding Joker for this purpose, you might find it easier than you think to add your own custom file-seq to filepath.joke and filepath/filepath_native.joke in Joker’s std/ subdirectory. See DEVELOPER.md and other docs on the internals, such as how the Standard-library stuff works, and let us know if you’d like to try this approach. (BTW it’s nice to see Joker getting such “heavy” use!!)