Fork me on GitHub
#pathom
<
2023-04-05
>
Ferdinand Beyer11:04:22

Hello! We are using Pathom in a probably slightly unorthodox way: As a dependency graph with on-demand computation. Think of it as a build tool, similar to a Makefile: When files change, other files derived from those need to be recomputed. While this works, it is not particularly fast. We have a huge graph with several thousand entities. Often, an attribute that we are interested in is already cached and up-to-date, but Pathom still needs to walk the graph and check the cache for each resolver, which is quite slow. I’m wondering if other people use Pathom in a similar way, and if there are some recommendations for this scenario?

wilkerlucio17:04:30

hello Ferdinand, Pathom does have some extra overhead when dealing with large connections, because as you said, it still has to scan to make sure that each entity has all the data, even when its cached, it still needs to pull from cache, merge, etc... one thing you can do is use the ^::pco/final meta to take the control out of pathom and handle it yourself, this is usually done to handle some expensive part of the query, while keep delegating the rest to Pathom, does that sound something that could work for your case?

Ferdinand Beyer07:04:13

Hi Wilker, thanks for your reply! Not sure if I understand ::pco/final correctly. It is used to mark maps/collections returned from a resolver as final, so it is not sub-processed, right? In our case, we do want Pathom to process the collections. For example, on resolver will return a large sequence of file names, and depending on the query these should be subprocessed: Read file contents, parse, etc. We do have some other maps that are indeed final. However, they are never sub-queried. Does it still make sense to mark those as final?

wilkerlucio14:04:19

gotcha, seems like in your case, trying to move final to that front list would defeat the purpose of using Pathom for it

wilkerlucio14:04:00

I think will be important to clarity on pathom docs about the base overhead when processing large collections, because it is noticable, but I like to also keep an eye if we can spot some opportunities to optimize, what makes this overhead is that Pathom still needs to scan every item to make sure all requested attributes are there, not many shortcuts possible here if we want to keep the "I'll try as much as I can"

wilkerlucio14:04:51

can you tell more about your case? and if you think there could be some way to optimize it to reduce the overhead?

Ferdinand Beyer14:04:28

Sure! First I want to emphasize that I don’t think this is “Pathom’s fault”, I think we are using it in a way that Pathom was not originally built for. We have a system where we want to analyze Git repositories. You can think of it as a software project: There are a lot of source files in a repository, and we are providing analyses for these. Depending on the task at hand, we need to read all files or just a few of them. In order to process files, we need to load configuration files that provide context/parameters. We are now using Pathom to describe the relationships between files in this scenario. The goal is that we can load on demand, and reuse a lot of code. We also want to be able to invalidate certain results when files change, but ideally only derived values that directly or indirectly depend on the changed file. So it is very similar to an incremental software build process. In our case, we are in total control over the inputs of the system. In other words: When querying for data, we could stop immediately if we find it in the cache, without having to check if an entity’s dependencies have changed. On the other hand, when something changes, we would need to update or invalidate everything depending on it. We are somewhat naive in the way we model this system with Pathom. There’s an entity representing the repository, an a resolver that finds all files in the repository. Each file is an entity, identified by its file path. The contents of a file is another attribute that can be resolved from the file path, so we only read files on demand. Parsing a file is yet another resolver, taking the file contents as input and resolving the parsed result. I figure we could change this somehow, so that a parsed result is cached based on the file path, not the actual resolver input. Then we could short-circuit: If a parsed result for a file path is cached, use it. There would be no need to check if the file contents is cached.