Fork me on GitHub
#clara
<
2018-06-05
>
jdt12:06:03

Any tips on analyzing where I'm spending time in rules? (for example, which rules).

mikerod13:06:08

@dave.tenny should be possible to use Clara tracing to take a look at something like that. Although it’s exploratory to find what you want given the data structures it returns. You could get something like count of times things were evaluated that way.

mikerod13:06:47

It’d be nice if Clara had more supporting functions for this purpose though.

jdt13:06:57

There are some minimum number of queries in addition to the above rules.

jdt13:06:36

Between each fire-rules the dispatcher will select one job per job type (so 15 in this numbers case), dispatch it, update the ActiveUserJobCount, WorkerResource, facts, and retract the dispatched JobRequest facts.

jdt13:06:10

So I'm averaging about 1-3 seconds per job dispatch, and whacking the hell out of my CPU. (Memory footprint is good however... surprise!)

jdt13:06:35

I had originally hoped to do the fact maintenance for active user job counts (used for round robin eligibility consideration), and worker resource stats (to track remaining worker capacity), in the RHS of rules. But I gave that up early in the process because it was a side-effect oriented process with a bunch of [:not A] => A scenarios, which might have been doable, but it was really the timing of the rule LHS evaluations that killed it, since LHS evaluations are perceptually "in parallel" and not sequential with respect to cause and effect for a given rule.

jdt13:06:40

So I compute some eligible things, then do the dispatching and accounting between calls to fire-rules, then do it all over again (always saving and continuing from the updated session).

mikerod13:06:01

> but it was really the timing of the rule LHS evaluations that killed it, since LHS evaluations are perceptually “in parallel” and not sequential with respect to cause and effect for a given rule. Only the case for insert-unconditional! (and possibly even a defect), but yeah. if you have to extract these facts after they are “done” it may make most sense to be an external thing anyways.

mikerod13:06:26

@dave.tenny I can look at your rules from above some, sounds too slow

jdt13:06:15

Happy to share the whole module, nothing but some pretty rules and ugly code to do the bookkeeping and setup mock data and such

mikerod13:06:05

when something takes long in that range of seconds (like 10+) I tend to do a cpu sampling

jdt13:06:09

Job dispatch and completion is all about accounting then trying again with updated rules.

mikerod13:06:20

With a profiler, I tend to use visualVM because it is good enough for this situation

mikerod13:06:47

sometimes it gives quicker leads to what is causing issues. Sometimes it is too opaque unless you know the rule engine internals, but not necessarily always

jdt13:06:57

Yeah, I'm having problems with VisualVM on my Linux system becuase of some JNI/Jar problem I can't figure out, and HPROF sampling is usually useless

mikerod13:06:11

but also, doing some just blunt counting of times that certain conditions in rules and/or rule RHS were fired sometimes can give you the outliers as well

mikerod13:06:38

Yeah, VisualVM with “cpu sampler” is what I have used. if things are whacky there, not sure hah

jdt13:06:40

Yeah, I'm looking at instrumenting, is it possible to capture the time spent, via instrumentation, in each LHS condition?

mikerod13:06:58

well, time will obviously go to the instrumentation/tracing you do

mikerod13:06:05

but if you do something like just gather counts

mikerod13:06:11

that is not time-sensitive anyways

jdt13:06:35

Re the rules, any obvious stupid hash-join failures or things that might be better done in predicates or with accumulators?

mikerod13:06:36

when I’ve tried “counting condition/rhs” firings before in bad performance cases, there are often outliers

mikerod13:06:48

Like most things evaluation perhaps no more than 1k times for example

mikerod13:06:56

but then you find something that evaluates like 1 mil times

mikerod13:06:06

typically its something like that

jdt13:06:16

In my case I'm worried it's the tests in the conditions that are firing a lot, but I'll get more data.

mikerod13:06:55

So even your rule like worker-viable-jobs may be pretty heavy

mikerod13:06:06

I’m just looking over what you had and your sort of fact counts

jdt13:06:17

For example, if I'm doing N^2 firings on the worker-viable-jobs conditions for the 100k job requests and 45 worker-resource facts.

mikerod13:06:26

you have 100K job requests facts and 45 workers, and you also have a :not in that rule

mikerod13:06:35

so you are something like 100K x 100K x 45 territory

jdt13:06:24

Right, is there a way to achieve the semantics there without the combinatorics?

mikerod13:06:34

probably just doing millions of comparisons with clj = and < etc and it adds up

mikerod13:06:43

yes, I think there is a way out of it

jdt13:06:18

This is where, if I could update the worker resources in the RHS and immediately prune the possibilities for the next firing of worker-viable-job it would have been a win, but that doesn't work because all the worker-viable-jobs are going to fire regardless of whether I update the worker resources because of that seemingly parallel LHS evaluation protocol.

mikerod13:06:45

Extract a rule, find the oldest jobs first, then only bring those into the join with WorkerResource rule here

mikerod13:06:21

in general, you have a lot of JobRequest facts to deal with. You want to avoid any rule that may do a join across that set of facts with itself

mikerod13:06:41

I’m not sure how memory and thread factor into an “oldest job” situation

mikerod13:06:47

so hard to give you an example

jdt13:06:47

Except we'll still need to potentially consider the next oldest, and so on, until we find one that fits the avaialble worker resources, so does an extra rule really help?

mikerod13:06:03

[:not [JobRequest (< job-id ?job-id) 
                    (= ?job-type job-type)
                    (= ?user-id user-id)
                    (<= threads ?threads)
                    (<= memory ?memory)]]

jdt13:06:21

The memory and thread tests are to test only for jobs that can actually execute on the worker resources, i.e. for which enough resources exist.

jdt13:06:54

That's the "viable" part of the semantics

mikerod13:06:01

the rule can help, just have to figure out what the correct rule is

mikerod13:06:30

thinking about it

jdt13:06:57

Any advice for a beginner on how to effectively use the tracing API here? The one time I tried it there was too much data to process, and that was on the simplest most minimal amount of facts.

jdt13:06:13

re: the "oldest job" stuff, I was wondering if accumulators would in any way help, I have no idea how they're implemented w.r..t incremental fact maintenance.

mikerod14:06:28

> Any advice for a beginner on how to effectively use the tracing API here? The one time I tried it there was too much data to process, and that was on the simplest most minimal amount of facts. I haven’t used it as much as I’d expect. I was used to rolling my own stuff prior to when tracing stuff was introduced. However, for counting, I believe you can do something like:

(let [traced (-> (clara.rules/mk-session <your rules>)
                 (clara.tools.tracing/with-tracing)
                 (insert <your facts>)
                 fire-rules
                 clara.tools.tracing/get-trace)]
  (frequencies (map :node-id traced)))

mikerod14:06:50

or perhaps better sorted:

(let [tr (-> (clara.rules/mk-session [temp-rule])
                                       (t/with-tracing)
                                       (insert (->Temperature 10 "MCI")
                                               (->Temperature 20 "MCI"))
                                       (fire-rules)
                                       (t/get-trace )
                                       )]
                            (->>  (map :node-id tr)
                                  frequencies
                                  (sort-by val)
                                  reverse))

mikerod14:06:10

once you know the highest count :node-ids you can look those up in the rulebase associated with the session

jdt14:06:35

excellent, thanks

mikerod14:06:32

(let [session (-> (clara.rules/mk-session <your rules>)
                  (t/with-tracing)
                  (insert <your facts>)
                  (fire-rules))

      trace (t/get-trace session)
      node-id <whatever node id in question from `trace`>


      {:keys [rulebase]} (clara.rules.engine/components session)
      {:keys [id-to-node]} rulebase]
  (get id-to-node node-id) )
This is how you could look up the node-id to try to find what node in the engine it is

mikerod14:06:52

a node will be a defrecord of stuff, not all that readable to you, but you should be able to recognize aspects of it and align it back to your rules typically

jdt15:06:13

Thanks Mike, I'll have a look

jdt16:06:05

@mikerod as I am still new at accumulators, I'm trying to discern the purpose of the ?user-id binding on line 30 in the above snippet. Is it used so that the rule will fire once for each distinct user id? Does it work given that ?user-id is not otherwise bound?

mikerod16:06:58

@dave.tenny > Is it used so that the rule will fire once for each distinct user id? yes, in that’s its purpose in that example

mikerod16:06:54

which makes me realize I had a typo there, it should have been (= ?user-id (-> this :worker-viable-job :user-id)) since it was nested one level lower, will update. Sorry if that caused confusion.

mikerod16:06:43

Accumulators behavior with field-level bindings like that is explained more in http://www.clara-rules.org/docs/accumulators/

jdt16:06:05

I'm getting null pointer exceptions on calls to < that I think are the accumulator, but are maybe something else, unfortunately the stack trace doesn't not clue me in other than it's in (fire-rules), pretty much. However the acc/min isn't documented to accept a :initial-value argument, so it's probably something else. Just in case something obvious is missing above.

jdt16:06:04

Ah wait, maybe it's because I removed an explicit binding of :user-id in the candidate rule.

mikerod16:06:44

@dave.tenny more typo’s because I forgot the thing was neste

mikerod17:06:00

(acc/min (comp :job-id :worker-viable-job) :returns-fact true), I updated the snippet above to reflect that

mikerod17:06:18

if :job-id may be nil though, would have to defend against it