Fork me on GitHub
#datalog
<
2023-08-01
>
vlad_poh19:08:39

Hi I’m trying to get a better feel for datalog and trying to create the Northwind traders database. I went through each column and defined a fairly simple schema. I’m now loading data and i realize i may have gone overboard with global attributes. Suppliers look like this and my question is are the any good directions on schema design. Dos and donts? Querying for suppliers i have to come up with some adhoc combination of attributes or creating a surrogate identifier

{:global/address "Brovallavägen 231",
  :global/companyname "Svensk Sjöföda AB",
  :global/contacttitle "Sales Representative",
  :global/city "Stockholm",
  :global/postalcode "S-123 45",
  :global/phone "08-123 45 67",
  :global/country "Sweden",
  :db/id  -60017,
  :global/contactname "Michael Björn"}
;suppliers
  (clojure.pprint/print-table
   (map first (d/q '[:find   (pull ?e [*])
                      :where
                      ;[?e :global/contactname]
                      [?e :global/contacttitle]
                      ;[?e :global/companyname]
                             ] db)))

Björn Ebbinghaus22:08:34

There is a "Best practices" page in the Datomic docs: https://docs.datomic.com/pro/best-practices.html You can model your data just like in the RDB. Just use the table name (in singular) as your namespace.

{:supplier/id 42    ; not necessary for joins! 
 :supplier/name "Svensk Sjöföda AB"
 :supplier/city "Stockholm"}
You don't need to model the IDs for joins! Nevertheless, it can be useful to have unique IDs as part of your schema. (to use them externally, for example, or to have an attribute that is guaranteed to be present). You model relationships the same with as you would do in an RDB (although you could omit many-to-many tables). Product.SupplierID just becomes :product/supplier or, in case of many-to-many, you would use :product/suppliers.
#:product{:name "Whatever" 
          :supplier #:supplier{:name "Svensk Sjöföda AB"
                               :city "Stockholm"}
          :category {:category/name "Something", :category/description "Lorem dolor ipsum..."}}

; with many categories
#:product{:name "Whatever" 
          :categories [{:category/name "Something", :category/description "Lorem dolor ipsum..."}]}
(if you don't know: #:namespace{:a 1, b 2}is short for: {:namespace/a 1, :namespace/b 2}) The schema would look like this:
{:db/ident :product/category
 :db/type :db.type/ref
 :db/cardinality :db.cardinality/one} ; <- or many if you have many-to-many (of course you could add an entity as well)
Add data:
(d/transact conn {:tx-data [{:supplier/id 42
                             :supplier/name "Svensk Sjöföda AB"
                             :supplier/city "Stockholm"}]})

(d/transact conn 
  {:tx-data [#:category{:db/id "my-temp-id", 
                        :name "Something"
                        :description "Lorem dolor ipsum..."}
             #:product{:name "Whatever", 
                       :supplier [:supplier/id 42], 
                       :category "my-temp-id"}]})
Example of find & pull separation.
(def all-suppliers-query
  '[:find (pull ?supplier pattern)
    :in $ pattern 
    :where [?supplier :supplier/id]])  ; every supplier has a :supplier/id

(d/q all-suppliers-query db 
  [:supplier/id 
   :supplier/name 
   {:product/_supplier [:product/name
                        {:product/category [:category/name]}]}])

; :product/_supplier is a _reverse pull_. Do not model relationships in two directions!

; => 
[{:supplier/id 42
  :supplier/name "Svensk Sjöföda AB"
  :supplier/city "Stockholm"
  :producct/_supplier [#:product{:name "Whatever"
                                 :category {:category/name "Something"}}]}]

vlad_poh01:08:03

@U4VT24ZM3 thanks for that thorough response. Still struggling to see why i would include the supplier id from the original database when i don’t plan to generate new supplierids. Also if supplier/name shows up elsewhere as companyname do i replace companyname with a supplier ref?

Björn Ebbinghaus12:08:16

Sorry for the late reply. What you need or don't need is up to you. Often you want a unique key, that identifies an entity as "something", a supplier in this case, and that you can use outside your database. (Using autoincremented IDs in public is sometimes seen as bad practice, but that discussion is not the point here.)

👍 2
Björn Ebbinghaus12:08:24

I'm not sure if I understand your second point. > Also if supplier/name shows up elsewhere as companyname do i replace companyname with a supplier ref? Do you mean like an alias? Or do you want companies that can fulfil multiple roles? I used this ER diagram for the Northwind Database. https://en.wikiversity.org/wiki/Database_Examples/Northwind#/media/File:Northwind_E-R_Diagram.png

👍 2
Björn Ebbinghaus14:08:29

It's a thing of thinking about what is its own entity and what is not. Customer, Shipper and Supplier could be three totally disjunct things. (like in the ER diagram above) Or they could all share attributes of a Company, like a name. Consider following attributes:

:company/name

:supplier/id
:shipper/id
:customer/id
(Of course, these don't have to be id attributes.) You can mix and match them together in a single entity.
{:company/name "My company"}  ; Company on it's own.

{:company/name "My shipping company"
 :shipper/id 42}

{:company/name "My supplier company"
 :supplier/id 100}    

{:company/name "My supplier company that also ships orders."
 :shipper/id 42
 :supplier/id 200} 

👍 2
vlad_poh21:08:04

@U4VT24ZM3 thanks very helpful!

vlad_poh22:08:48

This is what my https://github.com/kbosompem/northwind-datalevin/blob/main/src/northwind/schema/schema.edn currently looks like and i’m trying to create a query equivalent to the following https://neo4j.com/graphgists/northwind-recommendation-engine/ any guidelines?

Björn Ebbinghaus11:08:13

The equivalent queries to the "Popular Products" would be something like:

[:find ?company ?product (count ?order)
 :where 
 [?order :order/customer ?company]
 [?order-detail :orderdetail/order ?order]
 [?order-detail :orderdetail/product ?product]]

; =>
; #{[42 123 4], [99 567 6]} ; (of course you could `pull` on ?company and ?product to get the fields company and productName)

[:find (pull ?company [:customer/company-name]) (pull ?product [:product/name]) (count ?order)
 :keys :company :product :orders 
 ...]

; =>
; #{{:company {:customer/name "Company 42"}, :product {:product/name "Toothbrush"}, :orders 10000}
    {:company {:customer/name "Company 99"}, :product {:product/name "Box"}, :orders 7654}}
(Note that Relationships in neo4j can have attributes, so while the example just uses :PRODUCT, we have to go through the orderdetail)

vlad_poh11:08:20

Got that one. The second one is hard can’t figure it out match (c:Customer)-[:PURCHASED]->(o:Order)-[:PRODUCT]->(p:Product) <-[:PRODUCT]-(o2:Order)-[:PRODUCT]->(p2:Product)-[:PART_OF]->(:Category)<-[:PART_OF]-(p) WHERE c.customerID = 'ANTON' and NOT( (c)-[:PURCHASED]->(:Order)-[:PRODUCT]->(p2) ) return c.companyName, p.productName as has_purchased, p2.productName as has_also_purchased, count(DISTINCT o2) as occurrences order by occurrences desc limit 5

Björn Ebbinghaus12:08:36

(We are getting into territory where I'm not quite sure, if it's right, without testing it. 🙂) This can get unwieldy (just like the Cypher query) and you might want to start to write some rules:

(def rules
  '[[(purchased ?customer ?product)
     (purchased ?customer ?product _)]

    [(purchased ?customer ?product ?no-of-orders)
     [?order :order/customer ?customer]
     [?order-detail :orderdetail/order ?order]
     [?order-detail :orderdetail/product ?product]
     [(count ?order) ?no-of-orders]]])

[:find ?customer ?product-we-purchased ?product-others-purchased (sum ?no-of-orders)
 :in $ % ?customer        ; $ = database, % = ruleset
 :where 
 (purchased ?customer ?product-we-purchased)             ; get all products purchased by our customer
 (purchased ?other-customer ?product-we-purchased)       ; get all customers who purchased these products
 [(not= ?customer ?other-customer)]                      ; exlude our own customer
 (purchased ?other-customer ?product-others-purchased ?no-of-orders)   

 [?product-we-purchased :product/category ?category]     ; select categories of products we purchased
 [?product-others-purchased :product/category ?category] ; select products others purchased with the category

 (not (purchased ?customer ?product-others-purchased))]  ; filter out all products that we already purchased

; input: db rules [:customer/id "ANTON"]

Björn Ebbinghaus13:08:32

Rules are where the real Datalog shows its teeth. If you ever worked with Prolog the rules should look similar to you.

[(same-category ?p1 ?p2 ?category)
 [?p1 :product/category ?category]
 [?p2 :product/category ?category]]
You can use this rule with as much information as you like:
[:find ?p1 ?p2
 :where
 [?category :category/name "bathroom"]
 (same-category ?p1 ?p2 ?category)]

; returns Cartesian product of all products in the category "bathroom". 
=> #{["toothbrush" "toothbrush"], ["toothbrush" "dental floss"], ["dental floss" "toothbrush"], ...}
You can use rules in rules, even recursively. (`_` is a "blank" and means "we don't care")
[(same-category ?p1 ?p2)
 (same-category ?p1 ?p2 _)] 

[(others-in-same-category ?p1 ?p2)
 (same-category ?p1 ?p2)
 [(not= ?p1 ?p2)]]

vlad_poh20:08:34

Interesting!! How performant are rules? There is also a sense/question that i am struggling to articulate. It goes something like this : a sql table is a known quantity and i can get a count of rows and be certain that there are exactly x rows whereas a datalog count of entities that use a single entity feels wrong as any entity can use that attribute.

Björn Ebbinghaus21:08:57

Performance can be highly dependent on the implementation. In Datomic for example you don’t have to write a single large query, like you would with other databases. You can consider data to be local already. I don’t know specifics about rules, but when you use it for recursive queries it will (naturally) become slow. Query performance in general can be highly dependent on the data you have. And the order of the where-clauses can have a large impact on memory/speed.