2024-10-10 biff | Clojure Slack Archive

biff 2024-10-10

Biut 2024-10-10T15:03:10.113629Z

I am trying to deploy to EC2 and have been getting "protocol version mismatch" for rsync. Trying different things to upgrade rsync in Intel MacOS 12.7 was a waste of a day:man-facepalming: and downgrading rsync on EC2 didn't do much good. Any suggestions? (I am thinking about maybe modifying biff tasks to disable rsync and just push prod with git.)

2024-10-11T23:37:10.476579Z

awesome! just tagged a release, so you can do :git/tag "v1.8.23", :git/sha "4b85074" now

2024-10-10T15:06:34.230539Z

Yeah, if you can't get rsync to work easily, deploying with git would be a good approach. I'd be happy to add a config option later today to use git for deploys even if rsync is on the path.

2024-10-10T16:14:01.955309Z

I just pushed a commit to the dev branch. You can update your com.biffweb/tasks dependency (i.e. the one under :aliases, not under :deps) to :git/sha "4b850744b9548232cc428a7e0199974920f18e69". Then add :biff.tasks/deploy-with :git to resources/config.edn . I've tested it on my end; let me know if that works on your end and then I'll merge it into master.

Biut 2024-10-11T04:47:09.364099Z

Needed some more tinkering on AWS side but now it deploys as indended. Thank you!

🙌 1

Dallas Surewood 2024-10-10T18:36:58.337399Z

Reading through https://biffweb.com/p/indexes-2/ and somewhat confused what problem RocksDB solved. It looks like there's two bottlenecks identified here • We don't want to store entire RSS feeds in XTDB because of the lack of structural sharing • Making an index from your current Yakread architecture is hard because you have to query XTDB instead of getting an entity, which means indexers slow down the write speed to XTDB. I see how RocksDB solves the first problem but not the second. Are you not still querying from XTDB when updating the RocksDB index?

2024-10-11T17:24:23.542379Z

yeah, currently I append it to a set and then count the set size to update the # of total posts. The code is this specifically:

(defn rss-indexer [{:biff.index/keys [index-get op doc]}]
  (cond
    (and (= op ::xt/put) (:item.rss/feed-url doc)) ; An RSS post document was created/updated
    (let [{item-id :xt/id :keys [item.rss/feed-url item/fetched-at]} doc
          feed-items-id  [:feed-items feed-url]
          old-feed-items (or (index-get feed-items-id) #{}) ; Get the current set of post IDs for the associated RSS feed
          new-feed-items (conj old-feed-items item-id)]
      (merge (when (not= old-feed-items new-feed-items)
               {feed-items-id    new-feed-items ; if this was a new post, update the set in RocksDB
                [:feed feed-url] {:feed/last-published fetched-at
                                  :feed/n-items (count new-feed-items)}}) ; Also update :feed-n/items for the given feed
             (when-not (index-get [:item-feed-url item-id])
               {[:item-feed-url item-id] feed-url})))
    ...

2024-10-11T01:27:17.150559Z

I think I didn't write that post very clear, because even I'm scratching my head as I'm re-reading part of it... anyway, after reviewing the code a little, the issue is that the index needs to keep track of how many total items an RSS feed has, and to do that it needs to maintain a mapping from RSS feeds to sets of post IDs. Then when the indexer gets a new XT transaction that includes a ::xt/put operation on a post document, it can use that mapping to find out if that post is new (since ::xt/put doesn't tell us if we're creating or updating a document) and update the # of total items accordingly. So then the problem is how exactly should that mapping be stored? If we store it in a single document, we'll potentially use a ton of disk space due to the lack of structural sharing in XTDB. If we store it across many documents (e.g. a document for each post, like {:xt/id ..., :post/id "abc", :feed/id 123}, then the only way for the indexer to ask "given a feed ID, what is the set of post IDs belonging to that feed?" is to do a xt/q call, which would be too slow to do during indexing. so rocksDB lets us do the first approach of storing the entire mapping in a single document without making our disk usage balloon. (while typing this up, for this specific case I realized that since the only thing I'm actually using the feed->posts mapping for is to find out if transacted posts are being created or updated, I don't actually need the entire mapping at once. I could've just created a separate document for each post, then in future transactions done a xt/entity call to see if a post exists, and increment the total post # if not. I'm still happy I migrated the index implementation to rocksdb though; there will likely be other cases where you really do need to store a large mapping in one document, and even in the other cases, the rocksdb index is likely faster and uses less disk space.)

Dallas Surewood 2024-10-11T04:10:27.912709Z

So in this case all the indexer would be doing is checking if the new post is already in the RocksDB database, and if not, it appends it?

Clojurians Log v2

biff 2024-10-10