Fork me on GitHub
#clojuredesign-podcast
<
2024-01-25
>
neumann19:01:06

How do you test code that is littered with I/O? Aside from the I/O, is there anything left worth testing? Can the REPL and tests work together? In our latest episode, we start testing our code only to discover we need the whole world running first! https://clojuredesign.club/episode/108-testify/

neumann19:01:33

How would you describe the difference between REPL-driven and Test-driven development?

neumann20:01:50

@U05254DQM Here's the episode on testing that you asked for! 😁

JR21:01:24

It seems that there's a similarity between the extractor methods + the data they create, and DDD idea of an anti-corruption layer. Both are protecting you from changes in the services you're consuming and changing response so the data you're working with is closer to your problem domain. Do I have that right?

neumann23:01:18

@U02PB3ZMAHH I'd say they share the same goal of decoupling your application logic from the schema of the external system. In a practical sense, the recommendations I've seen for an anti-corruption layer treat it more like a proxy service in a microservice environment. The internal services call the anti-corruption proxy instead of calling the external API directly. It's outside the scope of Sportify! (a monolith), but in general, I'm not a fan of proxy services. I believe a service should integrate directly with its external dependencies but limit that surface area as much as possible through a clear "ingestion transform". If you have a number of services that need data from an external system, at some point it may make sense to create an internally shared view of that external data. If so, I would recommend a journal-oriented dataflow (not microservice) architecture. That's a whole different conversation.

👍 2
neumann23:01:41

For what it's worth, I think journal-based approaches are the way to go for integrations involving lots of data. If you've never read it, an influential and formative article in this space is: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying by Jay Kreps (co-creator of Kafka).

neumann23:01:37

When working for "Big Esports", @U0510902N and I created services that would poll microservice APIs for changes and generate changelogs of data. When the engineers of one such microservice couldn't figure out what happened to their data, we provided them with changelogs of their own data, which they then used to debug their own service. That lead them to add more logging to their service for the future, but I can't help but believe the big, mutable database-based microservice is a broken paradigm. For what it's worth, that experience contributed to the thinking we express in https://clojuredesign.club/episode/029-problem-unknown-log-lines/.

phronmophobic07:02:16

I think the secret to these types of pipelines is that you can reify them as workflows. The workflow is composed of an acyclic graph of steps and each step can be characterized by: • the steps it depends on, • the inputs it requires, • the outputs it produces. You can than have a workflow runner that does the dirty work of running each step and keeping track of the inputs and outputs. You can guarantee that steps will either: • succeed • fail • timeout If steps are executed in process, you may have to worry about them toppling over the whole program by going into an infinite loop, gobbling up all the program's memory or otherwise, but that's not always a big issue. An alternative is to run steps out of process or on another machine, but that's often overkill. Anyway, the key idea is that it's not hard to build trivial steps that always succeed, fail, or timeout and you can then build your workflow runner to handle these 3 cases. Once you know your workflow runner can handle these three cases, you're free to plug in any real-world steps you want. You can then slowly add in logging, automatic retries, pausing, resuming, partial reruns, manual recovery, progress tracking, resource monitoring, etc as needed.

Nick23:02:08

@U7RJTCH6J Is there a library you use for the "workflow runner" or do you write your own for each project/application based on it's requirements?

phronmophobic23:02:17

I've used a few libraries and frameworks that handle workflows in the large (eg. AWS datapipeline, Apache storm). I've written some adhoc in-process workflow runners for various projects, but haven't taken the time wrap them in a library. I really wish there was a good in-process workflow library, but I'm not aware of one.

Nick23:02:48

I wish for the same. I've used "plumbing" from the prismatic guys and it was nice in parts. But what you wrote above (a library that supports the three cases and then elegantly allows you to add in the other things as needed) would be awesome

phronmophobic23:02:12

Yea, would also love it to support showing the workflow state so you can a make simple UI that let's you cancel, pause, and resume tasks.

Nick23:02:03

yes, that's a great callout at as well. In the plumbing stuff we did (since the return value is graph that we could parse) we were able to make graphviz's of the code, and it's helpful in getting oriented and understanding what's going on (especially if you didn't write the code). Having a UI for cancel, pause, and resume would be a great next step

phronmophobic23:02:45

There are a couple libraries in the space though: • https://github.com/nubank/nodely • various stuff from https://twitter.com/ryrobes including https://github.com/ryrobes/flowmaps • and more listed at https://clojurians.slack.com/archives/CQT1NFF4L/p1657482280025899 When I've looked into it previously, they all seemed to be missing some feature I was looking for.

neumann22:02:24

@U7RJTCH6J Thanks for sharing that list! I need to go try some of these out!