Fork me on GitHub

Hello, I'm writing a program that fetches specified files from several FTP servers our counterparts provide, decrypts it, identifies it, validates, does some very basic ETL then puts the data somewhere. I'm right at the beginning of this pipeline and have a choice of two designs. In the first, I have a component (an AWS lambda) which periodically polls a counterpart FTP and determines whether any new files have been put to it. It emits one event for every new file, and puts that event on a message queue. This triggers a state machine instance (probably a AWS step function) to spin up, which will check the file name from the event against a list of files we're interested in. If we are interested in it, the state machine will download the file and do all that other stuff I mentioned above. The the second design, I have a component which also does the looking on the counterpart FTP, but here it is this component which has to decide whether or not to download the file. The file is downloaded to an S3 bucket, and it's this action which triggers the event and the state machine, and does everything I mentioned above, except obviously the downloading. The second design has the benefit of conserving the connection to the FTP, and keeping all the FTP actions in one place, doing it all in one swoop. But it also bifurcates the process into two parts: one scoped to the FTP covering multiple files and one scoped to the file itself. The first method maintains the same tight scope to the file throughout the process, which I feel has a nice cohesion, and keeps the 'story' of the file processing in a single place - one file has one state machine instance all the way through. I think this will have benefits in being able to stitch everything together, especially when it comes to presenting that story to users. I also have the option to plug into the event that effectively says "Hey, x sent you a file" which will be useful - for example if I have another component which is monitoring for late files and chasing the counterpart. I also like the separation of responsibilities here, with the first component narrowly scoped to "This file was just sent", though I think the second approach is also reasonable. I'd be interested in hearing peoples reaction to these two approaches, or how I should approach the decision. Thanks!


I’ve built and operated one of these. It was an in-datacenter solution. We had built it in Python using Tornado. It was a single state machine. The process also exposed the state machines via a web UI that allowed for visual monitoring and manual re-fetching/re-processing. We also had the notion of a “counterparty FTP host” and “files from an FTP”, so we could manage the state machine at a file level, but still have a single connection to an FTP for all the files that came from it


I'd have one thing poll the ftp servers and filter down the files from them which you care about, downloading them, and storing them in S3.


And I'd then use S3 events to trigger something else to process the file.


I think its better for you to think of it as some S3 bucket is where files are put you want to process. You can then test this whole thing yourself, just upload a file to S3 manually, see if everything works.


The way I'd justify it is like... That FTP server is like impure side effect. Your S3 bucket -> event -> process is like a pure function of input to output with a well defined input and output. So you.can now reuse this. For example, maybe one day you tell your counterpart the whole FTP thing is stupid, just have them push the files directly to your S3. Or they themselves decide to switch to S3 for serving files. In all these cases, your "pure" function isn't broken and doesn't need to change.


Thanks both! @U0K064KQV appreciate the justification and thought process. I’m still attracted to the idea of having the whole thing, from download to end, in a single state machine for cohesion - but the benefits of separation based on the hinges of where the process is likely to change, and the potential reuse, probably outweigh it. I like the purity analogy too - it brings the usual benefit of purity that it’s much easier to test too - just dunk something in a bucket.


Something else to consider is, how will you know that a file in the ftp has already been processed? Say you poll the ftp each 10min, see the file, publish a message, then poll it again, see the same file, how will you not publish another message? You have the same issue with the S3 solution, except you could rely on S3 to remember what was already downloaded. So it can serve as a deduplication mechanism as well.


We’re doing an ls on the FTP and diffing the result with the previous result to figure out adds deletes and modifications


Sort of manufacturing events