At Banked, we offer real-time payments to our customers, which involves interfacing directly with all major UK banks. It’s important for us to know the health/service level of the APIs at each bank we interact with, so we needed a way to monitor all of that traffic.
Before we get to the specifics of our solution, a quick general intro to event sourcing is probably useful. If you’re familiar with all this already, you might want to skip ahead.
The basic principle of ES is that we capture every change of state to our application as events, then store these events in the order they were produced. That log, or sequence of events, then forms our single source of truth for the state of our application.
That’s it really.
Contrast this with the more common approach of a relational database, which will generally store the current state of the world, i.e. where we got to after all those events occurred. We actually throw away a lot of information with this approach — we know where we are, but not how we got there.
Event sourcing is not a widely used approach in software engineering (and that’s probably as it should be…) but as a general system it is more common than you might think. In fact, I can almost guarantee you use an event sourced system every single day: your bank’s ledger is an event sourced record of every change you’ve ever made to your balance. No bank would store just your current balance, and throw away the changes. Accountants had perfected event sourcing well before software engineering was around to coin the term; the first recorded ledgers were found in the city of Mesopotamia, today’s Iraq, around 7000 years ago, so its a fairly well trodden path. In the development world, the most common example of software built with ES is probably Git, and the benefits there are pretty obvious.
That said, there’s a good reason that we all know how to use relational databases, but the same can’t be said for ES. It adds complexity, is much easier to get wrong and is arguably a more challenging paradigm for developers to operate within. For many use cases, particularly where the history of our state is irrelevant, that’s just a bad trade. For the right use case though, it’s an elegant solution.
A lot more has been said on ES (plus associated patterns and techniques like CQRS) by people much smarter than me, so if you want to explore the fundamentals a bit more, here’s a couple of videos that could be useful if you’re interested:
So what’s the general shape of a use case that fits? Here’s where it might make sense to get into the specifics of the problem we were trying to solve.
At Banked, we offer real-time payments to our customers, which involves interfacing directly with all major UK banks (we’re working on the rest of Europe…). It’s important for us to know the health/service level of the APIs at each bank we interact with, so we needed a way to monitor all of that traffic. Some form of event sourcing looked like a good fit, because the question is, in essence, a temporal one, i.e. how have the responses from each API changed over time. This felt like the key to a suitable use case for ES — it has to match the reality of the world you are modelling. If that fits, everything else tends to fall into place.
So we began to explore how we might put the theory into practice. There’s a lot of choice when it comes to the technology available. Tools like Kafka and EventStore are powerful and seem to be common choices, but the steep learning curve requires a non-trivial investment in developer time, and they also involve a reasonable commitment in ongoing maintenance. We wanted something managed, that’s quick to get going, and can be easily explained to other devs. It needs to be scalable too, because the number of events we’ll be chucking at it is significant.
Enter Google’s Firestore.
Firestore is a NoSQL document database, with a few handy features that make it a pretty great fit. The basic structure of Firestore is alternating documents and collections — with a document being a single blob of data. Each document is part of a collection alongside other documents, and each document can have sub-collections, which each contain further documents and so on, ad infinitum. Whilst a single document has a maximum size limit, the size of a collection is essentially unbounded. So this is starting to look like a good match — each of our events (a small blob of data) maps nicely to a document, and we can group these events into unbounded collections, which is ideal as the history of our application keeps growing.
One obvious issue here is reading back the data. Given such a big collection of events, we definitely don’t want to end up searching through the whole collection to get the information we need. This is where some careful structuring of the data makes all the difference. We want to group documents together in such a way that the collection contains all the events we are interested in (but no more than that) for a given operation, and then make sure we have an index for every field we need to query. Firestore automatically sets up an index for each field in our document, so as long as we include a timestamp for each event, we get a pretty simple way to very quickly retrieve the latest N events for whichever collection we are interested in. Different problems will require their own unique structure.
Ideally, this sequence of events should also be immutable; we should never be going back and altering our history of events. This allows us to reliably recreate/replay our entire history, which is a handy feature for any software (debugging, regression testing, audit, etc.). We can do a pretty good job of ensuring this with IAM policies, although this could be improved if Google offered IAM permissions for Firestore at a more granular level, on a per-collection basis.
So far so good.
Next we need to decide how we build our logic around Firestore. Cloud functions were an obvious choice here. Our events were already being published to PubSub, so it was simple to set up a cloud function that consumed these events and appended them to the appropriate collection in Firestore. Firestore then fires handy triggers whenever a new document is created. This allows us to hook up as many different cloud functions as we want, to listen for new events being appended, and gives us a nice way to keep each unit of work simple and self contained. Each function is responsible for calculating a given metric we are interested in, and can persist the new metric wherever we like. This also means the rest of our system needs to know nothing about event sourcing — each function can update a certain metric and push the updated information to PubSub (or wherever you like), to be consumed by any other part of the system that’s interested. So it’s totally decoupled, and doesn’t slow anyone else down if ES is not their thing. It’s also massively scalable with pretty much zero effort.
All in all, it’s been a very successful approach so far, but there are a few pitfalls lurking down the line that require a bit of forethought to avoid. The biggest of these is versioning of events. If you are going to rely on an immutable sequence of events for your application state, you can’t just run a database migration whenever you want to change some fields around. There are a few approaches to this, but we will probably save the solution we came up with for a future blog post.
Eventual consistency is another common characteristic of an ES system. Again, there are a variety of solutions that ultimately depend on your use case — we were happy with eventual consistency for our particular problem, so we neatly sidestepped that one.
So, there you have it. At the risk of a summary full of useless platitudes: ES is not a one-size fits all pattern but, for the right situation, it can be a powerful tool. A combination of Firestore, PubSub, and Cloud Functions can be a simple and scalable implementation strategy.
Banked Ltd is authorised and regulated by the UK Financial Conduct Authority
16 Great Chapel St, London, W1F 8FL
Company number 11047186 : Firm Reference Number 816944 : +44 (0) 20 3597 4496
© Banked : 2021