CQRS and ES for Online Auction Platform

In this article, I would like to tell you about the decision of our team to apply the CQRS & event sourcing approach to an online auction project. I’ll also cover the results, conclusions and hidden pitfalls you should avoid.

Preview

Let's start with some business background. A client came to us having a platform for holding so-called timed auctions, which had already been in production. The client wanted us to create a platform for live auctions.

An auction means selling certain items (lots), where buyers (bidders) place bids. The buyer who offered the highest bid becomes the new owner of the lot. At timed auctions, each lot has a predetermined closing time. Buyers place bids, and at some point, the lot becomes closed. It is similar to eBay.

The timed platform was made in a classic way, using CRUD. Lots were closed by a separate application, starting on a schedule. It worked unreliably. Some bids were lost, some were made as if on behalf of a different buyer, lots were not closed or were closed several times.

Live auction is an opportunity to participate in a real offline auction remotely via the Internet. There is a premise (we call it a ‘room’) where there is a host of an auction with a hammer and an audience, and also there is a so-called clerk with a laptop, who, by pressing buttons in the interface, transfers the course of the auction to the Internet, and buyers connected to the auction see offline bids and can place their bids.

Both platforms work in real time, but in the case of timed auctions, all bidders are in an equal position, and in the case of live auctions, it is extremely important that online bidders can successfully compete with those in the room. It means that the system must be very fast and reliable. Negative experience of using the timed platform made it clear that the classic CRUD is not suitable for us.

We did not have our own experience of working with CQRS & ES, so we consulted with our colleagues who had it (we have a big company), presented them our business realities and jointly we came to the conclusion that CQRS & ES should suit us.

Some more specifics of online auctions:

  • Many users simultaneously try to influence the same object in the system, i.e. the current lot. Buyers place their bids, the clerk enters bids of the bidders from the room into the system, closes a lot, opens the next one. At any given time in the system, you can place a bid of only one value, for example, 5 rubles. And only one user can place this bid.
  • It is necessary to store the entire history of system object operations so that, if required, you can check who made a particular bid.
  • The response time of the system should be very short. The online version of the auction should correspond to the offline version, users should understand whether their attempts to place a bid are successful or not.
  • Users should quickly get information about all changes during an auction, but not just about the results of their actions.
  • The solution must be scalable, so several auctions can be held simultaneously.

A short review of the CQRS & ES approach

I will not dwell on the CQRS & ES approach, there is a lot of information about it on the Internet. However, I will briefly remind you of the main points

  • The most important thing in event sourcing: the system does not store data, but the history of their change, i.e. events. The current state of the system is obtained by successive application of events.
  • The domain model is divided into entities called aggregates. An aggregate has a version. Events are applied to aggregates. Applying an event to an aggregate increment its version.
  • Events are stored in the write database. The events of all system aggregates are stored in one table in order in which they occur.
  • Changes in the system are initiated by commands. A command is applied to one aggregate. A command is applied to the latest (current) aggregate version. In this case, an aggregate is built by the consistent application of all events belonging to it. This process is called rehydration.
  • In order not to rehydrate every time from the very beginning, some versions of an aggregate (usually every N-version) can be stored in the system ready-made. These ‘shots’ of an aggregate are called snapshots. Then, to obtain an aggregate of the latest version during rehydration, the events that occurred after the creation of the freshest snapshot of an aggregate are applied to it.
  • The command is processed by the business logic of the system, and, as a result, several events are obtained, which are saved in the write database.
  • In addition to the write database, the system may also have the read database that stores data in a form convenient for the system’s clients. One read database entity does not have to match one aggregate in the system. The read database is updated by event handlers.
  • Thus, we get the separation of commands and queries to the system (Command Query Responsibility Segregation (CQRS)). Commands that change the state of the system are processed by the write part; queries that do not change the state of the system are processed by the read part.

The scheme of CQRS architecture

Implementation. Subtleties and difficulties

Choosing a framework

In order to save time, as well as due to the lack of specific experience, we decided that we need to use some framework for CQRS & ES.

In general, our technology stack is Microsoft, i.e. .NET and C #. Database is Microsoft SQL Server. Everything is hosted in Azure. A timed platform was made using this stack, it was logical to make a live platform using it.

At that time, Chinchilla was almost the only option suitable for our technological stack. So, we decided to use it.

Why do we need a CQRS & ES framework at all? It already can cope with such problems and support such aspects of implementation as:

  • Aggregate entities, commands, events, aggregate versioning, rehydration, snapshot mechanism.
  • Interfaces for working with different DBMS. Saving/loading events and snapshots of aggregates to/from the write database (event store).
  • Interfaces for working with queues. Sending commands and events to the appropriate queues, reading commands and events from a queue.
  • The interface for working with web sockets.

Thus, taking into account the use of Chinchilla, we added to our stack:

  • Azure Service Bus as a command and event bus, Chinchilla supports it out of the box;
  • Write and read databases are Microsoft SQL Server, that is, they are both SQL databases. It can’t be called the result of a conscious choice, but there are historical reasons of choosing it.

Yes, the frontend is made on Angular platform.

As I have already said, one of the requirements for the system is that users can find out the results of their actions and the actions of other users as quickly as possible; this is applied to both buyers and the clerk. Therefore, we use SignalR and web sockets to quickly update frontend data. Chinchilla supports SignalR integration.

Choosing aggregates

One of the first things to do when implementing the CQRS & ES approach is to determine in what way a domain model will be divided into aggregates.

In our case, the domain model consists of several main entities, like this:

public class Auction
{
     public AuctionState State { get; private set; }
     public Guid? CurrentLotId { get; private set; }
     public List<Guid> Lots { get; }
}
public class Lot
{
     public Guid? AuctionId { get; private set; }
     public LotState State { get; private set; }
     public decimal NextBid { get; private set; }
     public Stack<Bid> Bids { get; }
}
 
public class Bid
{
     public decimal Amount { get; set; }
     public Guid? BidderId { get; set; }
}

We have got two aggregates: Auction and Lot (with Bids). In general, it is logical, but we did not take into account one thing, it’s the fact that with such a division the state of the system is spread over two aggregates, and in some cases, to maintain consistency, we must change both aggregates, but not only one of them. For example, an auction can be paused. If the auction is paused, you cannot place a bid on a lot. It would be possible to pause a lot itself, but the paused auction also can’t process any commands unless it is ‘unpause’.

Alternatively, only one Auction aggregate could be made having all lots and bids inside. But such an object would be quite heavy, because there can be thousands of lots in the auction and dozens of bids per lot. During the lifetime of an auction, such an aggregate will have a lot of versions, and rehydration of such an aggregate (sequential application of all events to an aggregate) will take quite a long time if no snapshots of aggregates are made. This is unacceptable for our situation. If you use snapshots (we use them), then the snapshots themselves will weigh a lot.

On the other hand, to ensure that changes are applied to two aggregates within the processing of a single user action, you must either change both aggregates within the same command using a transaction or execute two commands within one transaction. Both can be counted as a violation of the architecture.

Such circumstances must be taken into account when breaking down the domain model into aggregates.

At this stage of the project development, we use two aggregates (Auction and Lot), and we violate the architecture by changing both aggregates within some commands.

Applying a command to a specific version of an aggregate

If several bidders place a bid on the same lot at the same time, that is, they send a ‘place a bid’ command to the system, only one of the bids will be successful. A lot is an aggregate, it has a version. When processing the command, events are generated, each event increments a version of an aggregate. There are two options:

  • Send a command, indicating a version of the aggregate we want to apply it to. Then the command handler can immediately compare the version in the command with the current version of the aggregate and stop its execution if it does not match.
  • Not to specify a version of an aggregate in the command. Then the aggregate is rehydrated having some version, the corresponding business logic is executed, events are generated. And only when they are saved, a message that such a version of the aggregate already exists can pop up. Because someone else did it earlier.

We use the second option. In this case, commands are more likely to be executed. In the part of the application that sends commands (in our case, this is the frontend), the current version of the aggregate will probably lag behind the actual version on the backend. Especially, when lots of commands are sent, and the version of the aggregate changes frequently.

Errors when executing a command using a queue

In our implementation which is heavily depended on Chinchilla, the command handler reads commands from a queue (Microsoft Azure Service Bus). We clearly distinguish situations when a command fails for technical reasons (timeouts, errors in connecting to a queue/base) and for business reasons (an attempt to place a bid on a lot of the same amount that has already been accepted, etc.). In the first case, the attempt to execute a command is repeated until the number of repetitions specified in the queue settings is reached, after which the command is sent to the Dead Letter Queue (a separate topic for unprocessed messages in the Azure Service Bus). In the case of business reasons, a command is sent to the Dead Letter Queue immediately.

Command queue error handling

Errors when processing events using a queue

Depending on the implementation, events generated as a result of a command execution can also be sent to a queue and taken from the queue by event handlers. When processing events, errors can also occur.

In comparison with an unexecuted command, the situation is more complicated in this case. A command may be executed, and the events may be written to the write database, but the handlers may not be able to process the events. And if one of the handlers tries to update the read database, then the read database will not be updated. That is, it will be in an inconsistent state. Due to the mechanism of repeated attempts to process an event, the read database is eventually updated, but there is a risk that the database will be broken after all the attempts.

 Errors when processing events using a queue

We have faced this problem. To a large extent, it has happened, because we had some business logic in the event processing, which, in case of an intense bid placement, can fail from time to time. Unfortunately, we realized it too late, it was not possible to change the business implementation quickly and simply.

As a result, we decided to temporally stop using the Azure Service Bus to transfer events from the write part of the application to the read part. Instead, the so-called In-Memory Bus is used, which allows you to process the command and events in one transaction and, in the case of failure, roll back the whole transaction.

Event handler exception scheme

Such a solution does not contribute to scalability, but on the other hand, we exclude situations when the read database breaks, resulting in the breaks of the frontends. The continuation of an auction without re-creating the read database by replaying all events becomes impossible.

Sending a command in response to an event

It can happen, but only in the case when the failure to execute this second command does not break the state of the system.

Processing multiple events of one command

In general, the execution of one command results in several events. Sometimes we need to make some change for each of the events in the read database. In some cases, the sequence of events is also important, and if the sequence is wrong, events will not be processed appropriately. It means that we can’t read the queue and process the events of one command independently, for example, with the help of different instances of code that reads messages from the queue. Plus, we need to know for sure that the events from the queue will be read in the same order in which they are sent there. Or we need to agree that some events of the command won’t be successfully processed during the first try.

Processing multiple events of one command

Processing one event with multiple handlers

If the system needs to perform several different actions in response to one event, usually several handlers are used to process such an event. They can work in parallel or sequentially. In the case of a sequential work, if one of the handlers fails, the entire sequence is restarted (it’s so in Chinchilla). In case of such an implementation, it is important that the handlers are idempotent, so that the handler which has successfully performed one run doesn’t fail the second run. Otherwise, when the second handler of the chain fails, the chain won’t work properly, because the first handler will fail during the second (and subsequent) attempt.

For example, an event handler in the read database places a bid of 5 rubles on a lot. The first attempt to do this will be successful, and the second won’t be executed, because of the constraint in the database.

The event handler in the read base adds a 5 rub bid

Summary/Conclusion

Now our project is at the stage when, as it seems to us, we have already faced most of the hidden pitfalls relevant to our business specifics. Overall, we consider our experience to be quite successful, CQRS & ES is well suited to our subject area. We would like to stop using Chinchilla and choose another framework that gives more flexibility for further development of the project. However, it is also possible to stop using framework at all. It is also likely that we will make some changes in order to find a balance between reliability on the one hand and the speed and scalability of the solution on the other hand.

As for the business component, there are still some open questions, for example, dividing a domain model into aggregates.

I hope that our experience will be useful, will help save time and avoid hidden pitfalls.

You Might Also Like

Blog Posts Developing SQL Query Testing System. Part 2
October 21, 2021
We developed a data layer testing framework to automate and simplify the process of testing complex SQL queries on a large project.
Blog Posts Techniques for Handling Service Failures in Microservice Architectures
October 13, 2021
This article may be useful for those who have suffered from the instability of external APIs: what are the strategies for handling failures and which way we found to deal with the problem.
Blog Posts Secure web application cheat sheet
October 08, 2021
This article is intended as a cheat sheet for web developers. It describes some basic steps and measures to create a secure web application protected from the most widely spread threats.