GDA’s Data Mini-Series: Part 1
Architecture and Normalisation
This article is the first part of a mini-series we are publishing about our data services. In this part, we will cover the updated architecture of our data services and also how and why we are doing data normalisation. There is a video covering the same content embedded below if you would prefer to watch rather than read.
A typical L3 data setup or architecture for quant research and trading is quite crude and simple; a single web socket connection that streams L3 data into the quant’s code directly. Our architecture is much more sophisticated and addresses a number of problems with this simple approach. There are two sides of our architecture, firstly there is the data lake, and secondly there is the data warehouse. Each of these components are discussed separately below.
The data for the data lake is streamed via a number of microservices running in Amazon ECS. By having multiple microservices, we can guarantee the accuracy and uptime of the data feeds. We call these micro-services content delivery networks (CDNs). If all of these CDNs are the same, then this verifies the incoming data is correct. If some are the same but a few are different, then we need a way of determining which feeds are correct. We have created a federated consensus algorithm that makes this decision.
The data lake has two important characteristics. Firstly, we don’t stream the data to the Amazon S3 bucket. We have a buffer in our micro-services and we fill the buffer. Each minute this buffer is flushed. Once it is sent to the S3 bucket, if we want to do some research on that data, we can use Amazon Sagemaker and create an EMR cluster. This is an easy way to create ML models and research. Secondly, the microservices retrieving the data is normalised and enhanced in real time. The term normalisation can refer to a few different meanings depending on the domain of study you’re in. When we say normalisation, we are referring to the normalisation of data within a database schema. Database normalisation has a number of steps involved starting at first normal form (1NF) and going up to fifth normal form (5NF). Our normalisation implementation consists of the first three normal forms:
- 1NF — all values within the database are made atomic.
- 2NF — Decomposing partial dependencies to new tables and making all rows functionally dependent on the primary key.
- 3NF — Decomposing transitive dependencies to new tables and making non-key attributes dependent on the primary key.
The key reason we are doing this normalisation is to make inserting, updating and removing, as well as aggregating, a much simpler task.
Normalising order book data
With the raw data feeds that come from the CDNs, we rebuild the order book. We do this to help us with the enhancement task, but also to provide seperate metrics that we can aggregate with the data. The order book we construct uses the limit order events streamed over the CDNs and the market order trades which are matched against the limit order book. The structure of the data that we use to build the order book is not normalised because it is not atomic. In the image below for example, you can see that there are multiple data points for each order book action.
By converting the data to first normal form, each row in the data is made atomic. The next step, second normal form, means that every value in the table is dependent on a primary key which is a unique ID. In the case of a limit order book, each event such as a bid or ask is given a unique ID (i.e. 0, 1, etc). In the image above, order book actions delete, update and insert would be converted into a unique ID 0, 1 and 2 respectively and a new table created to match these IDs to the actions. Finally, the database is reduced to its third normal form by applying a similar concept from 2NF to transitive dependancies. A database in this normalised form can much more easily be interrogated and manipulated.
Doing this normalisation drastically simplifies the extraction of data metrics from the order book, especially where these metrics are aggregating data. For example, metrics such as mid-price, micro-price, spread, volume imbalance and much more can be extracted from this order book easily. In total there are around 30 different metrics that can be calculated.
A data warehouse is not as big as a data lake but it is bigger than a typical relational database. The design of the data warehouse is based on star schemas and snowflake schemas. Our architecture is set up in such a way that the micro-services mentioned earlier that stream limit order book events and market order/trade events feed into a message bus pulsar which has certain “topics” to feed the data to adapters. There will be a HTTPS adapter, web socket adapter and a Kafka Pulsar Adapter. These adapters can be connected to Amazon EC2 instances which we will have set up for ourselves.
Part of our infrastructure will also include Airflow as a micro-service. This will be responsible for pushing the data to Chainlink to make it accessible to on-chain entities.
This is a summary of the architecture of the data lake and data warehouse. The normalisation of the data, which is done real time, is incredibly important because it makes later data analysis and aggregation much simpler. In the next part of this mini-series, we will talk about what we are doing regarding market quality, trade and order flow imbalances and liquidity. Stay tuned for that.