Project L3 Atom (Open-source Crypto Data Initiative)

World’s first open source crypto data initiative. Using advanced maths, data science, high performing computation and clever engineering, we’ve been able to solve one of the biggest problems in crypto’s industry, the unreliable, noisy data problem. We’ve been able to recreate the clean L3 order book from its atomic structure in real-time, enabling a new generation of quantitative finance and algo trading in financial engineering applications.

Project L3 Atom is the largest data innovation project in crypto to date.

Contents

The Problem
The Solution
Our Innovation
Use Cases
Types of Data
Users
High-Level Architecture
Data Collection and Warehousing
Data Lake
Data Warehouse
Data Normalisation and Enhancement
Creating Rich Data — Aggregation Tasks
Data Dissemination and Streaming — Open Data Streaming Suite (ODSS)
Security Management
New Approach to Delivering Data into Financial Engineers
Conclusion
References

The Problem

Current data infrastructure in the crypto space is extremely unreliable for a number of reasons. This is especially so when comparing crypto data to traditional financial markets. The crypto industry’s data has the potential to surpass that of traditional finance, but at the moment this is not the case. The current nature of the crypto industry permits poor market behaviour and manipulation of data.

Lack of regulation in the crypto space is partly to blame for the cultivation of this attitude. As a result, big market participants can manipulate data, spoof trades, and do other malicious acts to the detriment of data quality. CEXs can and do throttle user connections to limit their data capabilities — there are no incentives for them to deliver data of a high quality. Additionally, because they are outside the scope of regulations, they have no legal obligation, as a fiduciary or otherwise, to act accordingly.

The immense growth of crypto in recent years has seen an exponential increase in data points and demand, and supply hasn’t kept up. Data providers that do service this space fail to collect all relevant data points and are never free. These vendors issue disclaimers stating they do not guarantee the quality or consistency of their data. Data that is collected is plagued with noise, missing data points and is still affected by technical glitches and spoofing.

If a user does opt to collect their own data from trading venues, they are made to use technologies such as REST APIs and WebSockets. When using these services, users are given warnings similar to — “Due to network complexity, you may get disconnected at any time”. This is made worse in periods of high market activity, meaning these technologies can be unreliable.

These problems in the crypto space mean there is no foundation for quants and other traders, as well as institutional investors, to build strategies and engage with the ecosystem as a financial platform.

The Solution

The solution to these problems is our new initiative to democratise crypto data and provide it at almost zero cost — the Open Crypto Data Initiative. As a data provider, this initiative will enable users to connect to an institutional-grade, reliable, and fast stream of financial data across a wide range of exchanges, including decentralised protocols. All of the data collection, storage, and streaming will be handled by this initiative, so that users only need to worry about what is important to them. The data that we provide will also be reliable and accurate at an unprecedented level in the crypto industry.

Our Innovation

In crypto, data is king — whoever has access to the most exhaustive, reliable, and fastest data is poised to come out on top. Using high-performance computing (HPC) techniques and 3rd generation Intel Xeon Scalable processors (code-named Ice Lake), we have addressed crypto’s biggest problem — unreliable, dirty data. This allows us to reverse engineer the limit order book at an atomic level in real-time, opening up limitless possibilities.

As mentioned earlier, there are problems associated with accessing reliable data in crypto. Unlike traditional markets such as institutions or hedge funds where users pay to receive data, there are no further incentives or regulatory oversight to motivate centralised exchanges to deliver high quality data.

Throttling mechanisms are employed by these exchanges to limit the number of server calls users can make and the number of data points available. In order to provide a next-generation order book for users, GDA has worked on the development of primary and replica microservices with different Content Delivery Networks to ensure that if a connection is lost, data is still collected from a replica microservice.

Some of the innovative technologies we use to overcome these mechanisms include:

Querying large, multidimensional datasets is made easier via the implementation of OLAP Cubes, multi-dimensional arrays which allow data to be mapped and aggregated across several axes. This can be used to greatly speed up querying.

Primary and replica microservices allow for constant data streams in the event of a broken WebSocket connection. A federated consensus algorithm is used to pick the correct message coming from the WebSocket feed.

Microservices normalise data from the WebSocket messages and in real-time rebuild the limit order book with every single atomic event. Outliers are removed by the consensus algorithm which may be resultant of exchange errors or price manipulation.

Any aspects of our system which require Python code are optimised by making use of NumPy, a library that adds support for large, multi-dimensional arrays and matrices, as well as a collection of high-level mathematical functions for performing linear algebra over those arrays. We make use of NdArrays, shapeshifting, Ufuncs, collective methods, and Scipy for statistical analytics.

In order to get the best out of our AWS c6i.metal EC2 instance, we use Intels C++ compilers, OMP, and MPI libraries for distributed computers, because these libraries are optimised to work on Intel architecture specifically.

In order to further optimise the already relatively efficient C++ aspects of our system, the pragma directive framework is followed. Pragma is a language construct for providing additional information to the compiler and specifies the way the compiler should process its output.

These methods future proof the code as they will apply to future architectures as well. Techniques used will include: vectorisation, SIMD enabled functions, vector dependence, strip mining, and locality of memory access in space and time.

In order to get the most out of our EC2 instances, virtual machines running on 512-bit Advanced Vector Extensions (AVX 512) are selected. The significance of 512-bit processors lies within the length of numbers capable of passing through their arithmetic logic units (ALU). While the common 32 and 64-bit processors found in most modern machines can operate directly on values up to 32 and 64 bits respectively. Whereas a 512-bit register is capable of performing single operations across multiple data points by utilizing the vectorisation of 16 32 bit instructions, allowing it to process 512 bits per operation. We have an L1 bandwidth of 100 Gb/s and a 10nm architecture.

The above technologies are expanded upon in later sections.

Data is taken from exchanges and normalised through our services. As all the atomic order book events are universal for all exchanges (provided we get the data messages from the exchange), we can rebuild the Limit Order Book in memory and produce identical metrics and analytics which will be comparable.

This was designed for interoperability between exchanges and to make sure the data when processing is standardised on our end. Proprietary filters have also been custom-built by our team to filter out noise and errors.

As a result, GDA is able to get a deeper insight into the order book, liquidity, and be able to map it to the granular or ‘atomic’ level required to incorporate L3 data. We are also able to determine the following:

  • Participant market impact to market microstructure correlations mapping
  • Participant market impact
  • Participant fill rate
  • Value areas, liquidity & volume clusters mapping
  • Shapeshifting phenomenon detection

From the above insights, advanced users such as quants, financial engineers, and programmers could conduct the following use cases and tailor them to their own strategies.

Use Cases

GDA’s Open Crypto Data Initiative will be an essential part of crypto’s expansion into more institutional research, backtesting, and optimisation activities such as:

Advanced quantitative research & Backtesting, insights into market microstructure and order book dynamics, designing market behaviour models, data-driven contract insights, liquidity clusters heat-map development.

Liquidity clusters shapeshifting phenomenon, market microstructure correlations, market strength & weakness index, trend developments, identify trading ranges and the trend distribution, estimate fluctuations and volatility.

Signal contact strategy development, multi assets accelerated signals, cross strategy design and development (generating alpha combining strategies across CEX/DEX)

Enhancing trade execution (CEX & DEX), smart order routes, deeper liquidity cost optimisation, HFT load-balancing, pricing & fee optimisation, cost-effective market entry

Liquidity aggregation, portfolio constructions, composition & analytics, rebalancing for reliable and lightning-fast data stream

Advanced risk model development, Tail-hedging, liquidation management

Traditional Finance Use Cases

GDA’s services could also be used in more traditional finance activities such as:

Statistical arbitrage is a group of trading methods that make use of mean reversion models to invest in broadly diverse assets held for a relatively short period of time. This can be only for a few seconds or up to a couple of days.

Statistical arbitrage will match assets by type with the goal being to reduce exposure to beta and other risk factors. This is done across two phases: “scoring” and “risk reduction”. Scoring assigns a rank to each asset based on desirability whilst risk reduction will combine desirable assets into a portfolio to lower risk.

The strategies used in statistical arbitrage are described as market-neutral as they involve “Opening a long and short position simultaneously to take advantage of inefficient pricing” in correlated assets. As an example, if a hedge fund manager believes Stock A is undervalued and Stock B is overvalued, a long position would be opened in Stock A with a short position being opened with Stock B. This is known as pairs trading.

A market maker is a firm or individual that stands ready to buy and sell a particular stock on a regular and continuous basis at a publicly quoted price.

There is a significant overlap between market-making and high-frequency trading (HFT) firms, a crucial distinction is that true market makers don’t exit the market at their discretion and are committed not to, whereas HFT firms are under no similar commitment. HFT firms are characterized as market making — a set of HFT strategies involving placing limit orders to sell (or offer) or buy a limit order (or bid) in order to earn the bid-ask spread. By doing so, market makers provide a counterpart to incoming market orders. This role was traditionally carried out by specialist firms but is now implemented by a large range of investors thanks to the wide adoption of direct market access. This renewed competition among liquidity providers reduces effective market spreads, and subsequently the indirect costs for financial investors.

Implied volatility (IV) is a metric investors use to estimate future price fluctuations of an asset-based on certain predictive factors. IV can be thought of as a proxy of market risk and “Commonly expressed using percentages and standard deviations over a specified time horizon”.

In traditional finance:

  • A bearish market leads IV to increase
  • Investors believe asset prices will decline
  • High volatility means large price swings (either up or down)
  • A bullish market leads IV to decrease
  • Investors believe asset prices will increase
  • Low volatility suggest price would not make unpredictable changes

Types of Data

Data is a term that can refer to a number of different types of information in crypto and markets generally. Market data is the trading information from trading venues such as exchanges. Each trade is a packet of information that consists of its amount, price, and timestamp. A trade can only occur once a limit order is matched against a market order.

This section will cover all the types of data GDA provides users with.

Limit Order — Basic Explanation

A limit order is an order for a certain amount of one asset, at a set price. Because there can be millions of these orders at any one time, exchanges use a “limit order book” to keep track of them. A limit order book is a table of all limit orders aggregated together by price. For example, the BTC-USDT limit order book looks like this.

Figure 1. An example limit order book (Binance)

The price of the order is on the left and is how all orders are aggregated. The middle column has the amount at that price level — or the sum of the orders at that level. The amount is displayed in the base asset, in this case, BTC. The rightmost column has that amount in the quote asset, in this case, USDT. As an example, if you place a limit order for 0.5 BTC at $57,261, your order will be added to the order book at that price level. The new amount at that level would become 0.93114 on the above-limit order book. The orders in red are called asks and are sell orders of the base asset (BTC). The orders in green are called bids and are buy orders of the base asset.

The size of the order book or the volume of orders on it is referred to as liquidity. Typically, the more liquidity the better. As more trading venues are introduced, liquidity across all trading venues becomes more fragmented.

Market Order — Basic Explanation

A market order is an order designed to be executed immediately. Market orders only require an amount to be bought/sold and do not have a price parameter. That is because they will be matched with the best available price at that time up to the amount indicated. In the above limit order book, a market order to buy 0.5 BTC would be matched against the best ask price which is $57,269.99 and the remaining 0.29183 is matched against the next best price of $57,270.00. Therefore, even though the buy price of BTC was $57,269.99, you would have paid slightly more for your entire order. This extra price is called slippage. More liquidity in the market helps minimise slippage.

When market orders are relatively large, they are affected by slippage. This phenomenon occurs when you place an order for the best price, but the size of your order is greater than the amount of that asset at the best price. To fill your order, it is then matched against the next best price. This continues until it is filled.

LOB data

The limit order book is a rich data source for traders wanting a better understanding of the market. There are different levels of granularity for this data; L1, L2, and L3. L1 data is the simplest form of the data and only consists of the best bid and ask price of an asset. In the LOB snapshot above, L1 data is $57,269.98 as the best bid and $57,269.99 as the best ask. L2 data has more detail and consists of aggregated orders at each price level, as seen in the LOB above. L3 data is more granular than L2 because it does not aggregate the orders together at each price level. Individual orders are ascertainable in L3 data within every price level.

Financial market data

This includes data that is sometimes not accessible from exchanges such as funding rates — traders find this important to understanding the liquidation concentration of an asset. This also includes maker and taker fees — fees charged if an order does/does not match another on an order book and results in the addition/removal of liquidity. Tick/volume bars of an asset describe market activity.

Other financial market data includes:

  • Bid/ask price
  • Last, Index and Mark price
  • Open interest/value
  • Cross sequential
  • Delivery fee rate
  • Predicted delivery price
  • Predicted funding rate

Additional market data

Administrative data is usually only provided over social media platforms or on an exchange website. This contains information such as API downtimes or server maintenance which affects trading. Also, this is not available as a constant data stream.

Enhanced Data

GDA will provide users with the following ‘enhanced’ data:

  • AOPV are trades that are established but have not been closed out with an opposing trade. Considered as the aggregate value per contract for each exchange
  • Taker Buy: Within a set period, the total of buy orders filled by takers
  • Taker Sell: Within a set period, the total of sell orders filled by sellers
  • The difference between the futures price and the price index at a given time
  • Basis = Futures Price — Price Index
  • Basis Rate = (Futures Price — Price Index) / Index
  • Displays WebSocket expected errors and output stability

Customisable Data

Below shows a few of the data points customisable by users:

  • Funding rates (as a percentage) are used by centralised exchanges for perpetual contracts. NFR is the normalised funding rate across exchanges.
  • Open interest is the total of futures contracts held by market participants at the end of a desired trading period — “It is used as an indicator to determine market sentiment and the strength behind price trends”.
  • It can be defined as the “Amount of open positions (including both long and short positions) currently on a derivative exchange(s)”.
  • In traditional crypto futures exchanges like CME, open interest is calculated by looking at the total number of futures contracts held by market participants at the end of the trading day. EOI is an harmonised open interest across exchanges.
  • The total liquidity flowing into or out of the exchange is divided by the total liquidity amount transferred on the whole blockchain (BTC & ETH).
  • In the event of liquidation, the ADL Auto-Deleveraging systems are designed to automatically deleverage an opposing position from a trader that is selected.
  • If this order is closed at a price worse than the bankrupt price, exchanges use the balance of the Insurance Fund to cover the gap
  • If the Insurance Fund is insufficient, Auto-Deleveraging will be triggered. Monitoring Insurance Fund and ADL relationships is important to understand the volatility and design risk models for the traders.
  • Mentioned before, maker and taker fees are valuable data for arbitrage use cases and for medium to high frequency traders. GLP helps traders and quants to understand the wideness of the artificial markets.

Users

Users within this class are granted lifetime free access to our services. They are able to retrieve L2 and L3 data pulled directly from APIs and Websockets in real time which are linked to a number of on-chain and off-chain exchanges and then normalised.

In addition to this, they are able to receive both market data and administrative data which is sometimes not provided by exchanges.

Overall, we anticipate the Lite tier to largely consist of retail traders who may not need the full services provided by GDA’s Data Lake. Up to 100 000 users will be supported.

Users within this class have all the same capabilities as a Lite user but are able to customise the feeds from which the data is sourced from. Customisation allows for bespoke data feeds, filtering unnecessary data points, aggregating data by specific timesteps, grabbing historical data, custom pairs/symbols per CDN, etc. Eg. A user has the ability to access raw L2 data from any exchange instead of being provided with normalised, aggregated data coming from GDA’s downstream Content Delivery Networks (CDN).

For a particular case in high frequency trading, a user may want to test out their execution strategies in different regions. To do this they would select which CDN to source their data from. This is important, as the shorter the time frame difference from a user’s execution trade engine to a CDN or connection point of an exchange, the faster they are able to execute things and have a front running opportunity over others.

Users in this class are able to use the Data Dissemination Suite (ODSS) discussed about later in this article but with limitations to what adapters and plug-ins can be utilised. Further information surrounding these restrictions will be released soon.

Altogether, 50 000 users will have access to this tier.

Max Users have complete access to our Data Dissemination Suite and are able to customise any L2, L3 and additional data without restrictions. We expect most users in this class to be involved in advanced quantitative analytics.

At the current time, we expect only 10 000 users will have access to this class but this number may increase in the future.

Shown below is a table containing user features available to different classes. This may change with future updates.

Table 1. User tiers (GDA)

High-Level Architecture

The architecture of our solution incorporates the following design principles to ensure it achieves its purpose.

We address the reliability and availability problems by combining all the available data in the market with a set of carefully tailored design principles:

Figure 2. GDA has a data stream from hundreds of different networks (GDA)

Our solution connects individual microservices to multiple CDNs. Private CDNs allow us to duplicate the same stream in multiple pathways per product per exchange. For example, if one CDN goes down in the US, the other CDN in Germany will still be up and running, ensuring an interrupted data stream.

As we plan to build multiple data streams per exchange per product, we will reach hundreds of streams from numerous crypto exchanges for different markets (e.g., futures and spot markets). We aim to generate one aggregated and reliable value for each product or coin using all these sources.

As we will aggregate data from multiple sources, one issue to be addressed is to set the ideal valid price for a product. In the example below, our data stream provides us with four different BTC/USD prices:

Figure 3. Price for the same product can quickly change across different exchanges (GDA)

We have the task of deciding a valid price point by detecting and eliminating the source-specific price issues. The federated consensus algorithm aggregates the values and finds the actual value. The example above shows that our most frequent price for BTC is USD 64,876.00, and therefore, our federated consensus algorithm will know that the other prices have source-specific issues such as a potential glitch.

Figure 4. GDA creates private CDN networks and utilizes a federated consensus algorithm to find the valid price (GDA)

After we aggregate the data from multiple data streams, utilize a federated consensus algorithm, store the processed data in a cluster, advanced users can request a custom data dissemination method. Our engineering team can write a simple-yet-flexible script to ensure the immediate availability of the data in the requested dissemination format.

High-quality and reliable data does not require complex and hard-to-comprehend data structures. Our R&D team adopt NoSQL object formats and keep the data as flat as possible, meaning no infinitely nested complex data structures. With this simplicity in mind, any trader can take full advantage of GDA’s advanced data environment.

Example above: An advanced data environment does not mean a complex data structure.

By utilizing the design principles above, we provide L2 and L3 grade data points like funding rates that are not accessible by the REST APIs on the market. L2 and L3 grade data is already very scarce in the market. However, our researchers can gather all the available information in the market on top of our own CDN network streams and provide the most advanced and detailed crypto data in the market.

To showcase how vital can these previously inaccessible data points are, here is a simple example:

Figure 6. Funding Rate is a critical price determinant (Bybit)

Here is a funding rate jump around Nov 16, 2021, 8:00 am. The funding rate is an L2 grade data point, which cannot be accessed by the REST APIs provided by the crypto exchanges. It is a hugely significant piece of information, and as a trader, you don’t have access to this. And here is how it affects the price:

Figure 7. How the funding rate affects the price immediately after (Bybit)

As you can see, there is a price jump, which the funding rate can detect, and REST API does not provide this information.

If you have access to this data, you can put limit orders by using these signals so that you can be protected from losses.

Microservices are an architectural style that arranges an application as a collection of loosely coupled services. They are highly maintainable, testable, and can be organised around an organisation’s capabilities.

GDA uses the following finite state machines to represent the health of our services.

Figure 8. Microservice state machine (GDA)

Monitoring of our infrastructure is possible through Amazon Cloudwatch and OpenSearch dashboard. Alarms are carefully set up to report any extremes, in terms of price, data usage, error messages, costs, with all data volumes being logged.

Figure 9. Dashboard (GDA)

Data Collection and Warehousing

For data collection, GDA has designed microservices that connect to the data source and stream the data to our data storage. We use the Libwebsockets library in our primary and replica microservices to extract the data. Our system’s functionality extends well beyond data ingestion, normalisation, and persistence:

  • Running dedicated primary and replica services for data validation
  • Caching the most recent trade and order book for every market
  • Computing and storing price metrics
  • Investigating market quality and liquidity, and other services that help safeguard data accuracy
  • Using HPC techniques to normalise data feeds

Below are two options for configuring the services:

Figure 10. Data warehousing configuration options (GDA)

For data storage, we have designed two separate components; a data lake, and a data warehouse. Data lakes are great at storing very large amounts of data and data warehouses are great for storing large amounts of aggregated data. That is why we have combined the two to provide the best possible solution.

We minimise the latency of GDA’s trading systems by using the Libwebsockets library: It’s a “Lightweight, pure C library for implementing modern network protocols easily with a tiny footprint, using a non-blocking event loop”. This connects to the Bybit ws server and monitors transactions with an eye on low latency.

Figure 11. LibWebSockets

Current testing shows that latency is approximately 70msg/s with the messages on the wire being roughly 70-byte packets which contain small Transport Layer Security (TLS) records. As a TLS record cannot be authenticated until all packets are received, the larger the packets of data, the longer the first item in a packet must wait before being sent.

A permessage_deflate header overcomes this issue, reducing packet size before the TLS by applying compression and giving options for how LWS manages conflicting requirements. Additionally, the function LCCSCF_PRIORITIZE_READS helps the incoming data stream prioritise any RX not “Just at the SSL layer but in one event loop trip”.

Data for the data lake is streamed via a number of microservices. Having multiple microservices guarantees the accuracy and uptime of our data feeds. These services connect to the CDNs. Together, the microservices are responsible for verifying the incoming data and correcting outlying data points. If all microservices are the same, then this indicates that the incoming data is correct. If some are the same but a few are different, then we need a way of determining which feeds are correct. We have created a federated consensus algorithm that makes this decision. The microservices that stream the data also implement normalisation in real time, which is discussed in the next section.

The data lake utilises an Amazon S3 bucket to store the data. Data feeds stream into a buffer which flushes every minute into the data lake. This setup makes it easy to do research with the data using Amazon SageMaker and EMR clusters.

Figure 12. Amazon SageMaker

Amazon SageMaker is a fully managed machine learning service. With SageMaker, “Data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment”.

GDA’s Data Lake also utilises Apache Hive — a data system allowing for massive-scale analytics.

A data warehouse is a type of data storage infrastructure that is smaller than a data lake but larger than a typical relational database. GDA’s warehouse design is based on star schemas — a common database organisational structure optimised for use in data warehousing to store transactional and measured data. Snowflake schemas are also used.

The microservices that stream into the data lake also stream data to a message bus on Apache Pulsar. The message bus sorts the data into topics that can be connected to be our users. As well as live, real-time data, the data warehouse will also stream historical data replays.

Figure 13. Data warehouse (GDA)

The design of the data warehouse is based on an OLAP Cube. As stated by Microsoft, Online Analytical Processing refers to a “Data structure that overcomes the limitations of relational databases by providing rapid analysis of data. Cubes can display and sum large amounts of data while also providing users with searchable access to any data points”.

The OLAP cube design for our data warehouse can be seen below:

Table 1. User tiers (GDA)
Table 2. OLAP Cube design (GDA)

Data Normalisation and Enhancement

Because of the nature of the data and the data analytics that we will be doing, incoming data that GDA collects needs to be normalised. When we say normalisation, we are referring to the normalisation of data within a database schema.

Unnormalised data has multiple values within a single field, as well as redundant information in the worst case. WebSocket data arrives in multiple unnormalised, aggregated, and disparate feeds.

There are benefits to normalization such as updates becoming easier, and performing aggregation functions like Min, Max, Count, and Sum.

Database normalisation has a number of steps involved starting at the first normal form (1NF) and going up to the fifth normal form (5NF). Our normalisation implementation consists of the first three normal forms:

  • First normal form — 1NF
  • Second normal form — 2NF
  • Third normal form — 3NF

A 3NF database is significantly easier to insert, remove and update from as well as aggregate.

1NF reduces each row of the database to its atomic form. 2NF introduces a primary key as a unique ID into each row of the database such that every value is dependent on that key. 3NF reduces transitive dependent values to new tables and makes all non-key attributes dependent on the primary key.

Doing this normalisation drastically simplifies the extraction of data metrics from the order book, especially where these metrics are aggregating data.

All Limit Order Book data normalised by GDA will be collected in a standardised format. Shown below is a data schema explaining these formats:

Table 3. Limit order book data
Table 4. Timestamp data
Table 5. Order details data
Table 7. Market orders data

Creating Rich Data — Aggregation Tasks

Order flow analysis is a data aggregation task that is made possible by GDA’s normalisation. It refers to order flow imbalance (OFI) and trade flow imbalance (TFI), where OFI is a measurement of supply and demand inequalities in a Limit Order Book during a given time frame (Cont et al, 2014).

As mentioned in Order Flow Analysis of Cryptocurrency Markets, OFI is based on the concept that any event which alters the state of a LOB will either change the existing demand or supply within a LOB. In particular:

  • The arrival of limit bid order signals an increase
  • A cancellation of a full/partial limit bid order or the introduction of a market sell order signals a decrease
  • The arrival of a limit ask order signals an increase
  • A cancellation of a full/partial limit ask order or the introduction of market buy order signals a decrease

We have eight clear events here but are only provided with 3 LOB events from exchanges. These events are grouped together on the data feeds to decrease the amount of data in the feeds.

OFI is an aggregate of e (impacts) over a number of events that occur during time period t. N(t) is the number of events occurring at LI during the time period [0,t].

The dependent variable is the simultaneous mid-price change in the number of ticks over the same time period as OFI. MP stands for the midpoint between bid and ask price.

Section 4 of the aforementioned article explains why OFI provides a good estimate for realised mid-price change and the reasons why it cannot provide a better fit. The analysis provided shows that the impact of TFI on prices is stronger than that of OFI due to reasons both macrostructural and microstructural.

In crypto, spoofing is a term attributed to whales who seek to gain profits through market manipulation. They employ a strategy of placing a number of visible orders into a LOB in an attempt to skew the perception of supply and demand. This sends out a signal across the market, causing others to invest in an asset and resulting in an increase in price.

Unlike traditional finance who employ anti-spoofing policies, cryptocurrency markets have no regulatory equivalent. This spoofing affects the quality of the OFI as a statistical indicator of mid-price change. As a result, TFI in crypto markets is often a better mid-price change predictor. Where TFI, for most time frames, is a stronger predictor of the mid price change (R² is higher) than OFI, this indicates a lack of market quality (i.e. spoofing).

Data Dissemination and Streaming — Open Data Streaming Suite (ODSS)

Our Open Data Streaming Suite (ODSS) is how the market data will be disseminated to users. The suite consists of market data adapters where users can subscribe to real-time and historical data streams. Market data adaptors are interfaces through which clients can connect and subscribe to real-time as well as historical data in a variety of ways. Users will be able to connect over HTTPS to request market data, market data(real-time and historical) disseminated over WebSocket. Anyone who wants to connect to and query our data just needs to send an HTTPS request containing their user details. If their details are valid, then a WebSocket connection is established to stream the data until the user disconnects.

The ODSS also allows users who support Apache Pulsar or Kafka to connect to our message bus directly and subscribe to the topics. Streaming data this way will have low latency for users.

Figure 14. Pulsar cluster (GDA)

For users wanting historical data, they can make a HTTPS request and include the time window for which market data has to be replayed. They will then receive a replay of historical data for the time window. If the users wish, they can also include a flag indicating that they join the live data stream after the replay. Historical data snapshots are also available to users if they provide a start timestamp and stop timestamp.

The following shows the available connections users have access to:

REST API ( HTTPS)

  • Clients can connect to the assigned URL
  • The client has to be authenticated with JWT before streaming data, using client_id and password or API token
  • The HTTP connection is downgraded to WebSocket and real-time data is streamed over the WebSoc interface

Apache Pulsar

  • The client can join our native Pulsar messaging system and stream payloads in realtime

Apache Kafka

  • Clients can join our native Kafka infrastructure to stream real-time data

Redis (Streams and Sorted Sets)

Arctic

ZeroMQ

Kafka

RabbitMQ

PostgreSQL

Pub/Nub

Security Management

Because the Open Crypto Data Initiative will be free and available from public endpoints, we will implement throttling controls and duplicate IP restrictions to avoid any adverse back pressure on our market data dissemination system. Each time a connection comes in, their client ID is identified which we use to look up the user’s details and permissions.

The connection’s source IP and browser information is ascertained and if many requests are coming from the same IP and/or same browser, their connection can be throttled according to their permissions. It is necessary to determine a user’s permissions first because each user tier has different limitations. For Max users, they will not be subject to throttling and will be able to access low latency, high speed market data with the best quality of service.

New Approach to Delivering Data into Financial Engineers

An advantage of GDA’s approach over Numpy (which uses vectorisation) is that our data is already aggregated, therefore being faster to disseminate.

GDA also utilises the technology of OLAP Cubes. Briefly mentioned in an earlier section, OLAP is a way to quickly “Respond to multi-dimensional analytical (MDA) queries in computing”.

OLAP tools allow users to interact with and analyse data from different viewpoints. They are composed of three basic analytic operations: consolidation (roll-up), drill-down going from L2-L3 data (or vice versa), and slicing and dicing.

As expressed clearly in this article, a roll-up “Involves the aggregation of data that can be accumulated and computed in one or more dimensions”. For example, supply and demand in a LOB can be aggregated in the time dimension.

Conversely, the drill-down operation is a technique that allows users to navigate through the details of the LOB events. Through slicing and dicing, users are able to “slice” particular sets of data from the OLAP and see them from different viewpoints (dicing). These viewpoints can be referred to as dimensions (such as contrasting different CDNs with the same feed).

In comparison to Online Transaction Processing (OLTP) which is slower at processing queries, OLAP databases use a multidimensional model, allowing for fast execution time and are mainly optimised for read queries. In general, OLAP is more comprehensible and is easier to understand for non-IT professionals.

An OLAP cube’s use of aggregation allows it to have high query execution speeds. These aggregations are built from a fact table — an aggregate function would change the granularity on specific dimensions of data and total them up along the required dimensions. Every possible dimension granularity helps determine the total number of possible aggregations. This and the combination of base data “Contains the answers to every query which can be answered from the data”.

Since there are multiple aggregates that can be computed, only a predetermined number is fully calculated with the rest being addressed via request. This problem of deciding which aggregations (viewpoint) to compute is called the view selection problem. With GDA’s services, each customer is going to be limited to the number of data points they can view by only being able to view a certain level during the time hierarchy. Lite users would be able to see 365*24*60 data points whilst Advanced and Max would be able to see 365*24*60*60 and max users 365*24*60*60*10 respectively.

This is by controlling the levels in the time dimension that the users would be able to gain access to. These are all data points per metric. Note: This strategy was discussed in the ODSS section where users have the choice of Websocket, REST API or pagination.

Some functions are able to be calculated efficiently through the application of a divide and conquer algorithm. Eg. The sum of a roll-up is the sum of sub-sums in each cell. These are called self decomposable aggregation functions which include COUNT, MAX, MIN, and SUM. OFI and TFI is a great example of a self-decomposable aggregation function.

Different from the roll-up example given above, some aggregate functions are calculated via “Computing auxiliary numbers for cells, aggregating these auxiliary numbers, and finally computing the overall number at the end”. AVERAGE and RANGE are examples of this. Additionally, some functions can only be calculated by analysing an entire set of data at once. Examples of this are DISTINCT COUNT, MEDIAN, and MODE.

Conclusion

There are several issues with accessing high-quality data from centralised exchanges. The use of throttling mechanisms limits the types of data offered and is in some cases unreliable. Alternative methods for a user to obtain quality data is to seek the services of traditional exchanges, institutions or hedge funds. However, this can be expensive.

As mentioned throughout this article, GDA is launching our Open Data Initiative to combat said issues. For the first in crypto, institutional grade, reliable and fast streaming financial data will be given to the public for free. This was done through our use of various technologies and innovative approaches to handling data.

For updates on GDA’s work, press ‘follow’ or subscribe to our emailing list.

References

[1] “Apache Hive TM.” https://hive.apache.org/ (accessed Dec. 05, 2021).

[2] “Apache Kafka,” Apache Kafka. https://kafka.apache.org/ (accessed Dec. 05, 2021).

[3] Arctic TimeSeries and Tick store. Man Group, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://github.com/man-group/arctic

[4] Arctic TimeSeries and Tick store. Man Group, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://github.com/man-group/arctic

[5] “Beta (finance),” Wikipedia. Nov. 11, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Beta_(finance)&oldid=1054719731

[6] “Blockchain Oracles for Hybrid Smart Contracts | Chainlink.” https://chain.link/ (accessed Dec. 05, 2021).

[7] Minewiskan, “Create a Date type Dimension.” https://docs.microsoft.com/en-us/analysis-services/multidimensional-models/database-dimensions-create-a-date-type-dimension (accessed Dec. 05, 2021).

[8] “Divide-and-conquer algorithm,” Wikipedia. Nov. 22, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Divide-and-conquer_algorithm&oldid=1056538407

[9] “Fact table,” Wikipedia. Feb. 02, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Fact_table&oldid=1004435405

[10] “Fact table,” Wikipedia. Feb. 02, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Fact_table&oldid=1004435405

[11] “FAQ | Hyblockcapital.” https://hyblockcapital.com/faq/ (accessed Dec. 05, 2021).

[12] “Home,” Apache Airflow. https://airflow.apache.org/ (accessed Dec. 05, 2021).

[13] “libwebsockets.org lightweight and flexible C networking library.” https://libwebsockets.org/ (accessed Dec. 05, 2021).

[14] “Messaging that just works — RabbitMQ.” https://www.rabbitmq.com/ (accessed Dec. 05, 2021).

[15] “Microservices,” Wikipedia. Nov. 29, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Microservices&oldid=1057745936

[16] “Online analytical processing,” Wikipedia. Jul. 20, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Online_analytical_processing&oldid=1034587128

[17] “Online analytical processing,” Wikipedia. Jul. 20, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Online_analytical_processing&oldid=1034587128

[18] “Open Interest.” https://dataguide.cryptoquant.com/market-data-indicators/open-interest (accessed Dec. 05, 2021).

[19] “Open Interest — CME Group.” https://www.cmegroup.com/content/cmegroup/en/education/courses/introduction-to-futures/open-interest.html (accessed Dec. 05, 2021).

[20] E. Silantyev, “Order Flow Analysis of Cryptocurrency Markets,” Medium, May 04, 2018. https://medium.com/@eliquinox/order-flow-analysis-of-cryptocurrency-markets-b479a0216ad8 (accessed Dec. 05, 2021).

[21] P. G. D. Group, “PostgreSQL,” PostgreSQL, Dec. 05, 2021. https://www.postgresql.org/ (accessed Dec. 05, 2021).

[22] “Real-Time In-App Chat and Communication Platform,” PubNub. https://www.pubnub.com/ (accessed Dec. 05, 2021).

[23] “Selection algorithm,” Wikipedia. Oct. 26, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Selection_algorithm&oldid=1051949422

[24] “Snowflake schema,” Wikipedia. Nov. 05, 2021. Accessed: Dec. 05, 2021. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Snowflake_schema&oldid=1053757563

[25] “Understanding Airflow ETL: 2 Easy Methods,” Learn | Hevo. https://hevodata.com/learn/airflow-etl-guide/ (accessed Dec. 05, 2021).

[26] “What is a star schema and how does it work?,” SearchDataManagement. https://searchdatamanagement.techtarget.com/definition/star-schema (accessed Dec. 05, 2021).

[27] “What Is Amazon SageMaker? — Amazon SageMaker.” https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html (accessed Dec. 05, 2021).

[28] “What Is Auto-Deleveraging (ADL)?,” Bybit Learn, Nov. 10, 2021. https://learn.bybit.com/trading/what-is-auto-deleveraging-adl/ (accessed Dec. 05, 2021).

[29] “What is insurance fund?,” Bybit Official Help. https://help.bybit.com/hc/en-us/articles/900000037786-What-is-insurance-fund- (accessed Dec. 05, 2021).

[30] “ZeroMQ.” https://zeromq.org/ (accessed Dec. 05, 2021).

GDA Holdings Ltd is a crypto fund. GDA is developing a new generation of quant workflow environment and sophisticated trading engine for crypto market.

GDA Holdings Ltd is a crypto fund. GDA is developing a new generation of quant workflow environment and sophisticated trading engine for crypto market.