Addressing some of the biggest data problems in crypto

9 min readNov 19, 2021

Problems With Existing Solutions

Current data infrastructure in the crypto space is inherently unreliable. Comparing data collection practices in crypto to traditional financial markets, you realise how lacking the technology of these services are, both in reliability and accountability of the data sources. What compounds this problem is the importance of data to many trading strategies, especially for institutional traders. 2021 has seen the beginning of institutional adoption of crypto and DeFi, however, this unreliability significantly hinders many institutional participants from entering the crypto markets.

Unreliable Technology

Across all crypto exchanges, API technology is limited to WebSockets or even REST APIs at best. Anyone in the industry will know that these technologies are incredibly unreliable and will often fail when there is a high volume of traffic — typically when you need them the most¹. When comparing the suite of available technology in the crypto space to traditional finance, the crypto space is criminally inadequate. While Forex and equity exchanges provide dedicated connections to market participants, crypto traders at all levels are stuck with unreliable APIs.

Of course, traditional exchanges charge expensive fees for a licence to use these data connections. Crypto exchanges provide their API connections for free and lack any incentive to improve their data streaming technology. Their main users are retail traders, so crypto exchanges devote most of their resources into their UI instead. Crypto exchanges are further discouraged from improving their technology suite because they can take advantage of this to abuse and rip off their users. Missing trades due to failed API calls and outages is an all too familiar story, which often result in large losses for the trader and more profits for the exchange². Crypto exchanges will use the unreliability of their current technology as a scapegoat to rob users of their money. They keep getting away with it because of the lack of regulation in the space. If regulators are not going to step up and solve the problem, then something must be done.

Unreliable Data

When the REST APIs and WebSockets are working, there is no guarantee of the accuracy of their data. Data provided directly from the crypto exchanges will frequently contain glitches and spoofed data. In a report released in October 2021 by the International Monetary Fund, they stated that exchanges are incentivised to manipulate their data and the data they do report is limited, fragmented and unreliable³. Spoofed data, where the asset price might swing 50%, will ruin many trading strategies that automate profit taking and stop losses. Again, crypto exchanges have no accountability and regulators are doing nothing about this abusive activity.

Missing Data

The data that is currently provided through exchange APIs is limited to order book data. However, many trading strategies rely on additional data points. For example, the funding rate which changes every 8 hours, is an important parameter that can tell traders about the circulation of liquidity and where there is “collateral liquidation concentration”. Where traders are holding margin positions, the funding rate also tells the cost of maintaining this position. Additionally, there is no data stream of administrative data. This data includes API down times and expected platform maintenance. This information is important because it will directly affect trade execution. Currently, this data is only available over Twitter, through emails, or on the exchange website directly.

Data funding rate from Bybit (15/11/21–19/11/21)

Inadequate Data Providers

Services have been established to outsource the provisioning of data from crypto exchanges. These paid solutions offer the data provided from exchange APIs, as well as historical data. However, they often suffer from the same problems already mentioned. API failures interrupt the data stream from paid providers and their feeds are plagued with unreliable data. Furthermore, their data feeds pass through their own infrastructure before reaching you as a customer, slowing down the feed and introducing a single point of failure.

Expensive Setup

If an institutional participant wants to improve this reliability, they need to invest vast amounts of time and money to build relationships with individual exchanges and establish better data infrastructure. The cost benefit in doing this for most market participants is not there.

Revolutionising Crypto Data

GDA’s Data Lake is designed to solve these problems and push the crypto space into the next generation of crypto data that will be open-source, democratised and accessible* (because of operational limitations, access will be limited to 200,000 users). As a data provider, GDA’s services will include data collection, data normalisation, data streaming, and data warehousing.

This project will enable users to connect to an institutional-grade, reliable, fast stream of financial data across a wide range of exchanges for free. What is more exciting, is that the list of exchanges will not be limited to centralised entities, but extends to decentralised protocols as well. This data lake will solve all the problems mentioned above by implementing the following design principles.

Design Principles

Content Delivery Networks: Infrastructural redundancy to ensure uptime

The reliability of GDA’s Data Lake will be guaranteed through additional connections to exchanges via content delivery networks (CDN), in addition to existing REST APIs and WebSockets.

The purpose of building these CDNs is to provide several streams of data, per exchange, per product. CDNs are servers closely integrated with the execution engine that are distributed geographically around the world, ideally five CDNs per exchange. They are intentionally redundant copies of data feeds. In the event that a CDN goes down, the remaining data feeds remain unaffected, ensuring an uninterrupted stream of data. This design principle will ensure the reliability of data feed connections to the exchanges.

Federated Consensus: Data Normalization

History has shown us that exchanges can go down, or experience extreme phantom price changes. To handle this, GDA’s Data Lake will implement data normalisation via a federated consensus algorithm. Data is aggregated from CDNs, where normalisation algorithms are able to detect and differentiate outlying data from legitimate impulse movements. When exchanges go down for prolonged periods of time, the normalisation will fill these gaps. In these cases, the normalised data may not be as good as the real data, but it is much better than no data at all. Raw data, including these edge cases, will be stored and available to users as well. This design principle will ensure the reliability of the data coming from the data feeds.

Direct Connection From Exchange to User

GDA’s live data feeds will be a direct connection from exchange to user meaning it will not pass through the data lake first. This design principle improves live data feed latency for low-latency activities such as high frequency trading (HFT) and ensures there isn’t a single point of failure for users. Data is normalised before it is stored and is pushed to users as a quality of service measure. This design principle will improve the quality of the data feeds by minimising their latency.

Systematic and Quantitative-Ready Approach and APIs

Each exchange is represented by a JSON object stored in our data lake, long term storage will be kept in Amazon Glacier. This makes accessing data stored in the data lake simple for systematic and quantitative traders. Clusters of data are available to advanced users, who can make use of this by writing their own scripts to push indexed data through appropriate dissemination channels. This design principle will make the GDA data feeds easy to use and quick to implement.

{
  "amount": 0.0541,
  "exchange": "Binance",
  "high": 15845.56,
  "low": 15802.01,
  "open": 15821.98,
  "last": 15839.20,
  "pair": "btcusdt",
  "route": "smart",
  "source": "ticker-info",
  "timestamp": "2020-10-12T16:20:33.505Z"
}

Open Data Steaming Suite (ODSS)

Users will have a variety of different adapters to connect to data feeds depending on their use case. This suite will include REST API and WebSocket connectivity, and FPGA implementation is also being investigated. On-chain users can connect through a custom web 3.0 adapter. This will enable dissemination of GDA data feeds through oracle services on-chain. This design principle will improve the reliability of GDA’s data feeds, which is a foundational quality of this project.

Long Term Storage and Compression

Historical data will be stored long term in Amazon Glacier to keep the data lake running efficiently. Data will be indexed and accessible the same way as it is in the data lake via JSON object representation. Efficient data compression will keep this storage lightweight while maintaining data quality. This design principle will improve the efficiency of the data lake while maintaining access to high-quality historical data.

Normalized Historical Data

To give our data our advanced data normalisation, current data will not just be applied. Historical exchange data will be purchased from existing data providers and processed using our normalisation methodologies which will then be stored in our infrastructure. This will give users access to high quality historical data as well.

Additional Data Collection

Not only will high-definition, granular market microstructure data be available through GDA Data Lake, it will also provide additional data points such as funding rates, maker & taker fees, and admin data such as maintenance information. This will give users access to a full range of relevant trading data from one single resource; GDA.

User Tiers

This project will require significant resources to establish the appropriate infrastructure, and even more to maintain it. The team is aware of this and is creating a sustainable user model that can guarantee the project to be maintained. Data is expensive and is not free, importantly. Nevertheless, GDA wants to keep its resources free and open source. This user model will balance the need for sustainability while keeping the resource free. GDA will offer three tiers of users that have access to different levels of data.

Firstly, there will be Lite users. These users will have access to L1+ L2 data and additional data (such as funding rates, maker & taker fees, etc) through REST APIs and WebSockets. There will be resources provisioned for 100,000 Lite users and access will be free forever.

The next tier will be Advanced users. These users will have access to the same data as Lite users but will have the ability to customise their data feeds. Advanced users will also have access to FPGA adapters to connect to GDA’s Data Lake. The number of advanced users will be limited to 50,000 at any time, but access will be free forever.

Finally, there will be Max users. These users will have access to all available data, including L3 data, important for hft, and have the ability to customise their data feeds. They can connect via any adapter technology offered in our adapter suite. The number of Max users will be limited to 50,000. While this tier will be free initially, it will transition to a paid tier in the future.