L3 Atom MVP V2
In February, we launched the first version of our L3 Atom Project — an open source data lake for Web 3.0 that provides high quality and real-time data across many fields and sources, completely free of charge.
This was a great first step in revolutionizing the nature of data in crypto, but it was just the beginning. We only collected data from 13 sources: 11 centralised exchanges, and 2 decentralised exchanges. While this may be sufficient for a trader looking to make a quick buck on a simple arbitrage strategy, it’s far from the institutional data lake we’re aiming to build.
If we want to construct a data lake that is scalable, easy to maintain, and appropriate for a larger team as we continue our aggressive expansion, we need to create a solid framework that enforces correctness, quality, and speed. As a result, we’ve begun the development of a complete overhaul of our codebase, which will refactor everything into a solution that is more appropriate for an institutional-grade data lake.
The back-end infrastructure and overall codebase of the first version of our MVP was unsatisfactory going forwards: it was designed quickly and with a small team in mind. Testing was done as exchanges were implemented, rather than beforehand. This is fine for a proof of concept, but as the scope of the project expands to more exchanges and sources of data, significant changes are needed. New developers and contributors to the project shouldn’t be forced to scrape through poorly maintained and structured code.
Our current solutions for cloud hosting are also not suitable for future scalability — we already bottleneck when trying to host multiple exchanges and symbols concurrently. Looking for a more scalable solution like AWS will greatly benefit us as our user base and data collection needs grow exponentially.
There are a few other, more specific issues that we would like to address in future iterations of the MVP:
- Data collection uses callback based websockets while data dissemination uses asynchronous websockets.
- Microservice architecture is hard to maintain with a team of only two relatively inexperienced developers.
- Significant amount of code is duplicated across every exchange.
- Connections to the Aiven Kafka topics aren’t always consistent, sometimes takes multiple instantiations for connection to be established.
- Confluent Kafka’s python client doesn’t support cooperative multitasking — has a threaded design.
- Relatively difficult to deploy tasks on multiple systems. Requires manually setting up the environment and running the bash scripts.
Goals and Changes
There are design choices made in the first iteration that we would like to abandon in future iterations. Namely, the microservice architecture meant that code was often duplicated and hard to maintain. Abandoning the microservice architecture to support maintenance by only a few individuals means that the codebase bulk can be cut significantly.
- Exchanges will share as much code as possible — particularly for websocket connections, data processing (normalisation will be its own module), and data dissemination.
- Account for edge case data sources by allowing websocket functions to return a variable number of elements.
- Create better output logs for exchanges
- Log server status upon receiving SIGUSR1 / every 5 minutes
- Record output to stderr in a separate log.
- Dockerise into separate instances:
- Data collection and normalisation
- Data dissemination
- Make web connections consistent throughout the entire codebase — use asyncio instead of callbacks to handle data collection.
- Look into transitioning some asynchronous aspects of the MVP to Rust.
We’ve made the decision to switch from using Apache Kafka for our message brokering to using AWS ElastiCache. We found that Kafka was overkill for our purposes, and that using an in-memory solution like Redis would give us the performance we want.
There are multiple goals we have for version 2 of the MVP:
- Codebase is refactored to be more modular.
- Application is dockerised and ready to deploy onto AWS services.
- Performance metrics will be implemented and monitored.
- Data of all streamable currency pairs from supported exchanges of MVP v1 will be collected, as opposed to just BTC/USD.
- Load balancing for client connections.
- Benchmarking and correctness test cases will exist for all code in the codebase.
The current scope is 6 weeks, after which we plan to have a product usable by the public. Version 1 of the MVP is still available to connect to and pull data from, and we have plans to implement small features that will make V1 more useful. These include:
- Sending out snapshots of the limit order book when first connecting to the websocket
- Collecting and sending useful indicators provided by the exchanges (e.g., extended open interest, funding rate)
- Registering a domain for users to connect to, instead of just an IP address
As with the previous version of the MVP, we will be making all of our code and software open source and free to use.