Hello everyone, 

Byconity high level design : 


Byconity is a open-source warehouse managed by Bytedance - parent company of TikTok. The initiative started when people at Bytedance used Clickhouse for their internal analytics efforts, but due to issues around scalability of CH (Like tightly couple storage & compute) & manual sharding complexities when data increases they decided to modify and make a storage-compute separated version of Clickhouse just similar to Snowflake architecture. 

Byconity introduction as mentioned on their github page- "Our key innovations include the introduction of a compute-storage separation architecture, a state-of-the-art query optimizer, multiple stateless workers, and a shared-storage framework. These enhancements, inspired by both ClickHouse's strength and Snowflake's innovative approach, offer substantial performance and scalability improvements".

We have been experimenting with Byconity for last couple of months are we are surprised with performance and flexibility of it. The helm chart deployment makes it easier to deploy and configure for scale. It supports S3 or HDFS as a storage options.

We will not go too deep into architecture, you can read here.

Performance benchmark Byconity vs plain vanilla Clickhouse :

Configurations

  1.  We are running whole byconity cluster on : Azure `Standard_D16as_v5` (16 cores, 64GB RAM, AMD processor)
    1. We will be running Clickhouse on the same machine as well
  2. We are running Byconity in fully storage compute detached model on Kubernetes, where we are storing data in Azure Blob Storage (Using Minio for S3 compatibility on top of it)
  3. Also we have configured 3 VW for reading and 3 VW for writing in Byconity helm values. These individually have 50gb volumes attached.
  4. Also for Clickhouse instance we have 256gb SSD. (All data stays there)


Results

When we ran the benchmark on, seems Clickhouse with only SSD is 6x averagely faster than Byconity with purely S3 based storage.


Conclusion : 

* Although Clickhouse performance is good in a Clickbench test, the major issue with this is, you can only scale it vertically, where as Byconity can be scaled out horizontally and scaled down as well quickly with full-storage compute separation.



Reference : Byconity helm chart values.yaml i have modified : https://gist.github.com/shubham19may/6c531134b0993c7c5a8f4e1c19da1133


You can check https://datazip.io to get a managed data engineering stack in no-code manner so even an data analyst can create data pipelines without having to depend on data engineers.