Handling Clickhouse Out of Memory (OOM) in a Kubernetes Cluster

Handling Clickhouse Out of Memory (OOM) in a Kubernetes Cluster

Introduction

Out of Memory (OOM) refers to a state where a system or application exhausts its available memory resources, leading to degradation in performance, application crashes, or system failures.

This condition can occur due to memory leaks, excessive memory usage, or poorly optimized applications.

Out of Memory (OOM) in the Context of Kubernetes

Kubernetes manages containerized applications across multiple nodes in a cluster. Its memory management can lead to OOM conditions under several circumstances:

  1. Exceeded Pod Memory Limits: Each Kubernetes Pod can have specified memory requests and limits. If a container in a Pod tries to consume more memory than its limit, the OOM killer will terminate the container to free up memory .

  2. Node Resource Pressure: When a Kubernetes node is under memory pressure, it may invoke the system-level OOM killer to terminate processes and free up system memory. Kubernetes may also evict less critical Pods to alleviate memory pressure .

Ways OOM Incidents Can Occur in a Kubernetes Cluster

  1. Incorrect Resource Requests and Limits: Failing to set appropriate memory requests and limits for Pods can result in unpredictable memory usage and OOM conditions .

  2. Memory Leaks: Applications with memory leaks can continuously consume memory until the system runs out .

  3. High Memory Usage: Workloads that are not optimized for memory usage can result in high memory consumption, leading to OOM.

Strategies to Prevent OOM Occurrences in Kubernetes

  1. Set Appropriate Resource Requests and Limits: Define memory requests and limits for each Pod to ensure that Kubernetes can manage resources effectively.

  2. Use Liveness and Readiness Probes: These probes help in monitoring the health of applications and can restart containers to clear memory leaks or other transient issues.

  3. Monitor Memory Usage: Use monitoring tools like Prometheus and Grafana to track memory usage and detect potential issues early.

  4. Use Quality of Service (QoS) Classes: Utilize Guaranteed, Burstable, and BestEffort QoS classes to prioritize resource allocation depending on the application's needs.

  5. Automated Resource Management Tools: Tools like Fairwinds Insights can help enforce resource best practices across clusters.

OOM in the Context of Clickhouse

Clickhouse is a columnar database management system optimized for OLAP queries. OOM in Clickhouse can occur due to:

  1. Large Aggregations and Joins: Memory-intensive operations such as large joins or aggregations can consume excessive memory.

  2. Poor Configuration Settings: Inadequate memory settings can result in inefficient memory utilization leading to OOM.

OOM in Clickhouse and Kubernetes

When deploying Clickhouse on Kubernetes, OOM conditions can be exacerbated by the combined resource demands of both the database and the Kubernetes scheduler. Improper configurations can cause Kubernetes to evict Clickhouse Pods due to excessive memory usage.

Memory Tracker by Clickhouse

In order to avoid OOMs, ClickHouse tries to not request too much RAM. ClickHouse tracks how much memory it uses and avoids requesting more than mentioned in max_server_memory_usage setting.

If It is zero by default, ClickHouse uses the amount of the RAM available on the node and another setting called max_server_memory_usage_to_ram_ratio (default is 0.9).

So that means that with the default configuration, ClickHouse assumes it's safe to allocate up to 90% of the physical RAM. Usually, this seems to be sufficient to avoid a rendezvous with the OOM killer.

However we still see clickhouse getting OOM killed either due to not properly configuring clickhouse settings or resource crunch because of other applications running on the same node. Let's see how to avoid this

Avoiding OOM for Clickhouse in Kubernetes

  1. Query Optimization: Optimize your queries to avoid memory-intensive operations and ensure efficient memory usage. Techniques like partitioning and indexing can significantly reduce memory consumption. You can go through this blog for Optimising Clickhouse queries (Solve memory limit errors)**.

    While optimising queries would be a first good point, there would be a few queries which analysts want to run adhoc which could require a lot of RAM and requires a machine upgrade.

    This is why we are developing Open Engine (in private beta right now, reach out to us if you want a sneak peak), an innovative approach to run resource intensive queries without upgrading the machine.

  2. Optimize Resource Limits: Set appropriate memory requests and limits for Clickhouse Pods to prevent the Kubernetes scheduler from terminating them due to excessive memory consumption.

    Here it is important to keep note of the memory available in a node registered with Kubernetes. At Datazip, we dynamically adjust memory to the pods by considering the memory that is available for the pods in a node registered to Kubernetes.

    Eg: A 16GB AWS EC2 instance registered with Kubernetes would have only 14.2GB available for the pods and this needs to be considered when applying pod limits and requests. This information is crucial in configuring the settings in clickhouse also.

  3. Memory Configuration: Adjust Clickhouse settings such as max_memory_usage to control the amount of memory used by queries and avoid exceeding the memory limits.
    max_memory_usage is maximum amount of RAM that a single query can utilize. Allotting this to more RAM than available on the node can cause OOMs in the Kubernetes Nodes or machine running the clickhouse

  4. Monitoring and Alerts: Implement monitoring solutions to track memory consumption and set up alerts for abnormal memory usage patterns.

  5. Spill to Disk: Use settings like max_bytes_before_external_group_by and max_bytes_before_external_sort to allow Clickhouse to spill large memory operations to disk, thus preventing OOM.

By implementing these strategies and configurations, you can effectively manage and mitigate OOM risks for Clickhouse running on a Kubernetes cluster.

Author Bio

Pavan Kalyan Chiluka is a Founding Software Engineer at Datazip. With a passion for innovation and a deep understanding of ClickHouse and containerized deployments dedicated to helping organizations overcome their data challenges.

About Datazip

Datazip is an AI-powered Data platform as a service, giving the entire data infrastructure from ELT (Ingestion and transformation framework), storage/warehouse to BI in a scalable and reliable manner, making data engineering and analytics teams 2-3x more productive.

Ready to revolutionize your data warehousing strategy? Contact us at [email protected] to learn more about how Datazip can benefit your organization.

FAQ

1. How can Clickhouse settings be optimized to prevent OOM issues in Kubernetes environments?

Optimize Clickhouse by setting max_memory_usage to align with Kubernetes' pod memory limits—typically no more than 70% of allocated memory—to avoid triggering the OOM killer.

2. What are effective strategies to manage memory usage in Kubernetes to support database operations without facing OOM?

Implement precise memory requests and limits, employ Kubernetes' QoS classes, and utilize Prometheus for continuous monitoring to effectively manage Clickhouse memory demands and avert OOM scenarios.

3. Can you elaborate on the role of monitoring tools in identifying and mitigating OOM risks in Kubernetes clusters running Clickhouse?

Leverage Prometheus to monitor Clickhouse memory metrics continuously, setting critical threshold alerts to preemptively address rising memory consumption before reaching OOM conditions.

4. What are the implications of node resource pressure in Kubernetes, and how does it lead to OOM conditions?

Node resource pressure, when unchecked, forces Kubernetes to terminate or evict Clickhouse pods to free up memory, often leading to performance degradation or unexpected downtimes.