Why Proxmox is not production-ready for larger deployments (Part 1: Corosync)

The Silent Killer: Corosync Configuration Issues

Ahhhhh YouTube and endless homelabs videos about miracles of Proxmox clusters running on everything from huge rack mounted servers down to 3 electric toothbrushes (well, not really)

Awesome – as long as there are 3 of them.

What is Corosync and Why Does It Matter?

Proxmox VE relies on Corosync for maintaining node communication and cluster integrity. It handles crucial tasks like:

  • Determining which nodes are active in the cluster
  • Managing quorum (the minimum number of nodes required for cluster operations)
  • Facilitating communication between cluster nodes
  • Supporting cluster resource management

In theory, it’s a robust foundation for high-availability clusters. In practice, especially at scale and in the worst case scenario – this might be the end of your company (depending who your clients are and how good is your lawyer)

1. Configuration Synchronization Issues

Proxmox stores Corosync configurations in /etc/pve/corosync.conf. When making changes to cluster membership or configuration, this file must be properly synchronized across all nodes. However, in larger deployments (typically exceeding 8-10 nodes), this synchronization process becomes increasingly unreliable.

Simple network hiccup might (we lived it!) causes configuration file becomes inconsistent across the cluster, with some nodes having partial or corrupt configurations. Since there’s no robust verification mechanism, this inconsistency isn’t detected until it’s too late.

2. Scaling Limitations and Performance Degradation

Corosync was originally designed for smaller clusters, and its performance in Proxmox degrades noticeably as you scale:

# Example of problematic Corosync configuration in a large cluster
totem {
    version: 2
    secauth: on
    cluster_name: proxmox-cluster
    transport: knet
    # Default token timeout is too low for larger clusters
    token: 3000
    # Inadequate tuning for network latency in distributed environments
    token_retransmits_before_loss_const: 10
}

If you don't see it in your config - check current parameters with:

corosync-cmapctl | grep totem

The default parameters works for 3-5 node clusters (Homelab! ) but become progressively problematic as you add more nodes. Token timeouts, in particular, often need adjustments

The Cascade Failure

  1. Initial trigger: A network partition or node failure occurs, affecting Corosync communication
  2. Configuration confusion: Nodes have inconsistent views of the cluster in /etc/pve/corosync.conf
  3. Quorum battles: Sub-clusters form, each attempting to establish quorum
  4. Resource conflicts: Multiple nodes attempt to control the same resources
  5. Performance collapse: CPU and I/O spike as the cluster attempts to resolve conflicts
  6. Complete outage: Eventually, the entire cluster becomes unresponsive, requiring manual intervention

At this point the only way to save the day is to dismantle the cluster node by node with the trinity of
systemctl stop corosync
systemctl stop pve-cluster
pmxcfs
-l

executed on every node in the cluster.

Finally, when you at least restored the GUI access and ability to stop/start the VMs, you will need to carefully plan your next few weekends necessary to build new cluster and migrate the VM’s.

To be crystal clear: this is not the Proxmox issue. Proxmox is OK.
We have overwhelming respect and endless love for the folks over there.
They cannot be blamed for how underlying tools behave.

With that being said: maybe limit your clusters to 5 nodes and limit your trust a bit 🙂

Previous Post Next Post