Why Distributed Locking Is Hard
A distributed lock has one job: ensure only one process does a thing at a time. In a single-process system, a mutex does this trivially. In a distributed system, you have to deal with network partitions, clock skew, and process crashes.
The fundamental challenge: how do you know if a lock holder is still alive?
Redis Redlock
Redlock acquires a lock on N Redis nodes (typically 5). A lock is considered acquired if a majority (N/2 + 1) of nodes grant it. The lock has a TTL — if the holder crashes, the lock expires.
The problem: Martin Kleppmann's analysis showed that Redlock is unsafe under certain failure modes. If a process pauses (GC, OS scheduling) after acquiring the lock but before completing the operation, the lock can expire and another process can acquire it. Both processes now hold the lock simultaneously.
For most use cases, this is an acceptable risk. For our container provisioning system, it was not.
etcd Lease-Based Locking
etcd provides linearizable reads and writes. A lock is implemented as a key with a lease. The lease is renewed by the lock holder via a keepalive goroutine. If the holder crashes, the lease expires and the key is deleted.
The key difference: etcd's lease expiry is tied to the holder's liveness, not a wall clock TTL. If the holder is paused, the keepalive goroutine is also paused, and the lease expires correctly.
lease, _ := client.Grant(ctx, 10) // 10 second TTL
_, err := client.Put(ctx, lockKey, "", clientv3.WithLease(lease.ID))
if err != nil {
return err // lock not acquired
}
// Start keepalive
ch, _ := client.KeepAlive(ctx, lease.ID)
The Trade-offs
Redis Redlock: - Latency: ~1ms - Operational complexity: low (you probably already have Redis) - Correctness: good enough for most use cases
etcd: - Latency: ~5ms - Operational complexity: requires a 3-node cluster - Correctness: linearizable, correct under all failure modes
My Decision Framework
Use Redis for locks where the cost of a split-brain is low (e.g., cache warming, background job deduplication). Use etcd for locks where split-brain causes data corruption or incorrect billing.
For our container provisioning system, a split-brain would result in two processes trying to provision the same container — data corruption. etcd was the right choice.