Summer 2024 Incident: Causes, Impact, and Resolution

Summer 2024 Incident: Causes, Impact, and Resolution

Introduction

On June 26, 2024, the Humanode network encountered a finalization stalling issue on its mainnet. This incident resulted in blocks failing to be finalized. Although our team was able to recover the finalization within two hours, it was essential to investigate the root cause thoroughly and implement a long-term fix to prevent a recurrence. This report outlines the journey from the incident's detection to its resolution.

June 26, 2024: First finalization stall at Humanode Mainnet

Humanode Mainnet, based on Substrate v0.9.40, experienced its first finalization stalling issue at block #8327327 (the latest finalized block). This issue disrupted the consensus mechanism and prevented new transactions from being finalized.

2024-06-26 12:35:07 ♻️  Reorg on #8327330,0xa5ee…9c61 to #8327328,0x2669…84f8, common ancestor #8327327,0x9b9c…019b    
2024-06-26 12:35:07 ✨ Imported #8327328 (0x2669…84f8)    
2024-06-26 12:35:08 Encountered error finding best chain containing 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b: couldn't find target block: Chain lookup failed: Requested a finality target using a base 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b not in the best chain 0x2669f7c45d66835eac78f5e90c4a1a4e093c761d7f786bfb50c8920537c984f8    
2024-06-26 12:35:08 Could not cast prevote: previously known block 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b has disappeared 
2024-06-26 12:35:09 💤 Idle (1359 peers), best: #8327328 (0x2669…84f8), finalized #8327327 (0x9b9c…019b), ⬇ 3.5MiB/s ⬆ 438.3kiB/s    
2024-06-26 12:35:12 ✨ Imported #8327329 (0x6dfe…d531)    
2024-06-26 12:35:14 💤 Idle (1359 peers), best: #8327329 (0x6dfe…d531), finalized #8327327 (0x9b9c…019b), ⬇ 1.0MiB/s ⬆ 973.2kiB/s    
2024-06-26 12:35:18 ✨ Imported #8327330 (0x56aa…28a2)    
2024-06-26 12:35:19 💤 Idle (1359 peers), best: #8327330 (0x56aa…28a2), finalized #8327327 (0x9b9c…019b), ⬇ 158.8kiB/s ⬆ 33.4kiB/s    
2024-06-26 12:35:24 💤 Idle (1359 peers), best: #8327330 (0x56aa…28a2), finalized #8327327 (0x9b9c…019b), ⬇ 35.2kiB/s ⬆ 9.3kiB/s

TL;DR:

  • 12:35:07: A reorganization occurred, moving from block #8327330 to #8327328.
  • 12:35:08: Errors were logged, indicating problems with the block 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b, which had disappeared from the chain.
  • 12:35:09: The system idled while processing new blocks, with block #8327327 still finalized.

The disappearance of the block 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b in the logs was initially perplexing. However, the immediate priority was to restore finalization as quickly as possible. A detailed investigation into the root cause was scheduled for after finalization was secured.

To address the immediate issue, we focused on utilizing the note_stalled extrinsic, as detailed in the Substrate code. Before submitting this extrinsic, the code was meticulously reviewed, and the OneSessionHandler was also re-examined to ensure that the note_stalled extrinsic would behave as expected.

The note_stalled extrinsic was successfully submitted to the network, and finalization was restored within less than two hours of the incident. 

June 26 - July 26, 2024: Initial investigation of the issue

Post-incident, we began a thorough investigation by reviewing block production and finalization-related pull requests and issues in the Substrate codebase. During this process, we discovered a similar issue reported by another project in this GitHub issue. We brought this to the attention of the Parity team and sought their input on the incident affecting the Humanode mainnet. However, the Parity team acknowledged the complexity of the issue without providing immediate solutions.

To further diagnose the problem, we analyzed the code associated with specific log entries:

2024-06-26 12:35:08 Encountered error finding best chain containing 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b: couldn't find target block: Chain lookup failed: Requested a finality target using a base 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b not in the best chain 0x2669f7c45d66835eac78f5e90c4a1a4e093c761d7f786bfb50c8920537c984f8    
2024-06-26 12:35:08 Could not cast prevote: previously known block 0x5335a41fe6da0c54a090caa5b6d8f81f1356096a8677466ed46490e3da252b2b has disappeared

Figured out that the above warnings are produced from Encountered error finding best chain containing 0xAA..A: couldn't find target block: https://github.com/humanode-network/substrate/blob/98f2e3451c9143278ec53c6718940aeabcd3b68a/client/consensus/grandpa/src/environment.rs#L1214

Chain lookup failed: - https://github.com/humanode-network/substrate/blob/98f2e3451c9143278ec53c6718940aeabcd3b68a/client/consensus/common/src/longest_chain.rs#L155

Requested a finality target using a base 0xAA..A not in the best chain 0xCC..C: https://github.com/humanode-network/substrate/blob/98f2e3451c9143278ec53c6718940aeabcd3b68a/client/consensus/common/src/longest_chain.rs#L116

Could not cast prevote: previously known block 0xAA..A has disappeared: https://github.com/paritytech/finality-grandpa/blob/8c45a664c05657f0c71057158d3ba555ba7d20de/src/voter/voting_round.rs#L589

Our research pointed to a potential root cause: the node responsible for the primary slot (A1) experienced an unusually long block production time. As a result, the chain continued with the block produced by the secondary slot node (B1), which then produced two additional blocks (B2, B3). Nodes began voting to finalize B1 and its successors (B1, B2, B3). Meanwhile, A1 continued producing blocks on the best chain (A1, A2, A3, etc.), leading to a chain reorganization. The finalization process on B1 was disrupted because B1 was no longer part of the best chain after the reorganization.

Consequently, the GRANDPA consensus mechanism attempted to finalize a pruned block that was no longer part of the best chain. The precommits were received but failed to be resolved because the block was pruned.

To validate our hypothesis, we prepared a regression test to check for the described issue, as outlined in this test case. Additionally, we established internal alertmanager rules to monitor the distance between the latest best_block and the latest finalized_block to ensure timely recovery in the event of a similar issue.

We then focused on reproducing the issue in a controlled environment using our Kubernetes-based tool designed for running Substrate-based networks.

July 26 - July 30, 2024: Humanode x Parity further investigation

With the network scaling to over 1000 validators, finalization issues became more frequent.  A tracking issue was created to monitor these occurrences. We collaborated with the Parity team, sharing insights specifically about the LongestChain::finality_target method. It was noted that LongestChain::finality_target returns an error if the target block is not in the best chain. Therefore, it is crucial for LongestChain::finality_target to be properly handled by the Environment::best_chain_containing execution.

The Parity team has actively engaged in addressing the problem, and our research has significantly contributed to a proposed solution on their end. The root cause was identified as a failure in handling situations where a vote target is not present in the best chain. This oversight prevented finality from proceeding if a round estimate was reorganized. Ideally, GRANDPA should finalize the fork that got reorganized, followed by a reorganization of the node's chain after finalization. The full consequences of such a change were not fully considered initially.

To validate this proposed fix, a regression test was created to check the block finalization logic after a chain reorganization, detailed in this pull request. The pull request containing both the fix and our test was merged into the main branch of Parity’s repository.

Despite the merge, the fix was not immediately rolled out due to concerns about its potential impacts on other modules. As a precaution, the plan was to first reproduce the problem and monitor network behavior with the new fix in place. Meanwhile, a temporary workaround was implemented, which involves selecting a block for finalization voting that is at least seven blocks away from the current best_block, as described in this pull request.

Moving forward, we plan to enhance our chain-load-test tool to better reproduce and analyze such issues within a controlled environment.

July 30 - August 17, 2024: Controlled environment testing

We developed an initial version of our chain-load-test tool, which emulates a network using a specified Docker image and number of nodes. Previously, the Humanode team had considered using the Zombienet tool developed by the Parity team. However, we decided to implement and maintain our own tool due to the reasons outlined here and here.

We ran the network with over 1200 nodes to increase the likelihood of reproducing the issue. The nodes communicated effectively, and consensus operated as expected. However, we realized that the network lacked real production conditions, such as network delays and packet losses.

To address this, we investigated tools to introduce network chaos into the Kubernetes network layer to better emulate production conditions.

We improved our chain-load-test tool by integrating Chaos Mesh resources to simulate network issues. We successfully reproduced the issue with just 9 nodes, as detailed in this comment. We conducted three load tests, all of which encountered the same issue. Consequently, the chain-load-test tool now effectively emulates network conditions similar to a production environment.

August 17 – August 19, 2024: Fix implementation and verification

We backported the fix into our substrate fork and prepared a new image for load testing. Upon running the load tests, the consensus functioned as expected, and the bug was not reproduced, confirming that the issue was resolved effectively (details here).

The Humanode network code was updated to incorporate the backported fix from the substrate fork (pull request), and a new release was prepared and shipped.

Summary

The finalization stalling issue has been successfully addressed. We identified the bug, developed and implemented a fix, reproduced the issue reliably in a controlled environment, applied the fix, and confirmed that the bug no longer occurs under the same conditions. Additionally, we verified that no unintended side effects resulted from the fix.

Moving forward, we plan to:

  • Develop more dashboards and alerts for the humanode-peer, building on our initial finality graphing, and incorporating useful information from libp2p metrics.
  • Implement a practice of monitoring peers after each new release to visually assess graphs for anomalies and performance regressions.
  • Conduct a thorough comparison between the zombinet tool and our chain-load-test tool, and consider open-sourcing our tool for broader use.