CLAUDE.md: Improving Error Recovery Documentation

by ADMIN 50 views

Hey guys! Let's dive into how we can enhance the CLAUDE.md documentation by adding clear error recovery procedures. This is super important because, while the existing documentation does a solid job explaining the architecture, it needs more guidance on handling failures. Trust me, making this better will save everyone a lot of headaches down the road!

The Importance of Error Recovery Procedures

Error recovery procedures are crucial for any robust system. When things go wrong—and let's face it, they often do—having a well-defined plan can be the difference between a minor hiccup and a major meltdown. For CLAUDE.md, which has extensive architecture documentation, the lack of clear error recovery steps creates a significant gap. Imagine a new developer trying to figure out what to do when a terminal fails or a daemon crashes. Without proper guidance, they’re essentially flying blind. This is why we need to prioritize adding these procedures.

Firstly, well-documented error recovery procedures significantly speed up the onboarding process for new developers. Instead of spending hours troubleshooting common issues, they can quickly refer to the documentation and follow the outlined steps. This not only makes their lives easier but also boosts their confidence in the system. Knowing how to handle errors effectively can transform a newbie into a proficient contributor much faster. Imagine the frustration of encountering a terminal failure without any clue how to resolve it. Clear documentation acts as a safety net, providing a reliable reference point.

Secondly, clear error recovery steps reduce the support burden on senior developers and maintainers. When issues arise, the first instinct is often to ask for help. However, if the documentation provides comprehensive guidance, many of these questions can be answered independently. This frees up experienced team members to focus on more complex tasks and strategic initiatives. Think of it as a self-service help desk; the more information available upfront, the fewer support tickets need to be handled manually. This efficiency boost can be a game-changer for productivity.

Lastly, detailing error recovery mechanisms leads to a better understanding of the overall system architecture. By documenting how the system handles failures, we also implicitly explain the dependencies and interactions between different components. This holistic view is invaluable for both debugging and future development. For instance, understanding the steps involved in daemon reconnection can shed light on the importance of state preservation and retry mechanisms. It’s like having a detailed map of the system’s resilience, making it easier to navigate and improve.

Documentation Gap: Why It Matters

Currently, the documentation gap in CLAUDE.md regarding error recovery is marked as a medium severity issue. This isn't something we can just sweep under the rug. While the architecture patterns and state management are well-documented, the absence of recovery workflows leaves a critical piece of the puzzle missing. This omission can lead to inefficiencies, increased support requests, and a steeper learning curve for new developers. Let's break down why this gap is so significant.

The existing documentation excels at explaining the intricacies of the system's architecture. It meticulously details how different components interact and how state is managed. This is fantastic for understanding the system's design and intended behavior. However, it falls short when it comes to real-world scenarios where things don't go as planned. Knowing how a system is supposed to work is only half the battle; you also need to know how to fix it when it breaks. The lack of recovery workflows means developers are left to figure out these solutions on their own, which is time-consuming and error-prone.

The absence of documented recovery procedures can significantly impact the onboarding process. New developers often start by tackling smaller tasks and fixing bugs. If they encounter an error and can't find guidance in the documentation, they might feel overwhelmed and less confident. This can slow down their progress and create a negative first impression of the project. Clear, step-by-step recovery instructions, on the other hand, empower them to handle issues independently and contribute more effectively from the start. It's like giving them a cheat sheet for common problems.

Moreover, the support burden on experienced team members increases when error recovery isn't documented. When developers encounter issues, their first instinct is often to ask for help. If the answers aren't readily available, senior developers and maintainers end up spending time answering the same questions repeatedly. This not only diverts their attention from other important tasks but also creates a bottleneck. By documenting common recovery procedures, we can reduce the number of support requests and free up valuable time for the team. Think of it as creating a FAQ section that anticipates and answers common questions before they're even asked.

Recommended Additions: Error Recovery Workflows

So, how do we bridge this gap? I propose we add a new section to CLAUDE.md specifically dedicated to Error Recovery Workflows. This section will outline step-by-step procedures for handling various failure scenarios. Let's walk through some of the key workflows we should include.

Terminal Missing Session Recovery

Imagine a scenario where a terminal loses its tmux session. This can happen for various reasons, such as a system restart or a network interruption. Without a clear recovery procedure, the user might be left staring at a blank screen, unsure of what to do. Here’s how we can document the recovery process:

  1. Detection: The first step is detecting that the session is missing. This can be done using a health monitor that periodically checks the status of the tmux session. For example, the health monitor can attempt to connect to the session and, if the connection fails, it flags the session as missing.
  2. State Update: Once a missing session is detected, the system should update the terminal’s state. This might involve setting a flag like missingSession: true in the terminal’s configuration. This state update is crucial for triggering the appropriate UI changes and recovery actions.
  3. User Prompt: The user interface (UI) should then display a prompt, such as a "Restart" button, to inform the user about the issue and provide a clear action to take. The prompt should be user-friendly and clearly indicate that clicking the button will attempt to recover the session.
  4. Recovery: When the user clicks the restart button, the daemon should create a new tmux session. This involves allocating the necessary resources and setting up the environment for the terminal to reconnect. This step effectively replaces the lost session with a fresh one.
  5. State Sync: Finally, once the terminal is reconnected to the new session, the missingSession flag should be cleared. This ensures that the system returns to its normal operational state. It's crucial to verify that the terminal is functioning correctly after the reconnection.

The implementation details for this workflow can be found in src/lib/health-monitor.ts:218-296. Referencing specific code locations helps developers quickly understand how the recovery process is implemented.

Daemon Reconnection

Another common scenario is the loss of connection to the daemon. The daemon is a critical component, and losing connection can disrupt the entire system. Here’s a documented recovery workflow for this situation:

  1. Detection: The loss of connection is typically detected when an attempt to write to the Inter-Process Communication (IPC) socket fails. This indicates that the daemon is no longer responsive.
  2. Retry: The system should implement an exponential backoff strategy to retry the connection. This means the system will attempt to reconnect, doubling the delay between each attempt (e.g., 2 seconds, 4 seconds, 8 seconds), up to a maximum delay (e.g., 30 seconds). Exponential backoff prevents overwhelming the daemon with connection attempts and gives it time to recover.
  3. Fallback: If the daemon remains unresponsive after several retries, the system should prompt the user to restart the application. This provides a clear fallback option when automatic reconnection fails. The prompt should explain the situation and guide the user on how to restart the app.
  4. State Preservation: It’s crucial to ensure that terminal configurations are saved to a persistent storage location, such as .loom/config.json. This allows the system to restore the user's setup even after a daemon crash or application restart. Preserving state minimizes disruption and provides a seamless recovery experience.

The implementation for daemon reconnection can be found in src/lib/daemon-client.ts. Again, providing specific code references helps developers understand the technical details.

State Cleanup After Crash

Sometimes, the application might crash unexpectedly, leaving behind orphaned resources and potentially corrupted state. Documenting how to clean up after a crash is essential for maintaining system stability. Here’s a workflow for this:

  1. Orphaned tmux sessions: After a crash, there might be orphaned tmux sessions. These sessions can consume resources and cause conflicts. The recommended solution is to run tmux -L loom kill-server. This command kills the tmux server associated with the application, cleaning up any orphaned sessions.
  2. Stale worktrees: Git worktrees can become stale after a crash, especially if a terminal was working in a specific worktree. To clean these up, run git worktree prune. This command removes any worktrees that are no longer associated with an active branch.
  3. Corrupted config: In rare cases, the application’s configuration might become corrupted due to a crash. A factory reset can be performed by loading the default configuration from defaults/config.json. This ensures the application starts with a known good state.
  4. Terminal logs: Inspecting terminal logs can provide valuable insights into the cause of the crash. The logs are typically located in ~/.loom/console.log and ~/.loom/daemon.log. Analyzing these logs can help diagnose the issue and prevent future occurrences.

Additionally, we can provide guidance on using MCP (Management Command Protocol) servers to inspect the state: mcp__loom-ui__read_state_file and mcp__loom-logs__tail_daemon_log. These commands allow developers to delve deeper into the system's state and logs, aiding in troubleshooting.

Worktree Orphan Cleanup

Another specific scenario we should address is when worktrees remain after terminal deletion. This can happen if the cleanup process is interrupted. Here’s how to handle it:

  1. List all worktrees: First, list all existing worktrees using git worktree list. This provides an overview of the current worktree setup.
  2. Remove specific worktree: To remove a specific orphaned worktree, use git worktree remove .loom/worktrees/issue-42 --force. Replace .loom/worktrees/issue-42 with the path of the orphaned worktree. The --force flag ensures the worktree is removed even if it contains uncommitted changes.
  3. Prune all orphaned worktrees: To remove all orphaned worktrees in one go, use git worktree prune. This command cleans up any worktrees that are no longer associated with an active branch.

To prevent this issue, the daemon should include auto-cleanup logic. The implementation for this can be found in loom-daemon/src/terminal.rs:87-102. This proactive approach minimizes the chances of orphaned worktrees accumulating.

Benefits of Documenting Error Recovery

Adding these error recovery workflows to CLAUDE.md isn't just a nice-to-have; it's a game-changer. Here’s a rundown of the benefits:

  • Faster onboarding for new developers: Clear documentation means new team members can quickly get up to speed and handle common issues independently.
  • Clear troubleshooting procedures: Developers have a reliable reference for resolving errors, reducing guesswork and wasted time.
  • Reduced support burden: Fewer support requests mean senior developers can focus on more strategic tasks.
  • Better understanding of recovery mechanisms: Documenting recovery workflows enhances the overall understanding of the system's architecture and resilience.

Related Issues and Next Steps

This effort complements the planned troubleshooting section in README.md, which is being tracked in a separate issue. By addressing both areas, we’ll have comprehensive documentation for handling errors and troubleshooting issues.

So, guys, let’s make this happen! Adding these error recovery procedures to CLAUDE.md will significantly improve the developer experience and make our system more robust. Let's aim for clarity, completeness, and, most importantly, a system that’s easy to recover when things go sideways. What do you think? Let's get this done!