OpenBMC System Resilience


The OpenBMC platform is a core component in modern server deployment. As such, it needs to ensure that the services it provides are always available to the host system and out of band controllers. Currently, OpenBMC has far from the ideal amount of fault tolerance and recovery in the design of its core software architecture. The goal of this talk is to raise awareness of how OpenBMC currently handles failures of hardware and internal software components, and what improvements should be made if we want to improve the reliability and debuggability of the platform. This talk will address methods for automated health checking and recovery of services within the system, and utilities OpenBMC should provide in order to make life easier for the system operator to monitor and triage failures. We will go over at a high level the construction of the system today, what mechanisms it currently provides for recovery and monitoring. We will dive in to some real failure modes that have been experienced at Google, and how the current system recovered from those failures or failed to provide us with good enough information to triage the issue. The talk will also cover BMC to host interactions, and the configuration options a system integrator has to configure the BMC to provide powerful watchdog coverage of the host system.