- There is a new commit queued up for the Linux kernel 6.1 merge window that aims to make it easier to find faulty CPUs in systems.
- This feature will be especially useful for the systems that have many CPUs, like servers, where Linux operating systems are mostly used in.
- Those changes are expected to be merged in early October to Linux kernel 6.1; the final release is expected in the second half of November.
Finding out faulty hardware in huge server environments is not always an easy task. It is currently possible to run some kernel code to find the faulty CPUs and cores; the process crashes the systems with faulty hardware. A new feature that is queued up for the Linux kernel 6.1 merge window aims to make this process easier.
Making it easier to find
The new commit that is queued up for kernel 6.1 will print the likely CPUs, CPU cores, and sockets at segmentation fault time. Rik van Riel, the committer of this change summarizes the feature as follows:
« In a large enough fleet of computers, it is common to have a few bad CPUs. Those can often be identified by seeing that some commonly run kernel code, which runs fine everywhere else, keeps crashing on the same CPU core on one particular bad system.
However, the failure modes in CPUs that have gone bad over the years are often oddly specific, and the only bad behavior seen might be segfaults in programs like bash, python, or various system daemons that run fine everywhere else.
Add a printk() to show_signal_msg() to print the CPU, core, and socket at segfault time.
This is not perfect, since the task might get rescheduled on another CPU between when the fault hit, and when the message is printed, but in practice, this has been good enough to help people identify several bad CPU cores. »
Linux kernel 6.1 merge window period is expected in early October. The full release of the Linux kernel 6.1 is expected in the second half of November.