My go-to answer is my time when I worked in AIX kernel development at IBM. We'd get bugs for kernel crashes that appeared related to memory corruption. They frequently ended up being caused by stale DMA addresses in device drivers for (mostly) Infiniband adapters writing into memory that now belonged to some userland process or kernel data structure.
How I'd debug these (it took me a while to be effective in this regard):
- Main tool was the AIX kernel debugger (like cutting bone with a butter knife :)
- Identify corrupted memory, look for clues like recognizable data structures or pointers in the raw dump that could be cross-checked against symbol maps, etc.
- Confirm the alignment of the corrupted memory. Page alignment was a tell-tale sign of errant DMA writes in our system... cache alignment is more mysterious and can be related to CPU design bugs (IBM designs their own POWER processors, and we'd test on alpha hardware frequently).
- Scour the voluminous kernel trace for the physical frame # of the corrupted memory. A typical offending sequence was:
1. Frame assigned to adapter for DMA
2. Physical memory layout change (we supported live hot-swappable memory arbitrated by the POWER hypervisor)
3. Frame allocated for use by page fault handler
4. Crash happens
Sometimes the root cause was that the device drivers were not properly serialized with the dynamic memory resource subsystem (the hot-swappable memory) and the sequence above happens very quickly (<1 ms). Sometimes the bug took a while to manifest, and the nice story tols above for our page was interspersed with thousands of unrelated activities in the same region of memory.
We had to be like a prosecutor and build a strong case to implicate a bug somewhere else. Until then, our team was always on the hook to figure these out.
This class of problem was hard because the tools we have at our disposal to collect evidence were quite inadequate, and the amount of data to sift through was enormous. Also, any tool we think might help to sift through all this data needed to already be in the system and in the kernel debugger as a diagnostic command (a crashed system in the debugger cannot be modified in practice). There's hundreds of those debugger commands for all kinds of randomly recurring problems we had trouble figuring out. Over time, you'd build your own for your own set of problems in your kernel specialty :-)
How I'd debug these (it took me a while to be effective in this regard):
Sometimes the root cause was that the device drivers were not properly serialized with the dynamic memory resource subsystem (the hot-swappable memory) and the sequence above happens very quickly (<1 ms). Sometimes the bug took a while to manifest, and the nice story tols above for our page was interspersed with thousands of unrelated activities in the same region of memory.We had to be like a prosecutor and build a strong case to implicate a bug somewhere else. Until then, our team was always on the hook to figure these out.
This class of problem was hard because the tools we have at our disposal to collect evidence were quite inadequate, and the amount of data to sift through was enormous. Also, any tool we think might help to sift through all this data needed to already be in the system and in the kernel debugger as a diagnostic command (a crashed system in the debugger cannot be modified in practice). There's hundreds of those debugger commands for all kinds of randomly recurring problems we had trouble figuring out. Over time, you'd build your own for your own set of problems in your kernel specialty :-)