How To Monitor Kernel Panic

Written by Nick Otter.

Introduction
- Requirements
kdump
crash

Introduction

Okay, what is Kernel panic?

In basic terms a kernel panic is a situation when kernel can’t load properly and fails to boot properly or crashes. When the kernel detects an error from which it cannot recover itself. This happens rarely but it is majorly caused due to hosed updates or failing hardware or missing drive or partitions results in panic or voluntary halt to system activity.

Nicely put Codementor.io. Let’s work out a way to save the state of the kernel on crash so we can debug it. We will be taking a look at kdump and crash. As a nice intro here’s a diagram of what is happening during a kernel panic without, and then with kdump configured.

Here’s the test case we will be using to trigger a kernel panic.

$ echo 1 > /proc/sys/kernel/sysrq
$ echo c > /proc/sysrq-trigger

Requirements

Updated	`04/2020`
Linux	`Kernel 5.4` `RHEL 8 4.18`

kdump

exports memory image file of Kernel in the event of a Kernel crash to analyze.

Kdump ships with RHEL 8 and has some great features: an SSH client to SFTP the dump file/memory image file of the Kernel to a target machine and it will show us the trace before and after the kernel panic.

The rest that follows this is a pretty minimal follow through but I hope it’s helpful.

Here’s a kdump cheatsheet to get familiar.

`kdump`	Daemon.
`/etc/kdump.conf`	Conf file.
`makedumpfile`	Customise dump file in conf file.
`/var/crash`	Default dump file path in conf file.
`IP-YYYY-MM-DD-HH:MM:SS`	Default dump file directory name created.
`vmcore`	Compressed dump file name.
`vmcore-dmesg.txt`	Pretty dump file name.
`dmesg`	Kdump calls `dmesg`.

kdump install instructions

kdump is already installed in RHEL 8and its installation steps are well documented.

kdump output explained

In this example I have configured kdump on a RHEL 8 client and triggered a kernel panic. vmcore-dmesg.txt is the file that’s generated by kdump, this is the exported memory image file of the kernel that we can now look at.

Here I’m using the famous oops as a reference point to isolate the panic for the sake of a first look.

[root@rhel-8-1 127.0.0.1-2020-04-28-01:31:54]# grep -C 4 Oops vmcore-dmesg.txt
[   71.563871] ISO 9660 Extensions: RRIP_1991A
[  566.651975] sysrq: SysRq : Trigger a crash
[  566.651985] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[  566.651987] PGD 0 P4D 0 
[  566.651990] Oops: 0002 [#1] SMP PTI
[  566.651993] CPU: 0 PID: 6631 Comm: bash Kdump: loaded Tainted: G           OE    ---------r-  - 4.18.0-147.3.1.el8_1.x86_64 #1
[  566.651994] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  566.651999] RIP: 0010:sysrq_handle_crash+0x12/0x20
[  566.652011] Code: 34 d3 c5 ff 48 89 df e8 7c fb ff ff e9 9c fe ff ff 90 90 90 90 90 90 90 0f 1f 44 00 00 c7 05 4d 0a d3 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 0f 1f 44 00 00 bf 01 00

Let’s take a look at the RIP ([ 566.651999] RIP: 0010:sysrq_handle_crash+0x12/0x20) line of that trace just because. What does it all mean? Here you go.

`[ 566.651975]`	`RIP:`	`0010:`	`sysrq_handle_crash`
Timestamp of event (0 equal to the time of Kernel boot).	Instruction pointer.	Code segment register, task leading to crash was running in Kernel mode (0010) not User mode (0033).	Function executed and offset (line number) of crash.

crash

analyze live system or a core dump file.

Now we’ve created a dumpfile of the kernel with kdump we can now dig a bit deeper into that file with crash. This is an easier way to analyse and debug than just looking at vmcore-dmesg.txt.

Some crash info:

Requires dumpfile created by kdump.
Requires Kernel DEBUG_INFO customization or kernel-debuginfo package.
Can also analyze netdump, diskdump, LKCD, xendump or kvmdump Kernel dump files.

Below is a brief runthrough of the crash install after a cheat sheet.

crash cheatsheet

`crash`	Yum package.
`kernel-debuginfo-<kernel>`	Yum package. Compiles `vmlinux` file with debug data.
`vmcore`	Memory image Kernel crash dump file that will be analyzed.
`extract-linux`	Kernel script to extract `vmlinuz` file to a Kernel object file.
`/usr/lib/modules/$(uname -r)/vmlinuz`	Kernel boot bzImage file path. Must include debug data. Debug data is managed in Kernel settings.
`/usr/lib/debug/modules/$(uname -r)/vmlinux`	Kernel object file path with debug data. Compiled by yum package `kernel-debuginfo`.

crash install instructions

Get debug info.

$ subscription-manager repos --enable=rhel-8-for-x86_64-baseos-debug-rpms --enable=rhel-8-for-x86_64-appstream-debug-rpms
$ yum install kernel-debuginfo-$(uname-r)

This next step is optional:
only if Kernel was compiled with DEBUG_INFO:

Extract kernel image.`

$ cd /usr/lib/modules/$(uname -r)
$ /usr/src/kernels/$(uname -r)/scripts/extract-vmlinux vmlinuz > vmlinux

Copy vmlinux and vmcore file to a directory.

$ cp /usr/lib/debug/usr/lib/modules/$(uname -r)/vmlinux /tmp/vmlinux
$ /var/crash/127.0.0.1-2020-04-28-01\:31\:54/vmcore /tmp/vmcore

How to start a crash session

Now crash is installed, let’s start a session with a kernel dump file generated by kdump called vmcore.

$ crash vmlinux vmcore

output

[root@rhel-8-1 tmp]# crash vmlinux vmcore

crash 7.2.7-3.el8
Copyright (C) 2002-2020  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.
 
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [392MB]: patching 93367 gdb minimal_symbol values

      KERNEL: vmlinux                                                  
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 1
        DATE: Wed Apr 29 07:01:07 2020
      UPTIME: 00:09:26
LOAD AVERAGE: 0.21, 0.35, 0.25
       TASKS: 596
    NODENAME: rhel-8-1
     RELEASE: 4.18.0-147.3.1.el8_1.x86_64
     VERSION: #1 SMP Wed Nov 27 01:11:44 UTC 2019
     MACHINE: x86_64  (2807 Mhz)
      MEMORY: 10.1 GB
       PANIC: "sysrq: SysRq : Trigger a crash"
         PID: 6631
     COMMAND: "bash"
        TASK: ffff8da79cad2f80  [THREAD_INFO: ffff8da79cad2f80]
         CPU: 0
       STATE: TASK_RUNNING (SYSRQ)

crash> 

Use crash session command backtrace

see stack trace leading up to kernel panic and trace of panic itself.

Now in the crash session, run the backtrace command by typing bt and pressing enter. You will now see the stack trace that lead up to the kernel panic and the trace of the panic itself.

crash> bt
PID: 6631   TASK: ffff8da79cad2f80  CPU: 0   COMMAND: "bash"
 #0 [ffffa49002107bf0] machine_kexec at ffffffff99857e9e
 #1 [ffffa49002107c48] __crash_kexec at ffffffff99955b4d
 #2 [ffffa49002107d10] crash_kexec at ffffffff99956a2d
 #3 [ffffa49002107d28] oops_end at ffffffff99820e8d
 #4 [ffffa49002107d48] no_context at ffffffff998677ae
 #5 [ffffa49002107da0] do_page_fault at ffffffff998682e2
 #6 [ffffa49002107dd0] page_fault at ffffffff9a20114e
    [exception RIP: sysrq_handle_crash+18]
    RIP: ffffffff99d12ec2  RSP: ffffa49002107e80  RFLAGS: 00010246
    RAX: ffffffff99d12eb0  RBX: 0000000000000063  RCX: 0000000000000006
    RDX: 0000000000000000  RSI: 0000000000000086  RDI: 0000000000000063
    RBP: 0000000000000007   R8: 0000000000000229   R9: 0000000000000007
    R10: 0000000000000000  R11: ffffffff9b239b2d  R12: 0000000000000000
    R13: ffffffff9ab37040  R14: 0000557d8bcd5cf0  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffa49002107e80] __handle_sysrq.cold.9 at ffffffff99d13a75
 #8 [ffffa49002107ea8] write_sysrq_trigger at ffffffff99d1393b
 #9 [ffffa49002107eb8] proc_reg_write at ffffffff99b3413c
#10 [ffffa49002107ed0] vfs_write at ffffffff99ab9ae5
#11 [ffffa49002107f00] ksys_write at ffffffff99ab9d5f
#12 [ffffa49002107f38] do_syscall_64 at ffffffff998041cb
#13 [ffffa49002107f50] entry_SYSCALL_64_after_hwframe at ffffffff9a2000ad
    RIP: 00007faad0d84e18  RSP: 00007ffdab172798  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000002  RCX: 00007faad0d84e18
    RDX: 0000000000000002  RSI: 0000557d8bcd5cf0  RDI: 0000000000000001
    RBP: 0000557d8bcd5cf0   R8: 000000000000000a   R9: 00007faad0e16300
    R10: 000000000000000a  R11: 0000000000000246  R12: 00007faad1056780
    R13: 0000000000000002  R14: 00007faad1051740  R15: 0000000000000002
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

Backtrace output explained

Let’s take breakdown that backtrace output for debugging. Analysing the kdump generated file, we can now see more granular information around the panic. Call trace numbers, task pointers and the memory address of each function. This is more like it.

We could see the function executed that caused the kernel panic wihtout the use of crash, but this will be more helpful in the long run.

#0 [ffffa49002107bf0] machine_kexec at ffffffff99857e9e

`#0`	`[ffffa49002107bf0]`	`machine_kexec`	`at ffffffff99857e9e`
Call trace number.	Task pointer.	Function executed.	Memory address.

[exception RIP: sysrq_handle_crash+18]

`exception RIP:`	`sysrq_handle_crash`	`+18`
Oops at instruction pointer.	Function executed.	Offset (line number) inside the function where the exception occurred.

CS: 0010

`CS:`	`0010`
Code segment register.	Task leading to crash was running in Kernel mode (0010) not User mode (0033).

Thanks. some of the articles I used for this included understanding a kernel oops, determining cause of linux kernel panic, analyze crash and this crash whitepaper. This was written by Nick Otter.