Written by Nick Otter.
Okay, what is Kernel panic?
In basic terms a kernel panic is a situation when kernel can’t load properly and fails to boot properly or crashes. When the kernel detects an error from which it cannot recover itself. This happens rarely but it is majorly caused due to hosed updates or failing hardware or missing drive or partitions results in panic or voluntary halt to system activity.
Nicely put Codementor.io. Let’s work out a way to save the state of the kernel on crash so we can debug it. We will be taking a look at kdump and crash. As a nice intro here’s a diagram of what is happening during a kernel panic without, and then with kdump configured.
Here’s the test case we will be using to trigger a kernel panic.
$ echo 1 > /proc/sys/kernel/sysrq
$ echo c > /proc/sysrq-trigger
Updated | 04/2020 |
Linux | Kernel 5.4 RHEL 8 4.18 |
exports memory image file of Kernel in the event of a Kernel crash to analyze.
Kdump ships with RHEL
8
and has some great features: an SSH client to SFTP the dump file/memory image file of the Kernel to a target machine and it will show us the trace before and after the kernel panic.
The rest that follows this is a pretty minimal follow through but I hope it’s helpful.
Here’s a kdump cheatsheet to get familiar.
kdump |
Daemon. |
/etc/kdump.conf |
Conf file. |
makedumpfile |
Customise dump file in conf file. |
/var/crash |
Default dump file path in conf file. |
IP-YYYY-MM-DD-HH:MM:SS |
Default dump file directory name created. |
vmcore |
Compressed dump file name. |
vmcore-dmesg.txt |
Pretty dump file name. |
dmesg |
Kdump calls dmesg . |
kdump
is already installed in RHEL
8
and its installation steps are well documented.
In this example I have configured kdump on a RHEL 8 client and triggered a kernel panic. vmcore-dmesg.txt
is the file that’s generated by kdump, this is the exported memory image file of the kernel that we can now look at.
Here I’m using the famous oops as a reference point to isolate the panic for the sake of a first look.
[root@rhel-8-1 127.0.0.1-2020-04-28-01:31:54]# grep -C 4 Oops vmcore-dmesg.txt
[ 71.563871] ISO 9660 Extensions: RRIP_1991A
[ 566.651975] sysrq: SysRq : Trigger a crash
[ 566.651985] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 566.651987] PGD 0 P4D 0
[ 566.651990] Oops: 0002 [#1] SMP PTI
[ 566.651993] CPU: 0 PID: 6631 Comm: bash Kdump: loaded Tainted: G OE ---------r- - 4.18.0-147.3.1.el8_1.x86_64 #1
[ 566.651994] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 566.651999] RIP: 0010:sysrq_handle_crash+0x12/0x20
[ 566.652011] Code: 34 d3 c5 ff 48 89 df e8 7c fb ff ff e9 9c fe ff ff 90 90 90 90 90 90 90 0f 1f 44 00 00 c7 05 4d 0a d3 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 0f 1f 44 00 00 bf 01 00
Let’s take a look at the RIP
([ 566.651999] RIP: 0010:sysrq_handle_crash+0x12/0x20
) line of that trace just because. What does it all mean? Here you go.
[ 566.651975] |
RIP: |
0010: |
sysrq_handle_crash |
---|---|---|---|
Timestamp of event (0 equal to the time of Kernel boot). | Instruction pointer. | Code segment register, task leading to crash was running in Kernel mode (0010) not User mode (0033). | Function executed and offset (line number) of crash. |
analyze live system or a core dump file.
Now we’ve created a dumpfile of the kernel with kdump
we can now dig a bit deeper into that file with crash
. This is an easier way to analyse and debug than just looking at vmcore-dmesg.txt
.
Some crash
info:
kdump
.DEBUG_INFO
customization or kernel-debuginfo
package.netdump
, diskdump
, LKCD
, xendump
or kvmdump
Kernel dump files.Below is a brief runthrough of the crash
install after a cheat sheet.
crash |
Yum package. |
kernel-debuginfo-<kernel> |
Yum package. Compiles vmlinux file with debug data. |
vmcore |
Memory image Kernel crash dump file that will be analyzed. |
extract-linux |
Kernel script to extract vmlinuz file to a Kernel object file. |
/usr/lib/modules/$(uname -r)/vmlinuz |
Kernel boot bzImage file path. Must include debug data. Debug data is managed in Kernel settings. |
/usr/lib/debug/modules/$(uname -r)/vmlinux |
Kernel object file path with debug data. Compiled by yum package kernel-debuginfo . |
Get debug info.
$ subscription-manager repos --enable=rhel-8-for-x86_64-baseos-debug-rpms --enable=rhel-8-for-x86_64-appstream-debug-rpms
$ yum install kernel-debuginfo-$(uname-r)
This next step is optional:
only if Kernel was compiled with DEBUG_INFO
:
Extract kernel image.`
$ cd /usr/lib/modules/$(uname -r)
$ /usr/src/kernels/$(uname -r)/scripts/extract-vmlinux vmlinuz > vmlinux
Copy vmlinux and vmcore file to a directory.
$ cp /usr/lib/debug/usr/lib/modules/$(uname -r)/vmlinux /tmp/vmlinux
$ /var/crash/127.0.0.1-2020-04-28-01\:31\:54/vmcore /tmp/vmcore
Now crash
is installed, let’s start a session with a kernel dump file generated by kdump
called vmcore
.
$ crash vmlinux vmcore
output
[root@rhel-8-1 tmp]# crash vmlinux vmcore
crash 7.2.7-3.el8
Copyright (C) 2002-2020 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...
WARNING: kernel relocated [392MB]: patching 93367 gdb minimal_symbol values
KERNEL: vmlinux
DUMPFILE: vmcore [PARTIAL DUMP]
CPUS: 1
DATE: Wed Apr 29 07:01:07 2020
UPTIME: 00:09:26
LOAD AVERAGE: 0.21, 0.35, 0.25
TASKS: 596
NODENAME: rhel-8-1
RELEASE: 4.18.0-147.3.1.el8_1.x86_64
VERSION: #1 SMP Wed Nov 27 01:11:44 UTC 2019
MACHINE: x86_64 (2807 Mhz)
MEMORY: 10.1 GB
PANIC: "sysrq: SysRq : Trigger a crash"
PID: 6631
COMMAND: "bash"
TASK: ffff8da79cad2f80 [THREAD_INFO: ffff8da79cad2f80]
CPU: 0
STATE: TASK_RUNNING (SYSRQ)
crash>
see stack trace leading up to kernel panic and trace of panic itself.
Now in the crash session, run the backtrace
command by typing bt
and pressing enter. You will now see the stack trace that lead up to the kernel panic and the trace of the panic itself.
crash> bt
PID: 6631 TASK: ffff8da79cad2f80 CPU: 0 COMMAND: "bash"
#0 [ffffa49002107bf0] machine_kexec at ffffffff99857e9e
#1 [ffffa49002107c48] __crash_kexec at ffffffff99955b4d
#2 [ffffa49002107d10] crash_kexec at ffffffff99956a2d
#3 [ffffa49002107d28] oops_end at ffffffff99820e8d
#4 [ffffa49002107d48] no_context at ffffffff998677ae
#5 [ffffa49002107da0] do_page_fault at ffffffff998682e2
#6 [ffffa49002107dd0] page_fault at ffffffff9a20114e
[exception RIP: sysrq_handle_crash+18]
RIP: ffffffff99d12ec2 RSP: ffffa49002107e80 RFLAGS: 00010246
RAX: ffffffff99d12eb0 RBX: 0000000000000063 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000086 RDI: 0000000000000063
RBP: 0000000000000007 R8: 0000000000000229 R9: 0000000000000007
R10: 0000000000000000 R11: ffffffff9b239b2d R12: 0000000000000000
R13: ffffffff9ab37040 R14: 0000557d8bcd5cf0 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffa49002107e80] __handle_sysrq.cold.9 at ffffffff99d13a75
#8 [ffffa49002107ea8] write_sysrq_trigger at ffffffff99d1393b
#9 [ffffa49002107eb8] proc_reg_write at ffffffff99b3413c
#10 [ffffa49002107ed0] vfs_write at ffffffff99ab9ae5
#11 [ffffa49002107f00] ksys_write at ffffffff99ab9d5f
#12 [ffffa49002107f38] do_syscall_64 at ffffffff998041cb
#13 [ffffa49002107f50] entry_SYSCALL_64_after_hwframe at ffffffff9a2000ad
RIP: 00007faad0d84e18 RSP: 00007ffdab172798 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007faad0d84e18
RDX: 0000000000000002 RSI: 0000557d8bcd5cf0 RDI: 0000000000000001
RBP: 0000557d8bcd5cf0 R8: 000000000000000a R9: 00007faad0e16300
R10: 000000000000000a R11: 0000000000000246 R12: 00007faad1056780
R13: 0000000000000002 R14: 00007faad1051740 R15: 0000000000000002
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
Let’s take breakdown that backtrace output for debugging. Analysing the kdump generated file, we can now see more granular information around the panic. Call trace numbers, task pointers and the memory address of each function. This is more like it.
We could see the function executed that caused the kernel panic wihtout the use of crash, but this will be more helpful in the long run.
#0 [ffffa49002107bf0] machine_kexec at ffffffff99857e9e
#0 |
[ffffa49002107bf0] |
machine_kexec |
at ffffffff99857e9e |
Call trace number. | Task pointer. | Function executed. | Memory address. |
[exception RIP: sysrq_handle_crash+18]
exception RIP: |
sysrq_handle_crash |
+18 |
Oops at instruction pointer. | Function executed. | Offset (line number) inside the function where the exception occurred. |
CS: 0010
CS: |
0010 |
Code segment register. | Task leading to crash was running in Kernel mode (0010) not User mode (0033). |
Thanks. some of the articles I used for this included understanding a kernel oops, determining cause of linux kernel panic, analyze crash and this crash whitepaper. This was written by Nick Otter.