Debugging a superblock percpu_rw_semaphore deadlock November 24, 2024 on Rustam Kovhaev's blog

There are many ways you can go about analyzing a kernel deadlock.

While some prefer using lockdep, I find it easier to analyze a kernel dump file.

The issue that has recently come up is this:

[ 2335.487579] INFO: task mariadbd:1914 blocked for more than 122 seconds.
...
[ 2335.489157] INFO: task peer local sock:2449 blocked for more than 122 seconds.

The kernel watchdog has detected that the tasks with PID 2449 and 1914 stopped making forward progress.

Let’s trigger a kdump by doing “echo c > /proc/sysrq-trigger” and load it up in crash utility.

crash> bt
PID: 2449     TASK: ffff9c7d41fc8000  CPU: 3    COMMAND: "peer local sock"
 #0 [ffffa87d03997908] __schedule at ffffffffb3a3d99c
 #1 [ffffa87d039979b0] schedule at ffffffffb3a3de13
 #2 [ffffa87d039979d0] percpu_down_write at ffffffffb3a42a5a
 #3 [ffffa87d03997a20] freeze_super at ffffffffb2ce906c
 #4 [ffffa87d03997ab0] fs_bdev_freeze at ffffffffb2ce94ac
 #5 [ffffa87d03997ad8] bdev_freeze at ffffffffb2facfc8
 #6 [ffffa87d03997b00] snapshot_free at ffffffffc0d28e9c [veeamblksnap]
 #7 [ffffa87d03997b40] snapshot_destroy at ffffffffc0d2971f [veeamblksnap]
 #8 [ffffa87d03997b68] ioctl_snapshot_destroy at ffffffffc0d247ff [veeamblksnap]
 #9 [ffffa87d03997b98] ctrl_unlocked_ioctl at ffffffffc0d24215 [veeamblksnap]
#10 [ffffa87d03997bc8] __x64_sys_ioctl at ffffffffb2d02b90
#11 [ffffa87d03997c00] x64_sys_call at ffffffffb28068fb
#12 [ffffa87d03997c10] do_syscall_64 at ffffffffb3a2601f
...

Judging by the backtrace, the snapshot removal, initiated from veeamblksnap kernel module, got stuck trying to freeze the superblock.

This is v6.8 kernel, percpu_down_write() tries to acquire a writer lock of the percpu
sb->s_writers.rw_sem semaphore.

And percpu_down_write() is waiting for other readers to finish.

Now, let’s try to find them.

The first thing to do is to find the address of the semaphore on the stack (I will skip going over the function’s disassembly).

crash> bt -f
PID: 2449     TASK: ffff9c7d41fc8000  CPU: 3    COMMAND: "peer local sock"
 #0 [ffffa87d03997908] __schedule at ffffffffb3a3d99c
    ffffa87d03997910: ffffc87cffd91ee8 ffff9c7d408b5280
    ffffa87d03997920: ffff9c7d41fc8000 ffff9c7e77db4b80
    ffffa87d03997930: ffffa87d039979a8 ffffffffb3a3d99c
    ffffa87d03997940: ffffa87d03997950 ffff9c7d41fc8ae8
    ffffa87d03997950: 0000000000000000 0000000000000004
    ffffa87d03997960: ffffffffb3e161a0 0000000000000000
    ffffa87d03997970: 0000000000000002 dfb7acc13a389e00
    ffffa87d03997980: ffff9c7d41fc8000 ffff9c7d41fc8000
    ffffa87d03997990: ffff9c7d41fc8000 0000000000000002
    ffffa87d039979a0: 00000000000517e0 ffffa87d039979c8
    ffffa87d039979b0: ffffffffb3a3de13
 #1 [ffffa87d039979b0] schedule at ffffffffb3a3de13
    ffffa87d039979b8: 0000000000000001 ffff9c7d417ed270
    ffffa87d039979c8: ffffa87d03997a18 ffffffffb3a42a5a
 #2 [ffffa87d039979d0] percpu_down_write at ffffffffb3a42a5a
    ffffa87d039979d8: ffffffffb3a42702 ffff9c7d417ed000
    ffffa87d039979e8: ffffa87d03997a68 0000000000000000
    ffffa87d039979f8: ffff9c7d417ed000 ffff9c7d417ed070
    ffffa87d03997a08: ffff9c7d417ed2d0 ffff9c7d417ed270
    ffffa87d03997a18: ffffa87d03997aa8 ffffffffb2ce906c
 #3 [ffffa87d03997a20] freeze_super at ffffffffb2ce906c

The semaphore is at 0xffff9c7d417ed270.

crash> struct percpu_rw_semaphore -o
struct percpu_rw_semaphore {
   [0] struct rcu_sync rss;
  [48] unsigned int *read_count;
  [56] struct rcuwait writer;
  [64] wait_queue_head_t waiters;
  [88] atomic_t block;
}
SIZE: 96
crash> p ((struct percpu_rw_semaphore *) 0xffff9c7d417ed270).writer
$1 = {
  task = 0xffff9c7d41fc8000
}
crash> p ((struct percpu_rw_semaphore *) 0xffff9c7d417ed270).block
$2 = {
  counter = 1
}
crash> p ((struct task_struct *) 0xffff9c7d41fc8000)->pid
$3 = 2449
crash> p ((struct task_struct *) 0xffff9c7d41fc8000)->comm
$4 = "peer local sock"
crash>

Now let’s find tasks waiting on it:

crash> p &(*((struct percpu_rw_semaphore *) 0xffff9c7d417ed270)).waiters
$5 = (wait_queue_head_t *) 0xffff9c7d417ed2b0
crash> waitq 0xffff9c7d417ed2b0
PID: 1914     TASK: ffff9c7d40a0d280  CPU: 0    COMMAND: "mariadbd"
PID: 2373     TASK: ffff9c7d4212a940  CPU: 2    COMMAND: "mariadbd"

crash> search -t ffff9c7d417ed270
PID: 1914     TASK: ffff9c7d40a0d280  CPU: 0    COMMAND: "mariadbd"
ffffa87d02cffc38: ffff9c7d417ed270

PID: 2373     TASK: ffff9c7d4212a940  CPU: 2    COMMAND: "mariadbd"
ffffa87d03767d20: ffff9c7d417ed270

PID: 2449     TASK: ffff9c7d41fc8000  CPU: 3    COMMAND: "peer local sock"
ffffa87d039979c0: ffff9c7d417ed270
ffffa87d03997a10: ffff9c7d417ed270

Let’s get their backtraces:

crash> bt 1914
PID: 1914     TASK: ffff9c7d40a0d280  CPU: 0    COMMAND: "mariadbd"
 #0 [ffffa87d02cffb38] __schedule at ffffffffb3a3d99c
 #1 [ffffa87d02cffbe0] schedule at ffffffffb3a3de13
 #2 [ffffa87d02cffc00] percpu_rwsem_wait at ffffffffb299f7f3
 #3 [ffffa87d02cffc60] __percpu_down_read at ffffffffb3a42898
 #4 [ffffa87d02cffc80] io_write at ffffffffb302ea54
 #5 [ffffa87d02cffd88] io_issue_sqe at ffffffffb3019be1
 #6 [ffffa87d02cffdd0] io_submit_sqes at ffffffffb301a878
 #7 [ffffa87d02cffe10] __do_sys_io_uring_enter at ffffffffb301ae32
 #8 [ffffa87d02cffe88] __x64_sys_io_uring_enter at ffffffffb301b012
 #9 [ffffa87d02cffe98] x64_sys_call at ffffffffb2806169
...
crash> bt 2373
PID: 2373     TASK: ffff9c7d4212a940  CPU: 2    COMMAND: "mariadbd"
 #0 [ffffa87d03767c20] __schedule at ffffffffb3a3d99c
 #1 [ffffa87d03767cc8] schedule at ffffffffb3a3de13
 #2 [ffffa87d03767ce8] percpu_rwsem_wait at ffffffffb299f7f3
 #3 [ffffa87d03767d48] __percpu_down_read at ffffffffb3a42898
 #4 [ffffa87d03767d68] vfs_write at ffffffffb2ce4e05
 #5 [ffffa87d03767e00] __x64_sys_pwrite64 at ffffffffb2ce4f16
 #6 [ffffa87d03767e48] x64_sys_call at ffffffffb28070d5
...
crash>

So, we have 1 writer (PID 2449) and 2 readers (PID 1914 and PID 2373) waiting on all previous/old readers to finish.

Now, since the backtrace shows io_uring involved, finding those previous/old readers could be a bit complicated.

Because with io_uring, a user-space tasks executes a syscall, enters the kernel, calls percpu_down_read(), increments the counter in the percpu rw semaphore, leaves the kernel, and there are no more traces left.

Let’s get the read_count of the semaphore, there are 4 CPUs to read it from:

crash> p (*((struct percpu_rw_semaphore *) 0xffff9c7d417ed270)).read_count
$6 = (unsigned int *) 0x517e0
crash> p *(int *)(__per_cpu_offset[0] + 0x517e0)
$7 = 713
crash> p *(int *)(__per_cpu_offset[1] + 0x517e0)
$8 = 236
crash> p *(int *)(__per_cpu_offset[2] + 0x517e0)
$9 = -55
crash> p *(int *)(__per_cpu_offset[3] + 0x517e0)
$10 = -893
crash>

The sum of (713 + 236 - 55 - 893) gives us 1.
There is one reader somewhere.
And percpu_down_write() is waiting for it.

io_uring is a complicated beast, and I am not familiar with its code base, and, at this point, I decided to search the LKML, maybe someone else has already reported this deadlock in io_uring.

To my surprise, Peter Mann has run into the same issue just a month ago, and Jens Axboe has already fixed it upstream.

https://lore.kernel.org/io-uring/46f67f82-846b-4607-902a-c71e3d2a68c6@kernel.dk/

So, this deadlock is fixed in v6.12 kernel, and the fix commit 1d60d74e852647255bd8e76f5a22dc42531e4389 will, eventually, be backported to older kernels.

The issue was quite easy to find.
Thank God for small mercies, eh? =)