When root on ZFS breaks on Arch Linux…

Today was update day¹. Then the expected unexpected happened: The ZFS module was missing from initramfs. Desktop's dead in the water. I boot up my laptop to quickly flash an Arch live ISO onto a USB drive, and while at it also upgrade that one. Knowing that what went wrong on my desktop would likely also fail here, I pay a bit more attention and there it is, in the post-transaction hooks². Most notably here: pacman does not exit with an error code causing yay to just keep going with AUR upgrades.

==> dkms install --no-depmod zfs/2.1.9 -k 6.2.8-arch1-1
Error! Bad return status for module build on kernel: 6.2.8-arch1-1 (x86_64)
Consult /var/lib/dkms/zfs/2.1.9/build/make.log for more information.
==> WARNING: `dkms install --no-depmod zfs/2.1.9 -k 6.2.8-arch1-1' exited 10
…
==> ERROR: module not found: 'zavl'
==> ERROR: module not found: 'znvpair'
==> ERROR: module not found: 'zunicode'
==> ERROR: module not found: 'zcommon'
==> ERROR: module not found: 'zfs'
==> ERROR: module not found: 'spl'
…
==> WARNING: errors were encountered during the build. The image may not be complete.
error: command failed to execute correctly
(5/5) Updating the info directory file...

In said log file:

ERROR: modpost: GPL-incompatible module zfs.ko uses GPL-only symbol 'bio_start_io_acct'
ERROR: modpost: GPL-incompatible module zfs.ko uses GPL-only symbol 'bio_end_io_acct_remapped'

Turns out that in Linux 6.2.8 some symbols got marked GPL-only, causing the CDDL-licensed ZFS to fail to build.

I make very sure not to reboot my laptop. I downgrade linux linux-headers, pick 6.2.7 and everything looks fine. Back to the other machine!

curl https://ftp.halifax.rwth-aachen.de/archlinux/iso/latest/archlinux-x86_64.iso -Ssf | sudo dd of=/dev/disk/by-id/usb-_USB_DISK_whatever bs=1M status=progress
sync

Boot the thing³ only to notice that the live ISO doesn't have ZFS, of course.

Since my machine only has network access over fiber and I use a Intel X520-DA2 network card with an "unsupported" SFP module, I have no network access in the rescue environment to add the archzfs repo to install ZFS.

Fine, let's just connect the ethernet port to the laptop instead and bridge the connections:

ip link add br0 type bridge
ip link set wlp166s0 master br0
Error: Device does not allow enslaving to a bridge.

Oh come on! Why can't anything just work ;_;

I guess that's a no for network access then.

Time to start digging how to build a live ISO with ZFS. But I can't, because I can only build a live ISO with Linux 6.2.8 easily, which is the original problem.

Back to fixing the network on the plain rescue system then, I guess. Setting the kernel module parameter to allow the unsupported SFP module was easy enough after all⁴: rmmod ixgbe ; modprobe ixgbe allow_unsupported_sfp=1. Good thing I still knew more or less what I had to do to get that to work. Interestingly, the slot with the unsupported module didn't show up in ip link at all before adding the parameter, but the second, empty slot did.

Being able to copy, paste and scroll is more convenient than working on a bare tty, so I'll continue over ssh: curl https://mnus.de/minus.pub > /root/.ssh/authorized_keys. Having your pubkeys readily available somewhere is really helpful in moments like this! sshd is already running in the live system, how convenient!

Anyway, back to work:

cat >> /etc/pacman.conf <<'EOF'
[archzfs]
Server = http://archzfs.com/$repo/x86_64
EOF
pacman-key --recv-keys DDF7DB817396A49B2A2723F7403BD972F75D9D76
pacman-key --lsign archzfs

Now a quick pacman -Sy archzfs-linux to get ZFS going… except the binaries from the archzfs repo don't match the kernel, of course. I reach for zfs-dkms, which fails because linux-headers is not installed. But I can't just install that because the ISO uses an older kernel than the current one (which would be the broken 6.2.8 anyway).

Time to downgrade. Except you can't build the package as root, so we'll need a user account first:

useradd -mG wheel build
echo '%wheel ALL=(ALL:ALL) NOPASSWD: ALL' > /etc/sudoers.d/wheel
sudo -iu build

As build:

curl https://aur.archlinux.org/cgit/aur.git/snapshot/downgrade.tar.gz | tar zvx
cd downgrade
pacman -S base-devel
makepkg -si

I downgrade linux-headers to 6.2.7 and it's finally time to pacman -S zfs-dkms for real. Except not. Not enough disk space, ugh! The live system only has 256MiB of writable space by default! Nothing a quick mount -oremount,size=4G /run/archiso/cowspace can't solve!

Okay, we've got ZFS!

cryptsetup open /dev/disk/by-id/whatever-part2 whatever
zpool import -R /mnt zroot  # -R /mnt very important!
mount /dev/disk/by-id/whatever-part1 /mnt/boot  # also very important!
arch-chroot /mnt

⁵

Now just downgrade linux{,-headers} and all should be well again:

downgrade linux linux-headers

That is if I hadn't forgotten to mount the boot partition and thus installed the initramfs and kernel inside my ZFS root…

Solution to prevent that from happening again: chmod 000 /boot && chattr +i /boot while it's not mounted.

The End?

Anyway, could this all have been prevented?

Yes, if Arch shipped ZFS itself. The problem with that, as gromit on #archlinux pointed out, is that ZFS may lag behind on mainline support. Which is exactly what caused this dilemma. Holding back the kernel certainly would work (in fact, that's why it works on Ubuntu), but if I were the kernel package maintainer for Arch, I wouldn't want to be held back by what's basically third party software. Oh would it be nice if ZFS wasn't CDDL or if the CDDL/GPL compatibility would be settled.
Yes, if pacman failed more visibly after the DKMS hook for ZFS failed. pacman did exit with status code 0. It also outputs mkinitcpio without colors.
Yes, if Arch kept around the old kernel and initramfs. ~~Copying those to {,.old} or whatever in /boot would be very useful!~~ Copying only those would not be enough since all of the other kernel modules would also be missing.

Note: this post is mostly notes for myself when, not if, this happens again.

Or rather, I had played with fan settings and I wanted to reboot to get them back to their automatic setting, but that's a different story. ↩
Can you really blame me for missing this? I'd probably have caught the red error if it wasn't followed by tons of log output from building AUR packges :) ↩
After various attempts not getting getting video output or into the BIOS, and somehow wiping my complete BIOS settings. Having two graphics cards and one monitor plugged into both of them sure doesn't make this easier… ↩
Using USB tethering on my phone, as MacGyver on #archlinux pointed out, would have worked too but would have prevented me from SSHing in from my laptop ↩
The part you don't see here is the LVM in the LUKS-encrypted device. But you don't need to anyway. I just keep my swap in there. zfs import just discovering what exists is very convenient, isn't it? ↩

When root on ZFS breaks on Arch Linux…

The End?

Last posts