Fixing Random Raspberry Pi Crashes: The NVMe TRIM Problem

The Mystery

For weeks, my Raspberry Pi cluster was going offline seemingly at random. No pattern, no warning—just sudden, complete failure. And when I say complete, I mean complete: the root partition would disappear, SSH would die mid-session, and I couldn’t even run basic commands like reboot or exit. The only solution was a hard power reset.

This was maddening because I had no logs. When your root filesystem vanishes, so do all your binaries and your ability to investigate what just happened.

The Investigation

I knew I needed logs that would survive the crashes, so I had to get creative:

Configured rsyslog to talk to journald
Set up remote logging to forward everything to my personal computer
Waited for the next crash (which felt like watching paint dry, but with more anxiety)
Trudged through mountains of logs looking for clues

Finally, I found something:

cluster0 fstrim[264392]: fstrim: /boot/firmware: FITRIM ioctl failed: Input/output error
cluster0 fstrim[264392]: /data/brick1: 1.7 TiB (1888567922688 bytes) trimmed on /dev/nvme0n1p3
cluster0 fstrim[264392]: /: 112.1 GiB (120364019712 bytes) trimmed on /dev/nvme0n1p2

cluster0 fstrim[264392]: fstrim: /boot/firmware: FITRIM ioctl failed: Input/output error

cluster0 fstrim[264392]: /data/brick1: 1.7 TiB (1888567922688 bytes) trimmed on /dev/nvme0n1p3

cluster0 fstrim[264392]: /: 112.1 GiB (120364019712 bytes) trimmed on /dev/nvme0n1p2

And the smoking gun:

cluster0 kernel: nvme nvme0: I/O tag 152 (3098) opcode 0x9 (I/O Cmd) QID 1 timeout, aborting req_op:DISCARD(3) size:4096
cluster0 kernel: nvme nvme0: I/O tag 153 (2099) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:20480
cluster0 kernel: nvme nvme0: I/O tag 152 (3098) opcode 0x9 (I/O Cmd) QID 1 timeout, reset controller
cluster0 kernel: INFO: task jbd2/nvme0n1p3-:680 blocked for more than 120 seconds.

cluster0 kernel: nvme nvme0: I/O tag 152 (3098) opcode 0x9 (I/O Cmd) QID 1 timeout, aborting req_op:DISCARD(3) size:4096

cluster0 kernel: nvme nvme0: I/O tag 153 (2099) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:20480

cluster0 kernel: nvme nvme0: I/O tag 152 (3098) opcode 0x9 (I/O Cmd) QID 1 timeout, reset controller

cluster0 kernel: INFO: task jbd2/nvme0n1p3-:680 blocked for more than 120 seconds.

The fstrim systemd timer was triggering, attempting to TRIM the /boot/firmware partition (which is vfat/FAT32), and causing the NVMe controller to timeout and reset. When the controller reset, my entire system—including the root partition—would go offline.

The Research Phase (and a Dead End)

Before I even discovered it was fstrim causing the problem, I was reading forum posts about NVMe HATs on Raspberry Pis. The overwhelming message was: “Your HAT (or drive) doesn’t work properly. Go buy different hardware.”

Something didn’t sit right with me about that diagnosis, but without logs showing what was actually failing, I couldn’t prove otherwise.

Once I finally captured logs showing fstrim was the last thing to run before the system panicked, I started searching specifically for “fstrim Raspberry Pi issues.” Guess what I found? Almost nothing. Not one person really discussing fstrim problems on a Pi, which seemed bizarre.

Then it hit me: the Raspberry Pi Imager creates that vfat /boot/firmware partition automatically. Every single person using the official imaging tool has this partition. And fstrim runs weekly by default on most Linux systems.

This means a significant number of people blaming their “incompatible” NVMe drives are probably experiencing the exact same fstrim/vfat issue. They either:

Get lucky and the timing never quite lines up to cause a full crash
Experience random instability they can’t explain and give up
Replace their hardware thinking it’s defective
Never capture logs because the root partition disappears when it crashes

I searched for fstrim solutions specifically and found nothing helpful. Most suggestions were either:

“Your NVMe drive doesn’t support TRIM” (but mine clearly did—TRIM worked fine on my ext4 partitions!)
“Disable TRIM entirely” (throwing the baby out with the bathwater)
Vague workarounds that didn’t address the root cause

I could manually run fstrim on my ext4 partitions ( / and /data/brick1) without any issues—they trimmed in seconds. But running it on /boot/firmware would lock up, timeout, and crash the entire system.

The Solution

Here’s what I learned: FAT/vfat filesystems have problematic TRIM support. The way the vfat filesystem’s FITRIM ioctl issues TRIM commands can cause certain NVMe controllers to timeout and reset.

The fix is beautifully simple: tell fstrim to only operate on filesystem types that handle TRIM well.

Step-by-Step Fix

Edit the fstrim systemd service:

sudo systemctl edit fstrim.service

1

sudo systemctl edit fstrim.service
Add this override configuration:

[Service] ExecStart= ExecStart=/sbin/fstrim --fstab --verbose --quiet-unsupported -t ext4

1
2
3

[Service]
ExecStart=
ExecStart=/sbin/fstrim --fstab --verbose --quiet-unsupported -t ext4

The first empty ExecStart= clears the original command, and the second one replaces it with a version that only trims ext4 filesystems.
Reload systemd:

sudo systemctl daemon-reload

1

sudo systemctl daemon-reload
Test it manually:

sudo systemctl start fstrim.service sudo journalctl -u fstrim.service -n 50

1
2

sudo systemctl start fstrim.service
sudo journalctl -u fstrim.service -n 50

That’s it! No more crashes, no more controller resets, and TRIM still works perfectly on the partitions that matter.

Why This Works

My /etc/fstab looked like this:

PARTUUID=64dd9cc5-01  /boot/firmware  vfat    defaults          0       2
PARTUUID=64dd9cc5-02  /               ext4    defaults,noatime,discard  0       1
/dev/nvme0n1p3 /data/brick1 ext4 defaults,noatime,discard 1 2

PARTUUID=64dd9cc5-01 /boot/firmware vfat defaults 0 2

PARTUUID=64dd9cc5-02 / ext4 defaults,noatime,discard 0 1

/dev/nvme0n1p3 /data/brick1 ext4 defaults,noatime,discard 1 2

By adding -t ext4 to the fstrim command, we’re telling it: “Only look at ext4 filesystems when reading from fstab.” The vfat /boot/firmware partition gets automatically skipped, and since that partition rarely changes and is tiny anyway, we’re not losing anything meaningful.

Scalability

The beauty of this approach is that it’s scalable and maintainable:

Add as many ext4 partitions as you want—they’ll automatically be trimmed
Problematic filesystem types (vfat, ntfs, etc.) are automatically excluded
No need to manually list every partition or create exclusion lists

If you use other filesystem types that support TRIM well, you can add them:

ExecStart=/sbin/fstrim --fstab --verbose --quiet-unsupported -t ext4,xfs,btrfs

1	ExecStart=/sbin/fstrim --fstab --verbose --quiet-unsupported -t ext4,xfs,btrfs

Note: If you use ZFS, don’t add it here—ZFS handles TRIM internally through zpool trim commands.

The Aftermath

After applying this fix, my Pis have been rock solid. No more random crashes, no more power resets, and TRIM is still doing its job on the partitions that matter.

It’s frustrating that this issue is so common yet the solution isn’t widely documented. Hopefully this helps someone else avoid the weeks of debugging I went through!

My Setup

For context, here’s what I’m running:

Raspberry Pi with NVMe HAT
Booting from NVMe drive
Three partitions: vfat boot, ext4 root, ext4 data
GlusterFS cluster (which is irrelevant to this issue but explains the /data/brick1 mount)

Lessons Learned

Remote logging is essential for debugging systems that can fail catastrophically
Not all filesystems handle all operations equally well—what works for ext4 might break on vfat
The default configuration isn’t always right for every use case
When in doubt, limit operations to what you know works rather than trying to exclude what you know fails

Special thanks to Claude for helping me connect the dots when I was completely stuck. Sometimes you just need a fresh perspective!