The Wrong Card

The server had been crashing for weeks. Sense key errors filling the logs. mpt3sas driver events firing every five minutes. Error codes that looked serious because error codes always look serious when a system is unstable.

We investigated. Some findings were real. Some turned out to be nothing. This is the post about which was which — and about the conclusion we reached at the end of it.


The Errors Were Benign

The first batch — the ones filling the syslog on a five-minute cycle — meant nothing.

Sense Key 0x1 / ASC=0x1c ASCQ=0x0. "Defect List Not Found." Four SAS SSDs, each responding with the same message every few minutes. The pattern looked like a polling loop gone wrong. The codes sounded serious.

They weren't.

The HBA firmware sends READ DEFECT DATA commands to every drive on the bus as standard SAS maintenance — it's asking each drive to report its internal defect list. The Toshiba SAS SSDs respond that they don't maintain one. The kernel logs it as a sense key event. The whole exchange is correct behaviour at the protocol level.

This also wasn't the right SMART configuration. Unraid was set to query the SSDs using ATA attribute polling — the kind designed for spinning rust. SAS SSDs don't speak that way. Changing the controller type to -d scsi for the SSDs and -d sat for the SATA drives behind the expander stopped the mismatched queries. The sense key events continued regardless, because those come from the HBA firmware, not from SMART polling.

Two separate things that looked connected. They weren't.


Running at Half Width

lspci shows the PCIe link state for every device in the system. The line that mattered:

LnkCap:  Speed 8GT/s, Width x8
LnkSta:  Speed 8GT/s, Width x4 (downgraded)

The card is rated for x8. Eight lanes of PCIe 3.0. It has been running at four.

Linux flags this as (downgraded) — the link negotiated below the device's rated capability. The card is running, but not as designed. Whether this caused any of the crashes is unclear — plenty of systems run x8 cards in x4 slots without instability, and the bandwidth available at x4 is more than enough for spinning rust. But it means the card has never been operating to spec on this machine.

The reason is simple: no x8 slot is free without moving the GPU. The remaining slots are x4. The card went in where it fit.


The Right Tool for the Wrong Job

This is where the investigation shifted from "what's wrong" to "what's wrong with having this card at all."

The Broadcom 9400-16i is tri-mode: it can connect SAS drives, SATA drives, and NVMe drives from a single controller. This matters in environments where all three storage tiers share one card — enterprise density shelves, high-throughput systems, mixed-media storage pools.

On this server: SAS SSDs and SATA hard drives. No NVMe. No plans for NVMe. The tri-mode capability has been unused since the card was installed.

The 9400 is also a first-generation product. Broadcom's first tri-mode HBA. The firmware has documented stability issues that were fixed in later revisions. The driver has instability reports across multiple Linux distributions and kernel versions. Users who needed the NVMe support found it worthwhile. Users who didn't found they'd bought complexity they couldn't use and inherited instability they didn't expect.

The 9305-16i is the previous generation: SAS and SATA, no NVMe. Same port count. Same connectors. Same physical footprint. A chip architecture with more production history and fewer instability reports. It is, by several accounts, boring in the specific way that storage controllers should be boring.


The Decision

A 9305-16i has been ordered.

Not because the 9400 is definitively responsible for the crashes — the root cause remains unconfirmed. Not because the x4 bandwidth is provably a problem — it probably isn't for this workload. The decision came from a simpler question: is this the right card for what this server actually does?

It isn't. It solves a problem that doesn't exist here, with a reputation for problems that do exist here, running below its rated spec in a slot that can't give it what it needs.

The replacement addresses a different question from the investigation. The investigation asked: why is this crashing? We found partial answers. The replacement asks: is this hardware appropriate for this role? The answer is no, and that's reason enough.

The 9305 will go in the same slot. It will also negotiate x4. The difference is that the 9305's track record in that configuration is considerably quieter.

Sometimes the right fix is not a root cause. It's a better choice.