nvme SSD: critical medium error, dev nvme0n1

I'm guessing this is bad, right? [Mon Jul 29 12:59:48 2019] print_req_error: critical medium error, dev nvme0n1, sector 296089600 flags 80700 [Mon Jul 29 12:59:48 2019] print_req_error: critical medium error, dev nvme0n1, sector 296089744 flags 0 Is it an oh-shit-get-yerself-a-new-drive-NOW thing, or …? Drive is a 2+ year old Intel 512 GB SSD. Not entirely sure what the right diagnostics are for SSDs. Filesystem is showing clean but touching certain known-bad files triggers the error in the system log. Dunno if these nvme stats are useful: Smart Log for NVME device:nvme0 namespace-id:ffffffff critical_warning : 0 temperature : 25 C available_spare : 85% available_spare_threshold : 10% percentage_used : 1% data_units_read : 10,349,479 data_units_written : 10,098,299 host_read_commands : 183,018,841 host_write_commands : 136,702,227 controller_busy_time : 1,342 power_cycles : 201 power_on_hours : 15,722 unsafe_shutdowns : 10 media_errors : 803 num_err_log_entries : 844 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0 Any suggestions, please, for: * what I should be looking for in stats (nvme smart-log-add doesn't give me anything at all, so no wear-levelling stats) * a decent brand to replace it with. I'm likely okay with a SATA SSD. cheers, Stewart

On Mon, 29 Jul 2019 at 13:28, Stewart C. Russell via talk <talk@gtalug.org> wrote:
I'm guessing this is bad, right?
[Mon Jul 29 12:59:48 2019] print_req_error: critical medium error, dev nvme0n1, sector 296089600 flags 80700 [Mon Jul 29 12:59:48 2019] print_req_error: critical medium error, dev nvme0n1, sector 296089744 flags 0
Is it an oh-shit-get-yerself-a-new-drive-NOW thing, or …?
Drive is a 2+ year old Intel 512 GB SSD. Not entirely sure what the right diagnostics are for SSDs. Filesystem is showing clean but touching certain known-bad files triggers the error in the system log.
Dunno if these nvme stats are useful:
Smart Log for NVME device:nvme0 namespace-id:ffffffff critical_warning : 0 temperature : 25 C available_spare : 85% available_spare_threshold : 10% percentage_used : 1% data_units_read : 10,349,479 data_units_written : 10,098,299 host_read_commands : 183,018,841 host_write_commands : 136,702,227 controller_busy_time : 1,342 power_cycles : 201 power_on_hours : 15,722 unsafe_shutdowns : 10 media_errors : 803 num_err_log_entries : 844 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0
Any suggestions, please, for:
* what I should be looking for in stats (nvme smart-log-add doesn't give me anything at all, so no wear-levelling stats)
* a decent brand to replace it with. I'm likely okay with a SATA SSD.
cheers, Stewart
The log doesn't sound like heavy use ... and yet that sounds like an "oh-shit-get-yerself-a-new-drive-NOW" error to me. At the very least, stay on top of your backups. As I understand it, when "segments" go bad on a solid state drive (hell, even on a spinning disk these days), the drive firmware should silently move the data and you'd never even know it happened. That you're seeing the errors is alarming and suggests a fairly serious malfunction. But ... I have no expertise with SSD (or NVMe) drives - I have a few, but none have failed so I haven't had to learn. Ignore this suggestion if you get advice from someone with more knowledge of those drives ... -- Giles https://www.gilesorr.com/ gilesorr@gmail.com

On 2019-07-29 3:58 p.m., Giles Orr via talk wrote:
The log doesn't sound like heavy use ... and yet that sounds like an "oh-shit-get-yerself-a-new-drive-NOW" error to me.
Thought so.
At the very least, stay on top of your backups.
Today was a great day to discover that my automatic backup system hadn't really been working since late 2017. Only major casualty is my Windows 10 VM that has a hard error somewhere in its drive image and hard-locks if it's run for more than about 20 minutes. Stewart

On Mon, Jul 29, 2019 at 01:28:32PM -0400, Stewart C. Russell via talk wrote:
I'm guessing this is bad, right?
[Mon Jul 29 12:59:48 2019] print_req_error: critical medium error, dev nvme0n1, sector 296089600 flags 80700 [Mon Jul 29 12:59:48 2019] print_req_error: critical medium error, dev nvme0n1, sector 296089744 flags 0
Is it an oh-shit-get-yerself-a-new-drive-NOW thing, or …?
Drive is a 2+ year old Intel 512 GB SSD. Not entirely sure what the right diagnostics are for SSDs. Filesystem is showing clean but touching certain known-bad files triggers the error in the system log.
Dunno if these nvme stats are useful:
Smart Log for NVME device:nvme0 namespace-id:ffffffff critical_warning : 0 temperature : 25 C available_spare : 85% available_spare_threshold : 10% percentage_used : 1% data_units_read : 10,349,479 data_units_written : 10,098,299 host_read_commands : 183,018,841 host_write_commands : 136,702,227 controller_busy_time : 1,342 power_cycles : 201 power_on_hours : 15,722 unsafe_shutdowns : 10 media_errors : 803 num_err_log_entries : 844 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0
Any suggestions, please, for:
* what I should be looking for in stats (nvme smart-log-add doesn't give me anything at all, so no wear-levelling stats)
* a decent brand to replace it with. I'm likely okay with a SATA SSD.
So according to intel's datasheet: Media Errors: Contains the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field. Now could that mean it simply has a bad block and has read errors on that block number and that if you were to rewrite that block it would be remapped and fix it? Could be. Is it the same sector numbers each time you see a log message? -- Len Sorensen

| From: Stewart C. Russell via talk <talk@gtalug.org> | Drive is a 2+ year old Intel 512 GB SSD. | data_units_read : 10,349,479 | data_units_written : 10,098,299 I wonder what a unit is. A logical sector (512 or 4096 bytes)? The ratio of those numbers surprises me. I've always assumed that reads are more common than writes. Except for archiving. Those numbers are so very very close | host_read_commands : 183,018,841 | host_write_commands : 136,702,227 These numbers are quite close but (a) not as close as the previous pair (b) larger than the previous pair (how could that be?) | unsafe_shutdowns : 10 There's a chance that this number explains your bad data. | media_errors : 803 | num_err_log_entries : 844 You can read this log (using smartctl). Only the most recent N entries might be preserved. | * what I should be looking for in stats (nvme smart-log-add doesn't give | me anything at all, so no wear-levelling stats) | | * a decent brand to replace it with. I'm likely okay with a SATA SSD. I know too little to answer this. There are lots of review sites (of varying quality). You want an NVMe drive, right? Some of the inexpensive NVMe drives "borrow" some RAM from your main memory. That may be a Good Thing (cheaply increasing performance) or not (adding a new and exciting way that a system crash could curdle your disk). There are a lot of sins that can be covered up by firmware. Here's one: SLC > MLC > TLC > QLC 1 2 3 4 bits per cell The more bits per cell, - the more bits you can fit on a flash chip - the slower the operations - the sooner the cell will wear out. For consumers, SLC hasn't ever been available. MLC is probably gone from the market TLC is very common QLC is just coming in. Many drives reserve a bit of flash to use in SLC mode as a fast buffer.

On 2019-08-06 3:30 p.m., D. Hugh Redelmeier via talk wrote:
| unsafe_shutdowns : 10
Oddly, there hadn't been an unsafe shutdown for ages. While I do have a UPS, I've sometimes had the machine hard lock with no way to get it to do anything. Maybe those are counted that way, but there was no disk activity when I did the "unsafe" shutdowns.
There's a chance that this number explains your bad data.
| media_errors : 803 | num_err_log_entries : 844
You can read this log (using smartctl).
Not on NVME, and seemingly the Intel-specific drivers that this device uses only give summary results. `nvme smart-log-add` does nothing. These numbers haven't changed any on subsequent remounts.
| * a decent brand to replace it with. I'm likely okay with a SATA SSD. … You want an NVMe drive, right?
I ended up with a 1 TB WD Blue SSD for $140 + tax. Not NVME, but fast enough. Because this machine (though only 2-ish years old) had some legacy OS boot cruft on it, doing a clean Ubuntu install means that boot time is around 5 seconds.
The more bits per cell, - the more bits you can fit on a flash chip - the slower the operations - the sooner the cell will wear out.
I just wish the old drive had the smarts to remap bad sectors. There were a couple of files that *always* caused kernel errors, but enough space that files could be moved/made lost+found. What really annoyed me was that fsck would show clean, but these files still existed with errors. cheers, Stewart

| From: Stewart C. Russell via talk <talk@gtalug.org> | I just wish the old drive had the smarts to remap bad sectors. There | were a couple of files that *always* caused kernel errors, but enough | space that files could be moved/made lost+found. I'm pretty sure that the drive can remap those bad "sectors". What you have to do is write to the bad sectors. Here's the logic of remapping, as I understand it: - if a write fails and the controller knows this at the time of the write, that block is marked as bad AND the write is attempted at another block. This is done with no indication to the OS (except S.M.A.R.T. counts are changed). This is fine since there is no observable discrepancy (except for time). This is called "remapping" in the HDD world. It should not be called this in the SSD because all blocks are mapped in the SSD world. - if a read fails, even on retry, the controller must report this to the OS. After all, the information has been lost. No remapping is done because that would be hiding the loss of information. - as a user, if you notice a bad block, and give up on trying to recover its content, just overwrite that block. + on an HDD, that will likely trigger a remapping + on an SDD, that will just cause the block to be mapped somewhere else (as always with a write). One hopes that the controller is smart enough to consider the bad block permanently bad and never use it again. Complications: - bad blocks are probably part of a bigger bad unit. For SSDs, I'd guess that the unit is an erase block (perhaps 128k or larger). For HDDs it is surely 4K these days. For HDDs, that may mean that a cluster of blocks (adding up to 4k) is bad, not just one. They may not be contiguous within one file but they probably are (I think that most Linux file systems allocate 4k or more at a time). For SDDs, many files may have blocks within a single erase block. - the normal way of finding bad blocks systematically is to use smartctl(8). You should be able to get a list of bad blocks on the drive. You probably want to be able to find out which file contains each bad block, and at what offset. It seems that the debugfs command can help for extN filesystems. See <https://linoxide.com/linux-how-to/how-to-fix-repair-bad-blocks-in-linux/> - e2fsck -c almost does what you want, but not quite. | What really annoyed me | was that fsck would show clean, but these files still existed with errors. fsck only checks metadata. Errors are probably in plain old data. But do see e2fsck's -c option. smartctl --test=long is the right way. Not surprisingly this can take quite some time on a large HDD. Probably not on an ordinary SSD.

On 2019-08-08 1:08 a.m., D. Hugh Redelmeier via talk wrote:
smartctl --test=long is the right way. Not surprisingly this can take quite some time on a large HDD. Probably not on an ordinary SSD.
smartctl returns instantly with no output when tried on any reasonable permutation of the device name. I know the main files that was affected, and it's a VirtualBox dynamic disk image. Unfortunately, VBox (or possible Win 10 running as the VM) doesn't seem to be very clever with underlying media errors on this type of disk and it locks up. Stewart
participants (5)
-
D. Hugh Redelmeier
-
Giles Orr
-
James Knott
-
lsorense@csclub.uwaterloo.ca
-
Stewart C. Russell