On Mon, 29 Jul 2019 at 13:28, Stewart C. Russell via talk <talk@gtalug.org> wrote:
I'm guessing this is bad, right?

    [Mon Jul 29 12:59:48 2019] print_req_error: critical medium error,
dev nvme0n1, sector 296089600 flags 80700
    [Mon Jul 29 12:59:48 2019] print_req_error: critical medium error,
dev nvme0n1, sector 296089744 flags 0

Is it an oh-shit-get-yerself-a-new-drive-NOW thing, or …?

Drive is a 2+ year old Intel 512 GB SSD. Not entirely sure what the
right diagnostics are for SSDs. Filesystem is showing clean but touching
certain known-bad files triggers the error in the system log.

Dunno if these nvme stats are useful:

    Smart Log for NVME device:nvme0 namespace-id:ffffffff
    critical_warning                    : 0
    temperature                         : 25 C
    available_spare                     : 85%
    available_spare_threshold           : 10%
    percentage_used                     : 1%
    data_units_read                     : 10,349,479
    data_units_written                  : 10,098,299
    host_read_commands                  : 183,018,841
    host_write_commands                 : 136,702,227
    controller_busy_time                : 1,342
    power_cycles                        : 201
    power_on_hours                      : 15,722
    unsafe_shutdowns                    : 10
    media_errors                        : 803
    num_err_log_entries                 : 844
    Warning Temperature Time            : 0
    Critical Composite Temperature Time : 0
    Thermal Management T1 Trans Count   : 0
    Thermal Management T2 Trans Count   : 0
    Thermal Management T1 Total Time    : 0
    Thermal Management T2 Total Time    : 0

Any suggestions, please, for:

* what I should be looking for in stats (nvme smart-log-add doesn't give
me anything at all, so no wear-levelling stats)

* a decent brand to replace it with. I'm likely okay with a SATA SSD.

cheers,
 Stewart

The log doesn't sound like heavy use ... and yet that sounds like an "oh-shit-get-yerself-a-new-drive-NOW" error to me.  At the very least, stay on top of your backups.  As I understand it, when "segments" go bad on a solid state drive (hell, even on a spinning disk these days), the drive firmware should silently move the data and you'd never even know it happened.  That you're seeing the errors is alarming and suggests a fairly serious malfunction.

But ... I have no expertise with SSD (or NVMe) drives - I have a few, but none have failed so I haven't had to learn.  Ignore this suggestion if you get advice from someone with more knowledge of those drives ...

--
Giles
https://www.gilesorr.com/
gilesorr@gmail.com