Other Slow boot and GPT corrupt

rmomota · Oct 24, 2022

Hi!

The system takes near 20 seconds just to show up the
BIOS drive C: is ...
I really wonder why it's taking so long, it doesn't use to happen before.

The HDD was formatted with a Linux system and then installed FreeBSD and selected ZFS filesystem.
Since ever I got this annoying boot message "secondary GPT table is corrupt or invalid"
and this

Code:

# gpart show
=>       40  976773088  ada0  GPT  (466G) [CORRUPT]
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048    4194304     2  freebsd-swap  (2.0G)
    4196352  972576768     3  freebsd-zfs  (464G)
  976773120          8        - free -  (4.0K)

and even if I do a "gpart recover ada0"
seems to fix this CORRUPT status but only until the next reboot, the problem comes back.

Thanks in advance

SirDice · Oct 24, 2022

You might want to run smartctl(8) on that drive, it sounds like it's dead or dying.

rmomota · Oct 24, 2022

SirDice said:
You might want to run smartctl(8) on that drive, it sounds like it's dead or dying.

It shows

Code:

smartctl 7.3 2022-02-28 r5338 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL="http://www.smartmontools.org"]www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MQ01ACF050
Serial Number:    76N6C339T
LU WWN Device Id: 5 000039 723587d2c
Firmware Version: AV0A2C
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Oct 24 09:50:41 2022 WEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

and got this error several times
(...)
Error 59 occurred at disk power-on lifetime: 16756 hours (698 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 88 e8 bb 52 40  Error: WP at LBA = 0x0052bbe8 = 5422056

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 10 d0 e8 34 71 40 00   6d+17:10:24.350  WRITE FPDMA QUEUED
  60 20 c8 18 d2 6c 40 00   6d+17:10:24.350  READ FPDMA QUEUED
  60 80 c0 98 d1 6c 40 00   6d+17:10:24.350  READ FPDMA QUEUED
  60 70 b0 30 d2 6c 40 00   6d+17:10:24.350  READ FPDMA QUEUED
  61 08 a8 68 35 71 40 00   6d+17:10:24.349  WRITE FPDMA QUEUED

SirDice · Oct 24, 2022

Yeah, that's a dead/dying disk, you're going to want to replace it.

rmomota · Oct 24, 2022

SirDice said:
Yeah, that's a dead/dying disk, you're going to want to replace it.

So that seems to be bad news for me.
I've been having a hard time to setup this system with FreeBSD and now everything is going to the trash can?
Is there a tool to backup all configuration?
If not I'm thinking on backing up
/etc/rc.conf
/etc/wpa_supplicant.conf
/boot/loader.conf
and probably check all the extra packages installed after the main setup.

SirDice · Oct 25, 2022

Back up at least /etc/rc.conf and look for any modified configuration files in /usr/local/etc/, you want to back those up too. Save the output of pkg prime-list, those are packages you explicitly installed, everything else is a dependency (they will get automatically installed). You also want to see if you want to save anything in your home directory.

You could also do a clean install on a new disk and attach this old one to it. That'll make it easier to copy some files. But keep in mind this disk is about to die permanently, it could take minutes, it could take a couple of days, no real good way of telling when it'll be lost forever.

_martin · Oct 25, 2022

SirDice said:
Save the output of pkg prime-list

and I was always doing pkg info (pkg_info) and sed-ing the output.

SirDice · Oct 25, 2022

There are a couple of useful aliases; pkg prime-list, pkg prime-origins and pkg leaf. Most of these involve some clever pkg-query(8) searches. There's really no need to get pkg-info(8) output and parsing that.

T-Daemon · Oct 25, 2022

A full list of pkg(8) aliases can be displayed executing pkg alias.

Code:

COMMANDS
     ...
     alias   List the command line aliases.

rmomota · Oct 26, 2022

SirDice, how can I tell from this logs that this disc is dying?
Error: WP at LBA and Error: UNC at LBA isn't a recoverable problem?
I had Windows and Linux installed already without any problem and the BIOS HDD check was successful before installed FreeBSD.

5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
(...)
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0

freejlr · Oct 26, 2022

I remember there was an app called HD REGENERATOR. that this one was able to scan the disk in search of damaged sectors and repair them, I don't know what kind of process it performs but I think it was due to poor magnetization and it could repair them, I have saved a few practically useless disks with that software.

The bad thing is that it is not free, it has a trial version but it only allows you to analyze the disk without repairing it.

On the other hand it would be interesting to know if there is any other application.wouldn't importing the zpool to a new drive be another option? how would those damaged sectors affect zfs?

_martin · Oct 26, 2022

LBA in question is deep into zfs partition so that particular LBA should not slow the boot. Try reading the whole disk (non-desctructive): dd if=/dev/ada0 of=/dev/null bs=64k status=progress to see where you hit the problem. You should see more I/O errors. Also smartctl -a /dev/ada0 will show you more smart variables.
You can also do the smartctl -t long /dev/ada0 to trigger the self test of the disk. While it doesn't catch all errors it's usually a good start.

rmomota · Oct 26, 2022

_martin said:
LBA in question is deep into zfs partition so that particular LBA should not slow the boot. Try reading the whole disk (non-desctructive): dd if=/dev/ada0 of=/dev/null bs=64k status=progress to see where you hit the problem. You should see more I/O errors. Also smartctl -a /dev/ada0 will show you more smart variables.
You can also do the smartctl -t long /dev/ada0 to trigger the self test of the disk. While it doesn't catch all errors it's usually a good start.

After sending all my disk data to the neverland (/dev/null) it got no errors.
Here's the report:
500048855040 bytes (500 GB, 466 GiB) transferred 4730.000s, 106 MB/s
7631040+1 records in
7631040+1 records out
500107862016 bytes transferred in 4730.819244 secs (105712739 bytes/sec)

But wait a moment, 500107862016 bytes transferred... of 500048855040 bytes? It transferred more than the disk capacity?
So I'm quite confused if there's really a problem or if I should just ignore but just in case I did a backup.

short and long self tests don't return any error too.

_martin · Oct 27, 2022

rmomota said:
But wait a moment, 500107862016 bytes transferred... of 500048855040 bytes?

I eyeballed the bs to 64k in that dd command. Clearly it was not ok as that disk can't be divided into 64k chunks. So last 64k copy of the dd was not read completely from the disk. To do a proper copy you'd need to specify bs=4k for your disk. But here dd was used just to test if you can read the disk so it's not a big deal.

Anyway, you had problem booting so something is up there. On desktop usual suspects are faulty cables too. Can you show the whole output of smartctl -a /dev/ada0 ? What does gpart say now - does it still say corrupt? How about boot times - is it still a problem?

smithi · Oct 27, 2022

rmomota said:
After sending all my disk data to the neverland (/dev/null) it got no errors.
Here's the report:500048855040 bytes (500 GB, 466 GiB) transferred 4730.000s, 106 MB/s

That would be the last status report line before the end.

rmomota said:
7631040+1 records in
7631040+1 records out
500107862016 bytes transferred in 4730.819244 secs (105712739 bytes/sec)

But wait a moment, 500107862016 bytes transferred... of 500048855040 bytes? It transferred more than the disk capacity?

No, more than the last in-flight status report was up to, and exactly what smartctl had reported above:
User Capacity: 500,107,862,016 bytes

rmomota said:
So I'm quite confused if there's really a problem or if I should just ignore but just in case I did a backup.

short and long self tests don't return any error too.

Yeah I'm not convinced it's in trouble but must defer to the experience of SirDice, who may elaborate on his finding?

And backups and spare disks are always reassuring <&^}=

Also note that the disk size doesn't divide evenly by 64k; using 4k will take longer but exercise each physical sector.

_martin · Oct 27, 2022

I must say I don't understand why there's a need to say the same thing twice. smithi Haven't you see my reply above with the explanation why that is?

smithi said:
but must defer to the experience of SirDice, who may elaborate on his finding?

I did. And not to challenge SirDice but to challenge the output and get more information. We don't even know the OPs setup. We've seen only 5 queued queries happened at one time of the life of the disk. We don't even know if that's relevant to the time of the issue (that's why I asked about the full smart output).

smithi · Oct 27, 2022

_martin said:
I must say I don't understand why there's a need to say the same thing twice. smithi Haven't you see my reply above with the explanation why that is?

I only saw yours after posting mine, which I'd been composing for some time, double checking sizes etc by calculator, interrupted by dinner.

I was happy to see your post came to much the same conclusion, and 'liked' it too, but that doesn't seem to have taken. Maybe I needed to refresh the page. Sheesh.

_martin said:
I did. And not to challenge SirDice but to challenge the output and get more information.

Me too. If you imagine I'm "challenging" anyone, I must ask what I've said to get your back up? I bear no ill intent.

_martin said:
We don't even know the OPs setup. We've seen only 5 queued queries happened at one time of the life of the disk. We don't even know if that's relevant to the time of the issue (that's why I asked about the full smart output).

Yes, good move. Don't mind me. Maybe it's a timezone or cultural thing?

Cheers, Ian

_martin · Oct 27, 2022

smithi said:
I only saw yours after posting mine, which I'd been composing for some time, double checking sizes etc by calculator, interrupted by dinner.

Fair enough then.

smithi said:
I must ask what I've said to get your back up? I bear no ill intent.

I may have got up the on the wrong side of the bed today. I've seen this so many times happening here I had to say something. Also this (double or even triple answers of the same thing) can be so irritating when one googles things and goes through posts (or is just me ?)

smithi said:
Maybe it's a timezone or cultural thing?

Maybe a language barrier. I was not challenging you, I wanted to elaborate on those findings.
Cheers mate.

Now let's see what rmomota shares with us.

smithi · Oct 27, 2022

_martin said:
Fair enough then.

Thanks, phew.

_martin said:
(or is just me ?)

Yeah, it's just you!

_martin said:
Maybe a language barrier. I was not challenging you, I wanted to elaborate on those findings.
Cheers mate.

I speak septuagenarian, but try to make allowances <&^}=

Chronically curious here too.

Peace for all beings.

rmomota · Oct 27, 2022

Hi!
I didn't mean to disagree or something, it's just curiosity and no more than that.
I never used this tool before so I still learning how to interpret the results.
BIOS HDD test didn't detect any problem, dd command didn't detect any problem as well, but this tool returns 59 errors that happened somewhere in time and never happened again.
Maybe I didn't prepare the HDD properly for the ZFS.

It still takes about 20 seconds just to show the message
BIOS drive C: is ...
and then the boot starts.
During boot there's still the message
"the secondary GPT table is corrupt or invalid."

and the command gpart show ada0 still shows
=> 40 976773088 ada0 GPT (466G) [CORRUPT]

Then I do a
gpart recover ada0
the message goes away
=> 40 976773088 ada0 GPT (466G)
it keeps this status until next reboot, it comes back.

Here's the output in attachment.
Thank you again.

_martin · Oct 27, 2022

rmomota said:
I never used this tool before so I still learning how to interpret the results.

Even wiki has some nice summary. It can get complicated sometimes as some manifacturers interpret some fields differently.

Well you did use this disk pretty well. There are several variables that show this disk was really used. The fact that you still have issues on it means more than a good self-test of the disk. If you have important data on it I suggest you do a backup.

I don't quite understand why GPT keeps corrupting though. If the write is ok, which gpart seems to say so, I don't see a reason for that. I'm assuming you don't have it in gmirror sw raid.
When it comes to your ZFS setup it's ok, at least from what you shared.

I'm thinking that maybe you see that delay because of the corrupted GPT. Let's see what that corruption is. Boot the system, fix the GPT, verify it's fixed by gpart show as you did before. Now save the beginning and end of the disk:
dd if=/dev/ada0 of=ada0_ok_start bs=512 count=40
dd if=/dev/ada0 of=ada0_ok_end bs=512 skip=976773128

Reboot the notebook, don't fix the gpt and execute these commands again:
dd if=/dev/ada0 of=ada0_bad_start bs=512 count=40
dd if=/dev/ada0 of=ada0_bad_end bs=512 skip=976773128

And please share it. These will be your GPT metadata.
I'm assuming the disk size; you could share the diskinfo ada0 so we can verify. But if I'm correct the 2nd dd (one with the skip) should show you that 40 blocks were copied.

rmomota · Oct 28, 2022

_martin said:
Even wiki has some nice summary. It can get complicated sometimes as some manifacturers interpret some fields differently.

Well you did use this disk pretty well. There are several variables that show this disk was really used. The fact that you still have issues on it means more than a good self-test of the disk. If you have important data on it I suggest you do a backup.

I don't quite understand why GPT keeps corrupting though. If the write is ok, which gpart seems to say so, I don't see a reason for that. I'm assuming you don't have it in gmirror sw raid.
When it comes to your ZFS setup it's ok, at least from what you shared.

I'm thinking that maybe you see that delay because of the corrupted GPT. Let's see what that corruption is. Boot the system, fix the GPT, verify it's fixed by gpart show as you did before. Now save the beginning and end of the disk:
dd if=/dev/ada0 of=ada0_ok_start bs=512 count=40
dd if=/dev/ada0 of=ada0_ok_end bs=512 skip=976773128

Reboot the notebook, don't fix the gpt and execute these commands again:
dd if=/dev/ada0 of=ada0_bad_start bs=512 count=40
dd if=/dev/ada0 of=ada0_bad_end bs=512 skip=976773128

And please share it. These will be your GPT metadata.
I'm assuming the disk size; you could share the diskinfo ada0 so we can verify. But if I'm correct the 2nd dd (one with the skip) should show you that 40 blocks were copied.

Ok, I'll try that.
If I take some time to my next answer is because the system is gone

rmomota · Oct 28, 2022

Still here.
The start seems the same but there are differences in the end.
Could it be some GPT copy at the end of the disk?
I remember doing some dd to zero the last bytes of the disk (read it in some forum).
Does it somehow gets override?

cy@ · Oct 28, 2022

rmomota said:

It shows

Code:

smartctl 7.3 2022-02-28 r5338 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, [URL="http://www.smartmontools.org"]www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MQ01ACF050
Serial Number:    76N6C339T
LU WWN Device Id: 5 000039 723587d2c
Firmware Version: AV0A2C
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Mon Oct 24 09:50:41 2022 WEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

and got this error several times
(...)
Error 59 occurred at disk power-on lifetime: 16756 hours (698 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 88 e8 bb 52 40  Error: WP at LBA = 0x0052bbe8 = 5422056

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 10 d0 e8 34 71 40 00   6d+17:10:24.350  WRITE FPDMA QUEUED
  60 20 c8 18 d2 6c 40 00   6d+17:10:24.350  READ FPDMA QUEUED
  60 80 c0 98 d1 6c 40 00   6d+17:10:24.350  READ FPDMA QUEUED
  60 70 b0 30 d2 6c 40 00   6d+17:10:24.350  READ FPDMA QUEUED
  61 08 a8 68 35 71 40 00   6d+17:10:24.349  WRITE FPDMA QUEUED

What do the error counters in the SMART Attributes Data Structure say? #197 and 198 will tell you there's an immediate problem. Is #5, reallocated sector count, below the threshold?

The errors may look serious but are the errors recent or long time ago. Smartctl will tell you the age of the drive. (Remember the counter does roll over.)

Regardless of whether the error counters are zero or not, run smartctl -t long against the drive. It will read every sector on the drive. Don't reboot because a system reset performed during boot will prematurely terminate the test. Not powering off the machine during the test goes without saying.

You can use the machine normally as the test is being performed. It'll only take longer.

The output from all the tests are accumulated at the bottom of the smartctl -a report.

On older drives or drives throwing errors I recommend running smartctl -t long every 4-6 months or longer.

Replace the drive if smartctl -t long fails to complete.

The other thing to notice is if your drive is having excessively slow writes its writeback cache may be disabled. I had a drive which showed no surface errors but its writeback cache was disabled. I re-enabled it to discover it disabled, again. It would remain enabled while performing tagged writes but as soon as the tagged write operations, which could last for minutes, stopped it immediately went back into write-through mode, disabling the write cache. Smartctl won't discover those errors and clean drives may suffer this problem, probably an flakey chip on its logic board gave up the ghost.

Run smartctl -t long over night and report back what it finds.

BTW, you can remediate/rescue drives using dd_rescue, ddrescue, or simply dd. You don't need to spend your money on a remediation or recovery tool. The tools in FreeBSD and the ports collection are more than sufficient for the task.

_martin · Oct 28, 2022

As gpart detects the corruption something is going on. Once gpt recovered it and confirmed it's ok there shouldn't be any reason for a corruption again. If you don't touch partitions these headers are not being touched.

GPT header at the beginning is what you showed with gpart - 3 partitions: boot,swap,zfs. But the end is weird. Even in the "ok" one you have 6 partitions defined: boot,swap,zfs, boot, swap, zfs.
Backup GPT header (+0x48) points to the different start of the partitions table than the bad one.

While disk still could have issues this particular situation is more admin mess than disk issue. I'd start with why there's corruption after reboot.
Please can you share the output of

Code:

diskinfo /dev/ada0
zdb -l /dev/ada0p3

rmomota said:
I remember doing some dd to zero the last bytes of the disk (read it in some forum).

What with and when? If before setting up the FreeBSD it's ok.

Other Slow boot and GPT corrupt

rmomota

SirDice

Administrator

rmomota

SirDice

Administrator

rmomota

SirDice

Administrator

_martin

SirDice

Administrator

T-Daemon

rmomota

freejlr

_martin

rmomota

_martin

smithi

_martin

smithi

_martin

smithi

rmomota

Attachments

_martin

rmomota

rmomota

Attachments

cy@

_martin