ZFS dd if=/dev/zero - One hard drive much slower than others

Avery Freeman · Jun 14, 2018

Hey,

This thread is kind of a follow-up about an issue I think is related to problems I was having back around may 16:

https://forums.freebsd.org/threads/...-certain-files-cause-transfers-to-hang.65875/

so this isn't specifically related to ZFS but I am still trying to figure out why an array of aging hard drives I have is failing when running rsync clone either to or from another machine

It was a 6-drive 3-vdev mirror pool. rsync fail was reproducible at a specific point during the clone - whether reading or writing from the pool (had to lose some data when copying off the first time, now losing info when trying to copy back as a backup)

Ran # zpool scrub several times - no errors detected
ran smartctl long tests on every drive - I noticed this particular drive is saying smartctl tests are all 'aborted by user' when I have run it several times all the way through:

Code:

[LIST=1]
[*]
=== START OF INFORMATION SECTION ===
[*]Model Family:     Western Digital Green
[*]Device Model:     WDC WD20EARX-00PASB0
[*]Serial Number:    WD-WMAZA5640072
[*]LU WWN Device Id: 5 0014ee 1aef33c24
[*]Firmware Version: 51.0AB51
[*]User Capacity:    2,000,398,934,016 bytes [2.00 TB]
[*]Sector Sizes:     512 bytes logical, 4096 bytes physical
[*]Device is:        In smartctl database [for details use: -P show]
[*]ATA Version is:   ATA8-ACS (minor revision not indicated)
[*]SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
[*]Local Time is:    Fri May 18 04:44:44 2018 PDT
[*]SMART support is: Available - device has SMART capability.
[*]SMART support is: Enabled
[*]

[*]=== START OF READ SMART DATA SECTION ===
[*]SMART overall-health self-assessment test result: PASSED
[*]

[*]General SMART Values:
[*]Offline data collection status:  (0x84)    Offline data collection activity
[*]                    was suspended by an interrupting command from host.
[*]                    Auto Offline Data Collection: Enabled.
[*]Self-test execution status:      (  38)    The self-test routine was interrupted
[*]                    by the host with a hard or soft reset.
[*]Total time to complete Offline
[*]data collection:         (41100) seconds.
[*]Offline data collection
[*]capabilities:              (0x7b) SMART execute Offline immediate.
[*]                    Auto Offline data collection on/off support.
[*]                    Suspend Offline collection upon new
[*]                    command.
[*]                    Offline surface scan supported.
[*]                    Self-test supported.
[*]                    Conveyance Self-test supported.
[*]                    Selective Self-test supported.
[*]SMART capabilities:            (0x0003)    Saves SMART data before entering
[*]                    power-saving mode.
[*]                    Supports SMART auto save timer.
[*]Error logging capability:        (0x01)    Error logging supported.
[*]                    General Purpose Logging supported.
[*]Short self-test routine
[*]recommended polling time:      (   2) minutes.
[*]Extended self-test routine
[*]recommended polling time:      ( 396) minutes.
[*]Conveyance self-test routine
[*]recommended polling time:      (   5) minutes.
[*]SCT capabilities:            (0x3035)    SCT Status supported.
[*]                    SCT Feature Control supported.
[*]                    SCT Data Table supported.
[*]

[*]SMART Attributes Data Structure revision number: 16
[*]Vendor Specific SMART Attributes with Thresholds:
[*]ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
[*]  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       10
[*]  3 Spin_Up_Time            0x0027   181   155   021    Pre-fail  Always       -       5950
[*]  4 Start_Stop_Count        0x0032   096   096   000    Old_age   Always       -       4943
[*]  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
[*]  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
[*]  9 Power_On_Hours          0x0032   047   047   000    Old_age   Always       -       39324
[*] 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
[*] 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
[*] 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       346
[*]192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       214
[*]193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       767153
[*]194 Temperature_Celsius     0x0022   118   106   000    Old_age   Always       -       32
[*]196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
[*]197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
[*]198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
[*]199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
[*]200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       4
[*]

[*]SMART Error Log Version: 1
[*]No Errors Logged
[*]

[*]SMART Self-test log structure revision number 1
[*]Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
[*]# 1  Extended offline    Interrupted (host reset)      60%     39294         -
[*]# 2  Extended offline    Completed without error       00%     39291         -
[*]# 3  Short offline       Completed without error       00%     29116         -
[*]# 4  Short offline       Completed without error       00%     11529         -
[*]# 5  Conveyance offline  Completed without error       00%     11515         -
[*]# 6  Extended offline    Completed without error       00%     11452         -
[*]# 7  Short offline       Aborted by host               90%      3750         -
[*]# 8  Short offline       Aborted by host               90%      3750         -
[*]# 9  Short offline       Aborted by host               90%      3750         -
[*]#10  Short offline       Aborted by host               90%      3750         -
[*]#11  Short offline       Aborted by host               90%      3750         -
[*]#12  Short offline       Aborted by host               90%      3750         -
[*]#13  Short offline       Aborted by host               90%      3750         -
[*]#14  Short offline       Aborted by host               90%      3750         -
[*]#15  Short offline       Aborted by host               90%      3750         -
[*]#16  Short offline       Aborted by host               90%      3750         -
[*]#17  Short offline       Aborted by host               90%      3750         -
[*]#18  Short offline       Aborted by host               90%      3750         -
[*]#19  Short offline       Aborted by host               90%      3750         -
[*]#20  Short offline       Aborted by host               90%      3750         -
[*]#21  Short offline       Aborted by host               90%      3750         -
[*]

[*]SMART Selective self-test log data structure revision number 1
[*] SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
[*]    1        0        0  Not_testing
[*]    2        0        0  Not_testing
[*]    3        0        0  Not_testing
[*]    4        0        0  Not_testing
[*]    5        0        0  Not_testing
[*]Selective self-test flags (0x0):
[*]  After scanning selected spans, do NOT read-scan remainder of disk.
[*]If Selective self-test is pending on power-up, resume after 0 minute delay.
[/LIST]

As per recommendation from another thread (was querying badblocks to see if that command would be a decent remedy), now I'm trying writing zeros to all the drives using dd and pv:

Code:

# dd if=/dev/zero | pv -s 1907729M | dd of=/dev/adax bs=4096

I noticed right off the bat that the first drive is WAY slower than the other drives writing the zeros, despite being the first one on which the command was executed. All the others are fairly comparable speed-wise.

e.g. the slow drive is like:

Code:

6.00GiB 0:38:53 [2.26MB/s] ETA 8:06:15:34

When the drive started right after it is like:

Code:

80.4GiB 0:38:16 [36.8MB/s] ETA 13:59:17

There's like a whole 'nother column for the ETA (!!)

So I think this one must be the culprit. Thankfully I have another drive that is one of the original WD20EARS models, Phoenix told me in the other thread that the newer greens (EARX model) do not allow WDIDLE.EXE to turn off spin-down.

Anyway, I just thought that was kind of interesting that dd was running so slow compared to the other drives. I was having a hard time figuring out which one it was until now.

Thanks to everyone who has helped me along this slow and arduous journey.

phoenix · Jun 16, 2018

What's PV and why do it on a pipeline like that?

dd if=/dev/zero of=/dev/ada0 bs=1M

Play around with the bs parameter. Some disks like 1M. Others like larger values. Over 16M isn't too useful, though.

ralphbsz · Jun 16, 2018

Agree with phoenix: This can be done simply and well with dd. Use large block sizes.

Testing a disk drive with 4K blocks is just pointless. You are not testing the drive; you are testing all the stack between the application in user space and the drive. You also used a somewhat complex pipeline, where every 4K block needs to be handed from dd to pv (meaning a user->kernel transition for dd to write, and another kernel->user transition for pv to read), another pair of transitions to get from pv into the second dd, and then a third one from dd to the /dev/ada device. And these are writes, so if all this data management takes a bit too long, the platter has rotated away under the head, and the next write might have to wait a full rotation (depending on the various write buffer settings on the drive), and all that for 4096 bytes. You is just not a useful way to measure disk performance. The single biggest step is to use seriously large blocks, significantly larger than a typical track size (which on modern drives are getting to be about 1MB), and the using the simplest pipeline possible, ideally none at all.

Next step: Look up the specification for bandwidth for this drive. It should be somewhere on the manufacturers website, and it should be around 100 to 200 MByte/second. Some manufacturers even quote a range of bandwidth, which is always highest at the beginning of the disk (outer diameter). When using a dd to read or write the disk using very large blocks (anything that's multiple MByte, I like to go to 16M), you should be getting at least 80% of the nominal bandwidth of the disk. Real-world example (from a 4TB Hitachi near line drive connected via SATA); the manufacturer specifies a bandwidth of 110-140 MByte/s, and I did a read test only (this drive has real data on it):

Code:

# dd if=/dev/ada3 of=/dev/null bs=16M count=1024
1024+0 records in
1024+0 records out
17179869184 bytes transferred in 127.202259 secs (135059466 bytes/sec)

So the drive got more than 95% of the rated speed, which I consider good.

phoenix · Jun 17, 2018

And, if you need to get stats out of dd while it's running, just hit CTRL+T.

vince66 · Jun 17, 2018

If you enjoy, you can also see this thread:

How to see if a command is running

It is well explained by ralphbsz and the other members.

Avery Freeman · Jun 19, 2018

ralphbsz said:
Agree with phoenix: This can be done simply and well with dd. Use large block sizes.

Testing a disk drive with 4K blocks is just pointless. You are not testing the drive; you are testing all the stack between the application in user space and the drive.

Well, I was writing zeros to the drives in an attempt to get them to work better, which is why I used 4K block size.

I was mostly just noting the discrepancy between speeds, I didn't go into the whole process for anything to do with speed testing.

Thank you for the info, though, I really appreciate it.

I removed the drive that was staggeringly slow, but now it appears I'm having a new issue:

Code:

(aprobe0:ahcich1:0:0:0): NOP FLUSHQUEUE. ACB: 00 00 00 00 00 00 00 00 00 00
(aprobe0:ahcich1:0:0:0): CAM status: ATA Status Error
(aprobe0:ahcich1:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error :04 (ABRT )
(aprobe0:ahcich1:0:0:0): RES: d1 04 ff ff ff ff ff ff ff ff ff
(aprobe0:ahcich1:0:0:0): Error 5, Retries exhausted

That's a new one to me.

Avery Freeman · Jun 19, 2018

phoenix said:
And, if you need to get stats out of dd while it's running, just hit CTRL+T.

Tried CTRL-T, that's great I'd never known about that - is that only on FreeBSD?

pv is kind of nice because of the real-time updates. If you have a better suggestion about how to use it, I'm all ears.

phoenix · Jun 20, 2018

Avery Freeman said:
Tried CTRL-T, that's great I'd never known about that - is that only on FreeBSD?

Correct, that is a FreeBSD-ism, and something that I miss a lot when using Linux systems. You can hit CTRL+T while pretty much any command is running (even simple things like cp(1)) and get a status update from it.

pv is kind of nice because of the real-time updates. If you have a better suggestion about how to use it, I'm all ears.

I've never used pv, that why I was wondering. To get semi-real-time updates, you can run the dd command in the background, then send a SIGINFO signal to the process ID in a while loop to get similar output.

Code:

# dd if=/dev/ada0 of=/dev/null bs=1M &
[1] 60038
# while true; do kill -INFO 60038; echo ""; sleep 2; done
807+0 records in
807+0 records out
846200832 bytes transferred in 5.673023 secs (149162243 bytes/sec)

That will give you a status report from process ID 60038 every 2 seconds.

I use that construct all the time on Linux systems (using -USR1 instead of -INFO) to monitor what dd is doing, as Linux doesn't support CTRL+T.

ralphbsz · Jun 20, 2018

Avery Freeman said:
Well, I was writing zeros to the drives in an attempt to get them to work better, which is why I used 4K block size.

You could have written the zeros with a 16M or 64M block size, and gotten done faster!

Code:
but now it appears I'm having a new issue:

Code:

... (aprobe0:ahcich1:0:0:0): ATA status: d1 (BSY DRDY SERV ERR), error :04 (ABRT )

Looks like a SATA interface problem. Check cabling and power first. Could also be the drive itself, hard to tell with just one line of error message.

Avery Freeman · Jul 3, 2018

ralphbsz said:
You could have written the zeros with a 16M or 64M block size, and gotten done faster!

Looks like a SATA interface problem. Check cabling and power first. Could also be the drive itself, hard to tell with just one line of error message.

That's a good point, I'll check it out before I do anything drastic. It's kind of on the back burner for now.

The reason I believe it has to do with the HDD/platter surface though is it keeps failing on around the same files in the rsync queue - which seems to indicate that the drive in question it craps out when gets full to a certain point.