HDD problems or something else?

Dendros · Nov 13, 2020

Hello.

I have a system with 4 mechanical HDDs and I have installed FreeBSD 12.1-p10 on one of them but it seems that FreeBSD runs a bit slow.

By "slow" I mean that, for example, when installing a package with pkg(8) the extraction of some - not all - packages is really slow (for example I have installed the Papirus icon theme and extracting it after downloading took about 10-15 minutes). And starting the GUI (Xfce 4.14) sometimes takes a bit longer than usual. Other than these hiccups, the system seems to run fine.

So, I'm thinking that there could be problems with the drive on which FreeBSD was installed so I booted into Debian and ran a long smartctl(8) test on FreeBSD's drive ( smartctl -t long /dev/sdc -> that is the FreeBSD's disk). These are the results:

Code:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         ( 7313) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 122) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   142   142   054    Pre-fail  Offline      -       69
  3 Spin_Up_Time            0x0007   115   115   024    Pre-fail  Always       -       197 (Average 201)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       849
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   115   115   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       16352
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       848
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       853
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       853
194 Temperature_Celsius     0x0002   187   187   000    Old_age   Always       -       32 (Min/Max 15/39)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     16352         -
# 2  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I don't know how to interpret these, are they good or bad?

PMc · Nov 13, 2020

Dendros said:
By "slow" I mean that, for example, when installing a package with pkg the extraction of some - not all - packages is really slow (for example I have installed the Papirus icon theme and extracting it after downloading took about 10-15 minutes). And starting the GUI (Xfce 4.14) sometimes takes a bit longer than usual. Other than these hiccups, the system seems to run fine.

[/QUOTE]

Just look what the machine does during these 10 minutes. top -HPS will show if a CPU is loaded and by which process. gstat -po will show which disks are how much busy and how long a request does take. systat -ifstat shows what is going thru the network. /var/log/messages should get some logging in case of disk timeouts.

The smart data is unreadable when lacking formatting, but doesn't seem to show a flaw.

Dendros · Nov 13, 2020

Sorry, I will repost the SMART data, perhaps now it will be more clear:

Code:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         ( 7313) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 122) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   142   142   054    Pre-fail  Offline      -       69
  3 Spin_Up_Time            0x0007   115   115   024    Pre-fail  Always       -       197 (Average 201)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       849
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   115   115   020    Pre-fail  Offline      -       34
  9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       16352
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       848
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       853
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       853
194 Temperature_Celsius     0x0002   187   187   000    Old_age   Always       -       32 (Min/Max 15/39)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     16352         -
# 2  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Additionally, I did run a small benchmark of the disk using diskinfo(1) in FreeBSD, the results are:

Code:

I/O command overhead:
    time to read 10MB block      0.126974 sec    =    0.006 msec/sector
    time to read 20480 sectors   0.997883 sec    =    0.049 msec/sector
    calculated command overhead            =    0.043 msec/sector

Seek times:
    Full stroke:      250 iter in   8.201265 sec =   32.805 msec
    Half stroke:      250 iter in   5.436290 sec =   21.745 msec
    Quarter stroke:      500 iter in   8.195377 sec =   16.391 msec
    Short forward:      400 iter in   2.752451 sec =    6.881 msec
    Short backward:      400 iter in   3.321364 sec =    8.303 msec
    Seq outer:     2048 iter in   0.153588 sec =    0.075 msec
    Seq inner:     2048 iter in   0.130773 sec =    0.064 msec

Transfer rates:
    outside:       102400 kbytes in   0.539472 sec =   189815 kbytes/sec
    middle:        102400 kbytes in   0.621512 sec =   164759 kbytes/sec
    inside:        102400 kbytes in   1.010047 sec =   101381 kbytes/sec

How are these results?
And I did another test, installed another icon theme using pkg(8) and the system became noticeably slower (mouse cursor speed was slowed, the GUI reacted slower to mouse clicks) during the extraction phase, gstat(8) showed that ada2 (FreeBSD drive) is at 100% and if I understood the report correctly, the write speed was oscillating between 1-5 MB/s.

Zvoni · Nov 13, 2020

I remember having something similiar with an old Dell Optiplex3010.
In its BIOS the AHCI-Setting for the HDD's was disabled!
Switching it on increased the speed by a factor of at least 10.

Have you looked into your BIOS?

olli@ · Nov 13, 2020

It might be helpful to see the boot messages related to ada2: grep ada2 /var/run/dmesg.boot

PMc · Nov 13, 2020

Dendros said:
And I did another test, installed another icon theme using pkg(8) and the system became noticeably slower (mouse cursor speed was slowed, the GUI reacted slower to mouse clicks) during the extraction phase, gstat(8) showed that ada2 (FreeBSD drive) is at 100% and if I understood the report correctly, the write speed was oscillating between 1-5 MB/s.

That might be the normal behaviour. If that "icon theme" consists of thousands of files, the disk will have to do many thousands of seeks (each costing 3-12 ms), Then it certainly will show 100% busy, and if these files are small, it will indeed move only a few MB per second.
The cache is rather ineffective as these are writes.
TCQ/NCQ will not speed it up much, because it's a single thread and no way to parallelize it.
Then, as the OS is on the same disk, and the disk is saturated, all access will delay and the OS gets slower.

There may well be ways to improve this, but at first impression it is not really uncommon. (When people say, an SSD might be 100 times faster, that must come from something. And this here is exactly where it comes from.)

So, one can either live with that, or get one or (for safety) two SSD, or start optimizing. And optimizing can get an elaborated piece of work. As olli@ said, start with dmesg. Then check the disk settings: camcontrol identify adaX. Then the filesystem options, design and layout (make the seeks shorter). Etc.

Dendros · Nov 13, 2020

Sorry for the delayed response, I was busy.

Zvoni, yes, I did look into BIOS and AHCI is active so this is not the cause.

olli@, here is the dmesg:

Code:

grep ada2 /var/run/dmesg.boot
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada2: <TOSHIBA DT01ACA100> ATA8-ACS SATA 3.x device
ada2: Serial Number
ada2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada2: Command Queueing enabled
ada2: 953869MB (1953525168 512 byte sectors)
Trying to mount root from ufs:/dev/ada2s1a [rw]...

PMc, perhaps you are right but I had FreeBSD installed on another system with a different mechanical HDD and I don't remember it being that slow, even when using pkg to install icon themes or lots of small files. That's why I'm a bit worried about my hard drive.
Anyway, here is the camcontrol report:

Code:

camcontrol identify ada2
pass2: <TOSHIBA DT01ACA100> ATA8-ACS SATA 3.x device
pass2: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)

protocol              ATA8-ACS SATA 3.x
device model          TOSHIBA DT01ACA100
firmware revision    
serial number        
WWN                  
additional product id
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       1953525168 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             7200
Zoned-Device Commands no

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes    yes
write cache                    yes    yes
flush cache                    yes    yes
Native Command Queuing (NCQ)   yes        32 tags
NCQ Priority Information       yes
NCQ Non-Data Command           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
NCQ Autosense                  no
SMART                          yes    yes
security                       yes    no
power management               yes    yes
microcode download             yes    yes
advanced power management      yes    no    0/0x00
automatic acoustic management  no    no
media status notification      no    no
power-up in Standby            yes    no
write-read-verify              no    no
unload                         no    no
general purpose logging        yes    yes
free-fall                      no    no
sense data reporting           no    no
extended power conditions      yes    yes
device statistics notification no    no
Data Set Management (DSM/TRIM) no
Trusted Computing              no
encrypts all user data         no
Sanitize                       no
Host Protected Area (HPA)      yes      no      1953525168/1953525167
HPA - Security                 yes      no
Accessible Max Address Config  no

My filesystem is UFS2, not ZFS, nothing special about the layout: a root slice and a swap one.

chrbr · Nov 13, 2020

Dear Dendros,
I am by far no expert. But two things about the file system came into my mind.

1. https://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/configtuning-disk.html contains one section about soft updates. May be this helps if the option is not already applied.

2. It might help to mount the file system with the noatime option if this is not the case.

As a last resort it is never bad to try to swap the cables - just in case.

PMc · Nov 13, 2020

From a quick scan thru the network, this Toshiba drive is supposed to be a Hitachi deskstar.
I have a couple of these, they are weak on random seek - one of them runs in an array with a Seagate Barracuda, and I have often seen the Seagate being at ~40% utilization while the Deskstar is already at 100%.
OTOH, the Seagate consumes at least twice the power, and is delicate on conditions, quickly producing raw-read-errors and high-fly-writes, while the Hitachi are rather rugged, they have been occasionally run near 70 Celsius and are still working reliable for now almost 10 years.

Important to check: is your filesystem aligned to a 4k boundary?

ralphbsz · Nov 13, 2020

The smartctl looks perfect, like a healthy disk. Not too old, not too many power cycles, no errors.
The performance numbers from diskinfo look reasonable. It's a very simple and silly way to measure performance, but the result is completely to be expected for a consumer desktop disk: 100...200 MByte/sec head throughput, 50-150 seeks per second depending on distance (the full stroke is unrealistic).

Question about your file system setup: The disk has 512 bytes logical block (sector) size, 4096 physical. Did you configure UFS for 4K sectors and align the partitions? Doing 512-byte updates on a disk that internally has to perform read-modify-write operations on 4K sectors will be slow.

Jose · Nov 13, 2020

Anything in dmesg(8)? This is what a flaky SATA controller looks like on one of my boxes:

Code:

(ada0:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 40 91 c0 40 00 00 00 01 00 00
(ada0:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada0:ahcich1:0:0:0): Retrying command, 3 more tries remain

ralphbsz said:
The smartctl looks perfect, like a healthy disk. Not too old, not too many power cycles, no errors.

Hey Ralphbsz, any chance you would write a brief HOWTO on what to look for in SMART results?

Dendros · Nov 13, 2020

I think that my filesystem is 4k aligned but I'm not sure as I did nothing special when installing the OS, just the normal procedure. Here is what diskinfo -v ada2 shows about my disk:

Code:

ada2
    512             # sectorsize
    1000204886016    # mediasize in bytes (932G)
    1953525168      # mediasize in sectors
    4096            # stripesize
    0               # stripeoffset
    1938021         # Cylinders according to firmware.
    16              # Heads according to firmware.
    63              # Sectors according to firmware.
    TOSHIBA DT01ACA100    # Disk descr.
                # Disk ident.
    No              # TRIM/UNMAP support
    7200            # Rotation rate in RPM
    Not_Zoned       # Zone Mode

Is this 4k alignment?

Jose, I posted a dmesg about ada2 above. Does not show any error(s).

Jose · Nov 13, 2020

Dendros said:
Jose, I posted a dmesg about ada2 above. Does not show any error(s).

That's dmesg from boot. There may have been kernel messages about ada2 after that. Run dmesg(8) in a terminal. That will display the latest kernel messages.

drhowarddrfine · Nov 13, 2020

Dendros said:
I booted into Debian and ran a long smartctl(8) test on FreeBSD's drive

In case you didn't know, smartctl is available on FreeBSD: sysutils/smartmontools

ralphbsz · Nov 13, 2020

Dendros said:
Is this 4k alignment?

No, diskinfo tells us about the hardware of the disk. It gives us one bit of information, which is that the physical block size is 4K, and that the offset of the first physical blocks in logical blocks is zero. The easier way to get this information by the way is with "camcontrol identify". But then, we need to make sure the software structures (the file systems, which are in turn on top of partitions) are aligned and use the correct block size. To verify that we have to start with gpart (to make sure the partition is not mis-aligned by a partial 4K block), and then at dumpfs for the file systems. I honestly don't even remember how to find the block size in the output of dumpfs for UFS, it is quite dense and cryptic, and I would have to spend a few minutes with the documentation. With ZFS it's easier for me (make sure ashift is at least 12), because I happen to remember that.

PMc · Nov 13, 2020

Partition alignment:

Code:

# gpart show /dev/ada3
=>       40  234441568  ada3  GPT  (112G)
         40        472     1  freebsd-boot  (236K)
        512   10485760     2  freebsd-swap  (5.0G)
   10486272    2000000     3  freebsd-ufs  (977M)
   12486272    3200000     4  freebsd-ufs  (1.5G)
   15686272     600000     5  freebsd-ufs  (293M)

-> Leftmost column should be a multiple of 8.
UFS block size:

Code:

 # dumpfs /dev/ada3p3 | grep fsize
fsize   4096    shift   12      mask    0xfffff000

-> fsize should be 4096 or a multiple of that.

Dendros · Nov 13, 2020

Jose: I checked dmesg messages after boot and there is nothing out of order, the messages about ada2 are basically identical to boot messages.

ralphbsz: here are some results after running gpart on ada2 and dumpfs (I have truncated dumpfs' output because its report is indeed quite long and I believe that the info from beginning of the report is the most relevant):

Code:

gpart show ada2
=>        63  1953525105  ada2  MBR  (932G)
          63           1        - free -  (512B)
          64  1953525104     1  freebsd  [active]  (932G)

and dumpfs:

Code:

magic    19540119 (UFS2)    time    Fri Nov 13 21:24:33 2020
superblock location    65536    id    [  ]
ncg    1518    size    243269632    blocks    235631026
bsize    32768    shift    15    mask    0xffff8000
fsize    4096    shift    12    mask    0xfffff000
frag    8    shift    3    fsbtodb    3
minfree    8%    optim    time    symlinklen 120
maxbsize 32768    maxbpg    4096    maxcontig 4    contigsumsize 4
nbfree    28313021    ndir    10129    nifree    121403912    nffree    5428
bpg    20035    fpg    160280    ipg    80256    unrefs    0
nindir    4096    inopb    128    maxfilesize    2252349704110079
sbsize    4096    cgsize    32768    csaddr    5056    cssize    24576
sblkno    24    cblkno    32    iblkno    40    dblkno    5056
cgrotor    189    fmod    0    ronly    0    clean    0
metaspace 6408    avgfpdir 64    avgfilesize 16384
flags    soft-updates+journal
check hashes    cylinder-groups
fsmnt    /
volname        swuid    0    providersize    243269632

Can you make any sense out of it?

And I can say that the system really is abnormally slow, I installed flat remix icons from repo and it took over 20 mins for pkg to extract the package and during this Xfce became nearly unresponsive. Definitely not normal even for a mechanical HDD.

Jose · Nov 13, 2020

Did you check /var/log/messages like Pmc suggested?

Dendros · Nov 13, 2020

Yes, and there was nothing, I even searched for errors of any kind and all I found was something about mod_load vesa from console-kit, I don't think it's related.

ralphbsz · Nov 13, 2020

I only have a moment right now, so here's half the answer: The gpart output looks good. It shows that the root file system begins at block 64. Now what I don't know whether that's a physical or logical block (to figure that out, we'd have to either read more documentation, or get the output from gpart list). But it doesn't matter, the number is divisible by 8, so even if it is logical blocks (512 bytes), the file system starts on a 4K boundary. I've seen people create partitions that are offset by one sector, and that absolutely kills performance (because then every 4K operation by the file system turns into a read-modify-write in the disk drive).

On the UFS output, your fragment size and block size look good, both are multiples of 4K. I think that means that all IO will be done in 4K blocks, and initial allocation in 32K units. What I don't know is how big inodes are; if they are still 512 bytes, this could get pretty slow when doing small file operation (creation in particular), but I don't remember how to adjust the inode size.

We can't use your "pkg" experiences as a good benchmark, since we have no idea what is really going on. Does installing those icons create 1 file, or 1000, or a million? Does it do other incredibly complex file system operations? Or is it CPU limited? I have no idea. To do performance optimization, you have to first measure where the bottleneck is.

Here is my suggestion: Using your favorite scripting language, write some simple file system benchmarks. All you need for this is shell: Create a example small file (say exactly 512 bytes), then create a temporary directory, and create 100 copies of the small file in that directory, for example with "cat example_file > temp_dir/file.${counter}", and then you use an expr loop to increment counter. How long does that take? How about 1M files? How about 1000 directories, each with 100 files? I just did that for fun on my system, and I can create ~300 files per second on either ZFS with a 7200RPM spinning disk, or UFS on an SSD.

Dendros · Nov 14, 2020

ralphbsz said:
I only have a moment right now, so here's half the answer: The gpart output looks good. It shows that the root file system begins at block 64. Now what I don't know whether that's a physical or logical block (to figure that out, we'd have to either read more documentation, or get the output from gpart list). But it doesn't matter, the number is divisible by 8, so even if it is logical blocks (512 bytes), the file system starts on a 4K boundary. I've seen people create partitions that are offset by one sector, and that absolutely kills performance (because then every 4K operation by the file system turns into a read-modify-write in the disk drive).

On the UFS output, your fragment size and block size look good, both are multiples of 4K. I think that means that all IO will be done in 4K blocks, and initial allocation in 32K units. What I don't know is how big inodes are; if they are still 512 bytes, this could get pretty slow when doing small file operation (creation in particular), but I don't remember how to adjust the inode size.

We can't use your "pkg" experiences as a good benchmark, since we have no idea what is really going on. Does installing those icons create 1 file, or 1000, or a million? Does it do other incredibly complex file system operations? Or is it CPU limited? I have no idea. To do performance optimization, you have to first measure where the bottleneck is.

Here is my suggestion: Using your favorite scripting language, write some simple file system benchmarks. All you need for this is shell: Create a example small file (say exactly 512 bytes), then create a temporary directory, and create 100 copies of the small file in that directory, for example with "cat example_file > temp_dir/file.${counter}", and then you use an expr loop to increment counter. How long does that take? How about 1M files? How about 1000 directories, each with 100 files? I just did that for fun on my system, and I can create ~300 files per second on either ZFS with a 7200RPM spinning disk, or UFS on an SSD.

I'm afraid I don't know much about shell scripting but I'm willing to give this a try, if you can give me an example code.

My "pkg experience" on the current system is definitely not normal - I can say this because I have FreeBSD installed on another system with a spinning HDD that does not exhibit this issue. So clearly something is not OK.
Until now, I never encountered this kind of problem - that using pkg to install some programs can nearly hang the system.
And this problem not limited to installing icon themes (although it's most visible in this case) but also other programs that seem to consist of many small files. For example, lib boost was a dependency of another big package and it took a long time for it to be extracted, much longer than usual. Another example: installing python as a dependency of a package took longer than expected and so on.

ralphbsz · Nov 14, 2020

Code:

#!/bin/sh
tempdir=temp_create_files
mkdir $tempdir
dd if=/dev/random of=$tempdir/master.file bs=512 count=1

cycles=1000
counter=0
while [ $counter -lt $cycles ]
do
  cat $tempdir/master.file > $tempdir/file.$counter
  counter=$(($counter + 1))
done

Save that script (for example in measure_create_files), make it executable with chmod a+x measure_create_files. Then run it as follows: /usr/bin/time ./measure_create_files.

Here is what it does: It creates one file with random content (by reading 512 bytes from /dev/random), then creates 1000 files that are copies of this. The time command in front will tell you how long it ran. On my machine, it says "3.64 real 0.88 user 2.71 sys", meaning it took 3.6 seconds, or 3.6 milliseconds per file, or about 300 files per second. That's running on ZFS, with 7200 rpm nearline disks underneath. Note that nearly all of the wall clock time (3.6 seconds) was spent as CPU time (0.9 and 2.7 seconds), so this test is not disk limited, but CPU limited. I ran the same test on a UFS file system on top an SSD, and I get fundamentally the same answer: I'm CPU limited, not disk limited. I don't have a UFS file system that uses spinning disks at home. For fun, I ran the exact same script on a Mac (with an SSD), which was a little slower (250 files per second), and only 3/4 disk limited, on a cloud-hosted FreeBSD machine (much faster, but I don't know how it is virtualized and what the CPU speed there is), and a cloud-based Linux machine (even faster, again I have no idea what the underlying hardware is).

Dendros · Nov 14, 2020

OK, saved your script as test.sh and ran it. Here is the output:

Code:

/usr/bin/time ./test.sh
1+0 records in
1+0 records out
512 bytes transferred in 0.000162 secs (3167161 bytes/sec)
        0.92 real         0.26 user         0.66 sys

How is it? And thanks for your help and patience.

ralphbsz · Nov 14, 2020

OK, so you needed 0.92 seconds (wall clock time) to create 1000 files, or slightly less than 1ms per file. Of that, 100% was spent on CPU time (add 0.26 and 0.66 seconds), so the storage subsystem was completely amortized (probably due to parallelism). That's very fast, 3 times than my server (which runs at 1 GHz with a 32-bit CPU). That also means that your storage subsystem is completely not limited by the disk itself, but in this trivial test (of creating tiny files, and the inodes and directory entries for them) only by CPU power.

My conclusion: Your disk speed is irrelevant *TO THIS TEST*. We can now make a leap of faith, and generalize that to all small file operations. Which means that either (a) the performance of the pkg command is to be expected, or (b) there is another bottleneck that we don't know about.

Your next task: Using tools like top, iostat and vmstat, measure what parts of the system are busy when doing that pkg command again.

ralphbsz · Nov 15, 2020

If I were interested in how long it takes to write one file *AND* get it flushed to disk, I would have done that. But here, we're trying to create a synthetic micro-benchmark, which somewhat resembles the OP's use case: creating a lot of small files from a program (in this case pkg), very likely without intervening sync. So this benchmark matches his use case.

For fun, I just ran the same thing with dd and oflag=(f)sync: It brings the performance on my ZFS (spinning rust) filesystem down to ~15ms/file (to be expected, the disks can take 8ms per revolution, and there are two disks), and about 4.6ms/file on ZFS with SSD (of which 3/4 is still the CPU time).