FreeBSD 13.1 extremally slow

cmiu147 · Oct 11, 2022

SirDice said:
There's barely any swap being used. Swap usage in and of itself isn't bad.

But I see a mariadbd and mysqld running. Those two plus ARC like to use a lot of memory. ARC and MySQL/MariaDB can and will fight over memory. So tune your databases to not use any more more memory than it actually needs and limit your ARC. And remember to keep some memory left over for the rest of your applications. With this much stuff on it I would probably opt for a memory upgrade too, 16GB isn't much.

Using a nvme SSD with ZIL and L2ARC, would help?

vmb said:
Sadly, SMR drives are effectively incompatible with ZFS. Yes, you can create a ZFS file system on them but as soon as the file system needs to write on a block that has already been written on, ZFS slows down due to the overhead in all the extra work it has to do because of the shingled recording. ZFS should only be used with Conventional Magnetic Recording (CMR) hard drives.

Haven't known that... So I should consider swap the drives in the future.

ralphbsz · Oct 11, 2022

... empty post ...

ralphbsz · Oct 11, 2022

If the root of the file system slowness is the SMR drives, that should be visible in iostat, vmstat, or zpool iostat. You should be seeing long latencies, and deep queues.

cmiu147 said:
why my all apps are fantastic slow compared to a linux box using the same config?

What file system and RAID system do you use on Linux?

gpw928 · Oct 11, 2022

cmiu147 said:
Even when I open ssh to connect to this it takes couple of second until I get my login prompt.

You are showing a near idle system. If it's like that when you try to login with ssh (and not flat-lined), something is mis-configured. I think that the original instincts of SirDice are well founded.

I suggest you focus on why ssh is taking so long. Run the ssh server daemon in the foreground in debug mode. Run the ssh client also in debug mode. Watch the traces. See where the delays are happening:

Code:

# On the server
$ cd /tmp
$ script
$ sudo /usr/sbin/sshd -p 1234 -d -d -d
# watch until login is complete
^C
$ exit
 
# On the client
$ cd /tmp
$ script
$ ssh -Y -p 1234 -v -v -v YourLoginName@ServerName
# watch until login is complete
$ exit
$ exit

PMc · Oct 12, 2022

I highly recommend gstat -p for disk issues.If there is an issue with a disk, it's immediateily visible turning red. If there is a problem with a disk in a raid, one can directly see that the counters of that disk are different to the others.

PMc · Oct 12, 2022

cmiu147 said:
Using a nvme SSD with ZIL and L2ARC, would help?

And a special device, yes, that may work. But: in order to do that, one needs information about the workload. The working set size, which filesystems should go into l2arc (and which shouldn't because they just overwrite the data that is needed). Etc.
And for zil we should look into gstat -po if we have flush events that delay our work.

cmiu147 said:
Haven't known that... So I should consider swap the drives in the future.

Yes, Seagate doesn't tell openly.

They go to quite a length to fill their spec sheets with useless statements, while not telling the relevant facts. And then they resort on the position that a consumer desktop drive (like the "DM") is not suited for databases.
But, if you notice a cache of 256M on a consumer drive, something must be wrong (datacenter ultrastar have 128M).

SirDice · Oct 12, 2022

cmiu147 said:
Using a nvme SSD with ZIL and L2ARC, would help?

It's not going to alleviate the memory pressure. It would actually start using even more. ZIL and L2ARC could help with disk performance, but it really depends on the workload.

Alain De Vos · Oct 12, 2022

To give more room to the databases you can limit the ZFS-ARC
sysctl.conf eg 2.5G

Code:

vfs.zfs.arc_max= 2500000000              #0

To reduce the fight of memory you can also limit database mariadb,
E.g. in my.cnf,

Code:

aria_pagecache_buffer_size=256M
innodb_buffer_pool_size=512M

Crivens · Oct 12, 2022

_martin said:
On Linux strace is your friend to quickly diagnose what are you waiting for. I'm sad to say this is no longer supported on FreeBSD. dtrace is a nice tool, more powerful too, but it has steep learning curve and sometimes it's harder to use for simple diagnostics.

What about ktrace?

_martin · Oct 12, 2022

Crivens said:
What about ktrace?

Personally I don't see benefit of it to truss. Certainly no versatility as strace. strace is able to do nice filtering with -e which is a feature I use the most. But other formatting features are missing from truss. I find it easy to use and for simple cases just one nice tool to have.

But in OPs problem I was hoping he'd react to SirDice suggestion about DNS and attach a truss to sshd daemon already with truss -fp <pidofdaemon> so we could see why there's lag in connection.

cmiu147 · Oct 12, 2022

PMc said:
I highly recommend gstat -p for disk issues.If there is an issue with a disk, it's immediateily visible turning red. If there is a problem with a disk in a raid, one can directly see that the counters of that disk are different to the others.

Ok, this really helped me... looks like my disks are doing something even in idle mode... The thing is that is fluctuating from 0.x straight to 100+ %. I stoped all the services and still get 100% spikes.

Alain De Vos · Oct 12, 2022

How do you find the process-id of the one creating disk-activity ?

W.hâ/t · Oct 12, 2022

Alain De Vos said:
How do you find the process-id of the one creating disk-activity ?

In top key 'm' display processes iostat

Alain De Vos · Oct 12, 2022

Code:

top -m io -o total

Peacekeeper2000 · Oct 12, 2022

Let me throw my two cents in here: I had two troubles with DNS: using it with encryption has delayed everything as the timeout is long, when encryption is not working. Using DNS to provide Fingerprints for PKI is also slow, when not every host is listed. And looks like your router works not as DNS server ? only Google (and noch cache) ?

gpw928 · Oct 13, 2022

It's entirely possible that you are conflating a collection of maladies.

The problem is that the possibilities being canvassed are growing at a pace.

So you need to narrow down the search be eliminating variates, and move forward one step at a time.

Yes, SMR disks are crap for ZFS. Their achilles heel is mostly with deletions and re-writing. They will never work well with a database that needs to be constantly updated. Deleting database rows is going to be slow. Resilvering a replacement disk will be a nightmare -- it will take an eternity (like two weeks instead of a day), and the system will run like a dog for the entire time. Your only way forward with these SMR disks is to eventually replace them.

So to trouble-shoot, quiesce your applications, your databases, and your disks. Then move forward, one step at a time. You need to eliminate mis-configuration before considering hardware problems.

Several people have mentioned different ways in which delays in every DNS lookup can multiply to kill application performance. This is experience speaking. It needs to be near top of the list for your investigations.

Time DNS lookups (forward and reverse) on your server and your client. Repeat the tests several times. Not all DNS queries are resolved in the same way. This is why I suggested you test ssh connections above.

Run the iocage list. Is it still slow?

cmiu147 · Oct 13, 2022

gpw928 said:
It's entirely possible that you are conflating a collection of maladies.

The problem is that the possibilities being canvassed are growing at a pace.

So you need to narrow down the search be eliminating variates, and move forward one step at a time.

Yes, SMR disks are crap for ZFS. Their achilles heel is mostly with deletions and re-writing. They will never work well with a database that needs to be constantly updated. Deleting database rows is going to be slow. Resilvering a replacement disk will be a nightmare -- it will take an eternity (like two weeks instead of a day), and the system will run like a dog for the entire time. Your only way forward with these SMR disks is to eventually replace them.

So to trouble-shoot, quiesce your applications, your databases, and your disks. Then move forward, one step at a time. You need to eliminate mis-configuration before considering hardware problems.

Several people have mentioned different ways in which delays in every DNS lookup can multiply to kill application performance. This is experience speaking. It needs to be near top of the list for your investigations.

Time DNS lookups (forward and reverse) on your server and your client. Repeat the tests several times. Not all DNS queries are resolved in the same way. This is why I suggested you test ssh connections above.

Run the iocage list. Is it still slow?

If I run iocage list after a period of time... let's say couple of hours, take around 10-20s until list the cages. If I run it again immediately, the cages are listed instantly. I sort out the dns problem... looks like now ssh password promo is showing in 1-2seconds.This happens with a wordpress website for example... if I open it for the first time take 10-20seconds until I get the response and then works almost normally... pretty slow compared to the same apache/php config in a similar server with centos and SAS Hitachi drives.

I suspect 2 things here...

1. Controller
2. SMR drives

Crivens · Oct 13, 2022

SMR Drives are slow in writes. What may help is to disable the atime tracking. That creates a write with every open() of a file, even for reading.

cmiu147 · Oct 13, 2022

One more thing... now the drives are under ZFS raid10. will I have a performance gain if I use a hardware raid controller, set to raid5 or raid10?

rootbert · Oct 13, 2022

if at all probably not by much. In addition to that, the error-correcting-codes used in hardware raid controllers are not as advanced as those of zfs, furthermore controller might pass the wrong information from the device up to zfs (though this probability is really low). Also note: most hardware controllers do save metadata, so in case your controller does not work any more, data restoration is a more complicated process. I'd rather spend the money on two SSDs, have a fast zpool and add two small partitions of each as ZIL (mirror) + L2ARC to my slow hdd based zpool.

PMc · Oct 13, 2022

cmiu147 said:
If I run iocage list after a period of time... let's say couple of hours, take around 10-20s until list the cages. If I run it again immediately, the cages are listed instantly.

Do the drives go into a powersave mode? Desktop drives usually do, and then this would be perfectly normal.

gpw928 · Oct 13, 2022

cmiu147 said:
If I run iocage list after a period of time... let's say couple of hours, take around 10-20s until list the cages. If I run it again immediately, the cages are listed instantly.

That's new and important information. I suspect that identifying the cause will solve the problem.

The powersave mode suggested by PMc is a strong contender. Look at what the disks are doing during that very long delay.

Edit, the Seagate ST2000DM008 has a "Standby Mode/Sleep Mode". This post on the use of camcontrol(8) may be useful.

cmiu147 · Oct 13, 2022

t

gpw928 said:
That's new and important information. I suspect that identifying the cause will solve the problem.

The powersave mode suggested by PMc is a strong contender. Look at what the disks are doing during that very long delay.

Hi,
They are spiking like crazy... and the speed transfer is... I don't want to talk about it... any little thing, the disk goes to 100%... is like I was having an old ATA33 drive ... not a modern sata drive... I understand that SMR is crap... but even that bad?

gpw928 · Oct 13, 2022

The fact that the delay only happens once is really important. If the delay problem was because of write amplification due to SMR, it should be consistent. It's not. I was editing my post above while you were responding. Suggest you follow the "Standby Mode/Sleep Mode" trail.

PMc · Oct 13, 2022

cmiu147 said:
They are spiking like crazy... and the speed transfer is... I don't want to talk about it... any little thing, the disk goes to 100%... is like I was having an old ATA33 drive ... not a modern sata drive... I understand that SMR is crap... but even that bad?

Disk going to 100% without heavy traffic usually means it is internally busy. SATA does auto-startup, so the first command will take a while:

Code:

# camcontrol standby ada2
# time dd if=/dev/ada2 count=1 of=/dev/null
1+0 records in
1+0 records out
512 bytes transferred in 6.262433 secs (82 bytes/sec)
0.000u 0.003s 0:06.26 0.0%      0+0k 1+0io 0pf+0w

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      1      1      0   2442      0      0    0.0  243.1| ada2

First, find out if these disks support epc (I don't think so, but let's check):

Code:

# camcontrol epc ada2 -c list
(pass9:ahcich2:0:0:0): READ_LOG_DMA_EXT. ACB: 47 00 08 00 00 40 00 00 00 00 02 00
(pass9:ahcich2:0:0:0): CAM status: ATA Status Error
[etc.etc.]

-> no epc support

Code:

# camcontrol epc ada7 -c list
ATA Power Conditions Log:
  Idle power conditions page:
    Idle A condition:
[etc.etc.]

-> disk with epc support

Then check apm:

Code:

# camcontrol identify ada2 | egrep "^(device m|Feature|advan)"
device model          ST3000DM008-2DM166
Feature                      Support  Enabled   Value           Vendor
advanced power management      yes      yes     192/0xC0

This seems to switch it off:

Code:

# camcontrol apm ada2
# camcontrol identify ada2 | egrep "^(device m|Feature|advan)"
device model          ST3000DM008-2DM166
Feature                      Support  Enabled   Value           Vendor
advanced power management      yes      no      0/0x00

FreeBSD 13.1 extremally slow

cmiu147

ralphbsz

ralphbsz

gpw928

PMc

PMc

SirDice

Administrator

Alain De Vos

Crivens

Administrator

_martin

cmiu147

Attachments

Alain De Vos

W.hâ/t

Alain De Vos

Peacekeeper2000

gpw928

cmiu147

Crivens

Administrator

cmiu147

rootbert

PMc

gpw928

cmiu147

gpw928

PMc