Need for help in identifying the bottleneck

Anonymous · Jan 27, 2013

I built my home server based on a low profile Atom D510MO board -- output of dmesg:

Code:

FreeBSD 9.1-RELEASE #0 r243825: Tue Dec  4 09:23:10 UTC 2012
    root@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
CPU: Intel(R) Atom(TM) CPU D510   @ 1.66GHz (1666.73-MHz K8-class CPU)
...
real memory  = 2147483648 (2048 MB)
avail memory = 2030952448 (1936 MB)

It got two identical 3 TB Hitachi (7200rpm) SATA drives attached via a mini-PCIe ASMedia SATA controller

Code:

...
ahci0: <ASMedia ASM1061 AHCI SATA controller> 0x2028-0x202f,0x2034-0x2037,0x2020-0x2027,
       0x2030-0x2033,0x2000-0x201f mem 0xf0200000-0xf02001ff irq 17 at device 0.0 on pci2
ahci0: AHCI v1.20 with 2 6Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
...
...
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <Hitachi HDS723030ALA640 MKAOA3B0> ATA-8 SATA 3.x device
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada1: <Hitachi HDS723030ALA640 MKAOA3B0> ATA-8 SATA 3.x device
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada1: Command Queueing enabled
ada1: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)

Both drives got UFS2 volumes with soft journaling enabled.

When copying ten big 1 GB files from one drive to the other, using simply cp, I consistently see transfer rates about 65 to 70 MByte/s. I am not that unhappy with that, however, I wonder, if this could be improved when I attach the drives to a more powerful computer board.

So, is 65 to 70 MByte/s limited by the bare transfer rate of the drives, or can I push it to let's say 100 MByte/s by attaching them to a high end computer board?

Of course, I do not want to buy another board, only for achieving a rate of 71.5 MByte/s.

Please, may I ask you to share your experiences on copy speed of real files from one drive to another. This task is different from copying files from one volume to another volume of the same drive.

I know about the possibility of improving copy speed by the way of a RAID system, this would be another consideration though.

wblock@ · Jan 27, 2013

Copying a 1G file from an SSD to a WD Black 1T drive gives 127.92MB/s. That's UFS with soft updates but no SUJ or the old gjournal(8).
Reviews of that Hitachi suggest it ought to be at least as fast as the WD drive. It is a 512-byte block drive, so alignment is not a problem.

I have not used that controller, but it would be nice to try a different one to isolate it. Or it could be something with the PCIe slots on that motherboard.

If the destination partition is on the slow (inner) part of the drive, it could go much slower than the outer section of the drive.

Anonymous · Jan 27, 2013

wblock@ said:
Copying a 1G file from an SSD to a WD Black 1T drive gives 127.92MB/s. That's UFS with soft updates but no SUJ or the old gjournal(8).
Reviews of that Hitachi suggest it ought to be at least as fast as the WD drive. It is a 512-byte block drive, so alignment is not a problem.

I have not used that controller, but it would be nice to try a different one to isolate it. Or it could be something with the PCIe slots on that motherboard.

If the destination partition is on the slow (inner) part of the drive, it could go much slower than the outer section of the drive.

Many thanks for the reply.

Seeing, that much higher transfer rates are possible with modern hard drives, I am now sure, that my Atom board is the bottleneck.

Best regards

Rolf

wblock@ · Jan 27, 2013

Make sure it has the latest BIOS. Intel's Atom boards used to ship with embarrassingly bad firmware versions. In fact, that put me off those boards entirely.

Anonymous · Jan 27, 2013

I updated the firmware already to the latest available version for the revision of the board, some time ago.

When copying big files from one disk to the other, the cpu utilization of cp as reported by top is about 25 to 33 %.
In the meantime, I finished my work on a file tree cloning program which runs 3 threads - a scheduler, a reader and a writer thread, and reading and writing occurs in parallel, which is most beneficial of course for transferring data from one disk to another.

With that, cloning the whole file hierarchy (2.3 TBytes of data) from one disk to the other took 7.5 h, so the average transfer rate was about 89 MByte/s -- for everything, i.e. directories, very small and very big files, symbolic links, hard links, and including file attributes, flags, ACLs and EAs. top reported 50 to 67 % cpu utilization for the file level cloning.

So, I guess, that cp does reading and writing sequentially. So, if one part is a very fast SSD, then the transfer rate would already benefit only from this.

Crivens · Jan 28, 2013

Sorry to chime in late, but this is what I do to check for tuning points:

Read the raw disk device using dd and output to /dev/null
Write to the raw disk device using dd, reading from /dev/null
Do the same as above, using a file system

These steps give you the maximum bandwith which the system can handle in streaming accesses. Should the raw read/write top out at, say, 70MB/s, you can not expect for a file system to do better. When the system shows that the access using a file system is a lot worse, you may be dealing with alignment or driver problems which need to be investigated before continuing.

After that, in your case, I would try to mix these tests on the two channels to check for interferences. One of these could be what I have in my NAS box where four discs on one controller can be seen to be as fast as one - here the bus is the limiting factor. Yes, I had some spare PCI controllers around, and since the link is 100MBit this is not the bottleneck. But you might be suprised by such checks when you find out which slots share the same PCIe lanes.

Your tree cloning programm sounds interesting, I remember to have read something about how to speed up copy operations by getting around the memory copy of the read/write operations when they are in seperate threads. IIRC it involved mmap()ing the target file and then read()ing directly into that, avoiding the penalty of copying the memory from/to user space several times. Might be worth a try, I'll check if I can find that reference again.

Anonymous · Jan 28, 2013

Crivens said:
Sorry to chime in late, but this is what I do to check for tuning points:

Read the raw disk device using dd and output to /dev/null

Write to the raw disk device using dd, reading from /dev/null

Do the same as above, using a file system

Code:

# dd if=one_giga_byte_file of=/dev/null bs=1m count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 16.108395 secs (65095002 bytes/sec)

# dd if=/dev/zero of=testfile bs=1m count=1000
1000+0 records in
1000+0 records out
1048576000 bytes transferred in 14.891147 secs (70416067 bytes/sec)

Well, I see 62 to 67 MByte/s using dd. Now only, these tests rise more questions than giving an answer to my actual question.

Crivens said:
These steps give you the maximum bandwith which the system can handle in streaming accesses. Should the raw read/write top out at, say, 70MB/s, you can not expect for a file system to do better. When the system shows that the access using a file system is a lot worse, you may be dealing with alignment or driver problems which need to be investigated before continuing.

Sorry, I am not a native english speaker, and perhaps I was not able to make it that clear what I need to know, so please let me try again:

If I connect the same disks to another high-end motherboard, then could I achieve much higher transfer rates or not?

Crivens said:
After that, in your case, I would try to mix these tests on the two channels to check for interferences. One of these could be what I have in my NAS box where four discs on one controller can be seen to be as fast as one - here the bus is the limiting factor. Yes, I had some spare PCI controllers around, and since the link is 100MBit this is not the bottleneck. But you might be suprised by such checks when you find out which slots share the same PCIe lanes.

It is of course always interesting to discuss how to squeeze out the max. performance on a given setup, however, in the very moment for me this of minor importance, I will come back to this, once I decided whether to spent some money for a new board or not.

Crivens said:
Your tree cloning programm sounds interesting, I remember to have read something about how to speed up copy operations by getting around the memory copy of the read/write operations when they are in seperate threads. IIRC it involved mmap()ing the target file and then read()ing directly into that, avoiding the penalty of copying the memory from/to user space several times. Might be worth a try, I'll check if I can find that reference again.

Perhaps, you read it here:

The GNU C Library Reference Manual - 13.7 Memory-mapped I/O

... This is more efficient than read or write, as only the regions of the file that a program actually accesses are loaded. Accesses to not-yet-loaded parts of the mmapped region are handled in the same way as swapped out pages. ...

I considered using this, but I decided against it, because it did not seem to fit that naturally (without forcing it to fit) into the required procedure, i.e. reading a chunk of a file of disk 1, writing it to disk 2, while reading at the same time the next chunk from disk 1, and so on. In any case, the whole file is read-in and written-out, and not only regions of it. And in this scenario, again needing the whole file in any case, I didn't expect the promised performance improvement. Also the reader thread does not wait for the writer completing a file, the reader may be up to hundred files ahead.

Sebulon · Jan 28, 2013

@rolfheinrich

You skipped over an important piece, namely the raw-device read test. While write would mean damaging the data that is already on there, read performance can still be tested like:
# dd if=/dev/(a)daX of=/dev/zero bs=1m count=1024
If that gives better performance, you have to consider alignment/driver issues, filesystem tweaks, etc. that could be a factor.

/Sebulon

Anonymous · Jan 28, 2013

Sebulon said:
You skipped over an important piece, namely the raw-device read test. While write would mean damaging the data that is already on there, read performance can still be tested like:
# dd if=/dev/(a)daX of=/dev/zero bs=1m count=1024

Code:

# dd if=/dev/ada0p2 of=/dev/zero bs=1m count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 9.778621 secs (109805036 bytes/sec)

# dd if=/dev/ada0p3 of=/dev/zero bs=1m count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 9.131814 secs (117582533 bytes/sec)

So, this is 105 MByte/s and 112 MByte/s now. What would this disk give when attached to a perfect SATA controller on a perfect high-end motherboard?

Sebulon said:
If that gives better performance, you have to consider alignment/driver issues, filesystem tweaks, etc. that could be a factor.

Code:

#gpart show
=>        34  5860533101  ada0  GPT  (2.7T)
          34           6        - free -  (3.0k)
          40         128     1  freebsd-boot  (64k)
         168     8388608     2  freebsd-swap  (4.0G)
     8388776  5852144352     3  freebsd-ufs  [bootme]  (2.7T)
  5860533128           7        - free -  (3.5k)

As Warren already noted, the Hitachi HDS723030ALA640 is a 512b disk, anyway, I routinely align all disks at 4kb boundaries. I guess, with driver issue, you mean the driver of my ASMedia AHCI SATA controller. Well, this would be history once I decide to buy a new high-end board.

- Shall I buy a new one?
- Would it significantly improve disk IO performance?
- What can I expect?

Crivens · Jan 28, 2013

rolfheinrich said:
Sorry, I am not a native english speaker

Neither am I. Your name suggests you may be more at home with german, is that correct?

rolfheinrich said:
If I connect the same disks to another high-end motherboard, then could I achieve much higher transfer rates or not?

If those disks are bound by the controller (meaning, the bottle neck is on the board), then yes. If you have a system with higher specs, attaching one disk to it and checking the read performance in comparison to the Atom should give some more information about this.

rolfheinrich said:
I considered using this, but I decided against it, because it did not seem to fit that naturally (without forcing it to fit) into the required procedure, i.e. reading a chunk of a file of disk 1, writing it to disk 2, while reading at the same time the next chunk from disk 1, and so on. In any case, the whole file is read-in and written-out, and not only regions of it. And in this scenario, again needing the whole file in any case, I didn't expect the promised performance improvement. Also the reader thread does not wait for the writer completing a file, the reader may be up to hundred files ahead.

There may be a problem here, do you check for memory usage?
When your memory for one file gets paged out between reading and writing that file you need to throttle the reader.

Back to the mmap() mechanism, the reason this ought to be faster is the handling of write requests in the OS. The system can not write the data you pass it on the write call without doing a copy of it into kernel owned memory. It can not be sure that, between the write() and the driver actually writing that memory to disk, you did not modify it. So the memory gets copied into kernel memory by memcpy(), which I think is called differently in kernel space but I trust you know what I mean.
When using the mmap() method, the read() call gets to deliver the data into the memory which already is owned by the target file - thus skipping the copy operation. Also, as the target file is mapped and upon unmap only belongs to the kernel, you have no need of a writer task. The system will do that for you.

Also I found this information, which deals with the topic of filling the file system cache. Using reads and writes will keep these files as copies in the cache, two copies even, when you might not want even one.

Then, also, this mmap/read topic seems to come up once in a while. Maybe it is still interesting to read, I will do so now.

And I would like to state that I have no intention in comandeering this thread

Sebulon · Jan 28, 2013

@rolfheinrich

Well, as I see it, since you got ~100MB/s(which is normal) streaming from the device, it would be "interesting" to see the difference on a raw device write test as well, though that would be impossible as I understand it, because you have data on there. But to be able to compare read/write tests aligned vs. unaligned as well.

Not so much leaning towards driver/HW issues, when the controller actually could produce a better score directly from the raw device. I donÂ´t think your situation would significantly improve just by getting new HW, I would imagine this lies somewhere in your configuration, but I cannot exactly pin it down...

Something interesting to test for a possible HW-based bottleneck though:
# dd if=/dev/ada0p3 of=/dev/zero bs=1m count=1024 | dd if=/dev/ada1p3 of=/dev/zero bs=1m count=1024
Should be able to complete at the same time, with an equal score. Both should be able to read at 100MB/s, at the same time. But if they turn out to read more like 50MB/s, you have a HW-based bottleneck.

/Sebulon

wblock@ · Jan 28, 2013

diskinfo(8)'s speed test is essentially a dd(1) read, but gives more information and also tests different read rates in different parts of the disk:

Code:

% diskinfo -tv ada1
ada1
	512         	# sectorsize
	1000204886016	# mediasize in bytes (931G)
	1953525168  	# mediasize in sectors
	0           	# stripesize
	0           	# stripeoffset
	1938021     	# Cylinders according to firmware.
	16          	# Heads according to firmware.
	63          	# Sectors according to firmware.
	WD-W            # Disk ident.

Seek times:
	Full stroke:	  250 iter in   5.085314 sec =   20.341 msec
	Half stroke:	  250 iter in   3.502225 sec =   14.009 msec
	Quarter stroke:	  500 iter in   5.775660 sec =   11.551 msec
	Short forward:	  400 iter in   2.002646 sec =    5.007 msec
	Short backward:	  400 iter in   1.459001 sec =    3.648 msec
	Seq outer:	 2048 iter in   0.248437 sec =    0.121 msec
	Seq inner:	 2048 iter in   0.206458 sec =    0.101 msec
Transfer rates:
	outside:       102400 kbytes in   0.776317 sec =   131905 kbytes/sec
	middle:        102400 kbytes in   0.885671 sec =   115619 kbytes/sec
	inside:        102400 kbytes in   1.598301 sec =    64068 kbytes/sec

That last value gives an idea of how much the rate can vary.

Anonymous · Jan 28, 2013

Crivens said:
Neither am I. Your name suggests you may be more at home with german, is that correct?

Yes, I am a German, living in Brazil.

Crivens said:
... If you have a system with higher specs, attaching one disk to it and checking the read performance in comparison to the Atom should give some more information about this.

That was the idea of my post. Only, my hope was, that I do not need to buy the high-end-board in order to do the test, but somebody could help me out with some significant tests, before I decide to spent money.

Crivens said:
rolfheinrich said:

...Also the reader thread does not wait for the writer completing a file, the reader may be up to hundred files ahead.

Click to expand...

There may be a problem here, do you check for memory usage?
When your memory for one file gets paged out between reading and writing that file you need to throttle the reader.

Yeah, the reader waits if it is 100 chunks of max. 1 MByte each ahead of the writer, and the scheduler waits, if it got 100 files in the queue ahead of the reader. The chunks are buffered in a fixed sized memory pool of about 100 MB, and the queue items generated by the scheduler occupy a fixed sized buffer of about 16 MB.

Crivens said:
...
When using the mmap() method, the read() call gets to deliver the data into the memory which already is owned by the target file - thus skipping the copy operation. Also, as the target file is mapped and upon unmap only belongs to the kernel, you have no need of a writer task. The system will do that for you.

Sounds good, however, reading and writing is strictly synchronized by this way (no need for a writer thread), which is another concept, of what I wanted to realize.

Crivens said:
Also I found this information, which deals with the topic of filling the file system cache. Using reads and writes will keep these files as copies in the cache, two copies even, when you might not want even one.

Yes, this is very interesting, I will have a look at it. Exactly as you said, there is no need for caching the copied chunks, since each is transferred only once - Thank you for the link!

Crivens said:
Then, also, this mmap/read topic seems to come up once in a while. Maybe it is still interesting to read, I will do so now.

Also mmap() is not without overhead. The respective d_mmap() routine of the device driver is called once for each page of the memory region to be mapped. The page size is usually 4096 bytes. So, for mapping a 1 GB file into virtual memory, d_mmap() of the device driver needs to be called 262144 times.

Crivens said:
And I would like to state that I have no intention in comandeering this thread

No problem, Germans like discussions

Anonymous · Jan 28, 2013

Sebulon said:
Well, as I see it, since you got ~100MB/s(which is normal) streaming from the device, it would be "interesting" to see the difference on a raw device write test as well, though that would be impossible as I understand it, because you have data on there. But to be able to compare read/write tests aligned vs. unaligned as well.

For this, I can use the inactive swap-partition of the cloned drive:

Code:

// raw reading
# dd if=/dev/ada1p2 of=/dev/zero bs=1m count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 9.718974 secs (110478929 bytes/sec)

// raw writing
# dd if=/dev/zero of=/dev/ada1p2 bs=1m count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 8.190581 secs (131094710 bytes/sec)

Sebulon said:
Something interesting to test for a possible HW-based bottleneck though:
# dd if=/dev/ada0p3 of=/dev/zero bs=1m count=1024 | dd if=/dev/ada1p3 of=/dev/zero bs=1m count=1024
Should be able to complete at the same time, with an equal score. Both should be able to read at 100MB/s, at the same time. But if they turn out to read more like 50MB/s, you have a HW-based bottleneck.

Code:

# dd if=/dev/ada0p2 of=/dev/zero bs=1m count=1024 |\
  dd if=/dev/ada1p2 of=/dev/zero bs=1m count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 9.329889 secs (115086237 bytes/sec)
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 11.703784 secs (91743134 bytes/sec)

Wait a moment, I got a third drive in the system (a 2.5'' Samsung attached to the MB-SATA port 1).

Code:

# dd if=/dev/ada0p2 of=/dev/zero bs=1m count=1024 |\
  dd if=/dev/ada1p2 of=/dev/zero bs=1m count=1024 |\
  dd if=/dev/ada2p2 of=/dev/zero bs=1m count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 10.343785 secs (103805504 bytes/sec)
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 11.262390 secs (95338719 bytes/sec)
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 12.181642 secs (88144260 bytes/sec)

Looks like the combined SATA bandwidth of the Atom board is at least 274 MByte/s. So, if ~100 MB/s for a single drive is normal, then these suggested tests answered my question sufficiently. My conclusion is, a new high-end motherboard won't give me significant improvements in disk transfer rates. This saves me from spending some hundred euros. Thank you!

Anonymous · Jan 28, 2013

wblock@ said:

diskinfo(8)'s speed test is essentially a dd(1) read, but gives more information and also tests different read rates in different parts of the disk:

Code:

% diskinfo -tv ada1
ada1
	512         	# sectorsize
	1000204886016	# mediasize in bytes (931G)
	1953525168  	# mediasize in sectors
	0           	# stripesize
	0           	# stripeoffset
	1938021     	# Cylinders according to firmware.
	16          	# Heads according to firmware.
	63          	# Sectors according to firmware.
	WD-W            # Disk ident.

Seek times:
	Full stroke:	  250 iter in   5.085314 sec =   20.341 msec
	Half stroke:	  250 iter in   3.502225 sec =   14.009 msec
	Quarter stroke:	  500 iter in   5.775660 sec =   11.551 msec
	Short forward:	  400 iter in   2.002646 sec =    5.007 msec
	Short backward:	  400 iter in   1.459001 sec =    3.648 msec
	Seq outer:	 2048 iter in   0.248437 sec =    0.121 msec
	Seq inner:	 2048 iter in   0.206458 sec =    0.101 msec
Transfer rates:
	outside:       102400 kbytes in   0.776317 sec =   131905 kbytes/sec
	middle:        102400 kbytes in   0.885671 sec =   115619 kbytes/sec
	inside:        102400 kbytes in   1.598301 sec =    64068 kbytes/sec

That last value gives an idea of how much the rate can vary.

Code:

# diskinfo -tv ada0
ada0
	512         	# sectorsize
	3000592982016	# mediasize in bytes (2.7T)
	5860533168  	# mediasize in sectors
	0           	# stripesize
	0           	# stripeoffset
	5814021     	# Cylinders according to firmware.
	16          	# Heads according to firmware.
	63          	# Sectors according to firmware.
	MK0311YHG19EKA	# Disk ident.

Seek times:
	Full stroke:	  250 iter in   6.432925 sec =   25.732 msec
	Half stroke:	  250 iter in   4.624562 sec =   18.498 msec
	Quarter stroke:	  500 iter in   7.292195 sec =   14.584 msec
	Short forward:	  400 iter in   2.297175 sec =    5.743 msec
	Short backward:	  400 iter in   2.856070 sec =    7.140 msec
	Seq outer:	 2048 iter in   0.332453 sec =    0.162 msec
	Seq inner:	 2048 iter in   0.314238 sec =    0.153 msec
Transfer rates:
	outside:       102400 kbytes in   0.750510 sec =   136441 kbytes/sec
	middle:        102400 kbytes in   0.837482 sec =   122271 kbytes/sec
	inside:        102400 kbytes in   1.334603 sec =    76727 kbytes/sec

Thank you Warren, this supports what you wrote in your other post, that my Hitachi drives are quite at a similar level as your WD drive. The seek times of the Hitachi drive are a little bit worse, while the transfer rates are a little bit better.

My overall conclusion is, that the drives are the bottleneck and not the low-end Atom board.

Many thanks to all who responded.

Best regards

Rolf

Need for help in identifying the bottleneck

Anonymous

Guest

wblock@

Anonymous

Guest

wblock@

Anonymous

Guest

Crivens

Administrator

Anonymous

Guest

Sebulon

Anonymous

Guest

Crivens

Administrator

Sebulon

wblock@

Anonymous

Guest

Anonymous

Guest

Anonymous

Guest