dd consistently stopping early

zerobus · Mar 3, 2021

I am attempting to clone 68GB FreeBSD 8.0 machine (1.1.1.1) over ssh to a 120GB FreeBSD 12.2 machine (2.2.2.2). Here is my argument:

Code:

2.2.2.2# ssh root@1.1.1.1 "dd if=/dev/da0 bs=4096 | gzip -1 -" | dd of=image.gz status=progress | pv -s 68G

After some time, the operation halts at only 6% reporting:

Code:

4.37GiB 0:52:12 [1.43MiB/s] [>                                        ]     6%
62914560+0 records in
62914560+0 records out
128849018880 bytes transferred in 3132.538034 secs (41132467 bytes/sec)
9174820+1 records in
9174820+1 records out
4697507868 bytes transferred in 3129.408783 secs (1501085 bytes/sec)

I can repeat the operation, each time resulting in a similarly premature end. I can't determine why this would be. Can you guys lend me a hand with this? I don't really have anybody around locally to bounce ideas off of.

SirDice · Mar 3, 2021

zerobus said:
I can't determine why this would be.

dd(1) doesn't do well if there are read errors on the disk.

zerobus · Mar 3, 2021

SirDice said:
dd(1) doesn't do well if there are read errors on the disk.

Forgive my naivety, but why do you presume there are read errors?

SirDice · Mar 3, 2021

zerobus said:
Forgive my naivety, but why do you presume there are read errors?

Because that would break the pipe and therefor the transfer. And you're reading from a disk on an old system (FreeBSD 8.0) so it is very likely it has some bad sectors on it.

Snurg · Mar 3, 2021

SirDice said:
dd(1) doesn't do well if there are read errors on the disk.

Neither when there are write errors.
And the target is also quite old (120GB), so I'd also look at the target carefully.
smartmontools might give clues.

zerobus · Mar 3, 2021

SirDice said:
Because that would break the pipe and therefor the transfer. And you're reading from a disk on an old system (FreeBSD 8.0) so it is very likely it has some bad sectors on it.

I did suspect this as a possibility, but I'm not familiar with dd. Are there any other possibilities I should investigate? Or assuming the cause is IO error, is there a better tool for this job? I suppose I could attempt using conv=sync,noerror to see if that makes a difference.

Snurg said:
Neither when there are write errors.
And the target is also quite old (120GB), so I'd also look at the target carefully.
smartmontools might give clues.

These are the sort of assumptions I was concerned about - this 120GB system is actually a VM running on an HP DL380 with several TB SSD datastore in RAID5+0. I'm attempting to carry over the old box which is running many complicated services to a virtualized environment. Recreating from scratch would be tedium ad infinitum considering outdated dependencies, etc.

I am careful to assume the issue is simply read errors as the source machine is also on a fairly decent hardware RAID controller, but of course it would be equally foolish to discount this as a cause.

SirDice · Mar 3, 2021

zerobus said:
I'm attempting to carry over the old box which is running many complicated services to a virtualized environment. Recreating from scratch would be tedium ad infinitum considering outdated dependencies, etc.

Don't, I see no reason for this. You don't want to put a FreeBSD 8.0 system back in production in any case, it's been EoL for 11 years. Copy the data from that system and set up an equivalent system with similar services (but up to date). Then transfer only the data. The data is important, the server and the applications on it is not.

zerobus · Mar 3, 2021

SirDice said:
Don't, I see no reason for this. You don't want to put a FreeBSD 8.0 system back in production in any case, it's been EoL for 11 years. Copy the data from that system and set up an equivalent system with similar services (but up to date). Then transfer only the data. The data is important, the server and the applications on it is not.

I thank you for the insight into dd, but respectfully, you don't see the reason because you're far removed from the relevant circumstances, and operating on an extremely limited amount of information.

For the sake of brevity, I'll limit my explanation to this: my priority is removing this production server from its current box which is consuming a large amount of electricity and producing a large amount of heat in its current suboptimal environment. Once it is virtualized or at the very least replicated, I can undertake the task of bringing it up to date, or starting from scratch.

Unfortunately when asking a question, there's always somebody in the crowd ready instead to answer the one you didn't ask ?.

zirias@ · Mar 3, 2021

There are good reasons you got this answer.*) It's unlikely you'll be able to upgrade such an old system successfully. So, focusing on a backup of important data first will be the best course of actions no matter what.

Then, trying to turn off that old box ASAP is understandable and of course, having it work in a VM would be a possible way (you'd probably STILL need a new install on a second VM, see above), but given the likelihood**) that indeed, the disk is not fully readable, you'll probably find no way to do that.

You could of course try to copy files instead of a hdd image (with tar), but that's no guarantee for success either. So better first backup as much important data as possible.

---
*) So this machine has always been in production, which is worse. Apart from the risk, you now have two problems instead of just one… not telling it's your fault, I can't know, but *someone* should realize the money spent for sensible maintenance is well spent.
**) The only other reason (well, apart from really weird hardware issues that just make one of the processes crash) would be that the SSH connection breaks, which will also break the pipe. Which one is more likely?

_martin · Mar 3, 2021

As you stated yourself conv=noerror,sync can help you (if disk is faulty you still may get lucky). But you can also test the disk locally, use of=/dev/null and see if it goes through. If the disk is faulty most likely there will be other issues showing up (dmesg, syslog..).

If there's real pressure to move this server you could try it on FS level instead. Create a tarball and copy it over to destination. Check if that works.
Prepare the vanilla 8.x virtual server (iso images are still available) and then do the restore of that tarball over to the virtual server.

EDIT: I see Zirias mentioned the tarball already. Well, I guess it's not wise to leave the half-written message be and send it later before checking for updates.

zirias@ · Mar 3, 2021

_martin said:
EDIT: I see Zirias mentioned the tarball already.

Just for completeness: No need to create a "tarball", you can also pipe the output of tar into another tar instance on the target machine

But good idea to test locally to rule out network problems.

richardtoohey2 · Mar 3, 2021

SirDice said:
that would break the pipe and therefor the transfer

I've had rsync fail over ssh due to bad RAM - which broke the pipe in my case. Don't think that's the case here but might be something to look at if none of the suggestions above work.

zerobus said:
Unfortunately when asking a question, there's always somebody in the crowd ready instead to answer the one you didn't ask ?.

Think that's probably down to experience. If someone is busy shooting themselves in one foot and asking how they can shoot themselves in the OTHER foot the temptation is to help them save the other foot! Because when they blow off the second foot they might come back here asking why no-one warned them or asking more questions about their new situation. But I think it's common everywhere - ask about X and you'll also get told about Y and Z which may or may not be useful.

_martin · Mar 3, 2021

Zirias said:
No need to create a "tarball", you can also pipe the output of tar into another tar instance on the target machine

Well technically you'd be still creating the tarball, it's just it will be on the other side. ;-) If I were creating it I'd use the ssh way too. But since OP is using the dd this way I figured he knows how to copy it over.

zirias@ · Mar 3, 2021

_martin said:
Well technically you'd be still creating the tarball, it's just it will be on the other side.

OR it only "lives" in form of a transient stream in case the other side is another (extracting) tar

Yes, sure, just wanted to mention it, as the original command showed some similar approach on the block level.

untested sketch: ssh root@oldmachine -C "cd /; tar cpzlf - ." | tar xzf -

ralphbsz · Mar 3, 2021

To debug this, I would suggest two things. First, look at dmesg, or /var/log/messages, to see whether there were actually any IO errors. Second, rewrite your pipeline so you save the status return (the number that is usually 0 if everything works) of every step. Like that, you'll know which part of the pipeline failed, and can target your debugging better.

Honestly, I suspect read errors too. And if that's really it (after knowing from debugging), you'll have to find a way around it. For example: instead of dd'ing the whole disk, just read a few files that contain important data, and abandon the rest. Painful, but might work. But before you make any decisions, make sure you have the data.

dd consistently stopping early

zerobus

SirDice

Administrator

zerobus

SirDice

Administrator

Snurg

zerobus

SirDice

Administrator

zerobus

zirias@

_martin

zirias@

richardtoohey2

_martin

zirias@

ralphbsz