Very creepy: PKI key damaged twice

I am using sysutils/bareos-server for backup.
Today I tried to restore some files and got an error instead:
Code:
Error: Encryption session provided an invalid symmetric key: ERR=error:0407109F:rsa routines:RSA_padding_check_PKCS1_type_2:pkcs decoding error
Error: Encryption session provided an invalid symmetric key: ERR=error:04065072:rsa routines:RSA_EAY_PRIVATE_DECRYPT:padding check failed
Error: Encryption session provided an invalid symmetric key: ERR=error:0607A082:digital envelope routines:EVP_CIPHER_CTX_set_key_length:invalid key length

I tried different files and got the same error. Even a file that was saved with the same running tool two minutes before could not be restored!

The bareos tools use a PKI key that is loaded by the sysutils/bareos-client for encryption and signing of the individual files, so this is transparent to the server side. I checked that key on file: everything is allright.

Then I restarted the client process.
Now the files were restored, but a different error was reported:
Code:
Error: OpenSSL digest Verify final failed: ERR=error:0407008A:rsa routines:RSA_padding_check_PKCS1_type_1:invalid padding
Error: OpenSSL digest Verify final failed: ERR=error:04067072:rsa routines:RSA_EAY_PUBLIC_DECRYPT:padding check failed
Error: filed/crypto.cc:205 Signature validation failed for file /tmp/bareos-restore/var/mail/admin: ERR=Signature is invalid

This explains what had happened: the secret part of the key must have been damaged, so backups where encrypted correctly with the public part, but the signature created with the secret part was wrong. And decryption failed because it also needs the secret part.

Then I checked which of the daily backups had that flaw: all from Sep 2 to Sep 10, and from Sep 14 until today.
Code:
boot time                                  Wed Sep 11 01:05
shutdown time                              Tue Sep 10 22:54
boot time                                  Fri Aug 30 18:59
shutdown time                              Fri Aug 30 18:20

The damage had happened midflight while the program was running, then disappeared due to reboot, and reappeared three days later again while the program was running.
No other errors of any kind were reported.

How unlikely is this to happen?
 
When you say "the secret key was damaged", do you mean that the on-disk file that contains the key was damaged? That is at least very unlikely (it would take an undetected ECC error, which has probability 10^-15), or ridiculously unlikely (if you are using ZFS, it adds its own CRC and checks it on read).

If you mean that the key was damaged in memory, by a random event (alpha particle, random bit flip), that is also very unlikely, although more possible than a disk error.

I suspect some form of software bug, but it is hard to understand why it would come and go. Could it be that a software upgrade was involved, and for a while client and server were not compatible, but after the upgrade to the other side the problem went away?
 
When you say "the secret key was damaged", do you mean that the on-disk file that contains the key was damaged?

No. The key file checksums are all correct, as compared to the copies in the vault.

If you mean that the key was damaged in memory, by a random event (alpha particle, random bit flip), that is also very unlikely, although more possible than a disk error.

Yes. Thats what must have happened, I don't see another explanation. But, how probable is this to happen two times in a row in the same way?

I suspect some form of software bug, but it is hard to understand why it would come and go. Could it be that a software upgrade was involved, and for a while client and server were not compatible, but after the upgrade to the other side the problem went away?

No, it was version 18.2.6 at Jul 26 already, and still is. Only now, when switching to ports@2019Q4, i get 18.2.6_1 (which is also not a technical change).
And that client daemon was not restarted!
It gets started at boot, then it did work, and after two days the key (in memory!) somehow went bad.
Then at Sep 11 the machine was fully rebooted, things worked correctly again, and after another 3 days they key got bad again, in the same way - while the daemon was still running since boot when I figured the problem.

Okay, if there is a bitflip typically happening after 2-3 days, then no memcheck would find that.
Then, when the machine is booted twice, it is possible that the same data gets placed to the same memory location. But how likely is that, given that the memory is 150% used and regularly paging and moving stuff around?
And finally, while ECC is present and active, the hardware is antique, and I don't really know how that chipset would handle the ECC.

On the software side, there may be a defective pointer damaging something. The bacula/bareos stuff has it's own story. While architecture and features are really great, I cannot say the same for the coding style. When I came across it, I found about a dozen flaws in the code. I considered it worth the effort to start fixing them, but while the first two fixes were accepted, I then experienced a pointless discussion about the definition of "bug", based on the agenda that "we have a backup solution, and we dont want to have bugs". So I quit the efforts.
Later on I got a request by the chief developer - he wanted me to sign a written paper that I have no claims on the code. That seems to have been related to the splitting between bacula and bareos.
Over all, no offense intended, but I think it's not really fun having to deal with these people.
 
And it happened again - after 3 days of system uptime.
I restarted the daemon, did some restore, and during this restore, after about 10 minutes, it happened once again.

Maybe I send them a price inquiry now, let's see how much they charge for fixing their broken code...
 
Back
Top