Solved ecc error reporting functionality

Linux and Windows OS's have the ability to report on ecc (un)corrected errors happening. This is very important in some scenarios to be able to see a failing memory module coming miles away and proactively deal with it rather than when the memory already died.

Is there anything similar in FreeBSD? and if so how would I go about it of having a sheduled task run every x amount of time to scan some kind of log and when it finds ecc errors mentioned send out an email?
 
Is there anything similar in FreeBSD?
Yes, of course.

and if so how would I go about it of having a sheduled task run every x amount of time to scan some kind of log and when it finds ecc errors mentioned send out an email?
Look in /var/log/messages. You're going to see messages like these:
Code:
Dec  3 09:23:34 testhost MCA: Bank 7, Status 0xcc00170000010092
Dec  3 09:23:34 testhost MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Dec  3 09:23:34 testhost MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0
Dec  3 09:23:34 testhost MCA: CPU 0 COR (92) OVER RD channel 2 memory error
Dec  3 09:23:34 testhost MCA: Address 0x1280a63f00
Dec  3 09:23:34 testhost MCA: Misc 0x1422a9686

You can also use sysutils/mcelog to help decode those messages.
 
cool great and thx. any chance you know of some resource that explains how to automate a process that scanns for those errors and sends an email when found?
 
any chance you know of some resource that explains how to automate a process that scanns for those errors and sends an email when found?
A shell script? Using Perl/Python/Lua/whatever? Using a monitoring solution like Munin, Nagios or Zabbix? I mean there are countless ways of achieving this.
 
point being I am new to this whole unix/linux world. and a bit of a guiduance would go a long way. I'd hate to be foreced back into microsoft world again.. but at least their cominity always points to useful resources.
 
Not sure what exactly you are after.

SirDice has given you guidance about looking in the logs and the mcelog utility. He then advised some languages you can use, some other tools (like Munin) that might be of interest, and suggested you script it yourself (which is very good advice, but might depend on the audience).

Code:
tail -f /var/log/messages | grep "memory error"
might be exactly what you need. But how would we know?

You might need to be more explicit about exactly what you want to do. Is there a specific utility that works in Linux that you have in mind? Is there a FreeBSD version of it (have a look in www.freshports.org)? EDIT: something like logwatch? https://www.freshports.org/sysutils/logwatch/ Logwatch is a customizable, pluggable log-monitoring system. It will go through your logs for a given period of time and make a report in the areas that you wish with the detail that you wish.

As soon as SirDice gave you some advice, you moved the goal posts "really small footprint" - well, you can't get smaller than writing a shell script.

People are quite willing to help, but you do have to do some homework and be clear in what you are after. FreeBSD isn't Windows, and many FreeBSD users don't use Windows, so they won't know what you are used to or what you are expecting.

No-one is forcing you to use Linux, or FreeBSD or "back into the microsoft world again".
 
As soon as SirDice gave you some advice, you moved the goal posts "really small footprint" - well, you can't get smaller than writing a shell script.
If I came across like that then my humble apologies. As a total noob I might 'shift' the goal sometimes unintended.

The goal is to have some kind of process that sends an email on detected ecc single and multi bit errors.

Clearly I have no idea how to create a shell script yet, how to automate it but i'll try over time to learn. All I had hoped for is a pointer to a source that explains how to do that on unix(freebsd). Of course I can spend countless amounts of time finding many sources my self only to end up reading the outdated ones. So hence my request for community support.

I will also have a look at the suggestions made by all of you, thanks for that btw. I appreciate it.

I just come from a world where prior knowledge is either a given or compensated with the abundance of information available online.

Please forgive me, I have more questions to ask in the future I am sure.

respect
 
ralphbsz, do you mean that newcomers should stay away? And unix is only for the initiated?
No, I mean that the easiest solution to monitoring this is probably to write a small script.

It is possible that there are pre-built packages (*) that already do this. If yes, you can find them and use them. A small search among packages might help. Honestly, I haven't looked ... the machines I have at home don't have ECC (not because I don't like ECC, but because it isn't available in the class of machines I use at home). The machines I use at work either have such monitoring tools built into the "BIOS" (which on industrial-grade servers is quite a complex beast, and integrated into systems for gathering data from many machines), or the operations staff has tools that monitor ECC errors for us, so I don't have to worry about it. My hunch is that there is nothing other than mcelog, which Sir Dice already point out. And even to use that probably takes a while of adapting and configuring. And since it doesn't have a daemon, you still need to write scripts to get warning e-mails out of it.

You could try adapting generic monitoring solutions (such as Nagios). I've used Nagios, and installing and configuring it takes hours. And if you want to adapt it to monitor /var/log/messages for specific ECC things, you'll be writing scripts.

So I think the easiest answer is indeed to write a small scripts. You could for example put it in crontab, and once a day search /var/log/messages, and if you find ECC errors, just send an e-mail to the administrator. Might even be possible to tie it into the syslog mechanism, so it runs automatically whenever the current log is detached and compressed (I think newsyslog has a hook for such functionality). You can make it super simple, or highly complex (tie it to a database, mix it with other data about system health), and so on.

I fail to see how any of this has to do with newcomers or the initiated. It's a script, which gets written, installed (/usr/local/sbin is a good place for it), and hooked up.
 
Clearly I have no idea how to create a shell script yet, how to automate it but i'll try over time to learn.
We all started at the bottom. Definitely learn some scripting, it's super useful for just about everything.
All I had hoped for is a pointer to a source that explains how to do that on unix(freebsd).
This is still my go-to resource when it comes to scripting, even after 20 or so years I still have to look stuff up.
 
Clearly I have no idea how to create a shell script yet, ...
Ah, I see. You can't run this race, because you have to get to the starting point of it first, and you don't know where that is. Well, that's a bit of a problem.

What's a shell script? Da stellen wir uns mal ganz dumm ... (old German joke from a classic movie, a physics teacher explaining how a steam engine works, and he starts with "lets pretend we're really stupid ..." to the class, and then demonstrates that he knows nothing, but is friendly and funny). Any command you issue from the shell is a part of a shell script.

Like: "fgrep ECC /var/log/messages". Good. Useful. Might not work (I can't remember whether ECC is in upper or lower case). So tune and tweak it by trial and error. Now, we need to get this recorded in a mail: "fgrep -i ecc /var/log/messages | mail -s "ECC error report" root". You're done (but better read some man pages to figure out why I added -i and -s to the commands). There's your very first version. Learn how to put a header line on this (search the web for shebang line), copy it into /usr/local/sbin/diversity_ecc_monitor_v0, and edit /etc/crontab to run it once a day.

EDIT: I forgot to tell you that when you save the command in a script, you'll have to do "chmod a+x" on it. You would have figured that out sooner or later, but it would be frustrating.

Now, is this a GOOD and COMFORTABLE solution? Hell no. To begin with, the message should have time and date in the subject line. Next, if there is no output, you shouldn't get an empty message. By the way, the script needs error handling: mail might fail, fgrep might fail, and so on. Don't wait for it to fail, rather think through "what could possibly go wrong". Even better: You should only get a message when the situation changes (if one ECC error happens Monday, getting an e-mail Monday night is reasonable, getting a second one Tuesday night is not). All that can be done easily ... except for a beginner, the easy things are hard and frustrating. Just keep working at it.

Also, go get a book about simple shell script programming. Honestly, I don't have a recommendation, having learned shell scripts by trial and error about 30 years ago (on Unix, on other machines considerably before that). Personally, if it weren't for Covid, I would say go to the programming section of your local library, and pick whatever shell script book they have.

Is this enough for a starting point?
 
To be clear, though, you don't have to write a script. Something like logwatch or munin might do what you need. Or you could write a program in any number of programming languages that you compile or run under an interpreter.

On Windows you could write a VBScript (if that's still a thing, maybe it's Power Shell something these days) to monitor the events and send an email using COM (if it's called that) but you might find a tool that does what you want.

You are the only person who knows exactly what you are trying to do and what's important to you - your time? Low memory footprint? Don't want to install anything, just use what's in a base FreeBSD install? Or do you want an easy-to-use GUI tool that does everything and means you can get things done in 5 minutes and move on?

Something like this does feel very like script, and will be a good learning experience, but if you are time-poor then maybe not.
 
Ah, I see. You can't run this race, because you have to get to the starting point of it first, and you don't know where that is. Well, that's a bit of a problem.

What's a shell script? Da stellen wir uns mal ganz dumm ... (old German joke from a classic movie, a physics teacher explaining how a steam engine works, and he starts with "lets pretend we're really stupid ..." to the class, and then demonstrates that he knows nothing, but is friendly and funny). Any command you issue from the shell is a part of a shell script.

Like: "fgrep ECC /var/log/messages". Good. Useful. Might not work (I can't remember whether ECC is in upper or lower case). So tune and tweak it by trial and error. Now, we need to get this recorded in a mail: "fgrep -i ecc /var/log/messages | mail -s "ECC error report" root". You're done (but better read some man pages to figure out why I added -i and -s to the commands). There's your very first version. Learn how to put a header line on this (search the web for shebang line), copy it into /usr/local/sbin/diversity_ecc_monitor_v0, and edit /etc/crontab to run it once a day.

EDIT: I forgot to tell you that when you save the command in a script, you'll have to do "chmod a+x" on it. You would have figured that out sooner or later, but it would be frustrating.

Now, is this a GOOD and COMFORTABLE solution? Hell no. To begin with, the message should have time and date in the subject line. Next, if there is no output, you shouldn't get an empty message. By the way, the script needs error handling: mail might fail, fgrep might fail, and so on. Don't wait for it to fail, rather think through "what could possibly go wrong". Even better: You should only get a message when the situation changes (if one ECC error happens Monday, getting an e-mail Monday night is reasonable, getting a second one Tuesday night is not). All that can be done easily ... except for a beginner, the easy things are hard and frustrating. Just keep working at it.

Also, go get a book about simple shell script programming. Honestly, I don't have a recommendation, having learned shell scripts by trial and error about 30 years ago (on Unix, on other machines considerably before that). Personally, if it weren't for Covid, I would say go to the programming section of your local library, and pick whatever shell script book they have.

Is this enough for a starting point?
as a starting point yes!. I do have extensive knowledge about programming logic so it is just a matter of adapting
 
maybe it's Power Shell something these days
Definitely. It's a big improvement over the old batch style command scripts. Those were horrible. You had to be a real masochist if you wanted to write large and complex batch scripts. Still though, one of my Windows colleagues recently complained he had to parse a 5 million line text file and powershell completely barfed at it. He had to split up the file in easier to handle chunks. I just had to giggle, grep(1), awk(1) or perl(1) would have made short work of that file in mere seconds.
 
Just donated 100$ to the community. I hope it benefits us. I just hope ixsystems start paying attention to ecc reporting
 
A shell script? Using Perl/Python/Lua/whatever? Using a monitoring solution like Munin, Nagios or Zabbix? I mean there are countless ways of achieving this.
Ahh yes, I think I read a deamon is the current way to go. What that deamon runs I need more time to answer
:
tail -f /var/log/messages | grep "memory error"
might be exactly what you need. But how would we know?
we would know if that would send an email when the output of that statement != null
 
Hmm syntax error ;(.

This forum sorftware interface and my sluggish fingers on mobile phone don't play nice
 
Ahh yes, I think I read a deamon is the current way to go. What that deamon runs I need more time to answer
A fairly simple cron job is probably easier to do for you, daemons are nice but require quite a bit of code to do right. Added bonus with a crontab(1) is that output is automatically emailed, so you don't have to figure out how to do that. Just schedule the script to run once a day, or more often if you like. Keep an eye on the settings on newsyslog.conf(5), log files (especially /var/log/messages) are automatically rotated at specific intervals (or sizes).
 
as a starting point yes!. I do have extensive knowledge about programming logic so it is just a matter of adapting
A "script" doesn't have to be a shell script. In theory there is nothing wrong with writing an "ECC errror watching daemon" in Perl, Python, C, assembly, or COBOL, depending on what you're most productive in. In practice, a lot of these choices will be really hard to use, like I only added assembly and COBOL as a joke. In practice, it will probably be a shell script or high-level language (Perl, Python ...).

Ahh yes, I think I read a deamon is the current way to go. What that deamon runs I need more time to answer
Actual daemons (programs that run continuously, and regularly wake up and do something useful) are very hard to write. As SirDice already said. And if it's not a daemon, the question is: when is it run? I like the idea of hooking it into newsyslog and/or cron.

we would know if that would send an email when the output of that statement != null
The mail program is perfectly happy to send a zero-length message. Which is probably not a very user-friendly thing to do. So one needs to program around that. For example, save all the ECC error messages in a buffer (memory or file), then if the buffer is empty, send nothing (or send a "everything is good message"), otherwise send the messages. A lot depends on the desired user interaction.

What is the most important thing in software engineering? Tricky question, as there are several correct answers. One of them is: find out what your requirements are. What do you want your ECC error watcher to do? How do you want it to react to certain situations? I would begin by writing the user manual for it, which explains how a human will interact with it. First optimize that user interface description, then actually implement it.
 
The mail program is perfectly happy to send a zero-length message.
mail -E should take care about empty mail bodies. From mail(1)
Code:
     -E      Do not send messages with an empty body.  This is useful for
              piping errors from cron(8) scripts.
There is not much more I can contribute. Ein blindes Huhn trinkt auch mal ein Korn :).
 
Back
Top