Solved ecc error reporting functionality

diversity · Dec 7, 2020

EDIT: the question is how to remove lines containing a substring from a file using sed command. It's at the bottom of the test code sample

I think that python makes the most sense to me as postix shell probably can't easily send emails. (EDIT: previous statment does not make any sense) But most importantly I think python might come in handy in the future for AI (tensorflow and the likes) and I can easily test python code; (EDIT: Also TrueNAS, which is based on FreeBSD seems to have eather python available by default or did not remove it.)

so here goes;

Code:

#test.py
import os;
#run a command line statement
os.system("cat /var/log/messages | grep MCA: > MCAmessages.cat"); #create new file MCAmessages.cat
## as i know how to trigger real ECC errors i can easily test this.
#already confirmed that /var/log/messages contains MCA: lines as soon as ECC errors are triggered

lineCnt = len(open("MCAmessages.cat").readlines(  ));
if lineCnt > 0:
    print("sending email"); #TODO: implement sending of email
else: #TODO: actually do nothing so remove the else statement
    print ("no lines containing MCA: substring found in messages"); #INFO: only for testing

#remove all lines from /var/log/messages that contain MCA:
os.system("sed -i.bak '/ MCA: /' ./MCAmessages.cat"); #INFO: test it first on MCAmessages.cat
#results in;
#sed: 1 "/ MCA: /": command expected
#does anyone know the proper syntax? regex does also not seem to work
#perhaps a good resource on where the syntax is explained?

richardtoohey2 · Dec 7, 2020

What's going to stop your script hitting the same memory errors repeatedly and sending the warnings repeatedly? e.g. if there was a memory fault at 9 a.m. and you have your script running every hour, what's to stop it emailing at 10, 11, 12, etc.?

Or is that what the "remove" part is about?

You are using Python to do a lot of OS calls which seems to defeat the point. Make a real shell script, calling the OS functionality directly, or use the Python libraries. You don't HAVE to, but might make things simpler.

I'd have a last_mem_error file.

I'd read the first part of /var/log/messages lines to get the date/time stamp e.g. Dec 8 08:30:14

If that's a memory error line, I check the contents of last_mem_error - if no file or it's empty, I'd store the time stamp in the file and send the email
If there is already a value in the file, I'd compare the date/time of the line just read to work out if a newer error message and if it was, send the email & update the file

That would still mean more than one email if a lot of lines in /var/log/message - so maybe read the log backwards, so you'd read & handle the most recent error first.

But there's a lot of ways to get this job done. And my approach will no doubt have flaws.

diversity · Dec 7, 2020

richardtoohey2 said:
What's going to stop your script hitting the same memory errors repeatedly and sending the warnings repeatedly? e.g. if there was a memory fault at 9 a.m. and you have your script running every hour, what's to stop it emailing at 10, 11, 12, etc.?

Or is that what the "remove" part is about?

Indeed, to avoid sending more than one email when MCA errors are encountered

richardtoohey2 said:
You are using Python to do a lot of OS calls which seems to defeat the point. Make a real shell script, calling the OS functionality directly, or use the Python libraries. You don't HAVE to, but might make things simpler.

Thx, I will consicer your suggestion. For now I am having fun with python

And the script and command line calls will probably stay quite uncomplicated so I do not expect t hit a brick wall sooner or later. I just need something to scratch an itch real fast and found python that seems to scratches several itches at once

richardtoohey2 said:
I'd have a last_mem_error file.

I'd read the first part of /var/log/messages lines to get the date/time stamp e.g. Dec 8 08:30:14

If that's a memory error line, I check the contents of last_mem_error - if no file or it's empty, I'd store the time stamp in the file and send the email
If there is already a value in the file, I'd compare the date/time of the line just read to work out if a newer error message and if it was, send the email & update the file

That would still mean more than one email if a lot of lines in /var/log/message - so maybe read the log backwards, so you'd read & handle the most recent error first.

But there's a lot of ways to get this job done. And my approach will no doubt have flaws.

Sweet thanks a lot. I will be sure to take a better look at your flow logic soon.

Would you or anyone happen to know where I can learn how to work with the sed statement te rmove lines from files? Or does anyone know how to use sed or another statement to remove any line from any file that contains " MCA: " substring?

richardtoohey2 · Dec 8, 2020

It looks like you are set on removing lines from /var/log/messages - I'm not sure that's advisable, but it's your computer! ?

All you want to do is read a file, record where you last got to (so you don't re-email about the same problem over-and-over), and maybe email if you found something. It shouldn't be a destructive process (risking trashing your system's logs). But that's my 2c!

if file last-date-checked found, open it and read in last-date-checked else set last-date-checked to 1970-01-01
set memory-error-found false
open /var/log/messages
loop through the lines in /var/log/messages
set last-date-checked to log date
if log date > last-date-checked
if line includes memory error set memory-error-found true
end loop
write last-date-checked to last-date-checked file
if memory-error-found send email

SirDice · Dec 8, 2020

[Mod: Thread moved to Userland Programming and scripting]

diversity · Dec 10, 2020

richardtoohey2 said:
if file last-date-checked found, open it and read in last-date-checked else set last-date-checked to 1970-01-01

Thanks. I will consider your suggestion once I have gotten the cron to run. The sed issue has been resolved in the meantime.

This is what I have so far

Code:

# Import smtplib for the actual sending function
import smtplib
# Import the email modules we'll need
from email.message import EmailMessage

import os;
#run a command line statement
os.system("cat /var/log/messages | grep MCA: > MCAmessages.cat"); #create new file MCAmessages.cat

lineCnt = len(open("MCAmessages.cat").readlines(  ));

if lineCnt > 0:
    print("sending email");
    # Import smtplib for the actual sending function
    import smtplib

    # Import the email modules we'll need
    from email.message import EmailMessage

    msg = EmailMessage()
    msg.set_content("MCA error(s) detected on TrueNAS") #TODO: send the contents of MCAmessages.cat

    msg['Subject'] = "MCA error(s) detected on TrueNAS"
    msg['From'] = "your email goes here"
    msg['To'] = "your email goes here"

    # Send the email via our own SMTP server.
    s = smtplib.SMTP("your smtp server goes here")
    s.send_message(msg)
    s.quit()
else: #TODO: actually do nothing so remove the else statement
    print ("no lines containing MCA: substring found in messages"); #INFO: only for testing

#remove all lines from /var/log/messages that contain MCA: as to prevent getting stuck in a loop
os.system("sed -i.bak '/ MCA: /d' ./MCAmessages.cat"); #INFO: test it first on MCAmessages.cat

when running this in the shell using python {filename.py} I am getting an email. all good so far.

but when crontab -e 0/10 0/1 * * * * * python /root/test.py #(run every 10 seconds) the cron is getting installed but I am not getting any output or emails.

Does anyone have a suggestion on how to proceed?

richardtoohey2 · Dec 10, 2020

You usually have to specify the full path of exectuables in cron.

So do "which python" to see the full path of your system python and put that full path in your cron table.

richardtoohey2 · Dec 10, 2020

That didn't quite work as expected (I don't use Python so haven't got it set up properly), but something like this depending on your Python set-up and version

Code:

 % which python3.7
/usr/local/bin/python3.7

ralphbsz · Dec 10, 2020

As Richard said, Python is in /usr/local, so cron might not see that. So what you do is to mark your script as a native python script, by adding the "shebang" in the first line:

Code:

#!/usr/local/bin/py...

For python, it is usually a better idea to specify /usr/local/bin/python3 in that line. That's a soft-link which points to the current version, and you won't have to update your scripts when you get updates such as 3.7 -> 3.8. Within python 3, compatibility from version to version should be excellent, so there is virtually no risk during updates.

Matter-of-fact, with python 2 becoming unsupported within the next few weeks, it might even be that /usr/local/bin/python now points at python3, in which case you should probably just use /usr/local/bin/python in the shebang line. But check first, your script will not run under python 2 (why? print function instead of print statement).

Here is a little bit of code review:

Code:

#!/usr/local/bin/python

# This program does ... describe its purpose in 1 or 2 sentence.
# How do you use it?
# Do you need to explain how it works internally?

# It was written by ... on ...
# Copyright ...

# Import smtplib for the actual sending function
import smtplib
# Import the email modules we'll need
from email.message import EmailMessage

# Please collect all import statements at the top. It makes it easier to see what this program uses.
import os;

# Run a command line statement  # I fixed the capitalization and spacing
os.system("cat /var/log/messages | grep MCA: > MCAmessages.cat"); #create new file MCAmessages.cat

# Suggestion for improvement, a bit more work:
# Run this using 
p = subprocess.run("cat ... | grep ...")
# Then it returns the output, without you have to read the file, and it doesn't leave an intermediate file on disk:
lines = p.stdout.splitlines()
lineCnt = len(lines)
# This allows you to also do more detailed error handling.

lineCnt = len(open("MCAmessages.cat").readlines(  ));

if lineCnt > 0:
# You are being silly: You are comparing an integer (lineCnt) to zero. But that is exactly the built-in conversion
# of an integer to a boolean value. So what you write is exactly identical to if lineCnt:
# But wait: lineCnt is just the length of a list of lines. The built-in conversion of a list to a boolean value is
# also the length. So what you wrote is exactly equivalent to:
if open(...).readlines():
    print("sending email");
    # Import smtplib for the actual sending function
    import smtplib # You already have that at the top. No need to repeat it.
   
    # Import the email modules we'll need
    from email.message import EmailMessage

    msg = EmailMessage()
    msg.set_content("MCA error(s) detected on TrueNAS") #TODO: send the contents of MCAmessages.cat
    # Ah! If you had used subprocess, you would have had the lines ready to go in a list!
    # But I would like to make a suggestion: If the number of lines is huge (thousands, or millions), send only
    # the first few, and dump the rest in a file. Sending multi-megabyte e-mail messages is not polite.
    # Something like this might work:
    if len(lines) > 1000:
        msg.set_content(('Detected %d errors, here are the first and last 500 lines:\n' % len(lines)) +
            lines[:500] + '...\n' + lines[-500:])

    msg['Subject'] = "MCA error(s) detected on TrueNAS"
    msg['From'] = "your email goes here"
    msg['To'] = "your email goes here"

    # Send the email via our own SMTP server.
    s = smtplib.SMTP("your smtp server goes here")
    s.send_message(msg)
    s.quit()
else: #TODO: actually do nothing so remove the else statement
    print ("no lines containing MCA: substring found in messages"); #INFO: only for testing

#remove all lines from /var/log/messages that contain MCA: as to prevent getting stuck in a loop
os.system("sed -i.bak '/ MCA: /d' ./MCAmessages.cat"); #INFO: test it first on MCAmessages.cat

There are two bigger structural problems which I didn't want to put in the code review. First, you are using a temporary file. Do you know what directory you are running in? I would explicitly put that file into /tmp/MCA... And what happens if some idiot runs two copies of the program at once? They will both mess with the temporarily file, and they can step on each other. The common fix is to put the PID of the running process in the file name. But running the program under subprocess.run(...) is much easier.

Second: I would not modify /var/log/messages. If something goes wrong, you just destroyed the evidence you need. Instead, I would save the current copy in a safe place (like /tmp/old.messages), then do a diff between old and new (using diff or using python), and only look for messages in the diff.

a6h · Dec 10, 2020

ralphbsz said:
Matter-of-fact, with python 2 becoming unsupported within the next few weeks, it might even be that /usr/local/bin/python now points at python3

Good point. Is that going to happen?

ralphbsz · Dec 11, 2020

Honestly, I don't know. I always hand-tweak my python installation: I clean up the soft-links in /usr/local/bin, and add them to /usr/bin.

diversity · Dec 12, 2020

for a complete solution check this out

Monitor and email an alert for ECC Memory errors on OS-level

Purpose Monitor /var/log/messages for MCA related messages and email them to you when found. MCA messages contain, for example, memory ECC error reportings and much more. More details: https://en.wikipedia.org/wiki/Machine_Check_Architecture Why...

www.truenas.com

diversity · Dec 12, 2020

I'd like to think I helped with that solution

SOLVED - The usefulness of ECC (if we can't assess it's working)?

Thank you Mastakilla. I must admit I am out of my league ;( also on your stability issue. I did not even notice first time around the amount of disks you have and hence the need for a HBA ;( Although very very much appreciated I am having the hardest time debugging your bash script. I am trying...

www.truenas.com

diversity · Dec 12, 2020

although triggering ecc errors as mastakilla suggests is the best way to do it (as it can trigger both single and multi bit errors)
it is also enormously time consuming and hard to get right.

for those that are strapped for time and want a quick fix, one can short pin 2 and 5 of a memory slot. it's fast and when done with a steady hand safe to do. It wont give you multi bit ecc though, only single bit

Solved ecc error reporting functionality

Administrator