Solved bge0 errors

Hello!

Just wanted to ask if this is hardware error or something wrong with configuration?

Code:
bge0: PHY read timed out (phy 1, reg 1, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 1, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 1, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 1, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 0, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 4, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 5, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 10, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 25, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 1, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 1, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 0, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 4, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 5, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 10, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 25, val 0xffffffff)
bge0: PHY read timed out (phy 1, reg 1, val 0xffffffff)

After that I cannot restore the network connection any other way than rebooting. The machine has been running long time, but now has given this error already twice. Is there any way to further test this issue?
 
Is there any way to further test this issue?
What's the exact card type? Have a look at pciconf -lv. The em(4) driver includes bge* but it may be some variant.

The machine has been running long time, but now has given this error already twice.
If it has been working correctly for quite some time and suddenly starts showing errors it could be a hardware issue. Anything changed with regards to the OS? Updates, newer version, etc?
 
What's the exact card type? Have a look at pciconf -lv. The em(4) driver includes bge* but it may be some variant.


If it has been working correctly for quite some time and suddenly starts showing errors it could be a hardware issue. Anything changed with regards to the OS? Updates, newer version, etc?

Code:
bge0@pci0:4:0:0:        class=0x020000 rev=0x10 hdr=0x00 vendor=0x14e4 device=0x1687 subvendor=0x103c subdevice=0x2215
    vendor     = 'Broadcom Inc. and subsidiaries'
    device     = 'NetXtreme BCM5762 Gigabit Ethernet PCIe'
    class      = network
    subclass   = ethernet

Have also configured some bridges and VM.
In the beginning (of that machine) I had also issues with bge and added

Code:
hw.bge.allow_asf=0

to the /boot/loader.conf. It was all good after that until yesterday.
 
The hw.bge.allow_asf sysctl has something to with sharing the interface with IPMI by the looks of it. I doubt this has anything to do with PHY errors.
Code:
     hw.bge.allow_asf
             Allow the ASF feature for cooperating with IPMI.  Can cause
             system lockup problems on a small number of systems.  Enabled by
             default.

Are you able to check the port status on the switch it's connected to? Any transmission errors or other issues? Maybe somebody yanked on the cable, have you tried a new one? Should be easy enough to test if the cable may be problematic, even if it's just to rule it out as a possible cause.
 
Are you able to check the port status on the switch it's connected to? Any transmission errors or other issues? Maybe somebody yanked on the cable, have you tried a new one? Should be easy enough to test if the cable may be problematic, even if it's just to rule it out as a possible cause.
I rebooted and it is working now without errors. See what happens.
 
I agree with SirDice that this is likely a hardware problem. I've noticed Freebsd is far more sensitive to bad cables and connectors than other OSes. Mac OS will happily downgrade to 100BaseTX and say nothing about it. It's a "feature".
 
What's the output of
Code:
netstat -i
Code:
root@Silicium ~ [1]# netstat -i
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
bge0   1500 <Link#1>      64:51:06:5f:17:85    83192     3     0    54993     0     0
bge0      - 0000-0000-000 1785-fe5f-06ff-66    47878     -     -    38008     -     -
bge0      - fe80::%bge0/6 fe80::6651:6ff:fe      523     -     -      879     -     -
bge0      - 0000-0000-000 0100-0000-0000-50      612     -     -      820     -     -
bge0      - 192.168.5.0/2 Silicium.lan         18786     -     -    15060     -     -
lo0   16384 <Link#2>      lo0                     13     0     0       13     0     0
lo0       - localhost     localhost                0     -     -        1     -     -
lo0       - fe80::%lo0/64 fe80::1%lo0              0     -     -        0     -     -
lo0       - your-net      localhost               12     -     -       12     -     -
vm-in  1500 <Link#3>      12:e5:67:7d:c2:9b      211     0     0      576     0     0
vm-in     - 192.168.5.0/2 192.168.5.113            0     -     -        0     -     -
vm-in     - 0100-0000-000 0101-0000-0000-00        0     -     -    19554     -     -
vm-in     - 0200-0000-000 0201-0000-0000-00       17     -     -       21     -     -
vm-pu  1500 <Link#4>      8e:39:00:ce:36:1f     4352     0     0    54976     0     0
 
I agree with SirDice that this is likely a hardware problem. I've noticed Freebsd is far more sensitive to bad cables and connectors than other OSes. Mac OS will happily downgrade to 100BaseTX and say nothing about it. It's a "feature".
Hello!

I suspect something else - not cable certainly. It may be a driver bug or actual bge chip which is on motherboard and cannot be replaced. But I did this now:
Code:
root@Silicium ~# ifconfig vm-public deletem bge0
and see if there are any changes. The chip has probably not burned, but higher load from VM-s may have somehow crashed the bge chip. May be, may be not.

Will see how it runs now.

P.S. The kernel is also the latest:
Code:
root@Silicium ~# uname -a
FreeBSD Silicium 13.1-RELEASE-p5 FreeBSD 13.1-RELEASE-p5 753d65a19 RHODIUM amd64
 
Broadcom made a million "OEM" chips for Dell and almost none of them were tested with the BGE driver, which is ancient. They're roughly suitable for system access but not for amy sort of network stream. Drivers written by Bill Paul report errors but do nothing to make the NIC work; I've had to hack many drivers over the years since he cranked out 30 drivers back in the day.

Get yourself an intel add on card if you can. If you can't, hack the driver so it does a reset after getting the errors; Calling bge_init() should get it working; then figure out how to do it gracefully. If you get the error once in a blue moon it will be usable. If it happens every time you get a burst of traffic you'll need to abandon the bge ports altogether.
 
They're roughly suitable for system access but not for amy sort of network stream.
You've obviously had more experience than me by a long shot, but I've got bge cards in production on Dell servers and nothing obviously bad happening.

Have you got some specific examples of bad Broadcom cards in Dell servers?
 
It has been working so far after I removed bge0 from virtual bridge. No errors. Looks like there is a connection with that.
 
You've obviously had more experience than me by a long shot, but I've got bge cards in production on Dell servers and nothing obviously bad happening.

Have you got some specific examples of bad Broadcom cards in Dell servers?

Problems come in different shapes and sizes. The driver isn't maintained, and these drivers haven't been hardened to recover from a failure. The RE driver, for example, works ok until it doesnt, then it's broken. So it might run for months and then run into an overrun or a bus error and the thing doesnt have recovery built in. usually with these "once in a blue moon" occurrences you just build in a reset and you have a few seconds of down time.
 
It has been working so far after I removed bge0 from virtual bridge. No errors. Looks like there is a connection with that.
I will mark this thread 'solved', meaning it works so far. As it turned out, the driver is not maintained and the chip itself is (to put it mildly) not the best one.
 
My experience with bge driver nics is that if it''s on the mainboard, its ok to access the machine for maintenance or GUI, I just don't run network streams through them. Same with RE.

With free OSes, if some driver guy isn't using a chip, it's probably not safe to use for critical streams. Nobody has the time to debug every odd NIC chip on the market. You really have to stick to what's well maintained and widely used.
 
My experience with bge driver nics is that if it''s on the mainboard, its ok to access the machine for maintenance or GUI, I just don't run network streams through them. Same with RE.
Got bge working on a number of Dell R430s (5720) and R640s (5720) with no noticeable issues. The newer Rx50s also offer the 5720. But the man page only goes as far as R300s and if the driver is unmaintained then not a good long term option so I'll keep that in mind (the newer Broadcom bnxt driver doesn't work with the other Broadcom NICs in my R450).
 
With free OSes, if some driver guy isn't using a chip, it's probably not safe to use for critical streams. Nobody has the time to debug every odd NIC chip on the market. You really have to stick to what's well maintained and widely used.
The alternative is that chip manufacturer supports the driver works...
 
The alternative is that chip manufacturer supports the driver works...
They all support linux. *Some* of them support FreeBSD. Intel has a dedicated driver guy for FreeBSD. I doubt Broadcom does since the driver was written by wind river 20 years ago. When new chips are encountered, somebody just adds the device ID to the driver and hopes that it works. Some people have the skills to update the driver if necessary, but the fact that nobody had taken the bge driver under their wing in 2 decades tells me that *most* people just avoid MBs with non-intel nics.
 
Back
Top