Thursday, November 4, 2010

IOS: EIGRP Peering Flapping, Auth Failure - %DUAL-5-NBRCHANGE: IP-EIGRP: Auth failure

This is an actual case we have raised recently with Cisco as we are having unexplained EIGRP flaps between two of our devices. It has been working for more than a year -- actually, it never had any issues when this was brought online last year.



Scenario:

EIGRP peering flaps between two devices, due to authentication failure. The output of show logging is flooded with the below syslogs repeatedly:
Nov 2 01:30:43.436 GMT: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 2: Neighbor 10.10.10.2 (GigabitEthernet1/1) is down: Auth failure


Nov 2 01:30:45.040 GMT: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 2: Neighbor 10.10.10.2 (GigabitEthernet1/1) is up: new adjacency

Nov 2 01:30:47.316 GMT: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 2: Neighbor 10.10.10.2 (GigabitEthernet1/1) is down: Auth failure

Nov 2 01:30:48.820 GMT: %DUAL-5-NBRCHANGE: IP-EIGRP(0) 2: Neighbor 10.10.10.2 (GigabitEthernet1/1) is up: new adjacency

Topology is straightforward:
Router1 Gi1/1 <-----> Gi2/2 Router2


Router1 Gi1/1 = 10.10.10.1/24
Router2 Gi2/2 = 10.10.10.2/24

MD5 Authentication is used and the same key string is configured on both devices
Router1#show key chain MYCHAIN

 key-chain MYCHAIN
  key 1 -- text "myCiscoChain"
   accept lifetime (always valid) - (always valid) [valid now]
   send lifetime (always valid) - (always valid) [valid now]
Router1#
Router1#
Router1#
Router1# show run | begin key chain
key-chain MYCHAIN
 key 1
  key string 5 098123456SA679
...
Router1# show run int Gi1/1
interface GigabitEthernet1/1
 ip address 10.10.10.1 255.255.255.0
 ip authentication mode eigrp 2 md5
 ip authentication key-chain eigrp 2 MYCHAIN
...

Problem:
The issue was with a Level2/Severe bug with the IOS image running on one of the devices. Bug details below:

CSCdu73495 - All routes to network not seen because of invalid md5 authentication
http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCdu73495

Enhanced Interior Gateway Routing Protocol (EIGRP) routes cannot be seen even when message digest algorithm 5 (MD5) is authenticated on all routers. This problem is intermittent and may occur when authentication is turned off and subsequently turned back on again. Sometimes, this problem occurs just after authentication is enabled.  

Workaround: This problem is intermittent and may be resolved by disabling and reenabling authentication a second time. This problem may automatically be resolved after a few minutes.


EIGRP Authentication problems & flaps on unrelated links


This bug is a duplicate of CSCdu73495, which causes authentication-related breakage in establishing peers, which eventually clears up on it's own after an indeterminate time. It can be triggered by bouncing peers/interfaces. You will not encounter this issue if you disable EIGRP authentication. CSCdu73495 was resolved in later versions of 12.1E IOS.

EIGRP neighbour cant be established if use MD5 authentication

C2610 EIGRP neighbour could be established via md5 authentication first time. After shut/no shut c2610 ethernet interface, it can't established any more. Via serial interface works fine.

EIGRP MD5 Authentication Breaks Neighbor Adjacencies over LANE

In a LANE environment with 3 or more devices running EIGRP, when upgrading from 12.1(6)E4 to 12.1(10)E4 on 7500's, EIGRP neighbor relationships may not be formed between devices running 12.1(10)E4. This is verified by performing a on one of the devices running 12.1(10)E4. The workaround for this scenario is to wait an unpredictable amount of time for the neighbors to converge, or remove and re-add EIGRP authentication from the interfaces on the affected devices. Also, neighbors can be statically configured in order for EIGRP to use unicast, rather than multicast.

2921-EIGRP flap due to bad TLV received on serial interface

Symptom: EIGRP flaps observed due to retransmission retry limit exceeded. Bad TLV error messages are seen in the logs. Conditions: Issue seen when 2921 replaces the 2611 device with similar configs.

Workaround: None. Apart from 2921, customer is using 2611 that works fine.


Known Affected Versions (Not a comprehensive list):
12.1(9)M

12.1(26)M
15.0M
12.1(8b)E15
12.3(12e)M
 
Fixed-In (Not comprehensive list):

12.1(10.2)M
12.2(4.2)M
12.0(30)SZ4
12.0(32)S6b
12.0(32)S7
12.0(32)SY4
12.0(32.3)S
12.1(6)E11
12.1(10.5)E
12.1(10.5)EC
12.2(4.2)PI
12.2(4.2a)DA
12.2(5.1)S
12.2(6.4)B
12.2(6.4)PB
12.2(15)BW
12.2(15)BX
12.2(15)ZN
12.0(32.11.1)SY


Workarounds:

1. Disable then re-enable EIGRP authentication;
2. Instead of MD5, use clear text authentication; or
3. Disable EIGRP authentication.

Permanent Fix:
Upgrade IOS version.

Due to intermittence/unpredictability, either use clear text authentication or disable authentication outright if IOS upgrade is not possible immediately. However, bouncing (disable/re-enable) the authentication can serve as your quick fix.

Tuesday, October 26, 2010

IOS: show interface FastEthernet mod/port - Detailed

The show interface output for physical interface
Router#sh interfaces FastEthernet 6/1

FastEthernet6/1 is up, line protocol is up (connected)
 Hardware is C6k 100Mb 802.3, address is 0009.11f3.8848 (bia 0009.11f3.8848)
 MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec,
  reliability 255/255, txload 1/255, rxload 1/255
 Encapsulation ARPA, loopback not set
 Full-duplex, 100Mb/s
 input flow-control is off, output flow-control is off
 ARP type: ARPA, ARP Timeout 04:00:00
 Last input 00:00:14, output 00:00:36, output hang never
 Last clearing of "show interface" counters never
 Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
 Queueing strategy: fifo
 Output queue :0/40 (size/max)
 5 minute input rate 0 bits/sec, 0 packets/sec
 5 minute output rate 0 bits/sec, 0 packets/sec
 1117058 packets input, 78283238 bytes, 0 no buffer
 Received 1117035 broadcasts, 0 runts, 0 giants, 0 throttles
 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
 0 watchdog, 0 multicast, 0 pause input
 0 input packets with dribble condition detected
 285811 packets output, 27449284 bytes, 0 underruns
 0 output errors, 0 collisions, 2 interface resets
 0 babbles, 0 late collision, 0 deferred
 0 lost carrier, 0 no carrier
 0 output buffer failures, 0 output buffers swapped out

Show interface output (physical interface) explained
 
up, line protocol up (connected) - the "up" is the physical layer (OSI layer 1) status of the link; the "line protocol up" is the data link layer (OSI layer 2) status of the link. Possible output are as follows:
 
up, line protocol up
up, line protocol down
down, line protocol down

Hardware - the interface hardware type, as well as the hardware/MAC address.

Description - the user-specified interface description as configured in the interface configuration mode.

MTU 1500 bytes - Maximum Transmission Unit.
BW -  Bandwidth.
DLY - Delay (in microseconds).

reliability - reliability, as fraction of 255 (where 255/255 = 100% reliability), exponential average over 5 minutes.
txload - current output load, as fraction of 255 (where 255/255 = 100% saturation), exponential average over 5 minutes.
rxload - current input load, as fraction of 255 (where 255/255 = 100% saturation), exponential average over 5 minutes.

Encapsulation - current data link/layer 2 encapsulation of the interface.
loopback - defines if loopback (hardware or software) is enabled or disabled.

Full-duplex, 100Mb/s - current duplex and speed settings of the interface.

ARP Type - the Address Resolution Protocol type enabled.
ARP Timeout - the time in hh:mm:ss for each entry remains in ARP cache before being removed.

Last input 00:00:14, output 00:00:36 - the time in hh:mm:ss when the last packet was received (input) or transmitted (output) by the interface.

output hang - the time in hh:mm:ss when the interface was reset because of a transmission that took too long.

Last clearing of "show interface" counters - the time when the interface counters are last cleared via "clear counter" command.

Input queue: 0/2000/0/0 (size/max/drops/flushes) - the input queue counters and thresholds; the first number (size) is the current number of frames in the queue; the second number (max) is the maximum number of frames in the queue before it starts dropping; the third number (drops) is the number of frames dropped because the max was exceeded; the last number (flushes) is the number of low-priority frames dropped due to Selective Packet Discard (SPD) algorithm when CPU is overloaded.

Total output drops: 0 - total number of packets dropped because the output queue is full; high output drops may indicate mismatched bandwidth settings of this and the remote connecting interface.

Queueing strategy - either First-In/First-Out (fifo), priority-list, custom-list, and weighted-fair.

Output queue :0/40 (size/max) - The number of packets in the output queue. Size is the current number of frames in the queue. Max is the number of frames the queue can hold before it starts dropping frames.

5 minute input/output rate - The average input and output rate seen by the interface in the last five minutes. The interval can be changed via the "load-interval " interface command.

packets input, bytes - Total number of error-free packets received by the system. Total number of bytes, including data and MAC encapsulation, in the error free packets received by the system.

no buffer - The number of packets received and discarded because there is no buffer space. Can be caused by broadcast storms.

Received broadcasts - The number of broadcast/multicast packets received by the interface.

runts - The number of packets that are discarded because they are smaller than the minimum packet size.

giants - The number of packets that are discarded because they exceed the maximum packet size (MTU).

throttles - The number of times the receiver on the port was disabled, possibly due to buffer or processor overload.

input errors - Includes runts, giants, no buffer, CRC, frame, overrun, and ignored counts.

CRC - Number of packets where the CRC generated by the originating far-end device does not match the checksum calculated from the data received; usually indicates noise or transmission problems on the LAN interface or the LAN bus itself.

frame - Number of packets received incorrectly having a CRC error and a noninteger number of octets A(elignment errors).

overrun - Number of times the receiver hardware was unable to hand received data to a hardware buffer because the input rate exceeded the receiver's ability to handle the data.

ignored - Number of received packets ignored by the interface because the interface hardware ran low on internal buffers.

watchdog - Number of times watchdog receive timer expired. It happens when receiving a packet with length greater than 2048.

multicast - Number of multicast packets received.

pause input - Number of times the connected device requests for a traffic pause when its receive buffer is almost full. This counter is incremented for informational purposes, since the switch accepts the frame. The pause packets stop when the connected device is able to receive the traffic.

input packets with dribble condition detected - A dribble bit error indicates that a frame is slightly too long. This frame error counter is incremented for informational purposes, since the switch accepts the frame.

packets output, bytes - Total number of error-free packets transmitted by the system. Total number of bytes, including data and MAC encapsulation, in the error free packets transmitted by the system.

underruns - Number of times that the transmitter has been run faster than the switch can handle. This can occur in a high throughput situation where an interface is hit with a high volume of bursty traffic from many other interfaces all at once. Interface resets can occur along with the underruns.

output errors - Sum of all errors that prevented the final transmission of datagrams out of the interface.

collisions - Number of times a collision occurred before the interface transmitted a frame to the media successfully. Collisions are normal for interfaces configured as half duplex but must not be seen on full duplex interfaces. If collisions increase dramatically, this points to a highly utilized link or possibly a duplex mismatch with the attached device.


interface resets - Number of times the interface transitioned from up to down to up.
 
babbles - Number of times that the transmit jabber timer expired. A jabber is a frame longer than 1518 octets (which exclude framing bits, but include FCS octets), which does not end with an even number of octets (alignment error) or has a bad FCS error.
 
late collision - Number of times a late collision occured. A late collision occurs when two devices transmit at the same time, and neither side of the connection detects a collision. The reason for this occurrence is because the time to propagate the signal from one end of the network to another is longer than the time to put the entire packet on the network. The two devices that cause the late collision never see that the other is sending until after it puts the entire packet on the network. Late collisions are not detected by the transmitter until after the first 64 byte slot time. This is because they are only detected in transmissions of packets longer than 64 bytes.



deferred - Number of frames that have been transmitted successfully after they wait because the media was busy. This is usually seen in half duplex environments where the carrier is already in use when it tries to transmit a frame.

lost carrier - The number of times the carrier was lost in transmission. This is usually caused by a bad cable. Check the physical connection on both sides.


no carrier - Number of times the carrier was not present in the transmission. This is usually caused by a bad cable. Check the physical connection on both sides.


output buffer failures, output buffers swapped out - Number of failed buffers and the number of buffers swapped out. A port buffers the packets to the Tx buffer when the rate of traffic switched to the port is high and it cannot handle the amount of traffic. The port starts to drop the packets when the Tx buffer is full and thus increases the underruns and the output buffer failure counters. The increase in the output buffer failure counters can be a sign that the ports are run at an inferior speed and/or duplex, or there is too much traffic that goes through the port.

Reference: Troubleshooting Switch Port and Interface Problems
http://www.cisco.com/en/US/products/hw/switches/ps700/products_tech_note09186a008015bfd6.shtml

CatOS: %SYS-3-PORT_OUT_DISCARD flood on disabled switchports

Scenario
Multiple %SYS-3-PORT_OUT_DISCARD syslogs are generated for a switchport which is currently disabled/administratively shutdown.

2010 Jan 04 16:04:10 EST -05:00 %SYS-3-PORT_OUT_DISCARD:Port 4/47 detected 6029 output discard error(s) in last 30 minutes

MySwitch> sh port status 4/47
# = 802.1X Authenticated Port Name.

Port Name Status Vlan Duplex Speed Type
----- -------------------- ---------- ---------- ------ ----------- ------------
4/47 disabled 66 full 100 10/100/1000
MySwitch>

MySwitch> (enable) sh run 4
... [output omitted] ...
#module 4 : 48-port 10/100/1000BaseT Ethernet
set port disable 4/3-48


Explanation
This is due to an identified bug on the CatOS version.

CSCeg24345 - WS-X6748-GE-TX: Tx counters increment on not connected ports

WS-X6748-GE-TX: Tx counters increment on not connected ports On a WS-X6748-GE-TX module in a Catalyst 6500 running CatOS 8.2(2), a port that is not-connected may increment Tx counters as well as ifOutErrors, ifOutDiscards and txCRC

This bug impacts CatOS releases prior to 8.6 and occurs on WS-X6748-GE-TX blades. It is a cosmetic bug and is non-service impacting.

Versions:
  • 1st Found-In: 8.2(2)


  • Fixed-In : 8.4(3.2), 8.4(4), 8.6(0.85)TAL

  • Note that although this is not service-impacting, it may wreck havoc on your monitoring system, as it will generate one syslog for each disabled port every thirty minutes.

    Monday, August 16, 2010

    IOS: %SYS-SP-3-CPUHOG: RFSS_server_action

    Scenario:
    A Cat6K throws the following syslog messages:
    Jul 18 01:48:12.362 EDT: %SYS-SP-3-CPUHOG: Task is running for (4000)msecs, more than (2000)msecs (0/0),process = RFSS_server_action.

     
    -Traceback= 4045D2CC 4045F5F8 4045F504 4047F45C 4047ED38 4047F31C 40481F5C 40489F04 4048A3CC 4048AF5C 40485DE4 4048B1AC 404816A8 402E41D8 40451534 4029A764

     
    Jul 18 01:48:14.366 EDT: %SYS-SP-3-CPUHOG: Task is running for (2000)msecs, more than (2000)msecs (1/0),process = RFSS_server_action.

     
    -Traceback= 4045D2A8 4045F5F8 4045F504 4047F45C 4047ED38 4047F31C 40481F5C 40489F04 4048A3CC 4048AF5C 40485DE4 4048B1AC 404816A8 402E41D8 40451534 4029A764

     
    Jul 18 01:48:18.370 EDT: %SYS-SP-3-CPUHOG: Task is running for (2000)msecs, more than (2000)msecs (2/1),process = RFSS_server_action.

     
    -Traceback= 4045D2A8 4045F5F8 4045F504 4047F45C 4047ED38 4047F31C 40481F5C 40489F04 4048A3CC 4048AF5C 40485DE4 4048B1AC 4048B504 404817C0 402E440C 40451660

    Description:
    The traceback shown indicates a problem with writing into the flash disk. Running the privileged-mode command "dir disk1:" will cause your login session to apparently hang for a few minutes. After that the logs will be filled up with a new batch of the above %SYS-SP-3-CPUHOG syslogs and traceback messages.
     
    ------------------ show disk1: all ------------------
    172683264 bytes available (83296256 bytes used)
    ******** ATA Flash Card Geometry/Format Info ********
    ATA CARD GEOMETRY
     Number of Heads: 16
     Number of Cylinders 978
     Sectors per Cylinder 32
     Sector Size 512
     Total Sectors 500736
     
    ATA CARD FORMAT
     Number of FAT Sectors 245
     Sectors Per Cluster 8
     Number of Clusters 62495
     Number of Data Sectors 500596
     Base Root Sector 598
     Base FAT Sector 108
     Base Data Sector 630 %
     
    Error show disk1: (TF I/O failed in data-in phase)
     
    Workaround/Resolution:
    1. Reseat the compact flash card.
    2. If error still occurs, reformat the flash card.
    3. If error still occurs, replace the flash card.

    Monday, July 5, 2010

    FortiOS v3.00 MR5 - CPU Usage Too High

    Problem:

    Fortigate 3600 running version 3.00 MR5 Patch 2 keeps sending high CPU trap SNMP traps to the SNMP trap servers. CPU utilization is confirmed to be high, based from the output of “get system performance status” or from the GUI. From “diag sys top”, confirmed that the “merged_daemons” process is using 99% of the total CPU, then shortly goes down to 14%.


    Cause:
    This is due to bug documented below:

    0062617: race condition in flgd can cause merged_daemons to spin
    The merged_daemons was constantly in the 'R' state and consuming 99% of CPU (when top is first started, the usage will display as 99% -- the usage will decrease to 14% while top is running).

    Fix: Build: 0566


    Workaround:
    Restart merged_daemons as follows:
    • Enter diag sys top and take note of the PID of merged_daemons
    • Enter diagnose sys kill 11 [pid]
    Note that merged_daemons may still climb back up to 99%.


    Resolution/Workaround:
    Upgrade to FortiOS MR6 or later.

    Monday, January 4, 2010

    IOS: %EARL_L3_ASIC-SP-3-INTR_WARN: EARL L3 ASIC: Non-fatal interrupt Packet Parser block interrupt

    Dec 18 09:54:43.989 JST: %EARL_L3_ASIC-SP-STDBY-3-INTR_WARN: EARL L3 ASIC: Non-fatal interrupt Packet Parser block interrupt
    Dec 18 09:54:43.993 JST: %EARL_L3_ASIC-SP-3-INTR_WARN: EARL L3 ASIC: Non-fatal interrupt Packet Parser block interrupt

    Description
    These messages are indicating that the switch has received an invalid packet which contained a Layer 3 IP checksum error. These packets are normally being dropped silently within older IOS. In some IOS releases, the switch informs of this condition to warn users that there is (are) devices outside sending IP packets with checksum errors and/or with wrong length.

    See CSCdz10360 (Need a CLI to be able to disable L3 error checking in HW) regarding this enhancement.

    Workaround
    These messages are purely informational. You may either:


    1. SPAN all the Vlans and look at layer3 IP source address then remove the device generating invalid packets (unfortunately the switch doesn't track the IP address. The only way is to sniff every suspected Vlan to find out where those invalid packets are coming from).


    2. Configure (this is a new config option added by means of CSCdz10360):
      no mls verify ip checksum ---> to stop to check for packet checksum errors
      no mls verify ip length ---> to stop to check for packet length errors
      no mls verify ip length minimum ---> to eliminate check for IP packets that are minimum length.
      no mls verify ip same-address ---> to stop checking for packet having equal source and destination IP address.


    3. Do nothing as these are pure informational.

    IOS: %ETHCNTR-3-LOOP_BACK_DETECTED : Keepalive packet loop-back detected on [chars]

    Scenario
    The switch reports this error message, and the port is forced to linkdown:
    %ETHCNTR-3-LOOP_BACK_DETECTED : Keepalive packet loop-back detected on [chars]

    Oct 2 10:40:13: %ETHCNTR-3-LOOP_BACK_DETECTED: Keepalive packet loop-back detected on GigabitEthernet0/1
    Oct 2 10:40:13: %PM-4-ERR_DISABLE: loopback error detected on Gi0/1, putting Gi0/1 in err-disable state


    Description
    The problem occurs because the keepalive packet is looped back to the port that sent the keepalive. Keepalives are sent on the Catalyst switches in order to prevent loops in the network. Keepalives are enabled by default on all interfaces. You see this problem on the device that detects and breaks the loop, but not on the device that causes the loop.

    Workaround
    Issue the no keepalive interface command in order to disable keepalives. A disablement of the keepalive prevents errdisablement of the interface, but it does not remove the loop.

    Permanent Fix
    In Cisco IOS Software Release 12.2(x)SE-based releases and later, keepalives are not sent on fiber and uplink interfaces by default. Upgrading the IOS version to this or later images should prevent the above issue in the first place.