Pages

VRRP master/master issue on CSS 11501 with 3550


In the picture illustrated in which two Cisco CSS 11501 Loadbalancers were providing a redundant setup with fate sharing, the route from the "Servers" networks towards the client network was provided through an IP setup on a redundant interface shared by the 2 Loadbalancers.

The VRRP announcements for the virtual routers holding redundant interfaces on vlans A,B between the two loadbalancers were going through the 2 Cisco 3550 Catalyst switches which were running (C3550-I9Q3L2-M), Version 12.1(19)EA1c IOS.

To better depict the picture, each of the 2 Loadbalancers had one physical link to its corresponding L3 3550 and carrying over it vlans A,B (on the server side), one ISC link was connecting the two CSS for adaptive session redundancy (ASR) and the link between the two Cisco 3550 was set up as 802.1q trunk and transporting among other the vlans A,B over which the VRRP communication had to take place.
Although the setup and configuration was double and triple checked, the problem was that each of the Loadbalancers was claiming to be master on the virtual router instance running for its corresponding vlan (A or B).
For brevity I will illustrate the case of the virtual router on vlan A, although the problem seemed to be strongly related to the fact that the CSS were connecting through a trunk link to the 3550.
CSS11510_right# show redundant-interfaces

Redundant-Interfaces:

Interface Address: 192.168.0.2 VRID: 1
Redundant Address: 192.168.0.1 Range: 1
State: Master Master IP: 192.168.0.2

CSS11501_left# show redundant-interfaces

Redundant-Interfaces:

Interface Address: 192.168.0.3 VRID: 1
Redundant Address: 192.168.0.1 Range: 1
State: Master Master IP: 192.168.0.3
While trying to browse for this specific problem (both CSS were master), I found out that most of the cases were related to misconfiguration. Either an access list was blocking traffic between the 2 devices, either the VRID was incorrect, etc. However there was nothing wrong with the configuration present on the CSS nor on the 3550s.
Checking the counter for VRRP announcements received by the presumably slave Loadbalancer at some point in time, the number was always 0.
CSS11501_left# llama
CSS11501_left(debug)# ip scp statistics

totalIpFrames received: 211300
invalidIPFrame: 0 malformedIPFrame: 0
noIngressIPFrame: 0 srcDestSameIPFrame: 0
badIPVersion: 0 badIpHeaderLength: 0
badIpChecksum: 0 badSrcIPFrame: 0
loopbackIPFrame: 0 badIPAddress: 0
badIpDestAddress: 0 zeroTTLIPFrame: 0
badIpProtocol: 0 badIpOptions: 0

Packets received with supported protocol types:
IPPROTO_IP: 0 IPPROTO_ICMP: 12285
IPPROTO_IGMP: 0 IPPROTO_GGP: 0
IPPROTO_TCP: 3129 IPPROTO_EGP: 0
IPPROTO_PUP: 0 IPPROTO_UDP: 47625
IPPROTO_IDP: 0 IPPROTO_TP: 0
IPPROTO_EON: 0 IPPROTO_OSPF: 0
IPPROTO_ENCAP: 0 IPPROTO_VRRP: 0
IPPROTO_OSPF: 0

IP PACKET TO VXWORKS STATISTICS:
packetLeakToVxWorks: 170436
As mentioned earlier the 3550 was running Version 12.1(19)EA1c IOS, while the CSS was running sg0730203 (07.30.2.03) WebNS.
I didn't solve the issue myself. I was notified that there is a problem with the current IOS running on the 3550 and there was a need to upgrade to at least an EMI image 12.1.20. There is also a bug logged with Cisco, although the setup and the configuration of the presented issue and the one logged with Cisco are not exactly the same.
Here is the bug logged to Cisco.
After upgrading to 12.1.20 IOS, the VRRP announcements were received by the slave Loadbalancer and the initial VRRP negotiation took place correctly.
Reference: CSS Redundancy Configuration Guide

2 comments:

AY said...

Hi,

I have just read the details of the Cisco 11501 and VRRP issue you experienced.

I have the exact same problem and wondered if you manged to get an answer.

I have managed to force the correct behaviour, at least initially:
I start of with both load balancers configured how I want them,
Primary priority 120, preempting
Secondary priority 110

Both loadbalancers declare they are master.

I then reconfigure the secondary with a greater priority than the primary using preempt also. For some reason this behaves correctly and forces the original primary to backup.

I then throw the Secondary back to the original priority of 110 removing preemption. The Primary reassumes mastership and the Secondary correctly becomes Backup.

All is well until the primary unit fails. VRRP mastership moves over to the secondary unit as it should but when the primary unit comes back on line we end up with both loadbalancers becoming master.

I have 2 Catalyst 6509's running 12.2(18)SXD5 where you have your 3550's and other parts of my configuration differ considerably from yours but I have the same fundamental issue.

Valentin said...

Hi AY,

The problem I had was resolved by upgrading the IOS on the 3550 to 12.1.20. This issue was caused by the fact that the master VRRP announcements were not reaching the peer.

From the description you made it looks like the VRRP announcements are passing through the 6509s, but you can double check that on the CSS with 'ip scp statistics' in debug mode.

My suggestion is to have a look also on the release notes for the WebNS you have on installed on the load balancers as older releases might be having problems (eg Software Version 7.30.3.03 Open Caveats - CSCeg10594 - The CSS does not correctly handle VRRP announcement upon a link failure being brought back into service by a backup CSS when using VIP interface redundancy.)