This post was originally written back in 2012 in my previous job. I’ve reposted it here in case it’s of any use in the future.
How can you improve Internet connectivity for an office on a budget?
This was a problem we had recently for a customer of ours. The customer did not have a budget for a highly available Internet connection from an ISP. Their existing service was a standard business/consumer cable connection that performed all of their needs most of the time and was about all they felt they needed to pay. However, they needed to insure against the occasional outage that is not uncommon for such a service (one or two hours a month, say). Spending thousands of pounds was not an option on the table.
The service did indeed fail every few weeks, but the provider had been unable to improve reliability and so the problem remained one of: tolerate the problem or look for an alternative more reliable provider.
However, the customer already had a Cisco 1811 ISR router (circa £600) which would be able to provide route failover if only there was a second Internet connection.
The solution we offered and implemented was to install a second low cost Internet connection from another supplier, and to implement IP SLAs to handle failover. The existing router had all of the features required using only static routes and without any routing protocols.
The following post is very Cisco oriented, but similar features may be available with alternative vendors’ equipment.
The cable connection was the fastest, and so that would be the active connection. The DSL connection would be redundant and only come into use when failure of the cable connection occurred.
The DSL connection was tested and then connected to a free Layer 3 port on the router. It was was added to the routing table thus:
ip route 0.0.0.0 0.0.0.0 FastEthernet0 192.168.1.1 10 ip route 0.0.0.0 0.0.0.0 FastEthernet1 192.168.2.1 20
The lower metric for FastEthernet0 means that all traffic will be routed out that unless the interface is administratively down. However, if the interface is up and the ISP has a failure upstream, then the connection will not automatically failover. More on that later.
Address translation and route maps
Outgoing connections need to be translated to the correct public IP address depending on what route is taken. (This is in contrast to higher availability solutions where the IP addresses might be rerouted.) The existing NAT rules were simply duplicated for the new connection. Here, using route maps, the connection is translated to the correct IP address depending on which interface the connection leaves:
ip access-list extended LAN-WAN remark Inside to Outside traffic permit ip 192.168.100.0 0.0.0.255 any ! route-map NAT-FE0 permit 10 match ip address LAN-WAN match interface FastEthernet0 ! route-map NAT-FE1 permit 10 match ip address LAN-WAN match interface FastEthernet1 ! ip nat inside source route-map NAT-FE0 interface FastEthernet0 overload ip nat inside source route-map NAT-FE1 interface FastEthernet1 overload
At this stage, the failover can be tested by manually shutting down FastEthernet0. (It may also be necessary to clear the NAT translation table.) All being correct, Internet connection should remain available.
The final step to providing failover is remarkably simple.
First a suitable upstream IP address needs to be found for each connection to monitor for availability. Your ISPs have probably provided equipment that serve as the next hop. However, monitoring those is not suitable for they will remain available when the connection has failed further upsteam.
Perform a trace route for each connection to determine a suitable test candidate. Here, upstream routers have been identified that can be monitored for availability.
Now, SLA rules can be defined to monitor the two connections.
ip sla 1 icmp-echo 10.123.42.5 source-interface FastEthernet0 threshold 500 timeout 1000 frequency 5 history lives-kept 2 history filter all ip sla schedule 1 life forever start-time now ip sla 2 icmp-echo 10.235.111.52 source-interface FastEthernet1 threshold 500 timeout 1000 frequency 5 history lives-kept 2 history filter all ip sla schedule 2 life forever start-time now ! track 1 ip sla 1 reachability delay down 15 up 120 track 2 ip sla 2 reachability delay down 15 up 120
Above, two seperate SLAs have been configured. Each independently pings the upstream routers every 5 seconds, and notes a problem if there is not a response within 1 second (or 1000ms). However, we want to avoid invoking a failover if the disruption is short, so the connection must be down for 15 seconds before a failure will occur. We also want to avoid flapping of the connection (if the connection is regularly going down every minute it will cause frustration to bring it back every time), so a failed state will only end after 120 seconds without timeout.
You can now check the current status with the ‘show track’ command. The command should show ‘Reachability is Up’ for each connection. To put the rules into effect, the routing table must be modified to use the rules. In the following, the routes have been altered to reference the track objects, and so a failure will cause the route entry to be disabled.
ip route 0.0.0.0 0.0.0.0 FastEthernet0 192.168.1.1 10 track 1 ip route 0.0.0.0 0.0.0.0 FastEthernet1 192.168.2.1 20 track 2
One more thing…
Once a failover does occur, the route is effectively gone from the routing table. That applies equally for traffic originating from the router as it does for traffic going across it. That includes the monitoring traffic. To allow the router to detect the resolution of a service outage after a failover, explicit routes need to be added to the routing table for each monitored router:
ip route 10.123.42.5 255.255.255.255 FastEthernet0 192.168.1.1 10 ip route 10.235.111.52 255.255.255.255 FastEthernet1 192.168.2.1 10
You now have a simple failover between two basic Internet connections using standard features available in most Cisco IOS routers, while avoiding the use of routing protocols and expensive availabilty solutions from service providers.
So far, in the last two months the primary connection (which is the fastest and preferred, but unfortunately the less reliable) has been down on three occasions for an average of 3 hours each time. Already the improvements have paid off.
In future articles I will elaborate on expanding the configuration to include load-balancing traffic to make better use of the redundant connection, and how incoming traffic for services such as web and email can continue to be served during a service provider failure.
* There are caveats with this post’s selection of router to monitor. If the ISP has a failure somewhere further upstream then your monitoring will fail to notice it. Alternatively, if the ISP reorganises the network and the router you were monitoring is removed, the monitoring will incorrectly believe the connection is down. These problems must be considered on a per connection basis and may need to be solved with the assistance of the particular service providers involved.