High availability Internet connectivity on a budget

This post was originally written back in 2012 in my previous job. I’ve reposted it here in case it’s of any use in the future.

How can you improve Internet connectivity for an office on a budget?

This was a problem we had recently for a customer of ours. The customer did not have a budget for a highly available Internet connection from an ISP. Their existing service was a standard business/consumer cable connection that performed all of their needs most of the time and was about all they felt they needed to pay. However, they needed to insure against the occasional outage that is not uncommon for such a service (one or two hours a month, say). Spending thousands of pounds was not an option on the table.

The service did indeed fail every few weeks, but the provider had been unable to improve reliability and so the problem remained one of: tolerate the problem or look for an alternative more reliable provider.

However, the customer already had a Cisco 1811 ISR router (circa £600) which would be able to provide route failover if only there was a second Internet connection.

The solution we offered and implemented was to install a second low cost Internet connection from another supplier, and to implement IP SLAs to handle failover. The existing router had all of the features required using only static routes and without any routing protocols.

Two ISPs connected to one router
The planned configuration

The following post is very Cisco oriented, but similar features may be available with alternative vendors’ equipment.

Basic configuration

The cable connection was the fastest, and so that would be the active connection. The DSL connection would be redundant and only come into use when failure of the cable connection occurred.

The DSL connection was tested and then connected to a free Layer 3 port on the router. It was was added to the routing table thus:

The lower metric for FastEthernet0 means that all traffic will be routed out that unless the interface is administratively down. However, if the interface is up and the ISP has a failure upstream, then the connection will not automatically failover. More on that later.

Address translation and route maps

Outgoing connections need to be translated to the correct public IP address depending on what route is taken. (This is in contrast to higher availability solutions where the IP addresses might be rerouted.) The existing NAT rules were simply duplicated for the new connection. Here, using route maps, the connection is translated to the correct IP address depending on which interface the connection leaves:

At this stage, the failover can be tested by manually shutting down FastEthernet0. (It may also be necessary to clear the NAT translation table.) All being correct, Internet connection should remain available.

IP SLAs

The final step to providing failover is remarkably simple.

First a suitable upstream IP address needs to be found for each connection to monitor for availability. Your ISPs have probably provided equipment that serve as the next hop. However, monitoring those is not suitable for they will remain available when the connection has failed further upsteam.

Perform a trace route for each connection to determine a suitable test candidate. Here, upstream routers have been identified that can be monitored for availability.

Traceroute can be used for each connection to identify the upstream routers on the service provider network.
Service provider routers identified for monitoring *

Now, SLA rules can be defined to monitor the two connections.

Above, two seperate SLAs have been configured. Each independently pings the upstream routers every 5 seconds, and notes a problem if there is not a response within 1 second (or 1000ms). However, we want to avoid invoking a failover if the disruption is short, so the connection must be down for 15 seconds before a failure will occur. We also want to avoid flapping of the connection (if the connection is regularly going down every minute it will cause frustration to bring it back every time), so a failed state will only end after 120 seconds without timeout.

You can now check the current status with the ‘show track’ command. The command should show ‘Reachability is Up’ for each connection. To put the rules into effect, the routing table must be modified to use the rules. In the following, the routes have been altered to reference the track objects, and so a failure will cause the route entry to be disabled.

One more thing…

Once a failover does occur, the route is effectively gone from the routing table. That applies equally for traffic originating from the router as it does for traffic going across it. That includes the monitoring traffic. To allow the router to detect the resolution of a service outage after a failover, explicit routes need to be added to the routing table for each monitored router:

You now have a simple failover between two basic Internet connections using standard features available in most Cisco IOS routers, while avoiding the use of routing protocols and expensive availabilty solutions from service providers.

Results

So far, in the last two months the primary connection (which is the fastest and preferred, but unfortunately the less reliable) has been down on three occasions for an average of 3 hours each time. Already the improvements have paid off.

In future articles I will elaborate on expanding the configuration to include load-balancing traffic to make better use of the redundant connection, and how incoming traffic for services such as web and email can continue to be served during a service provider failure.

* There are caveats with this post’s selection of router to monitor. If the ISP has a failure somewhere further upstream then your monitoring will fail to notice it. Alternatively, if the ISP reorganises the network and the router you were monitoring is removed, the monitoring will incorrectly believe the connection is down. These problems must be considered on a per connection basis and may need to be solved with the assistance of the particular service providers involved.

Leave a Reply

Your email address will not be published. Required fields are marked *