High Availability

From TBwiki

Revision as of 07:40, 1 December 2015 by Abrassard (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

High availability refers both to the design of a system as well as its ability to continue operating following one or more component faults. Concepts that are associated with high availability include:

fault tolerance (the ability for a given component to recover from a fault)
redundancy (the presence of one or more back-up instances of a hardware or software component)
hot-swappable (the ability to add, remove or change a system component without taking the overall system down and without compromising functionality
scalability (the ability of a system to grow over time as well as to respond to unexpected spikes in usage without negatively impacting performance

TelcoBridges and High Availability

While TelcoBridges hardware could always be purchased and configured to support high-availability (HA) requirements, this new release of Toolpack provides complete HA support in software as well, enabling complete end-to-end high availability. With Toolpack version 2.3, existing application-level redundancy is now complimented by redundancy of the core configuration database. With this release of Toolpack, the primary database can go down, while master versions of applications on the main server can keep running. Following a fault, they will simply refer to the secondary (backup) database. If master host machine should encounter a fault, then the primary configuration database and all other master application instances will also go down. In this case, all standby application instances on the standby server as well as the configuration database become the new primary (master) instances. Finally, it is important to note that all applications become highly available by default once HA support has been turned on.

HA available at various layer of the system

There are a lot of different type of redundancy in a Toolpack system. Let's try to summarize them.

IP Network (GW Control and Managment) (1+1)

All communications between application<-->application, application<-->TMedia, TMedia<-->TMedia are done using redundant ip interfaces. This is build from the ground up in all our code.

IP Network (SIP Signaling) (1+1)

For outgoing calls, the system can be configured such that you can use alternate addresses to reach the SIP Proxy, so if one network path is down, it will use the other one. For incoming calls, you can configure SIP stacks on two separate ip interfaces so that if one is not reachable by a peer, the second can be reached. Note however that we don't offer mechanism such as virtual IP, so the peer SIP agent must be configured with the 2 ip addresses used on the TMedia and it must be able to switch from one to the other if one is down.

IP Network (Voice Path) (1+1)

We support redundant voice path for SIP calls. If one path is down, the other one will be selected for all new calls. If both path are up, they can be used in load sharing. Not however, that if a path is lost, the active calls currently using that path won't be switched automatically to the other path (this would require sending a SIP re-INVITE because the ip address of the second interface is not the same as the address of the first interface.

Ip Network exception

TMG800 has limited physical ip ports so it does not support the IP network redundancy.

SS7 Redundancy (N+1)

At the MTP2 level, you can have redundancy by having links reaching a destination spread across at least 2 TMedias, so that if one TMedia fails you can still talk to the destination through the link(s) on the other TMedia. Full load-sharing at the MTP3 layer. MTP3 processing is distributed on all TMedia of a system. Active-Standby scheme available at the ISUP, SCCP, and TCAP layer. Bottom line: no loss of calls when any TMedia crash (the Standby stack will take control and keep the active call opened)

ISDN, CAS

Directly linked to the physical trunk and we don't support HA at the trunk level so no HA.

Automatic protection mechanism on the TMedia

On board process monitoring: dead-lock or crashed process are automatically detected and blade is rebooted to ensure minimal downtime. HW watch dog: When a Toolpack system is active, the TMedia is polled continuously by the System Manager. If this polling stops, it means the TMedia lost contact with the Toolpack applications and it will reboot automatically if communication is not restored in a timely manner.

Application Level (1+1)

All Toolpack applications support Active-Standby mode so you can have two hosts, each having a running instance of our Toolpack application. If the active server crashes, the application on the standby server will take the relay. Furthermore, on each host, Toolpack installs a service that monitors all Toolpack application and that automatically restart crashed or deadlocked applications. The Active-Standby scheme only works when using external host. When using TMG with their internal host, each TMG is considered a separate system and separate systems don't talk at all with each others. Note however that the automatic restart of application is still valid even when using the internal host.

Configuration Database HA (1+1)

We also support HA at the DB level. Toolpack can configure 2 DB Server and setup replication between them so that all changes made to the Master DB are replicated on the Slave DB. If the Master DB is lost, Toolpack will automatically switch to using the Slave DB.

Prerequisites

The first prerequisite for enabling HA database functionality is to have a secondary server available running the MySQL database software. To configure that secondary server you would follow the same prerequisites for a normal installation of Toolpack on that machine. One thing to note with this change is that the IP address for the primary as well as the secondary server must now be made explicit (i.e., using ‘localhost’ or ‘127.0.0.1’ is no longer permitted). You will also want to activate database replication functionality using MySQL’s procedures for achieving this.

Figure 2: Server scenarios

Testing this feature

While we expect enhanced HA functionality to have a major positive impact on your operations, HA is like an insurance policy in that you hope you never have to use it and it is not something you necessarily want to test on production servers. Consequently, it is the area where we expect to spend significant testing efforts. We have identified the following items as being worthy of testing and we recommend that they be part of your testing efforts as well.

You should expect different effects on the system depending on the test operations performed. These can be divided into two broad categories. The first is where the system will NOT drop any calls and will still accept new incoming calls and calls that are currently being processed (transient calls). The second is where established calls will NOT be dropped but the system will not be able to accept transient calls. Please note that this ‘interruption’ is temporary in nature, lasting only so long as the the switchover; once the switchover is complete, any incoming calls will be accepted again.

MySQL operations (no effect on established calls, still able to accept transient calls)

Shutdown MySQL service on master
Kill MySQL service on master
Shutdown MySQL service on slave
Kill MySQL service on slave

Toolpack operations

Quit (graceful) master Toolpack service: (no effect on established calls, will drop transient calls)
Quit (graceful) slave Toolpack service: (no effect on established calls, still able to accept transient calls)

Toolpack applications operations

Master
- tboamapp
- toolpack_sys_manager (no effect on established calls, still able to accept transient calls)
- toolpack_engine (no effect on established calls, will drop transient calls)
- gateway (no effect on established calls, will drop transient calls)
Slave
- tboamapp
- toolpack_sys_manager (no effect on established calls, still able to accept transient calls)
- toolpack_engine (no effect on established calls, still able to accept transient calls)
- gateway (no effect on established calls, still able to accept transient calls)

Host operations

Master (no effect on established calls, will drop transient calls)
Disconnect the network(s)
Shutdown the server (or disconnect power)
Slave (no effect on established calls, still able to accept transient calls)
Disconnect the network(s)
Shutdown the server

Following a switchover, you will want to test the following items and verify that there are no errors:

Established calls are still open
New incoming calls are accepted
The system closes established calls
You are able to switch configurations (this requires a configuration reload)
You are able to change the state (Run / Don't Run) of an application
You are able to enable / disable trunks (this requires a configuration reload)
You are able to enable / disable stacks ( SS7, SIP, ISDN) (this requires a configuration reload)
You are able to add/ remove / modify gateway routes (this requires a configuration reload)
You are able to add / remove / modify gateway accounts (this requires a configuration reload)

Known limitations

We have identified the following limitations with the enhanced high availability support available in Toolpack v2.3.

It is not always possible to recover from a double fault. For example, should both configuration databases go offline, any telephony applications that use these databases will cease to function correctly.

References

Wikipedia article