Our client, India’s fastest growing 4G, broadband, fibre and cable provider, was working on the largest implementation of fibre to home rollout in the world. The largest implementation in the US has about 4000 switches, while this rollout had about 8000 to 9000 switches. Our client was looking for a solution which would monitor switches, routers, wireless access points and other networked devices for faults. It wanted to provide High Availability and Zero Downtime. It wanted to ensure that the service is horizontally scalable and will be able to handle doubling of the network size. It wanted the system to provide flexibility for being able to configure how to correlate and dedupe alarms at a moments notice.
Wissen’s deep expertise in the Telecom industry was used to create an ingenious solution for this problem. A set of Node.js based services were created to collect traps, syslog events and poll based alarms. A rule based correlation and enrichment engine was built, that was able to group, dedupe and present a single alarm, based on device and event types, among other things. The solution used stateless containerized services, that were able to seamlessly scale up or down, based on CPU utilization, throughput rates and processing delays.
A fault management solution for WAN, spanning 15 data centers, 4000 switches, which were growing to over 7500 switches in 2019, and over 100,000 connected network devices, was successfully built. The solution had the functionality to add new alarm types and correlation rules with configuration changes. It successfully reduced the alarms by 30%, to surface true problems through correlation rules.