At time of writing, LONAP has approximately 210 members, with 350 member facing ports, a total of 6TB connected capacity, spread over 8 separate data centres spanning from London Docklands to Slough - a distance of approximately 30 miles. The traffic on the exchange currently peaks at over 550Gb/sec.
We run the exchange with four full time staff, three of whom are engineers and all work remotely. The exchange has been historically very stable and monitoring our own infrastructure and alerting on it is essential to keep it that way.
There are several different systems that comprise our monitoring and alerting system, including a central store of historic switch port statistics, which can be used for live status pages (‘Weathermaps’) as well as historical analysis and an monitoring system which makes active checks on parts of our infrastructure and alerts on them where appropriate. This post is concerned with the evolution of the latter.
The scale of the problem…
As well as the switches that operate the peering LAN, there is a variety of ancillary equipment such as out of band management switches, console servers, power distribution units, access points, and optical systems. There are also a number of servers running virtual machines, which operate services such as the website, ticketing system, mail, code repositories, route servers etc - in total around 150 different devices.
The pre-existing alerting system was built on the venerable Nagios system which, while stable, had a number of problems that needed addressing, including configuration from static files, large amounts of alert flapping, and getting out of date with our ever-changing infrastructure. While there are available options to update and refresh the Nagios system we took the opportunity to look around for alternatives and arrived at Sensu.
Sensu is a different style of system - it is API driven, checks it makes are distributed and run by agents on the hosts themselves, or via a proxy for devices that can’t run their own checks (such as switches). Proxies or agents are referred to as ‘entities’ in Sensu. The backend deployment of Sensu itself doesn’t run any checks, it just allows entities to register themselves and it schedules the checks that the entities run.
One primary benefit in moving to this system is that it allows us to automatically generate the monitoring and alerting as we deploy and update systems by using the sources of truth the exchange is built on:
SaltStack: Systems that we run are defined in our deployment of the Salt configuration management tool - deploying the Sensu monitoring agent automatically and consistently.
Netbox: Infrastructure that cannot be deployed from Salt is monitored via proxies whose configuration is pulled from our installation of Netbox, which is the canonical source of our physical infrastructure.
IXP Manager: BGP sessions which should be monitored are pulled from our installation of IXPManager.
Using these three methods, it is not necessary to manually install or configure agents as we add or alter our infrastructure. The check and alerting configurations themselves are stored GitLab and are version controlled. It’s possible to deploy a complete new instance of the monitoring infrastructure in minutes from a blank machine.
We currently monitor around 150 separate entities, and these generate around 1500 monitoring events every 30-60 seconds, and there is much more we will be adding in the near future.
Watch the full talk where Ian Chilton talks about our approach to monitoring at: NLNOG Live! - Sep 2020.