Monitoring of External Systems
How can one be alerted of a failure when the system to be monitored doesn't follow working hours and even takes a lunch break 🍜?
Pascal L.
Aug 1, 2023
For many businesses, it's crucial to monitor both their own and external systems. But what if an external system sporadically ceases communication, not due to a failure? How can false alarms be prevented?
The Challenge
For one of our clients, we developed a back office and a backend. The back office assists with various business processes and is supplied with data via the backend. The backend communicates with the client's Enterprise Resource Planning (ERP) system and other external interfaces.
We were then confronted with the issue that the ERP system shuts down overnight, and to everyone's amusement, it even takes a daily lunch break. For smooth business operations, it's necessary to identify when communication with the ERP fails, without triggering false alarms several times a day.
What makes it difficult?
The backend operates in the Google Cloud, and we use OpenTelemetry to send additional metrics to Google Cloud Monitoring (formerly Stackdriver). Hence, we can analyze the traffic very accurately. The ERP starts communicating with the backend daily between 5 AM and 7 AM. Apart from a brief pause at lunchtime, the data transfer stops daily between 4 PM and 10 PM. This prevents us from just enabling alerting at certain times, as we want to be alerted roughly within an hour if there's a failure.
The Trick
We found that the ERP traffic strongly correlates with the requests to the back office API. This allows us to determine when the ERP should be active. As soon as we notice that the back office API is active, indicating the employees are working, we expect requests from the ERP system.
With the right choice of the alignment period, brief transmission pauses can be smoothed out. In this case, it can cover the ERP's lunch break.
A retest window prevents an alarm from being triggered by a single measurement. This assists in the borderline areas, as the traffic on the back office API doesn't start and stop simultaneously with the ERP's traffic.
Conclusion
If there's a system that only needs to be monitored during business hours, a metric that correlates with the usage of the product can be effectively used.
Setting up such a monitoring and alerting system can be a challenge, but it is crucial to ensure smooth operation. In our case, it enables us to respond quickly and minimize downtime. By combining monitoring, analytics, and intelligent alerting, we can ensure a swift response to problems and reliable operation of services.
We would be happy to advise you on the development, operation, and monitoring of cloud applications. Just contact us!
Technologies
OpenTelemetry
Google Cloud Operations
Terraform
Google Cloud
Contact
We use an external calendar from Cal.com for scheduling calls.
By clicking the button, you consent to the use of cookies by Cal.com.
For details, please see our privacy policy.