Residential College | false |
Status | 已發表Published |
MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network | |
Yu, Xian1; Ye, Kejiang2; He, Dongbiao3; Chen, Xianfan3; Xu, Chengzhong4; Li, Jianhui1; Xie, Gaogang1 | |
2024 | |
Conference Name | 44th IEEE International Conference on Distributed Computing Systems, ICDCS 2024 |
Source Publication | Proceedings - International Conference on Distributed Computing Systems |
Pages | 913-924 |
Conference Date | 23 July 2024through 26 July 2024 |
Conference Place | Jersey City |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Abstract | Through providing cheap rack and network hosting services, third-party internet data centers (IDCs) have gained significant popularity among cloud service providers. Real-time monitoring of the quality of the IDC network and proactively alarming is crucial to guaranteeing the reliability of cloud services. The prevailing approach to addressing this problem involves utilizing active probes and making evaluations based on the results of single-link or multi-link probing. However, the existing efforts still tend to generate a significant number of unnecessary alerts, resulting in enormous operational costs. For this reason, we first build a large-scale distributed ping-based dial test system that enables monitoring the quality of the IDC network in a many-to-one probe mode. We develop an efficient exporter tool based on the standard Prometheus' data interface to ensure real-time and precise measurement data collection. To quickly and accurately detect potential network issues, we also design a multi-step heuristic-based fault detection and alarm method. Furthermore, we propose a comprehensive alarm life-cycle model based on the results of multi-link probing to guide alarm management in production practice. This system has been successfully deployed in the production environment of Sangfor company's managed cloud for over a year, enabling proactive diagnosis of hundreds of IDC gateway IP addresses. The actual statistical results indicate a significant improvement in the mean time to repair (MTTR) for IDC network failures, reducing it from a few hours to just a few minutes. The average daily number of alarms generated by this system is less than 15, decreasing approximately 85 % compared to before. The alarm accuracy exceeds 95 % and the false negative rate is less than 2 %. |
Keyword | Active Probing Alarm Aggregation Fault Detection Internet Data Center Network |
DOI | 10.1109/ICDCS60910.2024.00089 |
URL | View the original |
Indexed By | CPCI-S |
Language | 英語English |
WOS Research Area | Computer Science ; Telecommunications |
WOS Subject | Computer Science, Information Systems ; Computer Science, Theory & Methods ; Telecommunications |
WOS ID | WOS:001304430200080 |
Scopus ID | 2-s2.0-85203197458 |
Fulltext Access | |
Citation statistics | |
Document Type | Conference paper |
Collection | Faculty of Science and Technology DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE |
Corresponding Author | Yu, Xian |
Affiliation | 1.Cnic, Cas, China 2.Siat, Cas, China 3.Sangfor Technologies Inc., China 4.University of Macau, Faculty of Science and Technology, Macao |
Recommended Citation GB/T 7714 | Yu, Xian,Ye, Kejiang,He, Dongbiao,et al. MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network[C]:Institute of Electrical and Electronics Engineers Inc., 2024, 913-924. |
APA | Yu, Xian., Ye, Kejiang., He, Dongbiao., Chen, Xianfan., Xu, Chengzhong., Li, Jianhui., & Xie, Gaogang (2024). MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network. Proceedings - International Conference on Distributed Computing Systems, 913-924. |
Files in This Item: | There are no files associated with this item. |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment