UM  > Faculty of Science and Technology
Residential Collegefalse
Status已發表Published
MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network
Yu, Xian1; Ye, Kejiang2; He, Dongbiao3; Chen, Xianfan3; Xu, Chengzhong4; Li, Jianhui1; Xie, Gaogang1
2024
Conference Name44th IEEE International Conference on Distributed Computing Systems, ICDCS 2024
Source PublicationProceedings - International Conference on Distributed Computing Systems
Pages913-924
Conference Date23 July 2024through 26 July 2024
Conference PlaceJersey City
PublisherInstitute of Electrical and Electronics Engineers Inc.
Abstract

Through providing cheap rack and network hosting services, third-party internet data centers (IDCs) have gained significant popularity among cloud service providers. Real-time monitoring of the quality of the IDC network and proactively alarming is crucial to guaranteeing the reliability of cloud services. The prevailing approach to addressing this problem involves utilizing active probes and making evaluations based on the results of single-link or multi-link probing. However, the existing efforts still tend to generate a significant number of unnecessary alerts, resulting in enormous operational costs. For this reason, we first build a large-scale distributed ping-based dial test system that enables monitoring the quality of the IDC network in a many-to-one probe mode. We develop an efficient exporter tool based on the standard Prometheus' data interface to ensure real-time and precise measurement data collection. To quickly and accurately detect potential network issues, we also design a multi-step heuristic-based fault detection and alarm method. Furthermore, we propose a comprehensive alarm life-cycle model based on the results of multi-link probing to guide alarm management in production practice. This system has been successfully deployed in the production environment of Sangfor company's managed cloud for over a year, enabling proactive diagnosis of hundreds of IDC gateway IP addresses. The actual statistical results indicate a significant improvement in the mean time to repair (MTTR) for IDC network failures, reducing it from a few hours to just a few minutes. The average daily number of alarms generated by this system is less than 15, decreasing approximately 85 % compared to before. The alarm accuracy exceeds 95 % and the false negative rate is less than 2 %.

KeywordActive Probing Alarm Aggregation Fault Detection Internet Data Center Network
DOI10.1109/ICDCS60910.2024.00089
URLView the original
Indexed ByCPCI-S
Language英語English
WOS Research AreaComputer Science ; Telecommunications
WOS SubjectComputer Science, Information Systems ; Computer Science, Theory & Methods ; Telecommunications
WOS IDWOS:001304430200080
Scopus ID2-s2.0-85203197458
Fulltext Access
Citation statistics
Document TypeConference paper
CollectionFaculty of Science and Technology
DEPARTMENT OF COMPUTER AND INFORMATION SCIENCE
Corresponding AuthorYu, Xian
Affiliation1.Cnic, Cas, China
2.Siat, Cas, China
3.Sangfor Technologies Inc., China
4.University of Macau, Faculty of Science and Technology, Macao
Recommended Citation
GB/T 7714
Yu, Xian,Ye, Kejiang,He, Dongbiao,et al. MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network[C]:Institute of Electrical and Electronics Engineers Inc., 2024, 913-924.
APA Yu, Xian., Ye, Kejiang., He, Dongbiao., Chen, Xianfan., Xu, Chengzhong., Li, Jianhui., & Xie, Gaogang (2024). MLPing: Real-Time Proactive Fault Detection and Alarm for Large-Scale Distributed IDC Network. Proceedings - International Conference on Distributed Computing Systems, 913-924.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Yu, Xian]'s Articles
[Ye, Kejiang]'s Articles
[He, Dongbiao]'s Articles
Baidu academic
Similar articles in Baidu academic
[Yu, Xian]'s Articles
[Ye, Kejiang]'s Articles
[He, Dongbiao]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Yu, Xian]'s Articles
[Ye, Kejiang]'s Articles
[He, Dongbiao]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.