Log correlation analysis plays an important role in many information security areas. For example, it can be used to help nd abnormal navigation behaviors in inside threat detection. Besides, it can be used as the data source for intrusion detection [1]. However the original logs are lled with noises. Therefore, data cleaning is an indispensable preprocessing step in log correlation analysis in order to improve detection eciency and reduce storage space.
Many methods have been proposed to improve data quality by removing irrelevant items such as jpeg, gif les or sound les and access generated by spider navigation. Most of them are designed for web servers (such as e-commerce web site). These methods work by inspecting the elds of user-agent, http status and URL sux in web requests. However, they cannot be used to address the problem of improving data quality of proxy logs (recording web requests through intermediate roles) very well. Because proxy logs show dierent features compared with server logs. The biggest dierence is that proxy logs should be cleaned without knowing the information of the web site accessed by a web request, such as its web structure and content type. It makes traditional data cleaning methods incapable of ltering specic noises in proxy logs, such as software updates and requests from network behavior analyzers. Moreover, proxy logs experience rapid growth of web requests that are generated by unlimited websites and users, which makes the problem more dicult to tackle.