Abstract: Data cleaning is an important step performed in the preprocessing stage of web usage mining, and is widely used in many data mining systems. Despite many eorts on data cleaning for web server logs, it is still an open question for enterprise proxy logs. With unlimited accesses to websites, enterprise proxy logs trace web requests from multiple clients to multiple web servers,which make them quite dierent from web sever logs on both location and content. Therefore, many irrelevant items such as software updating requests cannot be filtered out by traditional data cleaning methods. In this paper, we propose the first method named EPLogCleaner that can filter out plenty of irrelevant items based on the common prefix of their URLs. We make an evaluation of EPLogCleaner with a real network trac trace captured from one enterprise proxy. Experimental results show that EPLogCleaner can improve data quality of enterprise proxy logs by further filtering out more than 30% URL requests comparing with traditional data cleaning methods.
Keywords: web usage mining, data cleaning, enterprise proxy logs