EPLogCleaner: Improving Data Quality of Enterprise Proxy Logs for Efficient Web Usage Mining

沙泓州  柳厅文  秦鹏  孙永  刘庆云 



Abstract: Data cleaning is an important step performed in the preprocessing stage of web usage mining, and is widely used in many data mining systems. Despite many e orts on data cleaning for web server logs, it is still an open question for enterprise proxy logs. With unlimited accesses to websites, enterprise proxy logs trace web requests from multiple clients to multiple web servers,which make them quite di erent from web sever logs on both location and content. Therefore, many irrelevant items such as software updating requests cannot be filtered out by traditional data cleaning methods. In this paper, we propose the first method named EPLogCleaner that can filter out plenty of irrelevant items based on the common prefix of their URLs. We make an evaluation of EPLogCleaner with a real network trac trace captured from one enterprise proxy. Experimental results show that EPLogCleaner can improve data quality of enterprise proxy logs by further filtering out more than 30% URL requests comparing with traditional data cleaning methods.
Keywords: web usage mining, data cleaning, enterprise proxy logs



首页
团队介绍
发展历史
组织结构
MESA大事记
新闻中心
通知
组内动态
科研成果
专利
论文
项目
获奖
软著
人才培养
MESA毕业生
MESA在读生
MESA员工
招贤纳士
走进MESA
学长分享
招聘通知
招生宣传
知识库
文章
地址:北京市朝阳区华严北里甲22号楼五层 | 邮编:100029
邮箱:nelist@iie.ac.cn
京ICP备15019404号-1