|
|
|
Abstract: Cybercriminals use Malicious Uniform Resource Locators (URLs) as the entry to implement a variety of web attacks, such as phishing, spamming, and malware distribution, which may lead to huge finance and data loss. Thus, malicious URLs should be detected as accurately and quickly as possible. Heuristic-based detection approaches are one of the most popular methods to achieve the above goals. The detection results come from the usage of many heuristic features in this approach. However, tremendous new pages and meaningless tokens lead to the explosion of feature sets, and exhaust memory space finally. In this paper, we try to address the problem by selecting some representative members from the initial feature set, which should have the best predictive ability among the same number of selected features. For each feature, we give an evaluation method of O(1) complexity to measure its predictive ability. Then we make the selection based on all the measured values with linear complexity. Experimental results show that our approach can achieve almost the same false negative rate using only 8.3% features for malicious URLs detection, comparing with prior approaches. Moreover, our approach may work efficiently in the big data era, as it can averagely handle 20 thousand URLs per second in our experiments.
|