Clairvoyant Squirrel: Large Scale Malicious Domain Classification
Large scale classification of domain names has many applications in network monitoring, intrusion detection, and forensics. The goal with this research is to predict a domain’s maliciousness solely based on the domain string itself, and to perform this classification on domains seen in real-time on high traffic networks, giving network administrators insight into possible intrusions. Our classification model uses the Random Forest algorithm with a 22-feature vector of domain string characteristics. Most of these features are numeric and are quick to calculate. Our model is currently trained off-line on a corpus of highly malicious domains gathered from DNS traffic originating from a malware execution sandbox and benign, popular domains from a high traffic DNS sensor. For stream classification, we use an internally developed platform for distributed high speed event processing that was built over Twitter's recently open sourced Storm project. We discuss the system architecture as well as the logic behind our model's features and sampling techniques that have led to 97% classification accuracy on our dataset and the model's performance within our streaming environment.
Here are the slides in case you’re interested.