Botnets are arguably one of the biggest threats online at present. In order to control networks infected by malware, the command-and-control (C&C) servers communicate with bots via an IP address or domain only known to them. However, if a dedicated domain, or a set of domains, is used for this communication, they can be easily detected and blacklisted.
Domain Generation Algorithms (DGA) is a technique used by modern botnets to avoid blacklisting and sinkhole. DGA algorithms periodically generate a large number of domain names to connect to. A new list can be generated every day, and a few of those domains are registered and activated to be used for botnet - C&C communication. Since there is infinite number of different algorithms it is impossible to generate a finite list of domains to blacklist, which makes detection of communication between bots and C&C extremely difficult.
DGA Classification and Detection
Different approaches have been used to detect DGA-generated domains. Like in many areas of data mining they fall into two categories: rule-based algorithms and machine learning algorithms. A rule-based approach uses various manually created rules. E.g. in the early days of DGA the automatically generated domains were in general much longer than human-generated ones. Consequently, applying a threshold to the domain length could serve as a simple rule for DGA detection. Another distinguishing feature of DGA-generated domains is their randomness; thus entropy, a quantity that measures randomness, can be calculated and thresholded for DGA detection. But this approach is easily circumvented by modifying DGAs to generate shorter and "less random" domain names.
More recently, machine learning (ML) has been called into action. There are several areas of machine learning: supervised learning, semi-supervised learning unsupervised learning, reinforcement learning. The task of identifying DGA-generated domains falls in the category of supervised learning and more specifically supervised classification. The basic idea behind machine learning is the ability of computer algorithms to automatically detect a pattern in data that are too complex for a human brain. Supervised machine learning, in the context of DGA detection, consists of presenting a computer algorithm with samples of domain names that have been reliably classified as legitimate or DGA-generated and training the algorithm to distinguish between the legitimate and DGA domain names. Once an ML algorithm is trained, it can be tested to determine its accuracy, and given sufficient accuracy, it is subsequently deployed to classify new, previously unseen domain names as legitimate or DGA-generated.
Most of ML algorithms used for classification, or classifiers, do not use the domain name directly. Instead, they used features derived from the domain names. The features used for classification range from basic, such as the string length, the number of vowels and consonants, to more complex, such as entropy and conditional probability of n-grams. It's important to remember that, even though the features are created manually, here, unlike the rule-based approach, no manual rules are created. The power of the ML approach is that the machine, or rather the algorithm, automatically creates the "rules" during training and then applies them in classification.
State-of-the-Art DGA Classifiers
Even more recently, state-of-the-art machine learning algorithms, such as Random Forest Classifiers and Deep Neural Network (DNN) classifiers have been used in DGA classification and detection. Some examples of such research are reported in the following three papers:
Paper 1 "Detecting Broad Length Algorithmically Generated Domains" by Aashna Ahluwalia, Issa Traore, Karim Ganame, and Nainesh Agarwal published in 2017 (https://link.springer.com/chapter/10.1007/978-3-319-69155-8_2)
Paper 2 "Inline DGA Detection with Deep Networks" by Bin Yu, Daniel L. Gray, Jie Pan, Martine De Cock, and Anderson C. A. Nascimento published in 2017 (http://ieeexplore.ieee.org/document/8215728/)
Paper 3 "Predicting Domain Generation Algorithms with Long Short-Term Memory Networks" by Jonathan Woodbridge, Hyrum S. Anderson, Anjum Ahuja, and Daniel Grant published in 2016 (https://arxiv.org/abs/1611.00791)
Paper 1 was dedicated to exploring the usage of various complex features with the Random Forest Classifier. Paper 2 and 3 explored the application of deep neural network (DNN) for DGA classification. Paper 2 conducted the comparative analysis of the convolutional neural network (CNN) and Long Short Time Memory (LSTM) network while using Random Forest as the baseline. They showed that both CNN and LSTM networks outperform Random Forest, but in the head-to-head comparison the results were not conclusive. Paper 3 only dealt with LSTM networks and comparison of its performance against the baseline of Random Forest as well two other, less advanced algorithms. The LSTM network has been shown to outperform the alternatives.
In addition to being superior performance-wise, another advantage of using DNNs is that they use the domain names directly and require no feature engineering. There is yet another reason why they are preferable to other approaches. One of the main complains about using a neural network is that it is viewed as a "black box". The decision made by the machine cannot be explained in a rational fashion, unlike, for example, those made by decision trees, SVM or other classification algorithms where the decision can in some way be explained to a human user. However, in the case of DGA detection, this shortcoming of neural networks turns out to be its strong point. For classifiers that use a manual set of features, cyber criminals can conceivably modify their DGAs to "distort" the features used by the classifier and circumvent detection. By contrast, precisely because it is a black box, it is nearly impossible to reverse engineer a neural network classification process and design a DGA that would circumvent a DNN classifier.
Our Work and First Results
The goal of our work is the development of a state-of-the-art DGA classifier. We are planning to use the the-of-the-art algorithms, such as DNN and Random Forest classifiers. To achieve further performance improvement we'll be using stacking, an ensemble learning technique that combines multiple classifications via a meta-classifier.
The data used for the initial round of training is taken from http://datadrivensecurity.info/blog/pages/dds-dataset-collection.html. It contains 133,926 samples, out of which 81,261 are legitimate domain names and 52,665 DGA domain names of three types: 34,319 cryptolocker, 7,347 goz, and 10,999 negoz.
Here we report on the first stage of our work. It is pretty obvious that one of the classifiers used for stacking should be a DNN classifier. The research quoted above used LSTM networks. The LSTM architecture is fairly complex; it was designed to deal with very long sequences and the problem of exploding and vanishing gradients. However, the purpose of DGA detection this problem may not arise. Therefore the first goal of our research was to find out whether the complexity of LSTM networks in is required for DGA classification.
We have evaluated three different architectures comparing the LSTM architecture with a simple RNN (recurrent neural network) and a GRU (gated recurrent unit) network, which is a simplified version of LSTM. For this purpose we have only considered the binary classification of legitimate vs. DGA domains, i.e. we didn't differentiate between different types of DGA algorithms, considering all of them as one class. Our results using the above dataset show that, indeed, there is no advantage gained by using LSTM vs. RNN or GRU. All three networks achieved a classification accuracy of ~99.3%. In fact if we increase the number of nodes to make the number of training parameters very close that of the LSTM, RNN and GRU outperform LSTM, albeit by a fairly small margin. Given equal performance, RNN and GRU are a preferred choice because they are faster to train and to apply.
Further work will address the issues of using a greater data set, evaluating multiclass classification performance, and create a meta-classifier. In order to extend our data set we are planning to use the top 1 million Alexa legitimate domains (https://support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-have-a-...) and seven to eight hundred thousand of domain from the DGA OSINT feed from Bambeck Consulting (http://osint.bambenekconsulting.com/feeds/). We will evaluate a multiclass classification accuracy, as well as other performance metrics, computed separately for each DGA type. Finally, we will create a meta-classifier by combining output of a DNN classifier with a Random Forest and perhaps one more yet to be decided upon classifier.
We will report our progress as it happens.