Typosquatting Detection using Advanced Analytics

Typosquatting - the good, bad, and ugly

 

Let me get this out of the way first, there isn't much good about typosquatting, it's predominantly bad, and sometimes really ugly. Typosquatting is the registration of Internet domain names that appear similar to popular and reputable domains (e.g., vacebook.com instead of facebook.com). It is a common trick used by hackers who slightly misspell the name of a valid domain to fool the users wishing to access that domain. A study [1] conducted  in 2015 concluded that just for the top 500 most popular websites over 10,000 malicious typosquatting domain is produced daily. 

There are multiple attack vectors that can be exploited by typosquatters: e-mail, mistyping a domain name in the address bar, mistyping URLs in HTML or in Javascript code. There are two types of email-based attacks: passive attacks and active attacks. Passive attacks involve attackers registering typosquatting domains and setting up an email server  to  receive all emails addressed to that domain regardless of the destination user. Since for most companies the email volume is very high, even a small percentage of emails with misspelled destination addresses represents a significant number of intercepted emails. The intercepted emails can contain all kinds of valuable and proprietary information, such as user names, passwords, sensitive information about corporate network configuration, trade secrets, etc. As one example, two researchers [2]  in 2011 managed to collect over the period of 6 months 20 GBytes of misdirected emails to a number of domains typosquatting the domains of the Fortune 500 companies. The  active attack vector  involves  social  engineering. Since typosquatting domain are very similar to the legitimate email domain, an attacker can impersonate a person and attempt to obtain sensitive information.

The  typosquatting websites can attract traffic even without setting up email servers. True to its name, typosquatting feeds off a human tendency to  err and mistype URL addresses in the address bar. A British IT security company Sophos  conducted an investigation [3] during which researchers intentionally typed misspelled variations of popular sites, such as  Facebook, Twitter, Apple, etc. into the address bar. They  discovered that 80% of these errors were redirected to a phishing website.

The main goal of redirection by typosquatters is to monetize Internet traffic that is intended to go to popular websites.  
A study of the monetization methods [4] concluded that most of them rely on pay-per-click advertisements. However, following a typosquatted URL by users may result in a hacking attempt, e.g. malware or ransomware can be added to their computer. Typosquatting can be also exploited to acquire control of and exfiltrate data from website. Typosquatting
 can also cause brand name damage, e.g. by displaying inappropriate messages. 

 

Common Approaches to Typosquatting Detection

 

Companies can fight typosquatting proactively by registering potential typosquatting domains and redirecting the internet traffic to the real domain. For example, if one enters  "facebok.com" or "faseboom.com" in the address bar the request is redirected to "facebook.com". However, this can be an expensive exercise as it requires registering many variations of a domain name. Large companies may be able to afford to have hundreds of typosquatting domain names registered as part of their brand protection. For small businesses, though, it will most likely present an undue burden. 

Alternatively, one can choose a more defensive approach that involves discovering and tagging potential typosquatters. In order to discover typosquatting domains, companies can screen incoming email traffic, or scan all newly registered domains. Once potential typosquatting domains are discovered, companies can configure their networks to block DNS and internal e-mails sent by employees sent to those domains.

 

While it’s possible to manually create a blacklisted domain list, the discovery of typosquatting domains in most cases involves using data processing computer programs. Similar to other areas of data processing, there are two main approaches in typosquatting detection: rule-based and data driven. A rule-based approach uses regular expressions to generate a set of domain names resembling the target name(s). E.g. the following regular expression (using python conventions) g(o{0,3}0{0,2}|0{0,3}o{0,2})gle matches strings such as "google", "go0gle", "g0ogle", "g00gle", "gooogle", etc. Rule-based approaches generate possible typo-variants of a legitimate domain, creating blacklists which can be used to prevent users from accessing typosquatted domains.These approaches, however, have a number of shortcomings. In addition to the possibility of missing some variations of the target domain name , as the number of target names grows, the number of rules becomes unmanageable. 

 

A more flexible, data driven approach involves using some type of a similarity measure between the domain names, such as edit distance. Edit distance [5] is a concept that originated in computational linguistics. It measures the dissimilarity between two strings, or more generally between two sequences of objects. One of the most common way to compute edit distance is called Levenshtein distance. The Levenshtein distance between two strings is defined as the minimum number of operations required to transform one string into the other using three basic operations: insertion, deletion and substitution. E.g. the Levenshtein distance between "facebook" and "aceb0oks" is 3: character 'f' is deleted, character 'o' is replaced with character '0', and character 's' is inserted. 

 

The more similar two domain names are the smaller their Levenshtein distance is. New domains are flagged as potentially typosquatting, if their Levenshtein distance from the target domain falls below a threshold. The threshold can be determined empirically. The optimal threshold should minimize the chances of flagging a legitimate domain as typosquatting while still detecting most, if not all, of true typosquatting domains.  Some of examples of recent research work that used string similarity measures for typosquatting detection are [6] who used a generalized Levenshtein distance and  [7] who used the Damerau–Levenshtein distance, a variation of the Levenshtein distance that adds the transposition of two adjacent characters to the set of possible operations. 

 

Our Approach to Typosquatting Detection

 

Using the Levenshtein distance, or related measures, has its disadvantages, especially when applied to short domain names. Specifically, any two three character domains are no more than Levenshtein distance 3 away from each other, since one can go from any three character string to another one by substituting all three characters. In order to alleviate this shortcoming we made a decision to use a normalized Levenshtein distance for typosquatting detection. There are a number of ways to define such a distance. We define normalized Levenshtein distance as the Levenshtein distance divided by the square root of the product of the lengths of the two strings. For example, the normalized Levenshtein distance between the strings "ook" and "00q" is 1, whereas the normalized Levenshtein distance between the strings "facebook" and "faceb00q" is 0.375. To the best of our knowledge, this is the first instance of using a normalized Levenshtein distance for typosquatting detection. 

 

The algorithm is being deployed to perform typosquatting detection for a TCS customer which has some 30+ target domain names.  In order to test our algorithms and to estimate the false positive rate, i.e. the number of non-typosquatting domains that can be detected by this algorithm, we downloaded a large collection of legitimate domain names, the Majestic Million [8]. The Majestic Million is a list of the top 1 million website in the world, ordered by the number of referring subnets. We computed the normalized Levenshtein distance between the Majestic Million and 30+ target customer domains. Since the anomalously small normalized Levenshtein distances are outliers in the distribution of all distances, we set the threshold using the standard definition of outlier detection [9].  Using this threshold we tagged as outliers 268 out of 1,000,000 domain names. This constituted the false positive rate of under 0.03%.  At the same time, using this threshold, we detected all newly registered domains that had been detected by a popular cybersecurity tool. 

 

So far we have shown how to apply advanced data analytics to typosquatting detection for a small set of domains. The next question is whether the similarity measure threshold optimized for some of domain names can be generalized to other domain names.This could be the first step on the road of applying machine learning to this problem. One challenge here is creating an extensive training set of actual typosquatting domain names for a significant number of real domains. Having a good training set will go a long way towards building a supervised classifier to perform typosquatting classification.

 

References 

 

[1] P. Agten, W. Joosen, F. Piessens, N. Nikiforakis, Seven months’ worth of mistakes: A longitudinal study of typosquatting abuse, in: Proceedings of the 22nd Network and Distributed System Security Symposium (NDSS 2015), Internet Society, 2015

[2] https://www.wired.com/images_blogs/threatlevel/2011/09/Doppelganger.Domains.pdf

[3] https://nakedsecurity.sophos.com/typosquatting/

[4] T. Moore and B. Edelman, “Measuring the perpetrators and funders of typosquatting,” in Financial Cryptography and Data Security. Springer, 2010, pp. 175–191

[5] https://en.wikipedia.org/wiki/Edit_distance

[6] Liu, T., Zhang, Y., Shi, J., Ya, J., Li, Q., & Guo, L. (2016). Towards quantifying visual similarity of domain names for combating typosquatting abuse. MILCOM 2016 - 2016 IEEE Military Communications Conference, pp. 770-775

[7] J. Szurdi, N. Christin, Email typosquatting, Proceeding IMC '17 Proceedings of the 2017 Internet Measurement Conference, pp. 419-431

[8] https://blog.majestic.com/development/majestic-million-csv-daily/

[9] https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

Rate this article: 
Average: 5 (5 votes)
Article category: 

There is 1 Comment