High placement in a search engine is critical for the success of any online business. Pages appearing higher in the search engine results to queries relevant to a site’s business will get higher targeted traffic. To get this kind of competitive advantage Internet companies employ various SEO techniques in order to optimize certain factors used by search engines to rank results.
In the best case SEO specialists create relevant well-structured keyword rich pages, which not only please the eyes of a search engine crawler but also have value to the human visitor. Unfortunately it takes months for this strategic approach to produce feasible results, and many search engine optimizers use so-called “black-hat” SEO.
‘Black Hat’ SEO and Search Engine Spam
The oldest and simplest “black SEO” strategy is adding a variety of popular keywords into web pages to make them rank high for popular queries. This behavior is easily detected since generally such pages include unrelated keywords that lack topical focus. With the introduction of the term vector analysis search engine became immune to this sort of manipulation. However “black-hat’ SEO went one step further creating the so-called “doorway’ pages – tightly focused pages consisting of a bunch of keywords relevant to a single topic. In terms of keyword density such pages are able to rank high in search results but never seen by human visitors as they are redirected to the page intended to receive the traffic.
Another trend is the abusing the link popularity based ranking algorithms, such as PageRank with the help of dynamically-generated pages. Such pages receive the minimum guaranteed PageRank and the small endorsements from thousands of these pages are able to produce a sizeable PageRank for the target page. Search engines constantly improve their algorithms trying to minimize the effect of “black-hat”‘ SEO techniques, but SEOs also persistently respond with new more sophisticated and technically advanced tricks so that this process bears a resemblance to an arms race.
“Black-hat” SEO is responsible for the immense amount of search engine spam — pages and links created solely to mislead search engines and boost rankings for client web sites. To weed out the web spam search engines can use statistical methods that allow computing distributions for a variety of page properties. The outlier values in these distributions can be associated with web spam. The ability to identify web spam is extremely valuable to search engine not just because it allows excluding spam pages from their indices but also using them to train more sophisticated machine learning algorithms capable to battle web spam with higher precision.
Using Statistics to Detect Search Engine Spam
An example of an application of statistical methods to detect web spam is presented in the paper “Spam, Damn Spam and Statistics” by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They used two sets of pages downloaded from the Internet. The first set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, number of non-markup words, and a vector indicating the changes in page content between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages were discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.
Another set was crawled between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links; for the HTTP redirects – the source and the target URL. 535 pages were manually inspected and 37 of them were identified as spam (6.9%).
The research concentrates on studying the following properties of web pages:
– URL properties, including length and percentage of non-alphabetical characters (dashes, digits, dots etc.).
– Host name resolutions.
– Linkage properties.
– Content properties.
– Content evolution properties.
– Clustering properties.
Search engine optimizers often use numerous automatically generated pages to massively distribute their low PageRank to a single target page. Since the pages are machine generated we can expect their URLs to look differently from those created by humans. The assumptions are that these URLs are longer and include more non-alphabetical characters such as dashes, slashes or digits. When searching for spam pages we should consider the host component only, not the entire URL down to the page name.
The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult site and 11 refer to the financial and credit related sites. Therefore in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the given set 0.173% of URLs are at least 45 characters long and contain at least 6 dots, 5 dashes or 10 digits — and the vast majority of these pages appear to be spam. By changing the threshold values we can change the number of pages flagged as spam and the number of false positives.
Host Name Resolutions
One can notice that Google, given a query q, tends to rank a page higher if the host component of the page’s URL contains keywords from q. To utilize this search engine optimizers stuff pages with URLs containing popular keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Generally SEOs generate a large number of host names to rank for a wide variety of popular queries.
This behavior can also be relatively easy detected by observing the number of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs — to 2 host names. There are also some extreme cases with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.
To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.
The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.
In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.
Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.
Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.
For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 “OK”).
The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.
The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.
Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.
To form clusters of similar pages the ‘shingling’ algorithm described by Broder et al.  will be used. Figure 7 shows the distribution of the cluster sizes on near duplicate pages in Set 1. The horizontal axis shows the size of the cluster (the number of pages in the near-equivalence class), and the vertical axis shows how many such clusters Set 1 contains.
The outliers can be put into two groups. The first group did not contain any spam pages, pages in this group are more related to the duplicated content issue. In the same time the second group is populated predominantly by spam documents. 15 of 20 largest clusters were spam containing 2,080,112 pages (1.38% of all pages in Set 1)
To Sum Up
The methods described above are the examples of a fairly simple statistical approach to spam detection. The real life algorithms are much more sophisticated and are based on machine learning technologies which allow search engine to detect and battle spam with a relatively high efficiency at an acceptable rate of false positives. Applying the spam detection techniques enables search engine to produce more relevant results and ensures a more fair competition based on the quality of web resources and not on technical tricks.