Minasan, Watashiwa Wawan Desu...

Saturday, February 19, 2011

How Much Blog Spam? A Study of a Ping Dataset

How much blog spam is produced in 5 minutes in a quiet Sunday evening? What is the ratio of spam blogs in the most popular blog services? To answer this question I present you the results of an experiment analyzing ping data and manually reviewing blogs.

The relative ease of creating and maintaining blogs makes them ideal tools for spamming search engines. Spam blogs or splogs serve two basic purposes: making money from advertising and affiliate programs, and participating in link farms. But making money from AdSense and providing nepotistic links are not what it takes to call a blog splog. Otherwise we would have to classify all blogs showing ads or promoting a business as spam; and there are thousands popular, quality blogs that would fall into this category. The distinctive feature of a splog, however, is that it has no use for its visitors. Should Google ban a splog from AdSense and prevent its links from passing on authority – such a splog would have no more value or purpose of existence. So my definition of a splog would be “a blog with the only purpose of showing contextual or affiliate ads, or boosting link popularity of certain target sites”.

How active are these splogs? This question calls for a little experiment; similar to one described by P. Kolari, A. Java and T. Finn in their paper “Characterizing the Splogosphere”. They did their experiment in early 2006, and I am going to repeat it at a smaller scale now, in the early 2007.

Every time a blog is updated it sends a ping to one of many ping servers in order to invite search engine crawlers to index the new post. I am going to use ping data provided by one of the most popular ping servers – Weblogs.com. Due to the limited scale of the experiment I will be using the smaller dataset covering the last 5 minutes of pings. It’s pretty big though: 8117 pings. I’ve written a simple Java application to parse the XML file and extract URLs and names of the blogs in the dataset. Also some of the blogs were classified by blog platform: Blogspot (Blogger), MySpace, Spaces.Live.com etc. I have discovered a number of popular blog services, that I haven’t come across yet, such as a popular Taiwanese site Wretch.cc, or Italian Libero.it and Splinder.com. I was surprised to see how few pings came from some other popular blog services; Livejournal for instance had only 6 pings! Obviously LJ doesn’t rely much on Weblogs.com, but LJ has little to do with my experiment, as it is known to have very small percentage of splogs.

So below is a break down of blogs by platform, according to a ping dataset retrieved on a Sunday evening, Feb. 11. Do not mix blogs under Wordpress.com category with blogs using WP as a blog engine. Only those blogs hosted by Wordpress.com are included into this category.

Fig. 1 Popular Blog Services in the Sunday Weblogs Dataset

The huge ‘Rest’ category consists of standalone blogs and blogs hosted by minor blog services.
A few words on the blogs in the dataset: a lot of blogs were not in English, I think as much as 70% of them. For instance, all Wretch.cc blogs and many Spaces.Live.com ones are in Chinese, there are also a lot of blogs in Italian, Spanish, Russian, Japanese and German.

Once dataset was downloaded and processed I started manually reviewing the blogs and discovering spam. Of course I couldn’t visit all the 8117 blogs, so I randomly selected 20 blogs from each category.

How did I classify spam blogs? While blogs with automatically generated content or dictionary dumps are easily classified as spam, those with plagiarized content or in foreign languages required a bit more of effort. Nepotistic links with keyword stuffed anchors were a good indicator of spam. Copyscape.com helped much discovering plagiarized posts. And finally, affiliate and contextual ads were the final complement in the spam classification problem. It has to be noted that very few blogs in languages other than English were classified as spam. I can be sure about my judgment of German and Russian blogs, since I know these languages, but when dealing with others I relied only on excessive advertising and nepotistic links as spam indicators. I skipped Wretch.cc and Explog.jp samples as I was totally unable to judge Chinese and Japanese blogs. In total of 177 reviewed blogs 36 were classified as spam.

Below you can see two charts, one indicating a ratio of spam within a sample, and another showing how much each blog platform contributes to the total amount of spam.

Fig 2. Percentage of Spam Blogs in 20-blogs Samples

Fig 3. Contribution of Each Category to the Total Blog Spam

With the notable exception of Blogspot, the majority of blogs hosted by popular blog services are spam free. Of course one can question their quality, as many of them are of little value to others. But let’s not forget that most of those blogs are private diaries or personal playgrounds never intended to have big audiences; and as long as they have value to the author and his/her close circle of friends we can’t call them spam.

Thus, according to my reviews blogs hosted by beon.ru, Libero.it, Spaces.Live.com, Livejournal.com, splinder.com, and typepad.com showed no instances of blog spam in 20 blogs samples. Among 20 MySpace blogs I have discovered 1 splog, and Wordpress.com sample contained 2. The popular Google’s service Blogspot has confirmed its unofficial name of Splogspot with 50% spam ratio. ‘The Rest’ category comprised by standalone blogs and blogs attached to commercial sites showed even bigger proportion of blog spam: 23 blogs of 27 reviewed were classified as spam. The relatively low number of splogs hosted by public services can be explained by anti-spam actions taken by the administration of such services. The standalone splogs, however, are not subject to such moderation, which allows them to thrive producing tons of junk content for SE crawlers and overloading ping servers with spam pings.

As you might have noticed I used the same style of charts introduced by the famous blog ModernLifeIsRubbish.co.uk, which has an excellent tutorial on how to create pretty pie charts in Adobe Illustrator. Highly recommended!

If anybody is interested, here is the dataset I used: Dataset


View the original article here

No comments: