Minasan, Watashiwa Wawan Desu...

NurCell Movies

Friday, February 4, 2011

SEO Articles : How Much Blog Spam? A Study of a Ping Dataset

How much blog spam is produced in 5 minutes in a quiet Sunday evening? What is the ratio of spam blogs in the most popular blog services? To answer this question I present you the results of an experiment analyzing ping data and manually reviewing blogs.

The relative ease of creating and maintaining blogs makes them ideal tools for spamming search engines. Spam blogs or splogs serve two basic purposes: making money from advertising and affiliate programs, and participating in link farms. But making money from AdSense and providing nepotistic links are not what it takes to call a blog splog. Otherwise we would have to classify all blogs showing ads or promoting a business as spam; and there are thousands popular, quality blogs that would fall into this category. The distinctive feature of a splog, however, is that it has no use for its visitors. Should Google ban a splog from AdSense and prevent its links from passing on authority “ such a splog would have no more value or purpose of existence. So my definition of a splog would be a blog with the only purpose of showing contextual or affiliate ads, or boosting link popularity of certain target site

How active are these splogs? This question calls for a little experiment; similar to one described by P. Kolari, A. Java and T. Finn in their paper Characterizing the Splogosphere. They did their experiment in early 2006, and I am going to repeat it at a smaller scale now, in the early 2007.

Every time a blog is updated it sends a ping to one of many ping servers in order to invite search engine crawlers to index the new post. I am going to use ping data provided by one of the most popular ping servers Weblogs.com. Due to the limited scale of the experiment I will be using the smaller dataset covering the last 5 minutes of pings. It's pretty big though: 8117 pings. I've written a simple Java application to parse the XML file and extract URLs and names of the blogs in the dataset. Also some of the blogs were classified by blog platform: Blogspot (Blogger), MySpace, Spaces.Live.com etc. I have discovered a number of popular blog services, that I haven't come across yet, such as a popular Taiwanese site Wretch.cc, or Italian Libero.it and Splinder.com. I was surprised to see how few pings came from some other popular blog services; Livejournal for instance had only 6 pings! Obviously LJ doesn't rely much on Weblogs.com, but LJ has little to do with my experiment, as it is known to have very small percentage of splogs.

So below is a break down of blogs by platform, according to a ping dataset retrieved on a Sunday evening, Feb. 11. Do not mix blogs under Wordpress.com category with blogs using WP as a blog engine. Only those blogs hosted by Wordpress.com are included into this category.

Fig. 1 Popular Blog Services in the Sunday Weblogs Dataset

The huge Rest category consists of standalone blogs and blogs hosted by minor blog services.
A few words on the blogs in the dataset: a lot of blogs were not in English, I think as much as 70% of them. For instance, all Wretch.cc blogs and many Spaces.Live.com ones are in Chinese, there are also a lot of blogs in Italian, Spanish, Russian, Japanese and German.

Once dataset was downloaded and processed I started manually reviewing the blogs and discovering spam. Of course I couldn't visit all the 8117 blogs, so I randomly selected 20 blogs from each category.

How did I classify spam blogs? While blogs with automatically generated content or dictionary dumps are easily classified as spam, those with plagiarized content or in foreign languages required a bit more of effort. Nepotistic links with keyword stuffed anchors were a good indicator of spam. Copyscape.com helped much discovering plagiarized posts. And finally, affiliate and contextual ads were the final complement in the spam classification problem. It has to be noted that very few blogs in languages other than English were classified as spam. I can be sure about my judgment of German and Russian blogs, since I know these languages, but when dealing with others I relied only on excessive advertising and nepotistic links as spam indicators. I skipped Wretch.cc and Explog.jp samples as I was totally unable to judge Chinese and Japanese blogs. In total of 177 reviewed blogs 36 were classified as spam.

Below you can see two charts, one indicating a ratio of spam within a sample, and another showing how much each blog platform contributes to the total amount of spam.

Fig 2. Percentage of Spam Blogs in 20-blogs Samples

Fig 3. Contribution of Each Category to the Total Blog Spam

With the notable exception of Blogspot, the majority of blogs hosted by popular blog services are spam free. Of course one can question their quality, as many of them are of little value to others. But let's not forget that most of those blogs are private diaries or personal playgrounds never intended to have big audiences; and as long as they have value to the author and his/her close circle of friends we can't call them spam.

Thus, according to my reviews blogs hosted by beon.ru, Libero.it, Spaces.Live.com, Livejournal.com, splinder.com, and typepad.com showed no instances of blog spam in 20 blogs samples. Among 20 MySpace blogs I have discovered 1 splog, and Wordpress.com sample contained 2. The popular Google's service Blogspot has confirmed its unofficial name of Splogspot with 50% spam ratio. The Rest category comprised by standalone blogs and blogs attached to commercial sites showed even bigger proportion of blog spam: 23 blogs of 27 reviewed were classified as spam. The relatively low number of splogs hosted by public services can be explained by anti-spam actions taken by the administration of such services. The standalone splogs, however, are not subject to such moderation, which allows them to thrive producing tons of junk content for SE crawlers and overloading ping servers with spam pings.

As you might have noticed I used the same style of charts introduced by the famous blog ModernLifeIsRubbish.co.uk, which has an excellent tutorial on how to create pretty pie charts in Adobe Illustrator. Highly recommended!

6 comments:

Anonymous said...

She has an attack that applies burning, she can buff the entire team with
mirror image, she can remove status effects, and two
of her three attacks in scrapper mode are ranged attacks.

1980's The Empire Strikes Back and 1983's Return of The Jedi both received equal, if not greater, favor with
fans, and the trilogy rose to legendary status. These were the best of the
best, the best players on the server.

Here is my site - Farm Heroes Saga Cheats 2014

Anonymous said...

I'm extremely pleased to discover this great site.

I want to to thank you for ones time due to this fantastic read!!
I definitely savored every part of it and
i also have you book marked to check out
new things on your site.

My blog :: castle clash Cheats

Anonymous said...

pieces that can get a declivity deduction. "Doing" is what the cerebrate
from tough luck to your financial debts when you are crusted in this subdivision to channelize you finished to your interior res publica of field
of force, and new methods on a subject, you can aliveness for many.
Repairs can Repairsbe inviting Cheap MLB Jerseys Wholesale Jerseys NHL Jerseys Cheap Wholesale Jerseys Cheap NFL Jerseys NHL Jerseys Cheap Wholesale Jerseys - - Cheap NFL Jerseys Wholesale Jerseys; http://bustas.ddns.net/wiki/index.php/Wholesale_Jerseys_of_jewelry_was_created_by_but_placing_the_leaflet_foundation, 2014 world cup jerseys Cheap NFL Jerseys China Jerseys (hvidesande.de) Wholesale Jerseys () Jerseys China []
Cheap NFL Jerseys Cheap world cup jerseys
NBA Cheap Jerseys Cheap Jerseys wholesale world cup jerseys Cheap NFL Jerseys up
all farewell and even. It likewise provides them with an contiguous admittance.
This is the first try aft you take put on event later on you junction the trafficker to ask for your visitors ensue to
them. In realism, new place taken by almost anyone.
With the

my blog post; Cheap MLB Jerseys

Anonymous said...

Upon reflection, I thought this article was timely and relevant.
Thank you exceedingly for writing this trove of information;
if more writers created intelligent and time-worthy musings, I gander
we'd all understand life better. I, personally, know might place our attention more on staying fit and eating well and escape from being perpetually
in front of the computer or cell phone screen. Going outside (and not being glued
to the cell phone all the while, either) and living life with appreciation for Nature, helps people feel sane
and feeling joyous. Thinking about deep philosophical
subjects is likewise some other sort of experience that can help us make the discovery of
self understanding.

yanmaneee said...

christian louboutin
air jordan
air jordan
coach outlet
louboutin shoes
jordan shoes
yeezy boost
coach outlet store
golden goose outlet
nike cortez

xem lich am hom nay said...

Lich Van Nien 365
Xem Lich Am hom Nay
Xen ngay tot xau
Tu Vi Hang Ngay
Xem tuoi lam nha
Thuoc lo ban online
xem boi bai hom nay
12 cung hoang dao tu vi hang ngay
Xem tuoi vo chong
xem boi ngay sinh