|
|
|
June 11,2010 One of the most difficult challenges that web crawlers and directories face on the web is spam. Web spam or otherwise known as spamdexing or spamming the index of a search engine, refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Other terms that relate to this are search spam and search engine spam. It involves injecting of artificially-created pages into the web in order to influence the search engine results, to drive traffic to certain pages for fun or profit. From the paper " Detecting Spam Web Pages through Content Analysis", they defined web spam as, "the practices of crafting web pages for the sole purpose of increasing the ranking of sites or some affiliated pages, without improving the utility to the viewer." It has been noted that due to the increase in the use of of the Internet to transact business, search engines have become the entry point in reaching relevant information. Consequentially, sites on the web observed that most traffic or the increase in their stes traffic come from search engine referrals. And for commercial web sites, this translates into revenue and sales increase. Thus this forces web operators to manipulate their web sites to be able to influence search results. This is not necessarily unethical, in fact Search Engine Optimization (SEO ); a technique accepted in the industry as procedures for making a website indexable by search engines without misleading the indexing process has been implemented for this purpose. However this has opened up opportunities for other web masters to take it into a different context, resulting to spamming. Some of the types of Web spams are: Term/Content/Text Spamming Refers to repeating some important terms and dumping of many unrelated terms in a web page. It could be recognized as placing keywords in various text fields such as body of a page, title, meta tags, url or anchor text. Under this type of technique, we have the so-called "keyword stuffing", that is, chosen keywords are exaggeratedly repeated in the body of a web page. Other technique which has become ineffective is the "meta tag stuffing" or overly repeating keywords in meta tags. Similarly, "doorway or gateway pages" are sites that offer very little content but are again stuffed with keywords. Sites "made for adsense" or " scapers"could be also considered under this form. Its approach involves getting or taking information from other sources without their permission and presents them in a unique way. Another form of content spamming is "article spinning". This is done by rewriting existing article either manually or automatically. "Redirecting sites" on the other hand redirects a user to a different web site without showing him or her the spammed site. Link Spamming It is the practice of adding extraneous and misleading links to web pages and vice versa. Some of its forms are "link farms". Link farms are group of web pages that all hyperlink to every other page in the group. Sometimes, web spammers purchase expired domains in order to populate its pages with links to their web site. One of its other versions include "domain flooding" where a plethora of domains redirect to a target website. "Page awards" may also be considered link spamming. The spammer pretends to run an organization that distributes awards for web site design or information, then, the participating site gets to display the award. It may be in the form of an image linking back to the awarding organization, effectively increasing visibility of the spammers site. Other techniques involve concealing or hiding spamming sentences, terms and links so that web users do not see them. This is done by having its color the same with the background. Sometimes, spam web servers return a html document to the user and a different document to a web crawler for indexing and this is known as "cloaking". Studies show that web spam is mostly prevalent under the .biz domain, next the .us and the third is the .com domain and it occurs highest to sites written in French, German and English. In addition, medications and health-related goods and services retained the lead for english-language spam. No matter how we may look at it, spamming is a menace that must be prevented if not stopped. Web spam decreases quality of search results and inflates search engines with pages that are useless. It deprives legitimate sites of the revenue that they may earn and it wastes significant resources of search engines thereby increasing cost of processing queries. It is a frustrating endeavor for search engines services and even forces honest web operators to spamdex in order to be found. Resources:
|
|