Monday, February 22, 2010

Search Engine and types of search engines

How did search engines begin, and where are they going? we use them so often that we take them for granted and forget how miraculous it is to be able to find answers to our most ridiculous questions within just a few seconds. It was the web directory which was used to find the information before the development of Search engines. The first ever directory was created by David Filo and Jerry Yang in April 1994, which was known as Yahoo! directory. As the number of links get increasing in such directory and thus their arose a need of Searching from vast data and it was than the first WebCrawler was developed the very same month. This web crawler was capable of indexing the entire pages and these lead a development of new form of search. The Search engine was born.

Following the development of WebCrawler Lycos, Ask Jeeves, and ultimately Google, was launched in 1998. Thus the Era of content search was developed. Search engines are an extremely powerful way of promoting your website online. Many studies have shown that between 40% and 80% of users found what they were looking for by using the search engine feature of the Internet.

According to Search Engine Watch 625 million searches are performed every day! The great thing about search engines is they bring targeted traffic to your website. These people are already motivated to make a purchase from you- because they searched you out.

Let See different Types of Search Engine:

Crawler-Based Search Engines

Crawler-based search engines use automated software programs to survey and categorize web pages. The programs used by the search engines to access your web pages are called ‘spiders’, ‘crawlers’, ‘robots’ or ‘bots’.

A spider will find a web page, download it and analyse the information presented on the web page. This is a seamless process. The web page will then be added to the search engine’s database. Then when a user performs a search, the search engine will check its database of web pages for the key words the user searched on to present a list of link results.

The results (list of suggested links to go to), are listed on pages by order of which is ‘closest’ (as defined by the ‘bots’), to what the user wants to find online.

Crawler-based search engines are constantly searching the Internet for new web pages and updating their database of information with these new or altered pages.

Examples of crawler-based search engines are:

1) Google (www.google.com)
2) Ask Jeeves (www.ask.com)

Directories

A ‘directory’ uses human editors who decide what category the site belongs to; they place websites within specific categories in the ‘directories’ database. The human editors comprehensively check the website and rank it, based on the information they find, using a pre-defined set of rules.

Examples of Web directories are:

1) Yahoo Directory (www.yahoo.com)
2) Open Directory (www.dmoz.org)

(Note: Since late 2002 Yahoo has provided search results using crawler-based technology as well as its own directory.)

Hybrid Search Engines:

Hybrid search engines use a combination of both crawler-based results and directory results. More and more search engines these days are moving to a hybrid-based model. Examples of hybrid search engines are:

Examples of Hybrid Search Engines are:

1) Yahoo (www.yahoo.com)
2) Google (www.google.com)

Meta Search Engines

Meta search engines take the results from all the other search engines results, and combine them into one large listing. Examples of Meta search engines include:

Examples of Hybrid Search Engines are:

1) Metacrawler (www.metacrawler.com)
2) Dogpile (www.dogpile.com)

Apart from these major types of search engine Special Search engines were developed to cater for the development of niche areas such as Shopping, local search, freeware & shareware Software etc.


Tuesday, February 9, 2010

Different Types of Sitemap

Sitemaps, as the name implies, are just a map of your site. i.e. on one single page you show the structure of your site, its sections, the links between them, etc. Sitemaps make navigating your site easier and having an updated sitemap on your site is good both for your users and for search engines. Normally there are two types of sitemap from the viewpoint of purpose, one for the real users and another one for the web crawlers. For real user you can use html sitemap generated either by Php or by asp etc. You may opt for any one as it is the output that matters.


Let see the sitemap which can be used for the Real users:

HTML Sitemap:

Code example of HTML:

html lang="en"
head This is a site map head
body
h1 header of HTML site map h1
p site map paragraph with links
body
html

You can also have an Xml version of the above mentioned sitemap i.e. html sitemap as xml

Code example of XHTML:

?xml version="1.0" encoding="UTF-8"
!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
head This is a site map head
body
h1 header of XHTML site map h1
p site map paragraph with links p
body
html

Apart from these let see the sitemap that serves the purpose of Search Engines i.e. Sitemap in the format which Search Engine understands, as such sitemap are only visible to search engines they are worthless for Real people. Anyway its importance cannot be Underestimated.

Text sitemaps

Text sitemaps contain one website url per line. Many search engines including Google and Yahoo can scan text sitemaps. For Yahoo, name the primary text sitemap file urllist.txt.

Example of text sitemap file:

http://www.example.com/
http://www.example.com/some-directory/

ROR sitemaps

ROR (Resources of a resource) expands on the RSS protocol with its own extensions. The standard file extension for ROR files is .ror. All search engines that understand RSS sitemap files continue to understand the RSS parts of ROR files. However, no major search engine, if any at all, currently supports the ROR sitemap extensions. So let we don’t elaborate it.

XML sitemaps

In 2005 Google started its own sitemaps protocol based on XML. It was called Google Sitemaps. Google later convinced more search engines to follow and the standard was renamed to XML sitemaps protocol. Currently Google, Yahoo, Microsoft MSN Search, Ask, IBM and possibly more supports XML sitemaps. It is likely that more search engines will implement support for XML sitemaps.

The protocol of XML sitemaps also defines auto discovery, i.e. how search engines can automatically discover website xml sitemaps. The answer is linking to the XML sitemap, e.g. sitemap.xml, from robots.txt.

Linking to the XML sitemap:

User-agent: *
Sitemap: http://www.example.com/sitemap.xml

Instead of just pointing to one XML sitemap file for auto discovery, you can list multiple sitemaps:

Sitemap: http://www.example.com/sitemap-1.xml
Sitemap: http://www.example.com/sitemap-2.xml

Or point to XML sitemap index file:

Sitemap: http://www.example.com/sitemap-index.xml

Code example of XML sitemap:

?xml version="1.0" encoding="UTF-8"?
urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
url
loc http://www.microsystools.com/ loc
priority 1.0 priority
changefreq weekly changefreq
lastmod 2007-06-18 lastmod
/url
url
loc http://www.microsystools.com/blogs/ loc
priority 0.8 priority
changefreq weekly changefreq
lastmod 2007-06-21 lastmod
/url
urlset

RSS feeds as sitemaps

The RSS (Really Simple Syndication) protocol is often used in feed files for blog, forums etc. The RSS file format uses XML and has evolved over multiple versions and names, all fairly compatible with each other:
• Really Simple Syndication (RSS 2.0)
• RDF (Resources Description Framework) Site Summary (RSS 1.0 and RSS 0.90)
• Rich Site Summary (RSS 0.91)
After Google and Yahoo adopted RSS feeds as a kind of website sitemaps, more search engines have followed

Code example of RSS sitemap:

?xml version="1.0" encoding="UTF-8"?
rss version="2.0"
channel
title Website title title
link http://www.example.com link
generator A1 Sitemap Generator generator
lastBuildDate Tue, 13 Mar 2007 22:28:20 GMT lastBuildDate
item
title Page 1 title
link http://www.example.com/page1.html link
/item
item
title Page 2 title
link http://www.example.com/page2.html link
/item
/channel
/rss

(Note: In the tag above arrow mark is not used, please consider the same anyway hope I have done well to make it readable.)


Sunday, February 7, 2010

How to Restrict Robots by making use of Meta Tags

Hope now every one will be well versed with the term robots.txt and have learnt their benefits and uses but the problem is not yet over, their might be some instance when many of you are unable to upload or control the robots.txt file at your website i.e. You may not have root access to your server. Here comes a Meta tag into action which will help you to keep your content out of search engine indexes and services. Like the /robots.txt, the robots META tag is a genuine standard. It originated from a "birds of a feather" meeting at a 1996 distributed indexing workshop, and was described in meeting notes.

You can use a special HTML tag to tell robots not to index the content of a page, and/or not scan it for links to follow. Like any Meta tag it should be placed in the head of an html page. Let clarify these by an example:

html
head
title.../title
META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"
/head

(Note: In the tag above arrow mark is not used, please consider the same)

The robots attribute controls whether search engine spiders are allowed to index a page, or not, and whether they should follow links from a page, or not. The noindex value prevents a page from being indexed, and nofollow prevents links from being crawled. Meta tags are not the best option to prevent search engines from indexing content of your website. A more reliable and efficient method is the use of the Robots.txt file (robots exclusion standard). I.e. First preference should be given to robots.txt file, but if in case you are unable to upload robots.txt file in the root directory than only you should opt for using Meta tags to accomplish your task.

Googlebot interprets the following robots meta tag values:

1) NOINDEX - prevents the page from being included in the index.
2) NOFOLLOW - prevents Googlebot from following any links on the page. (Note that this is different from the link-level NOFOLLOW attribute, which prevents Googlebot from following an individual link.)
3) NOARCHIVE - prevents a cached copy of this page from being available in the search results.
4) NOSNIPPET - prevents a description from appearing below the page in the search results, as well as prevents caching of the page.
5) NOODP - blocks the Open Directory Project description of the page from being used in the description that appears below the page in the search results.
6) NONE - equivalent to "NOINDEX, NOFOLLOW".



Tuesday, February 2, 2010

Robots.txt Optimization

Today I am going to talk about an interesting file and i.e. Robots.txt file . The purpose of these file is to tell the search engine robots or crawler that they are allowed to access my website. At this moment many of you might think that why should they insert the robots.txt file on the root directory of their site. When they want spiders to crawl their website completely and it is their normal duty. Than wait! I have a reply for you, when spiders look for a particular page on your website where that is not available than the normal result is error 404 and these is a known fact. Here comes a robots.txt file in action, it is a well known name for search engine spiders and they will look it to the file to check if any barrier is set on the site for them. If no robots.txt file created it will end to an error 404 page. The error will appear to spiders and they may report it as a broken link. This broken link report may reduce the importance of your website in Search Engine’s view. So to avoid this situation it is always advisable to upload this simple text file to the root directory on their server. Hope it is clear from the above discussion that the main purpose of robots.txt is to tell the web spiders to don’t crawl the following (given) links at the same moment no one can force the spiders to crawl their website as it is purely depends upon spiders. But one can block spiders from accessing certain part or even full of his website.

Let give me an example: You may not want Google to crawl the /images directory of your site, as it’s both meaningless to you and a waste of your site’s bandwidth. “Robots.txt” lets you tell Google just that by using simple text file.

Let’s start with an optimization process. Create a regular text file called “robots.txt”, and make sure it’s named exactly that. This file must be uploaded to the root accessible directory of your site, not a subdirectory. The format is simple enough for most intents and purpose; a user-agent line to identify the crawler in question followed by one or more disallow: lines to disallow it from crawling certain part of your site.

1) Here's a basic "robots.txt":

User-agent: *
Disallow: /

With the above declared, all robots (indicated by "*") are instructed to not index any of your pages (indicated by "/").

2) Lets get a little more discriminatory now. While every webmaster loves Google, you may not want Google's Image bot crawling your site's images and making them searchable online, if just to save bandwidth. The below declaration will do the trick:

User-agent: Googlebot-Image
Disallow: /

3) The following disallows all search engines and robots from crawling selected directories and pages:

User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.htm

4) You can conditionally target multiple robots in "robots.txt." Take a look at the below:

User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/

This is interesting- here we declare that crawlers in general should not crawl any parts of our site, EXCEPT for Google, which is allowed to crawl the entire site apart from /cgi-bin/ and /privatedir/. So the rules of specificity apply, not inheritance.


Related Posts Plugin for WordPress, Blogger...
Twitter Delicious Facebook Digg Favorites More