Trace files for Search Engines
Do you know what search engine trace files are? Understand how they work and why they make a website easier to read.
Trace files help search engine robots find and read your site more easily. That's why they are so important.
Understand better what sitemap files and robots are and how they work.
Sitemap
In simple terms, Sitemap.xml is a list of all the pages (URLs) of a website. It maps out how the site is structured and what is included in it. In other words, the sitemap is a file designed to facilitate the process of indexing pages on search engines such as Yahoo!, Bing and Google.
In general, sitemaps can be accessed by adding “/sitemap.xml” right after the website address. For example: https://www.corona.com/sitemap.xml
If a website has a large volume of pages, categories, subcategories, products and subproducts, for example, the sitemap file will be a great facilitator for the search engine to find each of these sections. So, remember that whenever you insert a new page you must also update the Sitemap.
How to create a sitemap
To create a sitemap file manually, you can check all protocols and tag definitions available on the sitemaps.org portal. The main obligations described are:
- Start the document with the opening tag “<urlset>” and end with the closing tag “</urlset>”, without the quotes;
- Specify the default protocol in the “<urlset>” tag;
- In each URL added, include a “<url>” tag as a parent tag and a “<loc>” child entry.
(Check out more information on creating sitemaps at sitemaps.org or through the Google guide.)
To make creating the file easier, there are also tools that can help, such as GsiteCrawler, which simulates search engine robots and automatically creates a sitemap, the XML-Sitemaps, that creates free maps for sites with up to 500 pages and for over 500 pages there is Xenu.
But, if your website was created in Wordpress, you can use plugins like Better WordPress Google XML Sitemaps or complete SEO tools like Yoast SEO, which generates the sitemap automatically.
It is noteworthy that no architecture or perfect sitemap guarantee complete indexing of the site, but they are complementary tools that must be used so that the robot can better understand the structure of the site.
It is also important to remember that the sitemap must be linked to the Google Search Console which, among its functions, has help in optimizing your SEO actions.
Robots
Google Documentation on the Robots file
The robots.txt file, also known as the robot exclusion protocol or standard, is a text file that tells internet robots which directories and pages on your website should or should not be accessed by them.
The commands in robots.txt work similarly to HTML and various programming languages. These commands will be followed by the robots to navigate and find the pages on your website.
User-agent
The function of the User-agent command is to list which robots must follow the rules indicated in the robots.txt file.
To know the name of each User-agent you can consult the Web Robots Database, which lists the robots of the main search engines on the market.
For example, Google's main search robot is Googlebot. To give him specific orders, the command would be:
User-agent: Googlebot
Or, if you wanted to leave specific orders for Bing's search robot, the command would be:
User-agent: Bingbot
Just change the User-agent name according to the robot of each search engine.
Now, if you want the command to be for all search robots, the command is:
User-agent: *
The asterisk indicates that the robots.txt file applies to every type of internet robot that visits the site.
Disallow
The Disallow command instructs search robots which pages and directories should not be included in search results. Just enter the page address after the command. For example:
Suppose I don't want robots to read my site page https://www.ambev.com.br/venda-ambev/ for some reason. The command for this activity would be:
Disallow: /venda-ambev
This is also true for file folders. If my folder in the directory is called "files", for example, and I don't want it to be read by search engine robots, the command would be:
Disallow: /files/
Allow
The Allow command asks search engines to read pages and directories you want indexed.
By default, all pages on your site will be indexed, except when you use the Disallow command. Therefore, using the Allow command is recommended only when you need to lock a folder or directory via the Disallow command, but would like to have indexed a specific file, folder, or page that is inside the locked directory.
If you want to block access to the “files” folder, for example, but need to allow access to the “products” page, the command would be like this:
Disallow: /files/
Allow: /files/products
Sitemap command
Another command allowed by robots.txt is the indication of the path and name of the sitemap in XML format of the site, very useful to help search robots identify all existing pages on the site.
To enter the sitemap address you need to have your sitemap file saved in the root folder of your site. The command to insert this address into your website is this:
Sitemap: https://www.domain.com.br/sitemap.xml
Other informations
After configuring your robots text file and inserting it in your site's root folder, you should insert the path in the code, placing a meta inside the <head>:
<meta name="robots" content="index,follow">
We instruct that, in order for your content to be read and indexed by search engine robots, use the index option.