Crawler Settings

You have set up the crawler to crawl your site (or sitemap!), but you're dealing with missing, unwanted, or duplicate pages? Is crawling taking too long for your tight schedule? There is probably a crawler setting for that!

Missing Pages

If you notice that some search results are missing, the first thing to check is whether the missing URLs are indexed.

The Index section of the Control Panel allows you to look up any URL and check if it was indexed.

Index Log

If a page is missing from the index and you use Sitemap Indexing, you will need to make sure the missing URL is included in your sitemap.

If you are crawling your website, you can try re-indexing the missing page manually.

You might get an error that will tell you why the page is not indexable. It could be because the page points to a different domain than your root URL(s), a noindex robots meta tag tells the crawler not to index the page, or because you have set up some blacklisting, whitelisting, or noindex rules that are preventing the URL from being indexed.

When you fix these issues and can index the single URL successfully, recrawl your entire site and check again for the missing page(s).

A couple notes: We skip your root URL(s) by default. For example, if your root URL is domain.com, this page will not get indexed. If for some reason you want your homepage in your search results, you can uncheck Skip Homepage.

The crawler does NOT go to external websites including Facebook, Twitter, LinkedIn, etc.

Refer to this post for a more in-depth look at missing pages.

What is the 'noindex robots meta tag' and how does it affect my search results?

You might already be using the noindex robots meta tag to keep Google from picking up specific pages or even your entire site (e.g. when it's still in development):

<meta name='robots' content='noindex,follow' />

If you want to keep your site pages hidden from Google, but allow Zoovu Search to index them, simply check the Ignore Robots Meta Tag box.

If it's the other way around and you want to keep the pages visible in Google, but remove them from your on-site search results, use blacklisting or no-indexing rules. Alternatively, you can add a meta tag to the unwanted pages and use zoovu-indexer instead of robots:

<meta name="zoovu-indexer" content="noindex" />

Unwanted Pages

If your index contains pages you do not want in your search results, you have a few options depending on the source of the unwanted pages.

By default, when we crawl your root URL https://domain.com, we will also crawl https://blog.domain.com and any other subdomains. Turn the Crawl Subdomains setting OFF under if you'd like to exclude pages from your subdomains.

If you want to remove specific pages or documents from your search results (without deleting them from your website), you can apply blacklisting, whitelisting, or no-index rules.

How do I use whitelisting, blacklisting, and no-index URLs to control which pages and documents are shown in search results?

The best method will depend on your site structure and which pages or documents you want in your index.

Before we dive in, please remember that URL and XPath patterns are interpreted as regular expressions so remember to put a backslash (\) before special characters, such as []\^$.|?*+(){}.

Blacklist URL patterns:

These patterns tell the crawler to completely ignore specific areas of your site or even types of documents.

In this example, pages found under /wp-admin/ or .php pages will not be indexed, and the crawler will not follow any links found on those pages either. The crawler will also not index any PDFs.

Note: blacklisting takes priority over whitelisting. If there's a conflict in your settings, the whitelisted patterns will be ignored.

Whitelist URL patterns:

These patterns restrict the crawler to a specific area of your site.

For example, imagine you want to limit your search to blog pages only. If you whitelist /blog/, our crawler won't index anything except for the URLs containing /blog/.

This can also be useful for multilingual sites. Depending on your URL structure, you could, for instance, use the following patterns to limit the search to French-language pages only:

Important: make sure that your root URL matches your whitelisting pattern (e.g. https://website.com/fr/). If the root URL doesn't contain the whitelist pattern, it will be blacklisted (which means nothing can be indexed, and there can be no search results).

No-index URL patterns:

These patterns have the same effect as the "noindex,follow" robots meta tag. The crawler follows the page and all the outgoing links, but doesn't include the no-indexed page in the results. It is different from blacklisting, where the crawler fully ignores the page without checking it for other useful links.

You should set up no-index patterns for pages that should not be in your search results, but contain links to important pages. For example, you could exclude your Blog landing page, but index all your blog posts, or exclude "tag" pages, but index the tagged posts.

Note the $ sign: it indicates where the matching pattern should stop. In this case, URLs linking from the escaped page, such as /specific-url-to-ignore/product1 , will still be followed, indexed, and shown in search results.

No-index URL patterns take priority over whitelisting. If there's a conflict in your settings, the whitelisted patterns will be ignored.

No-index XPaths:

Sometimes you need to no-index pages that do not share any specific URL patterns. Instead of adding every URL one by one to the no-index URL patterns, check if you can no-index them based on a specific CSS class or ID.

Let's say you have category pages for your products, and you want to hide them from the search results while still crawling your products. If those category pages have a distinct element which isn't used elsewhere, e.g. <div class="product-grid"></div>, you can add it as a No-Index XPath: //div[@class="product-grid"]

In this case, the crawler would go to the category pages, then follow and index all the outgoing URLs, so your product pages will get indexed and shown in the results. If you need help with XPaths, check out this guide or reach out to support.

Note: using a lot of no-index URL patterns or no-index XPaths slows down the indexing process, as the crawler needs to scan every page and check it against all the indexing rules. If you're sure that a page or a directory with all outgoing links can be safely excluded from indexing, use the blacklist URL pattern feature instead.

Whitelist XPaths:

Similar to whitelist URL patterns, whitelisting by XPath restricts the crawler to a specific area of your site.

If you want to limit your search to specific pages, but they do not share any URL pattern, then the whitelist XPaths option will come in handy.

For example, the following XPath limits the search to Russian-language pages only:

Note: whitelist XPath takes priority over no-index XPath. If there's a conflict in your settings, the no-index XPaths will be ignored.

Duplicate Pages

How can I remove duplicate pages?

If you find duplicate pages in your index, you can remove them using a few special crawler settings.

Use Canonical URL

Canonical tags are a great strategy to avoid duplicate results not only in your internal site search, but also in Google and other search engines. Learn more about the changes required on your side here.

Let's assume you have 3 distinct URLs, but the content is exactly the same:

http://mysite.com/url1

http://mysite.com/url2

http://mysite.com/page1

You don't want to have the same search result three times, so you would add the following tag to the first two pages to indicate that they refer to the same "master" URL:

<link rel="canonical" href=" http://mysite.com/page1" />

Once this is set up correctly on your site, turn on the "Use Canonical URL" toggle and re-index your site.

Ignore Query Parameters

Perhaps you have these two URLs with the same content:

http://mysite.com/url1

http://mysite.com/url1?utm_campaign=google&sessionId=cb5q69wo5

Even though these URLs refer to the same page, they are different for the crawler and would appear as separate entries in the index. You can avoid this by removing URL parameters that have no influence over the content of the page. To do so, turn ON Ignore Query Parameters.

Note: The setting cannot be applied safely if you use query parameters:

  • For pagination (?p=1, ?p=2 or ?page=1, ?page=2, etc.)

  • Not only as a sorting method (?category=news), but also to identify pages (?id=1, ?id=2, etc.)

In these cases, ignoring all query parameters might prevent our crawler from picking up relevant pages and documents. We can recommend the following strategies instead:

  • Submit a sitemap with clean URLs and switch from Website Crawling to Sitemap Indexing, which, again, is faster and usually produces cleaner results.

  • Add pagination to your no-index patterns (e.g., \?p=) and blacklist other query parameter patterns under blacklist URL patterns.

Lowercase All URLs

Before turning this setting ON, make sure your server is not case-sensitive. Example:

http://mysite.com/category/product

http://mysite.com/Category/Product

Remove Trailing Slashes

Only turn this setting ON if the URLs with and without the slash at the end display the same page:

http://mysite.com/category/product/

http://mysite.com/category/product

After adjusting any crawler settings, save your changes and re-index your site. When changing root URLs or sitemaps, we recommend emptying the index first (press "Empty Entire Index") to start with a clean slate.

Crawling Speed

Depending on the number of pages you have on your site, the indexing time could range from mere minutes to several hours. Since a full re-index is required every time you make changes to the project, the process might prove quite cumbersome if your project is on the bigger side.

If that is the situation you find yourself in, the best way to move forward would be making Sitemap Indexing your primary method of updating the project.

Our crawler has an easier time dealing with sitemaps because it doesn't have to visit your site and actively search for available URLs, following each link one by one. Instead, it can simply look through a single list and add them to your Index in a fixed order.

Plus, our crawler can easily identify if any changes have been made to the sitemap (e.g., an existing page is updated or an entirely new one is added) by analyzing the <lastmod> tag of each of those pages. This tag contains the date (and often the time) of the last modifications made to the site.

This is where our "Optimize Indexing" feature comes into play.

What is Optimize Indexing?

If this setting is enabled, the crawler only visits updated or new pages whenever your project needs a re-index (and that frequency ranges from daily to monthly depending on your plan). This significantly reduces re-indexing time and server stress on both sides.

In order to use this feature, you'll need to take the following steps:

  1. Upload your sitemap XML file formatted according to these guidelines under Sitemap Indexing and make sure every page has a <lastmod> tag.

  2. Move the "Auto Re-Index" toggle to ON and tick the box for "Optimize Indexing".

  3. Disable auto re-indexing for your root URL (if you have one) under Website Crawling or remove it completely if your sitemap contains all relevant pages.

Keep in mind that the number of pages you have in the sitemap will still affect the indexing time. Even with optimized indexing, a project with tens of thousands of URLs will require a bit of a wait.

There is one case in which our crawler ignores the "Optimize Indexing" setting and goes through all data sources with "Auto Re-Index" enabled in their entirety. This kind of full re-index runs whenever you make changes to the project that apply across all pages, like when you reconfigure your data sources, set up new content extraction rules, create a new result group, etc. This happens because the <lastmod> tag becomes inconsequential - every single page needs to be checked and updated so that the changes you made are reflected in all your search results.

If you're actively updating a large project and staying patient has become a struggle, you can try using a URL List of your site’s most relevant content instead of other data sources.

This solution is far from ideal since you'd have to manually track your changes and update the URL list each time a page gets added to or deleted from your site, as well as each time a page is modified. The URL list also has a limit of 1000 pages, because their primary purpose is indexing small batches of pages that aren't found on the site or in the sitemap. This option can help in a pinch, but it isn't a long-term solution.

If you are not using Sitemap Indexing, you still have one option. In the crawler's advanced settings, you will find the indexing intensity, which ranges from 1 to 5 (slowest to fastest indexing speed, and least to most stress on your server.) This is set to 2 by default, but if you want to crawl faster (and if your server can handle it), you can increase the intensity.