Working with XPaths

What are XPaths?

XPaths are expressions that allow you to identify elements on your web page. For example, the Xpath //img selects all images.

If you are not used to XPath expressions but are familiar with CSS selectors, you can look at a very simple conversion table here.

Here is a list of potentially useful XPaths that you could modify and use for your purposes:

XPath

Description

//h1

Selects all your <h1> elements

//div[@id="main"]

Selects the div element with the id "main": <div id="main"></div>. This can be useful if you want to only index text within a certain element of the page and avoid indexing text from footers and sidebars.

//p[contains(@class,"notes")]

Selects the p elements that have a class called "notes": <p class="something notes whatever">.

//img[contains(@class,"main-image")]//@src

Selects the src attribute of all image elements that have a class called "main-image": <img class="main-image" src="image.jpg" />. This path can be used if you want to tell the crawler which image to index for your page.

How to control what content is indexed and used in search results

We made the Zoovu Search crawler as intelligent as possible when it comes to analyzing your website and picking the right title, image, and content for your search results.

Nonetheless, it might still be necessary to fine-tune your indexing rules by pointing the crawler directly to the desired content or exclude unwanted pieces of information from being indexed and, therefore, used in the search.

This can be done via XPath expressions placed in the Zoovu Search control panel. You can set up the general rules on the Crawler settings page and the result grouping-specific rules on the Result Groups page (if you're using any).

Check out these steps:

  1. First, let's search and install the Google Chrome extension called "XPath Helper". It will allow us to easily define XPaths right from your own site.

  2. Navigate to one of your website's pages. Press the XPath Helper icon in the top right corner of your browser to open the black overlay which will reveal the currently selected XPath expression.

  3. Now we want to extract the main content. After opening the XPath Helper, hold the [Shift] key and hover your mouse over your website's elements.

    You will see how the extension highlights them in yellow while displaying the XPath query in the black overlay box (in the left corner, the box is called "QUERY"). As you move your mouse this XPath query will change. Try to get all the content you are targeting highlighted in yellow.

    The Result half of the black overlay box allows you to preview the targeted content (in the right corner, the box is called "RESULTS").

  4. Tweak your XPath expression by shortening it. There are two ways of shortening an XPath query — you can remove something from the end to match more child nodes or you can leave the tail and cut the head off to make it match more generally. Make sure your XPath always starts with // when shortening from the front. That sometimes requires some testing but a good indicator is if there is an element with an id in the XPath. You can remove everything before that element and start the XPath with two forward slashes.

    Example: the XPath shown is /html/body[@id='body']/main/section[@class='u-pb-xxl u-pb-xl--sm']/article[@class='flex container'][1]/div[@class='main-feature__content col-6 col-12-sm']

    And you can shorten it to //*[@id='body']//div[contains(@class,'main-feature__content')]

  5. Copy the XPath query over to the Zoovu Search control panel and place it under Data Structuring -> Content Extraction in the appropriate XPath section: Title XPaths, Image XPaths, Include and Exclude Content XPaths.

  6. Press the "Test" button and enter your webpage URL to test the XPath query. If everything is fine, you will see the extracted content, headline, or image URL below. You can also Index Single URL to check what's going to be extracted from this page all at once.

Default XPaths and common strategies for your search results

You can use XPath expressions (one per line) for:

  1. Title XPaths pointing to the main title of the page. Default is //h1, i.e. the crawler takes your <h1> heading. Other common scenarios include //title, to pick up the page title tag content, or sometimes //h2. Change it according to your site structure.

    Title Regular expression allows you to apply a regular expression condition on the extracted titles, if you need even more control. For example, you might have your brand or company name repeated in every page title: <title>Working with XPaths – Zoovu Search</title>

    To only use the "Working with XPaths" part as a search result title, use //title as your Title Xpath and add ([^–])+ as Regular expression, and the "– Zoovu Search" part will be cut off.

  2. Image XPaths pointing to the main picture on your page. These images, if available, are automatically shown as search result thumbnails. Leave this field empty if our default crawler settings work well for your site or adjust to point to a specific image instead. For example: //img[@id='main']/@src

    If your images are lazy-loaded, try something similar to the following pattern: //div[@class='product-detail-images']//img/@data-src

    You can also tell the crawler to ignore all images by toggling "Extract Images" off. Alt texts and captions can be indexed separately.

  3. Default Image XPath pointing to the default image to be used when no other image is found. For example, //img[@id='logo']/@src

  4. Include Content XPaths pointing to the content blocks that should be indexed. One XPath per line. Leave empty if everything should be indexed.

  5. Exclude Content XPaths pointing to the content blocks that should be ignored by the crawler. One XPath per line. Leave empty if everything should be indexed.

  6. Search Snippet XPath (is located under Search Settings -> Search Snippet -> Use content behind search snippet XPath) pointing to the content that you want to display in the search results. By default, we show the content around the terms matching the search query.

    Another common strategy would be using your page meta descriptions instead. That's why //meta[@name="description"]/@content is pre-filled for you. To start showing meta descriptions in your search snippets, go to Search Settings and change the Search Snippet Source.

Here you can find even more information about Xpath: https://www.w3schools.com/xml/xpath_intro.asp