The IMPORTXML function in Google Sheets is a built-in function you can use to import various types of data from webpages across the Internet. You will learn all of these in this article alongside important information and common errors you may encounter.
The IMPORTXML function uses the following syntax:
=IMPORTXML(“URL”,”xpath”)
Where
URL is the URL of the webpage you want to import
xpath is the specific XPath code to identify the data you want to import
Learn about XPath in the next section.
XPath, which stands for XML Path Language, is used to access data stored in XML files. XML files are used to store data, and its XML format is also used as the basis for HTML format today. As HTML format is used in web pages, XPath can also be used to access data from web pages.
The HTML syntax consists of a system of tags and attributes that dictate how the content should be displayed. The tags describe in general terms how the content should be displayed. The attributes further refine the formatting and also include ways to identify the specific data stored in the given tag.
For example, we want to access the page title of a specific webpage. The page title is the one that appears on the top of your browsers. The page title is enclosed in title tag:
<title>Page Title</title>
To access the page title, the XPath that should be used is
//title
The double slash tells Google Sheets that all the instances of <title></title> should be imported. There is only one instance for each webpage and so will give us the page title. If you want to only find one instance of the tag, use single slash.
You can pinpoint the content of a specific tag if it has an identifying attribute such as class and id. This is important since most of the webpage content is wrapped in div tags. For example, we want to extract content in a div tag for the table of contents. We discover the HTML syntax as
<div class=”toc”>
.
.
.
</div>
The XPath would be:
//div[contains(@class,'toc')]
Google Sheets will scan for all div tags that contain the attribute class and then narrow down the list to those whose attribute value is “toc”.
Some HTML tags have their own unique set of attributes. For example, the hyperlink tag has the href attribute that contains the URL. Its syntax is
<a href=”URL”>Link text</a>
You can directly list all the links in the webpage by using the following XPath:
//@href
Where the @ sign indicates that href is an attribute of a tag. The a tag is not included because in this case we want to extract the URL which is the value of the attribute href. If we format this to include the a tag, what we will extract is the link text instead of the URL. Google Sheets will list down all the URLs from the webpage.
Finally, if the specified XPath gives a blank result, you can add a series of tags and attributes to ensure that Google Sheets are directed to the right content. This can be useful when handling div and p tags. For example, you can add the following in the start of the XPath
/html/body
As these are the top-level tags that every web page should have.
You can also go further within the tag if it has branches inside. For example, the div tag that we have for table of contents have the following structure
<div class=”toc”>
<ul>
<li>Entry one</li>
<li>Entry two</li>
<li>Entry three</li>
.
.
.
</ul>
</div>
You can include the ul and li tags, that are used for specifying lists, into the XPath:
//div[contains(@class,'toc')]/ul/li/*
Google Sheets will identify the div tag that contains the attribute class with value “toc”, then go further to the ul and li tags to extract the specific contents for each li tag.
We will see these in action through the IMPORTXML function examples in Google Sheets.
Sometimes you need to dig into the webpage code itself and identify the XPath from it.
Here’s an example:
We want to import the table of contents of this page to Google Sheets. Here are the steps:
The word Inspect is used in Chrome; for Firefox it’s Inspect element. The webpage code will show up on the right side of the browser.
Screencap of step 1 and 2
Sometimes the exact line may not be useful for our purpose. We should look around the highlighted line for the best XPath code.
We will use the line with class=”toc” included. To confirm it’s the right line, hover the cursor on the line and observe which part of the webpage.
The resulting XPath is
//div[contains(@class,'toc')]/ul/li/*
This will look into the div tag with class=”toc” then select the values stored in li tags within.
Finally you can combine the XPath with the URL in Google Sheets. The formula is
=importxml("https://en.wikipedia.org/wiki/List_of_spreadsheet_software","//div[contains(@class,'toc')]/ul/li/*")
We have five examples to demonstrate how IMPORTXML function works in Google Sheets.
We want to import the page title of the home page of Lido. The URL is www.lido.app
The XPath for page title is
//title
The formula is
=importxml("https://www.lido.app/","//title")
The page title will be imported to Google Sheets.
The meta description of a web page also helps in its visibility in search engines. Here are the steps in importing the meta description of a webpage to Google Sheets using IMPORTXML function:
We want to import the meta description of the home page of Lido. The URL is www.lido.app
The XPath for meta description is
/html/head/meta[@name='description']/@content
The formula is
=importxml("https://www.lido.app/","/html/head/meta[@name='description']/@content")
The meta description is now imported to your Google Sheets:
Web page structure consists of headers and paragraphs. A well-structured web page is easier to read and therefore attracts more traffic. Here are the steps to import headers present in the webpage and list them in your Google Sheets using the IMPORTXML function:
We want to import the headers on the Wikipedia page for “Spreadsheet” and analyze it in Google Sheets. The URL is
https://en.wikipedia.org/wiki/Spreadsheet
The XPaths for the headers are:
H1: //h1
H2: //h2
H3: //h3
The formula for importing h1 headings is:
=importxml("https://en.wikipedia.org/wiki/Spreadsheet","//h1")
For h2:
=importxml("https://en.wikipedia.org/wiki/Spreadsheet","//h2")
For h3:
=importxml("https://en.wikipedia.org/wiki/Spreadsheet","//h3”)
The list of headers will be loaded to Google Sheets:
We want to import the links in the Wikipedia page for Spreadsheet and analyze them in Google Sheets. The URL is
https://en.wikipedia.org/wiki/Spreadsheet
The XPath for the links is
//@href
The formula for is:
=importxml("https://en.wikipedia.org/wiki/Spreadsheet","//@href")
The links are then listed in Google Sheets:
We want to import the table in the Wikipedia page for “List of spreadsheet software” and analyze it in Google Sheets. The URL is
https://en.wikipedia.org/wiki/List_of_spreadsheet_software
The XPath for the links is
//table[n]/tbody/*
Where n points to what table to import. Use 1 if there is only one table in the page; else identify the order in which the table you want to import appears in the page.
The formula for is:
=importxml("https://en.wikipedia.org/wiki/List_of_spreadsheet_software","//table[1]/tbody/*")
The table is then loaded to Google Sheets:
It means that the XPath input does not follow the correct syntax; thus Google Sheets cannot import any data from the webpage. Check the formula to make sure there is no typo and that the proper syntax is followed.
It can mean two things:
You may need to check the webpage itself to identify the proper XPath.
No. There are websites that actively block codes that import their content. This will cause Google Sheets to output errors in the IMPORTXML function when otherwise the XPath should work.
If what you need is to import analytics data from your e-Commerce and marketing platforms to Google Sheets, the IMPORTXML function is not the solution. I recommend Lido. There’s no need to dig through the webpage code and use the XPath. Instead, you just need to do a few clicks and the data will seamlessly load in your spreadsheet.
The IMPORTXML function lets you use the XPath to pinpoint the specific data you want from a web page and import it to Google Sheets. You can import a wide variety of data present in web pages such as page titles, metadata, lists, links, and paragraphs.
IMPORTXML function depends on using the correct XPath for each type of data that can be found in a webpage. Some examples you can use in Google Sheets are the following:
Page title: //title
Meta description: /html/head/meta[@name='description']/@content
H1: //h1
H2: //h2
H3: //h3
Links: //@href
Tables: //table[n]/tbody/*
You can review the previous section about XPath so you can further use the IMPORTXML function in Google Sheets with even better results!
We hope this article has helped you and given you a better understanding of how to use the IMPORTXML function in Google Sheets. You might also like our articles on how to use the Google Sheets IMPORTRANGE function and the TODAY function in Google Sheets.
On a side note, we also recommend reading our guide on how to get email notifications from Google Sheets.
-Import Multiple Sheets With IMPORTRANGE