Spreadsheets

|

August 13, 2020

Best Methods for Pulling Data into Your Spreadsheet

Test spreadsheet Google Sheet with list of blog titles and excerpts
SECTIONS
  1. Understanding web page structure
  2. Pulling emails from a web page with the ImportXML function
  3. Pulling website data with the ImportXML function
  4. Pulling tables with importHTML function
  5. Pulling RSS or Atom feeds from a web page

Data is everywhere, and in this age, you can’t really escape from it. All our decisions are based on data, and when we don’t have enough to make a decision, we get scared. The question that then arises is ‘how do I store my data?’ Google Sheets is a powerful data management tool that allows you to quickly organize your data into a structured format. With it, you can perform all sorts of analysis and keep track of important things.

One of the highlights of Google Sheets is that it lets you add data directly from the internet without having complicated procedures. 

In this article, we’re going to discuss the best methods you can use to pull data from websites into your spreadsheets. Luckily, you don’t need to be some experienced programmer to take advantage of these features!

Google Sheet named Test spreadsheet


Note: Google Sheets works best with small amounts of data. If you need to work with a plethora of data, you might be better off using a Relational Database Management System.

Understanding web page structure

Before we can extract data from web pages, we need to know where that specific data actually is inside the page. To find it, we need to first learn a bit about how web pages store data. Every web page has an HTML structure and everything you see on a page is contained within that structure.

In layman terms, you could consider the website to be a very big box. Inside that big box, are smaller boxes (web pages) which could contain even smaller boxes (headings and subheadings) and inside those boxes is the content (text, tables, pictures, etc.).

Box with several compartments


The boxes give organization so that the content inside them is easily managed and we know where everything would be. The boxes are also labeled and that allows us to know exactly where to look for whatever we need.

There’s really no limit on how long a box-within-a-box chain can be. The more complex a website is, the more boxes it could contain. 

In HTML terms, the above-mentioned labels are called tags and they come in a variety of types that specify what kind of content they can contain. This means that not every box is capable of containing another kind of box.

In order to see the HTML structure of a website and its corresponding tags, simply open up any web page and right-click (or press CTRL + Shift+ i) on any part of it. On Google Chrome, you should see an option called ‘Inspect’ at the bottom of the options pane. Clicking on it will open a window on your screen that looks complicated, but is fairly simple to comprehend. 

Lido Website next to corresponding HTML code


Hovering your mouse over the new window will highlight certain parts of the page. The part that gets highlighted is the box your mouse is hovering over in the new window. In order to see the box of a specific part of the page, move your cursor to the page window, right-click on it and press the inspect button.

The new window that will pop up will show you exactly where in the stack of boxes your desired box lies in. You will also notice that some of these boxes have small black triangles to their left and clicking them reveals more code. Clicking on the arrow is like opening a box to find more stuff inside it.

Highlighted HTML code showing where it corresponds on the page
As we can see here, the ‘Build reports in minutes’ heading belongs to the label/tag of h1.


Tags also have different properties/attributes which can give room for customization and added functionality. They can also be separated from the rest of the code by using the ‘class’ property.

This allows us to reuse certain styling elements without having to type them again for every place we want to use it in and also enables us to easily identify the tags we want to give different functionality to.

Now that we’ve figured out how to identify content inside a webpage, let’s look into examples of actually extracting the said data into Google Sheets.

What you need to keep in mind is that every website has a different structure and you will need to understand it first before attempting to extract data from it.

Pulling email data from a web page with the ImportXML function

We’ll now be attempting to pull a list of emails from a site. The first step is to go to Google Sheets and open a blank sheet. In another tab, open your desired web page that contains email addresses. This example uses  Cambridge University Architecture department’s visiting faculty page for its simple structure.

When you inspect the page element, you should be able to easily identify where in the HTML structure the email is embedded.


Visiting Lecturers section next to HTML code


In order to keep this example simple, we won’t be going much into the technical details of the code we’re going to use. The only focus will be on the key parts that you will need to change to use it according to your needs.

Google Sheets has a function called ImportXML which helps in extracting data from websites. In our case, we will provide the function with the link containing the emails and the label they reside in.

In the first cell of the sheet, give the column a name for better readability. In the second row, copy the code below and press enter: 


=importxml("https://www.arct.cam.ac.uk/people/visiting-lecturers", "//div[@class='emailAddress']")


You should have the following set of emails in your list.

Test spreadsheet with list of emails in Column A


No need to worry, we’ll break it down now to explain what happened.

The =importXML part is simply using the function. Inside the brackets, the first part ("https://www.arct.cam.ac.uk/people/visiting-lecturers") is where we tell the function what link it needs to go to pull the data from.

After the comma, there are two forward slashes which are just part of the syntax. After that, we tell it what tag AND what class ("//div[@class='emailAddress']") we need it to look at. In this case, the tag is ‘div’ and the class is ‘emailAddress’.

The end result is the function looking in the page’s HTML code for ‘div’ tags that have the ‘emailAddress’ class while it ignores everything else. Whenever it finds a suitable candidate, it pulls the email into the sheet.

In a lot of cases, you will not find clearly labeled classes like this and sometimes that is intentionally done to ward off scrapers. A good understanding of the HTML code is therefore necessary for accurate results.

Pulling website data with the ImportXML function

Now that we’ve established a method of identifying where the data of interest is located, let’s try pulling data from a different website. In this section, we’re going to attempt to pull the names and excerpts of featured articles of the Lido Blog page.

The steps will be identical to the above section: Understand the HTML structure of the page, identify where the data is stored, and make the necessary changes in the ImportXML function.

These steps will be repeated twice since we have two different things to scrape. This means that we will be using the function in two different columns.

Here we have identified that the title of the page is located in the ‘h2’ tag and its class is ‘no-bottom-space no-top-space’.


A blog page with corresponding HTML code highlighted


We’ll now add this information into the function. The function will now look like:


=importxml("https://www.lido.app/blog", "//h2[@class='no-bottom-space no-top-space']")


Our spreadsheet should now have a few entries like below.

Test spreadsheet with List of blog titles in Column A


By making the required changes for the excerpts, we’ll end up with the following function for the second column:


=importxml("https://www.lido.app/blog","//div[@class='text-block-blog-card']")


The resulting spreadsheet should look similar to our results.

Test spreadsheet Google Sheet with list of blog titles and excerpts


Pulling tables with the ImportHTML function

Sometimes we need to work with tables and they can be fairly large. That means copy-pasting them can become tedious and the process is prone to errors. To handle that, we can make use of the ImportHTML function, which is much simpler (and weaker) than the ImportXML function.

Open up a new spreadsheet and in the first cell, type:


=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area","table",1)


The result should be something similar to ours.

Test spreadsheet 2 with a table of information from Wiki


Let’s now break down what happened.

The first part inside the brackets ("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area") specifies the URL from which we want to extract our tables from. The second part ("table") tells the function to look for tables.

ImportHTML has the ability to pull tables and lists. If we wanted to pull a list, we would replace the word ‘table’ with ‘list’.

The last part tells the function which table we need from the page. The number of the table can be found through trial and error. The indexing of the tables begins from one and usually, a page doesn’t have many tables.

The indexing of tables might change over time so if you get a different result, try changing the number.

There are some technical methods to find out exactly which index a table is stored in, which you can find more about here.

Pulling RSS feeds from a web page with ImportFEED

RSS feeds are a handy way of keeping yourself updated with new content that is uploaded to any site. With an RSS feed, you don’t have to manually visit a website to see if it has anything new as you can simply use the feed to see information regarding updates.

Storing that information in a Google Sheet, which you can easily update with a button, saves time and effort.

Note: in order to get the RSS feed for a page, you’ll need to find its RSS link first. To do that, read this guide.

Open a new spreadsheet and in the first cell, type in the function below.


=IMPORTFEED("http://rss.cnn.com/rss/edition.rss")


The resulting feed should arrive in a similar format. As it’s a news website, the content that will appear in your attempt will undoubtedly be different from the screenshot below.

Test spreadsheet 3 with RSS feed
The ImportFEED function can be given more information about what kind of content it needs to pull. To understand them, check out its documentation.


Closing words

We’ve looked at a number of scenarios in which we pulled different kinds of data into Sheets. The list of use cases for pulling data is virtually limitless. If there is data on the internet, chances are someone, somewhere will need it. 

All sorts of data is extractable and in order to do that, you need to understand the art of web scraping. We stated above that every website has a different structure and that’s why it’s hard to find a standard way of pulling data.

Every tutorial that you will find regarding pulling data into Google Sheets will be different. Some tutorials will be very straightforward, while some will definitely involve complex coding structures.

You will also need to make sure if the website you’re trying to extract data from allows this. Not every website welcomes people to scrape its data, and some even have protective elements that simply return empty values for the fields you try extracting.

It’s a high-paying skill to have, and in order to be an effective web scraper, web development and HTML knowledge will do you wonders. Some websites have dynamic data (constantly changing) and extracting that requires techniques beyond the scope of this article.

Further reading

We’ve gathered a bunch of resources that will help you enhance your web scraping skills. If you’re constantly in the need for pulling data from the web, chances are you will soon encounter a page that requires an intermediate-level of knowledge.

You will also need to learn some form of programming for this to work. We suggest Python because it’s one of the easiest languages to understand and has amazing utility.

Suscribe to get more data and analytics tips!

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.