More Fun with Online Data - A Dynamic Screen Scrape

As the web has matured, there have been a great number of changes in the way that information is presented to the user. Though this is not a true history, a simplified story might start with static web pages that just sent HTML to the client, later including styling information and perhaps some JavaScript to coerce a certain look. These would be considered static pages. Server side improvements like the Common Gateway Interface (CGI) allowed for programming to change the page returned from the server based on perhaps the time of day, the URL sent, or anything that could be accessed through code. These were called dynamic pages, and greatly enhanced the usefulness of the web, however the technique required an increase in the amount of work done by the servers which can lead to slow pages or even site crashes in the event of a sudden increase in client requests.

This article will cover a dynamic screen scrape as a follow-on to our articles that introduced screen scraping and provided an overview of a static screen scrape. For more info see:

To Scrape or Not to Scrape: An introduction to screen scraping
Fun with Online Data - A Static Screen Scrape: An overview of performing a screen scrape from a static website

With the increase in power of client systems, and the maturity of JavaScript support in modern browsers, pages today will often use the client’s system to render the page, passing HTML, CSS, and JavaScript files that will then call back to the server using a data API to obtain the information used in the presentation. These types of pages require more browser resources than those offered by our previously used URLlib as it does not actually run the JavaScript needed by the page. In order to actually run the JavaScript, we will initially turn to Selenium WebDriver; a tool used for controlling a browser from a program. Python has Selenium bindings, so it can be installed quite easily using pip:

pip install selenium

Selenium requires a driver for the browser that it is controlling. This controller is browser specific, links to a variety of Selenium extensions and components can be found at http://docs.seleniumhq.org/download/. Once the driver is installed, we are ready to start scraping a client side dynamic site.

The site to be scraped is the landing page of data.world (https://data.world/) which is a site that hosts data sets, provides a social network for data, offers tools to analyze the data, and attempts to give meaning to the data sets through semantic tags.

The robots.txt file offers access to everything except for /meta, and the site does not appear to be add supported, and I agree with their project and would like to help spread the word.

Determining if the page is Client Side Rendered

The easiest way to discover if the page is client side rendered is to look at the source in a web browser and search for some of the text of the page in the source. For example, at the time of writing this post, the fist list entry of the data.world site (beneath the “Dig this Data” header) is a data set named IndiaCPS (FY2013-17).

Opening the page source in a new tab (in Chrome, right click and select “View Page Source”), and initiating text search (CTRL-F or CMD-F) to find the text “IndiaCPS” returns nothing found (no highlighted text, and “0 of 0” in the search box). This is evidence that the site has been constructed using JavaScript, and thus not amenable to URLLib. We do still need to inspect the rendered page to determine how to parse out the data that we want. There are a couple of ways to do this:

Use the Chrome Developer Tools (Customize and control Google Chrome -> More Tools -> Developer Tools, Elements tab).
Use a custom program to pull the rendered HTML and write it to a file.

For this post, we’ll write a quick Python program to use Selenium to pull the HTML we need. Make sure Chrome is open before you run the code.

Here is the code for the rendered HTML grab:

from bs4 import BeautifulSoup

import codecs

from selenium import webdriver

driver = webdriver.Chrome()

driver.get('https://data.world');

page_source = driver.page_source

driver.quit()

soup = BeautifulSoup(page_source, 'html.parser')

pretty = soup.prettify()

f = codecs.open('dataworld.txt', 'w', 'utf-8')

f.write(pretty)

f.close()

This uses Selenium to drive Chrome to get the page. Chrome then runs the JavaScript to render the page, and the code grabs the rendered HTML via the drive.page_source property. The HTML is then parsed by BeautifulSoup, prettified, and written to a text file. Opening that file and searching for “India” leads to the discovery of the following anchor tag:

IndiaCPS (FY2013-17)

</a>

So the HTML has been successfully rendered and captured by the code. After examining the tags surrounding the above anchor, a promising class in the HTML to search on for the data sets is “dw-DatasetCard” which appears in every one of the spans surrounding the data sets listed. A quick check in the interactive python shell to see if the class will work for us:

>>> data_sets = soup.find_all('span', {'class':'dw-DatasetCard'})

>>> len(data_sets)

21

If you count the data sets, there are 15 in the section that we are examining, and 6 more in the Featured Datasets section above. Since we don’t want the featured data sets for this application, we’ll need to thin the result set a tad. A useful divider appears to be the “Dig this data” title, resulting in the following:

>>> diggit = soup.find('span', text = u’Dig this data')

>>> diggit

>>> data_sets = diggit.find_all_next('span', {'class':'dw-DatasetCard'})

>>> len(data_sets)

15

The extra work of searching after the “Dig this data” span has resulted in the correct number of data sets. The remaining work of parsing out the pertinent information from the “dw-DatasetCard” is similar to the last post’s extraction process.

Going Headless

Using the WebDriver to control Chrome is tremendously useful, allowing one to drive to a certain place and then take over, however, it might be argued that a user application like Chrome might not be appropriate on a server installation. The Chrome browser requires a desktop to function, which means a logged in user. A headless (not having any user windows) alternative is PhantomJS. The download for Windows is available here: http://phantomjs.org/download.html , simply unzip the file and move the phantoms.exe into a location listed in your path (or add a entry). The change to the code is minimal:

from bs4 import BeautifulSoup

import codecs

from selenium import webdriver

driver = webdriver.PhantomJS()

driver.get('https://data.world');

page_source = driver.page_source

driver.quit()

soup = BeautifulSoup(page_source, 'html.parser')

pretty = soup.prettify()

f = codecs.open('dataworld.txt', 'w', 'utf-8')

f.write(pretty)

f.close()

In Windows, a console window opens to run the phantoms.exe executable, but no subsequent browser window opens. This allows for services to run the scraper without needing a logged in user.

Note: Since Chrome 59 (for Mac and Linux) or Chrome 60 (for Windows), Chrome has been able to be run in a headless mode. To enable this, start Chrome from the command line (or shortcut) with the following flags:

chrome --headless --disable-gpu

Thanks for reading.

neXus Data Solutions is a Software Development Company in Anchorage, Alaska. Bob is our in-house tech lead with over 25 years experience in software design and development. He is neXus Data Solutions' Employee Number One who helps to drive our technical solutions using an innovative yet practical approach.

#WebDevelopment #Database #Coding #WebScraping #SoftwareDevelopment #DataPortal