top of page

To Scrape or Not to Scrape

The art of data science and analysis continues to evolve as we see organizations making data available to the public via the web and API's. Our own local government here in Anchorage, Alaska recently released the Municipality of Anchorage Open Data portal (https://data.muni.org) where they offer up data and statistics about restaurant ratings, homeless population counts, property and tax values and more.

Anchorage Muni Data Portal

Data from the Municipality of Anchorage Open Data portal is available to the public through a user-friendly dashboard displaying tabular data that can be exported to .csv and visualizations that can be exported to .png. Alternately, users can access the information programmatically via API (Application Programming Interface).

As software developers, we'd lean naturally towards accessing data via the API; but sometimes that's just not an option. When an API is not available, web or screen scraping is a possible alternative for obtaining data from the web. This is a programmatic way of extracting data from web sites and saving that data to a local file on your computer or in a database for future use in analyses.

Of course, when embarking on any such activity that includes collecting data or the building of any other type of collections, it's important to understand the ethics of the activity. For example, perhaps one enjoys collecting sea shells, and thus wanders the public beaches searching the shores for a particularly striking example. As long as one is not trespassing, or damaging property then one can safely consider oneself to be acting in an ethical manner.

What then of scraping? Is there a way to scrape ethically? There are a few ways to ensure that one is acting in an ethical manner:

  1. Check for any terms and conditions on the site if available and ensure that you're accessing and using the data as instructed.

  2. If there are no posted terms and conditions, ask permission. Look for a contact link on the site and ask if you can scrape it.

  3. If you can find no contact link, check the robots.txt file and follow its instructions. Learn more about the robots exclusion standard at https://en.wikipedia.org/wiki/Robots_exclusion_standard .

  4. Look for copyright information. If the site is copyrighted, respect the author’s rights.

  5. Is the site ad supported? If so, your scraping might cause difficulty for the site’s owner and their ad providers. Unless explicit permission is obtained, perhaps it is best to avoid scraping ad supported sites.

  6. If scraping is appropriate, minimize bandwidth use by efficiently scraping the site. Minimize server load by implementing a crawl delay.

In the coming weeks we'll present a series of tutorials on how to perform a screen or web scrape from a federal government site; a site that provides no other API for data extraction and is not ad supported. The tutorial will use Python, an interpreted language with a remarkable set of libraries that provides support for many different activities. We'll incorporate three libraries:

  1. URLLIB2, a library supporting web access

  2. BeautifulSoup, an HTML access library that allows for loose parsing and searching, and

  3. Scrapy, a more abstracted web scraping library

We'll perform two different extractions. One will be performed on a static web site, or a site that is served in its final form by the web server (no JavaScript is used to modify the DOM (Document Object Model)). A second extraction will be performed on a dynamic web site, one that is rendered and modified using JavaScript.

Stay tuned for our follow-on tutorial or subscribe below to ensure our update is sent directly to your email inbox, subscribe to our monthly TechTalk newsletter.

Bob is our in-house tech lead with over 25 years experience in software design and development. He is neXus' Data Solutions' Employee Number One who helps to drive our technical solutions using an innovative yet practical approach.

Featured Posts
Recent Posts
Search By Tags
Follow Us
  • fb
  • in
  • twt
  • ig
bottom of page