To Scrape or Not to Scrape

August 30, 2017

The art of data science and analysis continues to evolve as we see organizations making data available to the public via the web and API's.  Our own local government here in Anchorage, Alaska recently released the Municipality of Anchorage Open Data portal (https://data.muni.org) where they offer up data and statistics about restaurant ratings, homeless population counts, property and tax values and more.

 

 

Data from the Municipality of Anchorage Open Data portal is available to the public through a user-friendly dashboard displaying tabular data that can be exported to .csv and visualizations that can be exported to .png.  Alternately, users can access the information programmatically via API (Application Programming Interface).  

 

As software developers, we'd lean naturally towards accessing data via the API; but sometimes that's just not an option.  When an API is not available, web or screen scraping is a possible alternative for obtaining data from the web.  This is a programmatic way of extracting data from web sites and saving that data to a local file on your computer or in a database for future use in analyses.  

 

Of course, when embarking on any such activity that includes collecting data or the building of any other type of collections, it's important to understand the ethics of the activity. For example, perhaps one enjoys collecting sea shells, and thus wanders the public beaches searching the shores for a particularly striking example. As long as one is not trespassing, or damaging property then one can safely consider oneself to be acting in an ethical manner.

 

What then of scraping?  Is there a way to scrape ethically?  There are a few ways to ensure that one is acting in an ethical manner:

 

  1. Check for any terms and conditions on the site if available and ensure that you're accessing and using the data as instructed.

  2. If there are no posted terms and conditions, ask permission. Look for a contact link on the site and ask if you can scrape it.

  3. If you can find no contact link, check the robots.txt file and follow its instructions.  Learn more about the robots exclusion standard at https://en.wikipedia.org/wiki/Robots_exclusion_standard .

  4. Look for copyright information. If the site is copyrighted, respect the author’s rights.

  5. Is the site ad supported? If so, your scraping might cause difficulty for the site’s owner and their ad providers. Unless explicit permission is obtained, perhaps it is best to avoid scraping ad supported sites.

  6. If scraping is appropriate, minimize bandwidth use by efficiently scraping the site. Minimize server load by implementing a crawl delay.

 

In the coming weeks we'll present a series of tutorials on how to perform a screen or web scrape from a federal government site; a site that provides no other API for data extraction and is not ad supported.  The tutorial will use Python, an interpreted language with a remarkable set of libraries that provides support for many different activities. We'll incorporate three libraries:

 

  1. URLLIB2, a library supporting web access

  2. BeautifulSoup, an HTML access library that allows for loose parsing and searching, and

  3. Scrapy, a more abstracted web scraping library

 

We'll perform two different extractions.  One will be performed on a static web site, or a site that is served in its final form by the web server (no JavaScript is used to modify the DOM (Document Object Model)). A second extraction will be performed on a dynamic web site, one that is rendered and modified using JavaScript.

 

Stay tuned for our follow-on tutorial or subscribe below to ensure our update is sent directly to your email inbox, subscribe to our monthly TechTalk newsletter.

 

Bob is our in-house tech lead with over 25 years experience in software design and development.  He is neXus' Data Solutions' Employee Number One who helps to drive our technical solutions using an innovative yet practical approach.

 

 

Share on Facebook
Share on Twitter
Please reload

Featured Posts

Generational Entrepreneurship

January 31, 2019

1/4
Please reload

Recent Posts

November 11, 2019

Please reload

Search By Tags
Please reload

Follow Us
  • Facebook Classic
  • Google Classic

PO Box 110548

Anchorage, AK 99511

Tel: 907-350-6024

Fax: 907-338-9330

Mail: info@nexusdatasolutions.com

  • w-facebook
  • w-googleplus
EVENTS
CONTACTS

Web Site Design for

Your Small Business

Today's technology tools make building a website easier than ever before so that anyone with a computer and internet access can do it.  neXus Data Solutions offers this easy-to-follow course through the Alaska Small Business Development Center that empowers small business owners and makes it possible for them to design a site that is both eye-catching and effective.

We'll talk about color, font and image choice as well as placement and calls to action. At the end of this class you'll leave with the know-how to design a marketing-savvy website and get a start to your site using Wix to promote your small business.  Join us on Mar 5th, 8a-5p.

Visit the Alaska Small Business Development Center to register at:

https://aksbdc.ecenterdirect.com/events/7142

 

© 2018 by neXus Data Solutions, LLC.