Basic Web Scraper using Python, Selenium, and PhantomJS

The below example code performs searches on Yahoo and collects the results into a JSON file for later analysis. This tutorial assumes you have installed the prerequisites.

Note: this code is meant to demonstrate how to write a basic scraper; please do not use it to abuse Yahoo search resources.

This article is still being written, and is currently incomplete.

Python Entry Point

from selenium_crawler import SeleniumCrawler
from selenium.webdriver.common.keys import Keys
import json
import time

    crawler = SeleniumCrawler()
    browser = crawler.navigate('')

    # Get search terms that we'll send to Yahoo (and then scrape the results)
    search_texts = ['collect all 7', 'golden state warriors', 'data science utah']
    links_data = {}

    for search_text in search_texts:
        # type in the seach terms into the input box and press <enter>
        el = browser.find_element_by_id("yschsp")
        el.send_keys(search_text + Keys.RETURN)


        # get ready to inject JS on our search results page

        # inject JS into the page, using jQuery to retrieve the URL text
        link_data = crawler.run_js('get_search_urls.js')
        links_data[search_text] = link_data


    # dump our results to a JSON file for later analysis
    with open('links.json', 'w') as links_file:
        links_file.write(json.dumps(links_data, indent=4))

except Exception as ex:

SeleniumCrawler Python Class

The SeleniumCrawler class is shown here.

Injecting JavaScript Libraries

We inject jQuery and Lodash JavaScript libraries into the page to help us extract information from the page. I actually only ended up using jQuery in the JavaScript I wrote to extract URL text from the search results.

JavaScript to Get Link Text

(function($, _) {
    var urls = [];

    $('span.fz-15px.fw-m.fc-12th.wr-bw').each(function() {
    }); = {urls: urls};

})(window.WebScrapeNS.$, window.WebScrapeNS._);

I create a global WebScrapeNS (namespace) variable to store data and references to JavaScript library objects (using $.noConflict() with jQuery, for example, to avoid clashing with the site's version of jQuery).


