Basic Web Scraper using Python, Selenium, and PhantomJS

The below example code performs searches on Yahoo and collects the results into a JSON file for later analysis. This tutorial assumes you have installed the prerequisites.

Note: this code is meant to demonstrate how to write a basic scraper; please do not use it to abuse Yahoo search resources.

This article is still being written, and is currently incomplete.

Python Entry Point

from selenium_crawler import SeleniumCrawler
from selenium.webdriver.common.keys import Keys
import json
import time

try:
    crawler = SeleniumCrawler()
    browser = crawler.navigate('http://search.yahoo.com')

    # Get search terms that we'll send to Yahoo (and then scrape the results)
    search_texts = ['collect all 7', 'golden state warriors', 'data science utah']
    links_data = {}

    for search_text in search_texts:
        # type in the seach terms into the input box and press <enter>
        el = browser.find_element_by_id("yschsp")
        el.clear()
        el.send_keys(search_text + Keys.RETURN)

        crawler.screenshot()

        # get ready to inject JS on our search results page
        crawler.setup_scripts()

        # inject JS into the page, using jQuery to retrieve the URL text
        link_data = crawler.run_js('get_search_urls.js')
        links_data[search_text] = link_data

        time.sleep(10)

    # dump our results to a JSON file for later analysis
    with open('links.json', 'w') as links_file:
        links_file.write(json.dumps(links_data, indent=4))

    browser.quit()
except Exception as ex:
    print(ex)
    pass

SeleniumCrawler Python Class

The SeleniumCrawler class is shown here.

Injecting JavaScript Libraries

We inject jQuery and Lodash JavaScript libraries into the page to help us extract information from the page. I actually only ended up using jQuery in the JavaScript I wrote to extract URL text from the search results.

JavaScript to Get Link Text

(function($, _) {
    var urls = [];

    $('span.fz-15px.fw-m.fc-12th.wr-bw').each(function() {
        urls.push($(this).text());
    });
    window.WebScrapeNS.data = {urls: urls};

})(window.WebScrapeNS.$, window.WebScrapeNS._);

I create a global WebScrapeNS (namespace) variable to store data and references to JavaScript library objects (using $.noConflict() with jQuery, for example, to avoid clashing with the site's version of jQuery).

Comments

Leave a comment

What color are green eyes? (spam prevention)
Submit
Code under MIT License unless otherwise indicated.
© 2020, Downranked, LLC.