The below example code performs searches on Yahoo and collects the results into a JSON file for later analysis. This tutorial assumes you have installed the prerequisites.
Note: this code is meant to demonstrate how to write a basic scraper; please do not use it to abuse Yahoo search resources.
This article is still being written, and is currently incomplete.
from selenium_crawler import SeleniumCrawler
from selenium.webdriver.common.keys import Keys
import json
import time
try:
crawler = SeleniumCrawler()
browser = crawler.navigate('http://search.yahoo.com')
# Get search terms that we'll send to Yahoo (and then scrape the results)
search_texts = ['collect all 7', 'golden state warriors', 'data science utah']
links_data = {}
for search_text in search_texts:
# type in the seach terms into the input box and press <enter>
el = browser.find_element_by_id("yschsp")
el.clear()
el.send_keys(search_text + Keys.RETURN)
crawler.screenshot()
# get ready to inject JS on our search results page
crawler.setup_scripts()
# inject JS into the page, using jQuery to retrieve the URL text
link_data = crawler.run_js('get_search_urls.js')
links_data[search_text] = link_data
time.sleep(10)
# dump our results to a JSON file for later analysis
with open('links.json', 'w') as links_file:
links_file.write(json.dumps(links_data, indent=4))
browser.quit()
except Exception as ex:
print(ex)
pass
The SeleniumCrawler class is shown here.
We inject jQuery and Lodash JavaScript libraries into the page to help us extract information from the page. I actually only ended up using jQuery in the JavaScript I wrote to extract URL text from the search results.
(function($, _) {
var urls = [];
$('span.fz-15px.fw-m.fc-12th.wr-bw').each(function() {
urls.push($(this).text());
});
window.WebScrapeNS.data = {urls: urls};
})(window.WebScrapeNS.$, window.WebScrapeNS._);
I create a global WebScrapeNS (namespace) variable to store data and references to JavaScript library objects (using $.noConflict() with jQuery, for example, to avoid clashing with the site's version of jQuery).
Leave a comment