Python iterating through pages google search

Python iterating through pages google search - python

I am working on a larger code that will display the links of the results for a Google Newspaper search and then analyze those links for certain keywords and context and data. I've gotten everything this one part to work, and now when I try to iterate through the pages of results I come to a problem. I'm not sure how to do this without an API, which I do not know how to use. I just need to be able to iterate through multiple pages of search results so that I can then apply my analysis to it. It seems like there is a simple solution to iterating through the pages of results, but I am not seeing it.
Are there any suggestions on ways to approach this problem? I am somewhat new to Python and have been teaching myself all of these scraping techniques, so I'm not sure if I'm just missing something simple here. I know this may be an issue with Google restricting automated searches, but even pulling in the first 100 or so links would be beneficial. I have seen examples of this from regular Google searches but not from Google Newspaper searches
Here is the body of the code. If there are any lines where you have suggestions, that would be helpful. Thanks in advance!
def get_page_tree(url):
page = requests.get(url=url, verify=False)
return html.fromstring(page.text)
def find_other_news_sources(initial_url):
forwarding_identifier = '/url?q='
google_news_search_url = "https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro"
google_news_search_tree = get_page_tree(url=google_news_search_url)
other_news_sources_links = [a_link.replace(forwarding_identifier, '').split('&')[0] for a_link in google_news_search_tree.xpath('//a//#href') if forwarding_identifier in a_link]
return other_news_sources_links
links = find_other_news_sources("https://www.google.com/search? hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro")
with open('textanalysistest.csv', 'wt') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
for row in links:
print(row)

I'm looking into building a parser for a site with similar structure to google's (i.e. a bunch of consecutive results pages, each with a table of content of interest).
A combination of the Selenium package (for page-element based site navigation) and BeautifulSoup (for html parsing) seems like it's the weapon of choice for harvesting written content. You may find them useful too, although I have no idea what kinds of defenses google has in place to deter scraping.
A possible implementation for Mozilla Firefox using selenium, beautifulsoup and geckodriver:
from bs4 import BeautifulSoup, SoupStrainer
from bs4.diagnose import diagnose
from os.path import isfile
from time import sleep
import codecs
from selenium import webdriver
def first_page(link):
"""Takes a link, and scrapes the desired tags from the html code"""
driver = webdriver.Firefox(executable_path = 'C://example/geckodriver.exe')#Specify the appropriate driver for your browser here
counter=1
driver.get(link)
html = driver.page_source
filter_html_table(html)
counter +=1
return driver, counter
def nth_page(driver, counter, max_iter):
"""Takes a driver instance, a counter to keep track of iterations, and max_iter for maximum number of iterations. Looks for a page element matching the current iteration (how you need to program this depends on the html structure of the page you want to scrape), navigates there, and calls mine_page to scrape."""
while counter <= max_iter:
pageLink = driver.find_element_by_link_text(str(counter)) #For other strategies to retrieve elements from a page, see the selenium documentation
pageLink.click()
scrape_page(driver)
counter+=1
else:
print("Done scraping")
return
def scrape_page(driver):
"""Takes a driver instance, extracts html from the current page, and calls function to extract tags from html of total page"""
html = driver.page_source #Get html from page
filter_html_table(html) #Call function to extract desired html tags
return
def filter_html_table(html):
"""Takes a full page of html, filters the desired tags using beautifulsoup, calls function to write to file"""
only_td_tags = SoupStrainer("td")#Specify which tags to keep
filtered = BeautifulSoup(html, "lxml", parse_only=only_td_tags).prettify() #Specify how to represent content
write_to_file(filtered) #Function call to store extracted tags in a local file.
return
def write_to_file(output):
"""Takes the scraped tags, opens a new file if the file does not exist, or appends to existing file, and writes extracted tags to file."""
fpath = "<path to your output file>"
if isfile(fpath):
f = codecs.open(fpath, 'a') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
else:
f = codecs.open(fpath, 'w') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
return
After this, it is just a matter of calling:
link = <link to site to scrape>
driver, n_iter = first_page(link)
nth_page(driver, n_iter, 1000) # the 1000 lets us scrape 1000 of the result pages
Note that this script assumes that the result pages you are trying to scrape are sequentially numbered, and those numbers can be retrieved from the scraped page's html using 'find_element_by_link_text'. For other strategies to retrieve elements from a page, see the selenium documentation here.
Also, note that you need to download the packages on which this depends, and the driver that selenium needs in order to talk with your browser (in this case geckodriver, download geckodriver, place it in a folder, and then refer to the executable in 'executable_path')
If you do end up using these packages, it can help to spread out your server requests using the time package (native to python) to avoid exceeding a maximum number of requests allowed to the server off of which you are scraping. I didn't end up needing it for my own project, but see here, second answer to the original question, for an implementation example with the time module used in the fourth code block.
Yeeeeaaaahhh... If someone with higher rep could edit and add some links to beautifulsoup, selenium and time documentations, that would be great, thaaaanks.

Related

Generate and download tsv from a website (with python)

I have this website and want to write a script which can execute a code which gives the same output as clicking on 'Export' -> 'Generate tsv' -> Wait to generate -> 'Download'.
The endgoal is to use this for a list of approx. 1700 proteins which I have in .txt (so extract a protein, in this case 'Q9BXF6' and put it in the url: https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table) and download all results in .tsv files.
I tried inspecting the 'Export' button but the sourcecode wasn't illuminating (or I didn't know where to look). I also tried this:
r = requests.get('https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table')
soup = BeautifulSoup(r.content, 'html.parser')
to locate what I need but it outputs a bunch of characters that I can't really understand.
I also tried downloading the whole page just like it is with the urllib library:
with
myurl = 'https://www.ebi.ac.uk/interpro/protein/UniProt/Q9BXF6/entry/InterPro/#table'
urllib.request.urlopen() as f:
html = f.read().decode('utf-8')
or
urllib.urlretrieve (myurl, 'interpro.txt') # although this didn't work
It seems as if all content is written somewhere else and refered to and everything I've tried outputs something stupid, but I don't know anything about html and am really new to python (I only use R).

For your first question, you can use the URL of the following element to retrieve the protein value that you require for the next problem.
href="blob:https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b"
The URL is set to the href tag which you can then use it to make the request to download the file. You can also obtain this by right-clicking on the download button for TSV and clicking Inspect-Element you will then be able to see the presence of this href tag.
Following that, download by doing e.g.
import urllib.request
url = 'https://www.ebi.ac.uk/806960aa-720c-4958-9392-f242adee627b'
urllib.request.urlretrieve(url, '/Users/abc/Downloads/file.tsv') # any dir to save
with open("/Users/abc/Downloads/file.tsv") as file_in:
for line in file_in:
#here make your calls for your second problem.
You can also use a Web-Automator such as selenium to gracefully solve this problem. If the latter is of interest do look into it - it's not hard.

Trouble downloading xlsx file from website - Scraping

I'm trying to write some code which download the two latest publications of the Outage Weeks found at the bottom of http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/
It's xlsx-files, which I'm going to load into Excel afterwards.
It doesn't matter which programming language the code is written in.
My first idea was to use the direct url's, like http://www.eirgridgroup.com/site-files/library/EirGrid/Outage-Weeks_36(2016)-51(2016)_31%20August.xlsx
, and then make some code which guesses the url of the two latest publications.
But I have noticed some inconsistencies in the url names, so that solution wouldn't work.
Instead it might be solution to scrape the website and use the XPath to download the files. I found out that the two latest publications always have the following XPaths:
/html/body/div[3]/div[3]/div/div/p[5]/a
/html/body/div[3]/div[3]/div/div/p[6]/a
This is where I need help. I'm new to both XPath and Web Scraping. I have tried stuff like this in Python
from lxml import html
import requests
page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)
v = tree.xpath('/html/body/div[3]/div[3]/div/div/p[5]/a')
But v seems to be empty.
Any ideas would be greatly appreciated!

Just use contains to find the hrefs and slice the first two:
tree.xpath('//p/a[contains(#href, "/site-files/library/EirGrid/Outage-Weeks")]/#href')[:2]
Or doing it all with the xpath using [position() < 3]:
tree.xpath'(//p/a[contains(#href, "site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/#href')
The files are ordered from latest to oldest so getting the first two gives you the two newest.
To download the files you just need to join each href to the base url and write the content to a file:
from lxml import html
import requests
import os
from urlparse import urljoin # from urllib.parse import urljoin
page = requests.get('http://www.eirgridgroup.com/customer-and-industry/general-customer-information/outage-information/')
tree = html.fromstring(page.content)
v = tree.xpath('(//p/a[contains(#href, "/site-files/library/EirGrid/Outage-Weeks")])[position() < 3]/#href')
for href in v:
# os.path.basename(href) -> Outage-Weeks_35(2016)-50(2016).xlsx
with open(os.path.basename(href), "wb") as f:
f.write(requests.get(urljoin("http://www.eirgridgroup.com", link)).content)

Parsing xml in python - don't understand the DOM

I've been reading up on parsing xml with python all day, but looking at the site i need to extract data on, i'm not sure if i'm barking up the wrong tree. Basically i want to get the 13-digit barcodes from a supermarket website (found in the name of the images). For example:
http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985
has 11 items and 11 images, the barcode for the first item is 0000003235676. However when i look at the page source (i assume this is the best way to extract all of the barcodes in one go with python, urllib and beautifulsoup) all of the barcodes are on one line (line 12) however the data doesn't seem to be structured as i would expect in terms of elements and attributes.
new TESCO.sites.UI.entities.Product({name:"Lb Mens Mattifying Dust 7G",xsiType:"QuantityOnlyProduct",productId:"275303365",baseProductId:"72617958",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/805/5021320051805/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"g",unitPrice:3.58,catchWeight:"0",shelfName:"Mens Styling",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Thickening Shampoo 250Ml",xsiType:"QuantityOnlyProduct",productId:"275301223",baseProductId:"72617751",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/225/5021320051225/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:1,catchWeight:"0",shelfName:"Mens Shampoo ",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
new TESCO.sites.UI.entities.Product({name:"Lb Mens Sculpting Puty 75Ml",xsiType:"QuantityOnlyProduct",productId:"275301557",baseProductId:"72617906",quantity:1,isPermanentlyUnavailable:true,imageURL:"http://img.tesco.com/Groceries/pi/287/5021320051287/IDShot_90x90.jpg",maxQuantity:99,maxGroupQuantity:0,bulkBuyLimitGroupId:"",increment:1,price:2.5,abbr:"ml",unitPrice:3.34,catchWeight:"0",shelfName:"Pastes, Putty, Gums, Pomades",superdepartment:"Health & Beauty",superdepartmentID:"TO_1448953606"});
Maybe something like BeautifulSoup is overkill? I understand the DOM tree is not the same thing as the raw source, but why are they so different - when i go to inspect element in firefox the data seems structured as i would expect.
Apologies if this comes across as totally stupid, thanks in advance.

Unfortunately, the barcode is not given in the HTML as structured data; it only appears embedded as part of a URL. So we'll need to isolate the URL and then pick off the barcode with string manipulation:
import urllib2
import bs4 as bs
import re
import urlparse
url = 'http://www.tesco.com/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31033985'
response = urllib2.urlopen(url)
content = response.read()
# with open('/tmp/test.html', 'w') as f:
# f.write(content)
# Useful for debugging off-line:
# with open('/tmp/test.html', 'r') as f:
# content = f.read()
soup = bs.BeautifulSoup(content)
barcodes = set()
for tag in soup.find_all('img', {'src': re.compile(r'/pi/')}):
href = tag['src']
scheme, netloc, path, query, fragment = urlparse.urlsplit(href)
barcodes.add(path.split('\\')[1])
print(barcodes)
yields
set(['0000003222737', '0000010039670', '0000010036297', '0000010008393', '0000003050453', '0000010062951', '0000003239438', '0000010078402', '0000010016312', '0000003235676', '0000003203132'])

As your site uses javascript to format its content, You might find useful switching from urllib to a tool like Selenium. That way you can crawl pages as they render for a real user with a web browser. This github project seems to solve your task.
Other option will be filtering out json data from page javascript scripts and getting data directly from there.

Scrape a Google Chart script with Scraperwiki (Python)

I'm just getting into scraping with Scraperwiki in Python. Already figured out how to scrape tables from a page, run the scraper every month and save the results on top of each other. Pretty cool.
Now I want to scrape this page with information on Android versions and run the script monthly. In particular, I want the table for the version, codename, API and distribution. It's not easy.
The table is called with a wrapper div. Is there any way to scrape this information? I can't find any solution.
Plan B is to scrape the visualisation. What I eventually need, is the codename and the percentage, so that's sufficient. This information can be found in the HTML in a Google Chart script.
But I can't find this information with my 'souped' HTML. I have a public scraper over here. You can edit it to make it work.
Can anyone explain how I can approach this problem? A working scraper with comments on what's going on would be awesome.

This is really a difficult case, because as kisamoto mentioned, the data is inside the embedded JavaScript and not in a seperate JSON file as you would expect. It is possible with BeautifulSoup but it involes some ugly string processing:
last_paragraph = soup.find_all('p', style='clear:both')[-1]
script_tag = last_paragraph.next_sibling.next_sibling
script_text = script_tag.text
lines = script_text.split('\n')
data_text = ''
for line in lines:
if 'SCREEN_DATA' in line: break
data_text = data_text + line
data_text = data_text.replace('var VERSION_DATA =', '')
# delete semicolon at the end
data_text = data_text[:-1]
data = json.loads(data_text)
data = data[0]
print data['data']
Output:
[{u'perc': u'0.1', u'api': 4, u'name': u'Donut'}, ... ]

As this is stored and rendered in JavaScript, the raw Python scraper is unable to execute this code and view the visualisation or table.
ScraperWiki is great however I've always found, if you're doing a single page each month, a python script + cron is much better and, if you need to have this JavaScript parsing, using Selenium and it's python driver is a much more powerful solution.
When you have the selenium server installed you can do roughly the following (in pseudocode)
#!/bin/env python
from selenium import webdriver
browser = webdriver.Firefox()
# Load page with all Javascript rendered in the DOM for you.
browser.get("http://developer.android.com/about/dashboards/index.html")
# Find the table
table = browser.find_element_by_xpath("/html/body/div[3]/div[2]/div/div/div[2]/div/div/table")
# Do something with the table element
# Save the data
browser.close()
Then just have a cron job running the script on the first day of the month like so:
0 0 1 * * /path/to/python_script.py

Download text from a URL in Python

I'm working on a school project currently which aim goal is to analyze scam mails with the Natural Language Toolkit package. Basically what I'm willing to do is to compare scams from different years and try to find a trend - how does their structure changed with time.
I found a scam-database: http://www.419scam.org/emails/
I would like to download the content of the links with python, but I am stuck.
My code so far:
from BeautifulSoup import BeautifulSoup
import urllib2, re
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll('a')
links2 = soup.findAll(href=re.compile("index"))
print links2
So I can fetch the links but I don't know yet how can I download the content. Any ideas? Thanks a lot!

You've got a good start, but right now you're simply retrieving the index page and loading it into the BeautifulSoup parser. Now that you have href's from the links, you essentially need to open all of those links, and load their contents into data structures that you can then use for your analysis.
This essentially amounts to a very simple web-crawler. If you can use other people's code, you may find something that fits by googling "python Web crawler." I've looked at a few of those, and they are straightforward enough, but may be overkill for this task. Most web-crawlers use recursion to traverse the full tree of a given site. It looks like something much simpler could suffice for your case.
Given my unfamiliarity with BeautifulSoup, this basic structure will hopefully get you on the right path, or give you for a sense for how the web crawling is done:
from BeautifulSoup import BeautifulSoup
import urllib2, re
emailContents = []
def analyze_emails():
# this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents
def parse_email_page(link):
print "opening " + link
# open, soup, and parse the page.
#Looks like the email itself is in a "blockquote" tag so that may be the starting place.
#From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents
def parse_list_page(link):
print "opening " + link
html = urllib2.urlopen(link).read()
soup = BeatifulSoup(html)
email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages
for link in email_page_links:
parseEmailPage(link['href'])
def main():
html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work
for link in links:
parse_list_page(link['href'])
analyze_emails()
if __name__ == "__main__":
main()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python iterating through pages google search - python

Related

Generate and download tsv from a website (with python)

Trouble downloading xlsx file from website - Scraping

Parsing xml in python - don't understand the DOM

Scrape a Google Chart script with Scraperwiki (Python)

Download text from a URL in Python

Categories

Resources