I am beginner to python. I am the developer of Easy APIs Project (http://gcdc2013-easyapisproject.appspot.com) and was doing a Python implementation of weather API using my project. Visit http://gcdc2013-easyapisproject.appspot.com/APIs_Doc.html to see Weather API. The below is my implementation but it returns HTTPError: HTTP Error 400: Bad request error.
import urllib2
def celsius(a):
responsex = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/unitconversion?q='+a+' in celsius')
htmlx = responsex.read()
responsex.close()
htmlx = html[1:] #remove first {
htmlx = html[:-1] #remove last }
htmlx = html.split('}{') #split and put each resutls to array
return str(htmlx[1]);
print "Enter a city name:",
q = raw_input() #get word from user
response = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/weather?q='+q)
html = response.read()
response.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
print "Today weather is " + html[1]
print "Temperature is " + html[3]
print "Temperature is " + celsius(html[3])
Please help me..
The query string should be quoted using urllib.quote or urllib.quote_plus:
import urllib
import urllib2
def celsius(a):
responsex = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/unitconversion?q=' + urllib.quote(a + ' in celsius'))
html = responsex.read()
responsex.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
return html[0]
print "Enter a city name:",
q = raw_input() #get word from user
response = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/weather?q='+urllib.quote(q))
html = response.read()
print repr(html)
response.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
print "Today weather is " + html[1]
print "Temperature is " + html[3]
print "Temperature is " + celsius(html[3].split()[0])
In addition to that, I modified celsius to use html instead of htmlx. The original code mixed use of html and htmlx.
I have found the answer. The query should be quoted with urllib2.quote(q)
Related
I'm running into an intermittent issue when I run the code below. I'm trying to collect all the page_tokens in the ajax calls made by pressing the "load more" button if it exists. Basically, I'm trying to get all the page tokens from a YouTube Channel.
Sometimes it will retrieve the tokens, and other times it doesn't. My best guess is either I made a mistake in my "find_embedded_page_token" function or that I need some sort of delay/sleep inserted somewhere.
Below is the full code:
import requests
import pprint
import urllib.parse
import lxml
def find_XSRF_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('"', pos_begin)
return html[pos_begin: pos_end]
def find_page_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('&', pos_begin)
return html[pos_begin: pos_end]
def find_embedded_page_token(html, key, num_chars=2):
pos_begin = html.find(key) + len(key) + num_chars
pos_end = html.find('&', pos_begin)
excess_str = html[pos_begin: pos_end]
sep = '\\'
rest = excess_str.split(sep,1)[0]
return rest
sxeVid = 'https://www.youtube.com/user/sxephil/videos'
ajaxStr = 'https://www.youtube.com/browse_ajax?action_continuation=1&continuation='
s = requests.Session()
r = s.get(sxeVid)
html = r.text
session_token = find_XSRF_token(html, 'XSRF_TOKEN', 4)
page_token = find_page_token(html, ';continuation=', 0)
print(page_token)
s = requests.Session()
r = s.get(ajaxStr+page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
while page_token:
ajxBtn = ajxHtml.find('data-uix-load-more-href=')
if ajxBtn != -1:
s = requests.Session()
r = s.get(ajaxStr+ajax_page_token)
ajxHtml = r.text
ajax_page_token = find_embedded_page_token(ajxHtml, ';continuation=', 0)
print(ajax_page_token)
else:
break
This is what's returning randomly that is unexpected. It's pulling not just the token, but also the HTML after the desired cut off.
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D"><span class="yt-uix-button-content"> <span class="load-more-loading hid">
<span class="yt-spinner">
<span class="yt-spinner-img yt-sprite" title="Loading icon"></span>
The expected response I'm expecting is this:
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk1yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5iZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk5yZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk43Z0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9MZ0JBQSUzRCUzRA%253D%253D
4qmFsgJAEhhVQ2xGU1U5X2JVYjRSYzZPWWZUdDVTUHcaJEVnWjJhV1JsYjNNZ0FEZ0JZQUZxQUhvQk9iZ0JBQSUzRCUzRA%253D%253D
Any help is greatly appreciated. Also, if my tags are wrong, let me know what tags to +/-.
I am trying to get the similarity score between two sets of GO terms, there is a webpage that does it and I am trying to automate this calculation for several sets using a python script.
So far from my script I generate two files that if uploaded manually work to the webpage, but I cannot figure out how to do it from the code.
temp1 = somefile
url = "http://bioinformatics.clemson.edu/G-SESAME/tools.php?id=3"
isa = "0.8"
partof = "0.6"
for f_set in features_set_list:
temp2 = open('temp2.txt', "w")
for item in f_set:
print >> temp2, item + '\t'
temp2.close()
values = {"tool_id":"3","uploadedfile1": temp1,"uploadedfile2": temp2,"isA": isa,"partOf": partof,"email":"False" ,"emailAddress": "","description": "","submit": "Submit"}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html = response.read()
root = lxml.html.fromstring(html)
score = root.cssselect("div.row span b")[0].text
print score
My guess is that my error is in the values that give as input, but cannot find it. Any help would be appreciated!
I am using Python to scrape US postal code population data from http:/www.city-data.com, through this directory: http://www.city-data.com/zipDir.html. The specific pages I am trying to scrape are individual postal code pages with URLs like this: http://www.city-data.com/zips/01001.html. All of the individual zip code pages I need to access have this same URL Format, so my script simply does the following for postal_code in range:
Creates URL given postal code
Tries to get response from URL
If (2), Check the HTTP of that URL
If HTTP is 200, retrieves the HTML and scrapes the data into a list
If HTTP is not 200, pass and count error (not a valid postal code/URL)
If no response from URL because of error, pass that postal code and count error
At end of script, print counter variables and timestamp
The problem is that I run the script and it works fine for ~500 postal codes, then suddenly stops working and returns repeated timeout errors. My suspicion is that the site's server is limiting the page views coming from my IP address, preventing me from completing the amount of scraping that I need to do (all 100,000 potential postal codes).
My question is as follows: Is there a way to confuse the site's server, for example using a proxy of some kind, so that it will not limit my page views and I can scrape all of the data I need?
Thanks for the help! Here is the code:
##POSTAL CODE POPULATION SCRAPER##
import requests
import re
import datetime
def zip_population_scrape():
"""
This script will scrape population data for postal codes in range
from city-data.com.
"""
postal_code_data = [['zip','population']] #list for storing scraped data
#Counters for keeping track:
total_scraped = 0
total_invalid = 0
errors = 0
for postal_code in range(1001,5000):
#This if statement is necessary because the postal code can't start
#with 0 in order for the for statement to interate successfully
if postal_code <10000:
postal_code_string = str(0)+str(postal_code)
else:
postal_code_string = str(postal_code)
#all postal code URLs have the same format on this site
url = 'http://www.city-data.com/zips/' + postal_code_string + '.html'
#try to get current URL
try:
response = requests.get(url, timeout = 5)
http = response.status_code
#print current for logging purposes
print url +" - HTTP: " + str(http)
#if valid webpage:
if http == 200:
#save html as text
html = response.text
#extra print statement for status updates
print "HTML ready"
#try to find two substrings in HTML text
#add the substring in between them to list w/ postal code
try:
found = re.search('population in 2011:</b> (.*)<br>', html).group(1)
#add to # scraped counter
total_scraped +=1
postal_code_data.append([postal_code_string,found])
#print statement for logging
print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
#if substrings not found, try searching for others
#and doing the same as above
except AttributeError:
found = re.search('population in 2010:</b> (.*)<br>', html).group(1)
total_scraped +=1
postal_code_data.append([postal_code_string,found])
print postal_code_string + ": " + str(found) + ". Data scrape successful. " + str(total_scraped) + " total zips scraped."
#if http =404, zip is not valid. Add to counter and print log
elif http == 404:
total_invalid +=1
print postal_code_string + ": Not a valid zip code. " + str(total_invalid) + " total invalid zips."
#other http codes: add to error counter and print log
else:
errors +=1
print postal_code_string + ": HTTP Code Error. " + str(errors) + " total errors."
#if get url fails by connnection error, add to error count & pass
except requests.exceptions.ConnectionError:
errors +=1
print postal_code_string + ": Connection Error. " + str(errors) + " total errors."
pass
#if get url fails by timeout error, add to error count & pass
except requests.exceptions.Timeout:
errors +=1
print postal_code_string + ": Timeout Error. " + str(errors) + " total errors."
pass
#print final log/counter data, along with timestamp finished
now= datetime.datetime.now()
print now.strftime("%Y-%m-%d %H:%M")
print str(total_scraped) + " total zips scraped."
print str(total_invalid) + " total unavailable zips."
print str(errors) + " total errors."
I have this simple python code, which returning the content of URL and store the result as json text file named "file", but it keeps returning empty result .
What I am doing wrong here? It is just a simple code I am so disappointed.
I have included all the imports needed import Facebook,import request,and import json.
url ="https://graph.facebook.com/search?limit=5000&type=page&q=%26&access_token=xx&__after_id=139433456868"
content = requests.get(url).json()
file = open("file.txt" , 'w')
file.write(json.dumps(content, indent=1))
file.close()
but it keeps returning empty result to me what I am missing here?
here is the result:
"data": []
any help please?
Its working fine:
import urllib2
accesstoken="CAACEdEose0cBACF6HpTDEuVEwVnjx1sHOJFS3ZBQZBsoWqKKyFl93xwZCxKysqsQMUgOZBLjJoMurSxinn96pgbdfSYbyS9Hh3pULdED8Tv255RgnsYmnlxvR7JZCN7X25zP6fRnRK0ZCNmChfLPejaltwM2JGtPGYBQwnmAL9tQBKBmbZAkGYCEQHAbUf7k1YZD"
urllib2.urlopen("https://graph.facebook.com/search?limit=5000&type=page&q=%26&access_token="+accesstoken+"&__after_id=139433456868").read()
I think you have not requested access token before making the request.
How to find access token?
def getSecretToken(verification_code):
token_url = ( "https://graph.facebook.com/oauth/access_token?" +
"client_id=" + app_id +
"&redirect_uri=" +my_url +
"&client_secret=" + app_secret +
"&code=" + verification_code )
response = requests.get(token_url).content
params = {}
result = response.split("&", 1)
print result
for p in result:
(k,v) = p.split("=")
params[k] = v
return params['access_token']
how do you get that verification code?
verification_code=""
if "code" in request.query:
verification_code = request.query["code"]
if not verification_code:
dialog_url = ( "http://www.facebook.com/dialog/oauth?" +
"client_id=" + app_id +
"&redirect_uri=" + my_url +
"&scope=publish_stream" )
return "<script>top.location.href='" + dialog_url + "'</script>"
else:
access_token = getSecretToken(verification_code)
I am parsing html from the following website: http://www.asusparts.eu/partfinder/Asus/All In One/E Series I was just wondering if there was any way i could explore a parsed attribute in python?
For example.. The code below outputs the following:
datas = s.find(id='accordion')
a = datas.findAll('a')
for data in a:
if(data.has_attr('onclick')):
model_info.append(data['onclick'])
print data
[OUTPUT]
Bracket
These are the values i would like to retrieve:
nCategoryID = Bracket
nModelID = ET10B
family = E Series
As the page is rendered from AJAX, They are using a script source resulting in the following url from the script file:
url = 'http://json.zandparts.com/api/category/GetCategories/' + country + '/' + currency + '/' + nModelID + '/' + family + '/' + nCategoryID + '/' + brandName + '/' + null
How can i retrieve only the 3 values listed above?
[EDIT]
import string, urllib2, urlparse, csv, sys
from urllib import quote
from urlparse import urljoin
from bs4 import BeautifulSoup
from ast import literal_eval
changable_url = 'http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series'
page = urllib2.urlopen(changable_url)
base_url = 'http://www.asusparts.eu'
soup = BeautifulSoup(page)
#Array to hold all options
redirects = []
#Array to hold all data
model_info = []
print "FETCHING OPTIONS"
select = soup.find(id='myselectListModel')
#print select.get_text()
options = select.findAll('option')
for option in options:
if(option.has_attr('redirectvalue')):
redirects.append(option['redirectvalue'])
for r in redirects:
rpage = urllib2.urlopen(urljoin(base_url, quote(r)))
s = BeautifulSoup(rpage)
#print s
print "FETCHING MAIN TITLE"
#Finding all the headings for each specific Model
maintitle = s.find(id='puffBreadCrumbs')
print maintitle.get_text()
#Find entire HTML container holding all data, rendered by AJAX
datas = s.find(id='accordion')
#Find all 'a' tags inside data container
a = datas.findAll('a')
#Find all 'span' tags inside data container
content = datas.findAll('span')
print "FETCHING CATEGORY"
#Find all 'a' tags which have an attribute of 'onclick' Error:(doesn't display anything, can't seem to find
#'onclick' attr
if(hasattr(a, 'onclick')):
arguments = literal_eval('(' + a['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments #arguments[1] + " " + arguments[3] + " " + arguments[4]
print "FETCHING DATA"
for complete in content:
#Find all 'class' attributes inside 'span' tags
if(complete.has_attr('class')):
model_info.append(complete['class'])
print complete.get_text()
#Find all 'table data cells' inside table held in data container
print "FETCHING IMAGES"
img = s.find('td')
#Find all 'img' tags held inside these 'td' cells and print out
images = img.findAll('img')
print images
I have added an Error line where the problem lays...
Similar to Martijn's answer, but makes primitive use of pyparsing (ie, it could be refined to recognise the function and only take quoted strings with the parentheses):
from bs4 import BeautifulSoup
from pyparsing import QuotedString
from itertools import chain
s = '''Bracket'''
soup = BeautifulSoup(s)
for a in soup('a', onclick=True):
print list(chain.from_iterable(QuotedString("'", unquoteResults=True).searchString(a['onclick'])))
# ['Asus', 'Bracket', 'ET10B', '7138', 'E Series']
You could parse that as a Python literal, if you remove the this, part from it, and only take everything between the parenthesis:
from ast import literal_eval
if data.has_attr('onclick'):
arguments = literal_eval('(' + data['onclick'].replace(', this', '').split('(', 1)[1])
model_info.append(arguments)
print arguments
We remove the this argument because it is not a valid python string literal and you don't want to have it anyway.
Demo:
>>> literal_eval('(' + "getProductsBasedOnCategoryID('Asus','Bracket','ET10B','7138', this, 'E Series')".replace(', this', '').split('(', 1)[1])
('Asus', 'Bracket', 'ET10B', '7138', 'E Series')
Now you have a Python tuple and can pick out any value you like.
You want the values at indices 1, 2 and 4, for example:
nCategoryID, nModelID, family = arguments[1], arguments[3], arguments[4]