Sending input values to webpage using urllib2 in python - python

I am trying to get the similarity score between two sets of GO terms, there is a webpage that does it and I am trying to automate this calculation for several sets using a python script.
So far from my script I generate two files that if uploaded manually work to the webpage, but I cannot figure out how to do it from the code.
temp1 = somefile
url = "http://bioinformatics.clemson.edu/G-SESAME/tools.php?id=3"
isa = "0.8"
partof = "0.6"
for f_set in features_set_list:
temp2 = open('temp2.txt', "w")
for item in f_set:
print >> temp2, item + '\t'
temp2.close()
values = {"tool_id":"3","uploadedfile1": temp1,"uploadedfile2": temp2,"isA": isa,"partOf": partof,"email":"False" ,"emailAddress": "","description": "","submit": "Submit"}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html = response.read()
root = lxml.html.fromstring(html)
score = root.cssselect("div.row span b")[0].text
print score
My guess is that my error is in the values that give as input, but cannot find it. Any help would be appreciated!

Related

Read column/list containing urls by skipping blank cells using python

I have a column all_sheet_url in google sheet which has some url links in it.
I need to read them one by one and fetch data from them.
As there are some blanks/NA also in the column which I want to skip.
List item
I have tried with following code, but its not working to read only url and skip blanks if any in the column.
sheet_url = df['Links']
for line in sheet_url:
#if line in sheet_url:
try:
url = line
req = requests.get(url, stream=True)
r = requests.get(url, timeout=10)
r.raise_for_status()
# checking if it is an html page
content_type = req.headers.get('content-type')
if 'html' in content_type or 'application/xhtml+xml' in content_type:
# reading the contents
html = req.content
# req.close()
output = html
print(output)
continue
else:
print("\t{} is not an HTML file".format(url))
#req.close()
except Exception:
return
Kindly please suggest a method/corrections for the same, Thanks.
I need to read range of data from url and write in main google sheet.
I am able to read and write for single url at once.But, I want it to be dynamic so that it should read and write range of data from urls continously automatically by taking index of that url placed in the column.
This is the code I have tried with,
wks = gc.open_by_url(url)
wks1 = gc.open_by_url(url1)
# Defining Range of cells to read
range1 = wks1.range('A5:A9')
range2 = wks1.range('A5:B9')
range3 = wks1.range('G11:L19')
range4 = wks1.range('B12:C13')
# Defining Range of cells to write
range1n = wks.range('CO43:CS43')
range2n = wks.range('CT43:DR43')
range3n = wks.range('DS43:FK43')
range4n = wks.range('FL43:FM43')
range_names = [['range1', 'range2', 'range3', 'range4']]
for range1_cell, range1n_cell in zip(range1, range1n):
range1n_cell.value = range1_cell.value
wks.update_cells(range1n)
pandas has a dropna method to do exactly what you need.
Documentation: pandas.Series.dropna
In your cas that could just be the same processing code with the first line being:
all_sheet_url = df['Workbook Link'].dropna()

Python 3.6 API while loop to json script not ending

I'm trying to create a loop via API call to a json string since each call is limited to 200 rows. When I tried the below code, the loop doesn't seem to end even when I left the code running for an hour or so. Max rows I'm looking to pull is about ~200k rows from the API.
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
if len(data['rows'])<200:
break
Also, I'm looking to filter the loop to only output if json value 'pet.type' is "Puppies" or "Kittens." Haven't been able to figure out the syntax.
Any ideas?
Thanks
The break condition for you loop is incorrect. Notice it's checking len(data["rows"]), where data only includes rows from the most recent request.
Instead, you should be looking at the total number of rows you've collected so far: len(alldata).
bookmark=''
urlbase = 'https://..../?'
alldata = []
while True:
if len(bookmark)>0:
url = urlbase + 'bookmark=' + bookmark
requests.get(url, auth=('username', 'password'))
data = response.json()
alldata.extend(data['rows'])
bookmark = data['bookmark']
# Check `alldata` instead of `data["rows"]`,
# and set the limit to 200k instead of 200.
if len(alldata) >= 200000:
break

Python 3 and BeautifulSoup 4 searching a word on dictionary site and returning translations

I have recently followed this(http://blog.dispatched.ch/2009/03/15/webscraping-with-python-and-beautifulsoup/) tutorial on web scraping with python and beautifulsoup. Unfortunately, this tutorial was written with a 2.x version of python while I am using verion 3.2.3 of python as to focus learning a language version which is still developing for the future. My program retrieves, opens, reads and enters the search term on web page just fine(as far as I can tell) but it doesn't enter the
for result in results
loop so nothing is collected and printed. I think this may have to do with how I have assigned results but I am unsure of how to fix it. Here's my code:
`
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import string
import sys
user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = {'user-agent' : user_agent}
url = "http://www.dict.cc";
values = {'s' : 'rainbow'}
data = urllib.parse.urlencode(values)
data = data.encode('utf-8')
request = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(request)
the_page = response.read()
pool = BeautifulSoup(the_page)
results = pool.find_all('td', attrs = {'class' : 'td7n1'})
source = ''
translations = []
for result in results:
word = ''
for tmp in result.find_all(text = True):
word += " " + unicode(tmp).encode("utf-8")
if source == '':
source = word
else:
translations.append((source, word))
for translation in translations:
print ("%s => %s", translation[0], translation[1])

HTTPError: HTTP Error 400: Bad request urllib2

I am beginner to python. I am the developer of Easy APIs Project (http://gcdc2013-easyapisproject.appspot.com) and was doing a Python implementation of weather API using my project. Visit http://gcdc2013-easyapisproject.appspot.com/APIs_Doc.html to see Weather API. The below is my implementation but it returns HTTPError: HTTP Error 400: Bad request error.
import urllib2
def celsius(a):
responsex = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/unitconversion?q='+a+' in celsius')
htmlx = responsex.read()
responsex.close()
htmlx = html[1:] #remove first {
htmlx = html[:-1] #remove last }
htmlx = html.split('}{') #split and put each resutls to array
return str(htmlx[1]);
print "Enter a city name:",
q = raw_input() #get word from user
response = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/weather?q='+q)
html = response.read()
response.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
print "Today weather is " + html[1]
print "Temperature is " + html[3]
print "Temperature is " + celsius(html[3])
Please help me..
The query string should be quoted using urllib.quote or urllib.quote_plus:
import urllib
import urllib2
def celsius(a):
responsex = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/unitconversion?q=' + urllib.quote(a + ' in celsius'))
html = responsex.read()
responsex.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
return html[0]
print "Enter a city name:",
q = raw_input() #get word from user
response = urllib2.urlopen('http://gcdc2013-easyapisproject.appspot.com/weather?q='+urllib.quote(q))
html = response.read()
print repr(html)
response.close()
html = html[1:] #remove first {
html = html[:-1] #remove last }
html = html.split('}{') #split and put each resutls to array
print "Today weather is " + html[1]
print "Temperature is " + html[3]
print "Temperature is " + celsius(html[3].split()[0])
In addition to that, I modified celsius to use html instead of htmlx. The original code mixed use of html and htmlx.
I have found the answer. The query should be quoted with urllib2.quote(q)

python mechanize follow_link fails

I'm trying to access search results on the NCBI Images search page (http://www.ncbi.nlm.nih.gov/images) in a script. I want to feed it a search term, report on all of the results, and then move on to the next search term. To do this I need to get to results pages after the first page, so I'm trying to use python mechanize to do it:
import mechanize
browser=mechanize.Browser()
page1=browser.open('http://www.ncbi.nlm.nih.gov/images?term=drug')
a=browser.links(text_regex='Next')
nextlink=a.next()
page2=browser.follow_link(nextlink)
This just gives me back the first page of search results again (in variable page2). What am I doing wrong, and how can I get to that second page and beyond?
Unfortunately that page uses Javascript to POST 2459 bytes of form variables to the server, just to navigate to a subsequent page. Here are a few of the variables (I count 38 vars in total):
EntrezSystem2.PEntrez.ImagesDb.Images_SearchBar.Term=drug
EntrezSystem2.PEntrez.ImagesDb.Images_SearchBar.CurrDb=images
EntrezSystem2.PEntrez.ImagesDb.Images_ResultsPanel.Entrez_Pager.CurrPage=2
You'll need to construct a POST request to the server containing some or all of these variables. Luckily if you get it working for page 2 you can simply increment CurrPage and send another POST to get each subsequent page of results (no need to extract links).
Update - That site is a total pain-in-the-ass, but here is a POST-based scrape of the 2-N pages. Set MAX_PAGE to the highest page number + 1. The script will produce files like file_000003.html.
Note: Before you use it, you need to replace POSTDATA with the contents of this paste blob (it expires in 1 month). It's just the body a POST request as captured by Firebug, which I use to seed the correct params:
import cookielib
import json
import mechanize
import sys
import urllib
import urlparse
MAX_PAGE = 6
TERM = 'drug'
DEBUG = False
base_url = 'http://www.ncbi.nlm.nih.gov/images?term=' + TERM
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.set_handle_referer(True)
browser.set_debug_http(DEBUG)
browser.set_debug_responses(DEBUG)
cjar = cookielib.CookieJar()
browser.set_cookiejar(cjar)
# make first GET request. this will populate the cookie
res = browser.open(base_url)
def write(num, data):
with open('file_%06d.html' % num, 'wb') as out:
out.write(data)
def encode(kvs):
res = []
for key, vals in kvs.iteritems():
if isinstance(vals, list):
for v in vals:
res.append('%s=%s' % (key, urllib.quote(v)))
else:
res.append('%s=%s' % (key, urllib.quote(vals)))
return '&'.join(res)
write(1, res.read())
# set this var equal to the contents of this: http://pastebin.com/UfejW3G0
POSTDATA = '''<post data>'''
# parse the embedded json vars into POST parameters
PREFIX1 = 'EntrezSystem2.PEntrez.ImagesDb.'
PREFIX2 = 'EntrezSystem2.PEntrez.DbConnector.'
params = dict((k, v[0]) for k, v in urlparse.parse_qs(POSTDATA).iteritems())
base_url = 'http://www.ncbi.nlm.nih.gov/images'
for page in range(2, MAX_PAGE):
params[PREFIX1 + 'Images_ResultsPanel.Entrez_Pager.CurrPage'] = str(page)
params[PREFIX1 + 'Images_ResultsPanel.Entrez_Pager.cPage'] = [str(page-1)]*2
data = encode(params)
req = mechanize.Request(base_url, data)
cjar.add_cookie_header(req)
req.add_header('Content-Type', 'application/x-www-form-urlencoded')
req.add_header('Referer', base_url)
res = browser.open(req)
write(page, res.read())

Categories

Resources