I'm using Python to scrape data from a number of web pages that have simple HTML input forms, like the 'Username:' form at the bottom of this page:
http://www.w3schools.com/html/html_forms.asp (this is just a simple example to illustrate the problem)
Firefox Inspect Element indicates this form field has the following HTML structure:
<form name="input0" target="_blank" action="html_form_action.asp" method="get">
Username:
<input name="user" size="20" type="text"></input>
<input value="Submit" type="submit"></input>
</form>
All I want to do is fill out this form and get the resulting page:
http://www.w3schools.com/html/html_form_action.asp?user=ThisIsMyUserName
Which is what is produced in my browser by entering 'ThisIsMyUserName' in the 'Username' field and pressing 'Submit'. However, every method that I have tried (details below) returns the contents of the original page containing the unaltered form without any indication the form data I submitted was recognized, i.e. I get the content from the first link above in response to my request, when I expected to receive the content of the second link.
I suspect the problem has to do with action="html_form_action.asp" in the form above, or perhaps some kind of hidden field I'm missing (I don't know what to look for - I'm new to form submission). Any suggestions?
HERE IS WHAT I'VE TRIED SO FAR:
Using urllib.requests in Python 3:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Encode dict
example_data = example_data.encode('utf-8')
# Create request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
request = urllib.request.Request(example_url, data=example_data)
# Create opener and install
my_url_opener = urllib.request.build_opener() # no handlers
urllib.request.install_opener(my_url_opener)
# Open the page and read content
web_page = urllib.request.urlopen(request)
content = web_page.read()
# Save content to file
my_html_file = open('my_html_file.html', 'wb')
my_html_file.write(content)
But what is returned to me and saved in 'my_html_file.html' is the original page containing
the unaltered form without any indication that my form data was recognized, i.e. I get this page in response: qqqhttp://www.w3schools.com/html/html_forms.asp
...which is the same thing I would have expected if I made this request without the
data parameter at all (which would change the request from a POST to a GET).
Naturally the first thing I did was check whether my request was being constructed properly:
# Just double-checking the request is set up correctly
print("GET or POST?", request.get_method())
print("DATA:", request.data)
print("HEADERS:", request.header_items())
Which produces the following output:
GET or POST? POST
DATA: b'user=ThisIsMyUserName'
HEADERS: [('Content-length', '21'), ('Content-type', 'application/x-www-form-urlencoded'), ('User-agent', 'Python-urllib/3.3'), ('Host', 'www.w3schools.com')]
So it appears the POST request has been structured correctly. After re-reading the
documentation and unsuccessfuly searching the web for an answer to this problem, I
moved on to a different tool: the requests module. I attempted to perform the same task:
import requests
example_url = 'http://www.w3schools.com/html/html_forms.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.post(example_url, params=data_to_send)
contents = response.content
And I get the same exact result. At this point I'm thinking maybe this is a Python 3
issue. So I fire up my trusty Python 2.7 and try the following:
import urllib, urllib2
data = urllib.urlencode({'user' : 'ThisIsMyUserName'})
resp = urllib2.urlopen('http://www.w3schools.com/html/html_forms.asp', data)
content = resp.read()
And I get the same result again! For thoroughness I figured I'd attempt to achieve the
same result by encoding the dictionary values into the url and attempting a GET request:
# Using Python 3
# Construct the url for the GET request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
form_values = {'user': 'ThisIsMyUserName'}
example_data = urllib.parse.urlencode(form_values)
final_url = example_url + '?' + example_data
print(final_url)
This spits out the following value for final_url:
qqqhttp://www.w3schools.com/html/html_forms.asp?user=ThisIsMyUserName
I plug this into my browser and I see that this page is exactly the same as
the original page, which is exactly what my program is downloading.
I've also tried adding additional headers and cookie support to no avail.
I've tried everything I can think of. Any idea what could be going wrong?
The form states an action and a method; you are ignoring both. The method states the form uses GET, not POST, and the action tells you to send the form data to html_form_action.asp.
The action attribute acts like any other URL specifier in an HTML page; unless it starts with a scheme (so with http://..., https://..., etc.) it is relative to the current base URL of the page.
The GET HTTP method adds the URL-encoded form parameters to the target URL with a question mark:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Create request
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
get_url = example_url + '?' + example_data
# Open the page and read content
web_page = urllib.request.urlopen(get_url)
print(web_page.read().decode(web_page.info().get_param('charset', 'utf8')))
or, using requests:
import requests
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.get(example_url, params=data_to_send)
contents = response.text
print(contents)
In both examples I also decoded the response to Unicode text (something requests makes easier for me with the response.text attribute).
Related
I've been tinkering with Python using Pythonista on my iPad. I decided to write a simple script that pulls song lyrics in Japanese from one website, and makes post requests to another website that basically annotates the lyrics with extra information.
When I use Python 2 and the module mechanize for the second website, everything works fine, but when I use Python 3 and requests, the resulting text is nonsense.
This is a minimal script that doesn't exhibit the issue:
#!/usr/bin/env python2
from bs4 import BeautifulSoup
import requests
import mechanize
def main():
# Get lyrics from first website (lyrical-nonsense.com)
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib")
raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()
# Use second website to anotate lyrics with fugigana
browser = mechanize.Browser()
browser.open('http://furigana.sourceforge.net/cgi-bin/index.cgi')
browser.select_form(nr=0)
browser.form['text'] = raw_lyrics
request = browser.submit()
# My actual script does more stuff at this point, but this snippet doesn't need it
annotated_lyrics = BeautifulSoup(request.read().decode('utf-8'), "html5lib").find("body").get_text()
print annotated_lyrics
if __name__ == '__main__':
main()
The truncated output is:
扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)の夜(よる)昨日(きのう)どうやって帰(かえ)った体(からだ)だけが確(たし)かおはよう これからまた迷子(まいご)の続(つづ)き見慣(みな)れた知(し)らない景色(けしき)の中(なか)でもう駄目(だめ)って思(おも)ってから わりと何(なん)だかやれている死(し)にきらないくらいに丈夫(じょうぶ)何(なに)かちょっと恥(は)ずかしいやるべきことは忘(わす)れていても解(わか)るそうしないと とても苦(くる)しいから顔(かお)を上(あ)げて黒(くろ)い目(め)の人(にん)君(くん)が見(み)たから光(ひかり)は生(う)まれた選(えら)んだ色(しょく)で塗(ぬ)った世界(せかい)に [...]
This is a minimal script that exhibits the issue:
#!/usr/bin/env python3
from bs4 import BeautifulSoup
import requests
def main():
# Get lyrics from first website (lyrical-nonsense.com)
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib")
raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()
# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
html_annotated_lyrics = BeautifulSoup(requests.post(url, data=data).text, "html5lib")
annotated_lyrics = html_annotated_lyrics.find("body").get_text()
print(annotated_lyrics)
if __name__ == '__main__':
main()
whose truncated output is:
IQp{_<n(åiFcf0c_S`QLºKJoFSK~_÷PnMc_åjDorn-gFÄîcfcfKhU`KfD{kMjDOD+UKacheZKWDyMSho،fDfã]FWjDhhfæWDKTRfÒDînºL_KIo~_x`rgWc_Lkò~fxyjD·nsoiS`FTê`QLÒüíüLn [...]
It's worth noting that if I just try to get the HTML of the second request, like so:
# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
annotated_lyrics = requests.post(url, data=data).content.decode('utf-8')
A embedded null character error occurs when printing annotated_lyrics. This issue can be circumvented by passing truncated lyrics to the post requests. In the current example, only one character can be passed.
However, with
url = 'https://www.lyrical-nonsense.com/lyrics/aimer/brave-shine/'
I can pass up to 51 characters, like so:
data = {'text': raw_lyrics[0:51], 'state': 'output'}
before triggering the embedded null character error.
I've tried using urllib instead of requests, decoding and encoding to utf-8 the resulting HTML of the post request, or the data passed as an argument to this request. I've also checked that the encoding of the website is utf-8, which matches the encoding of the post requests:
r = requests.post(url, data=data)
print(r.encoding)
prints utf-8.
I think the problem has to do with how Python 3 is more strict in how it treats strings vs bytes, but I've been unable to pinpoint the exact cause.
While I'd appreciate a working code sample in Python 3, I'm more interested in what exactly I'm doing wrong, in what is the code doing that results in failure.
I'm able to get the lyrics properly with this code in python3.x:
url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
resp = requests.get(url)
print(BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text())
Printing (truncated)
>>> BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text()
'扉開けば\u3000捻れた昼の夜\r\n昨日どうやって帰った\u3000体だけ...'
A few things strike me as odd there, notably the \r\n (windows line ending) and \u3000 (IDEOGRAPHIC SPACE) but that's probably not the problem
The one thing I noticed that's odd about the form submission (and why the browser emulator probably succeeds) is the form is using multipart instead of urlencoded form data. (signified by enctype="multipart/form-data")
Sending multipart form data is a little bit strange in requests, I had to poke around a bit and eventually found this which helps show how to format the multipart data in a way that the backing server understands. To do this you have to abuse files but have a "None" filename. "for humans" hah!
url2 = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
resp2 = requests.post(url2, files={'text': (None, raw_lyrics), 'state': (None, 'output')})
And the text is not mangled now!
>>> BeautifulSoup(resp2.text).find('body').get_text()
'\n扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)...'
(Note that this code should work in either python2 or python3)
I'm trying to scrape a site, when I run the following code without region_id=[any number from one to 32] I get a [500], but if I set region_id=1 I'll get only a first page by default (on the url it is pagina=&), pages are up to 500; is there a command or parameter for retrieving every page (every possible value of pagina=), avoiding for loops?
import requests
url = "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=%E2%9C%93&grupo_id=Plantas®ion_id=&parent_id=&pagina=&nombre="
resp = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
data = resp.json()
Even without a for loop, you are still going to need iteration. You could do it with recursion or map as I've done below, but the iteration is still there. This solution has the advantage that everything is a generator, so only when you ask for a page's json from all_data will url be formatted, the request made, checked and converted to json. I added a filter to make sure you got a valid response before trying to get the json out. It still makes every request sequentially, but you could replace map with a parallel implementation quite easily.
import requests
from itertools import product, starmap
from functools import partial
def is_valid_resp(resp):
return resp.status_code == requests.codes.ok
def get_json(resp):
return resp.json()
# There's a .format hiding on the end of this really long url,
# with {} in appropriate places
url = "http://www.enciclovida.mx/explora-por-region/especies-por-grupo?utf8=%E2%9C%93&grupo_id=Plantas®ion_id={}&parent_id=&pagina={}&nombre=".format
regions = range(1, 33)
pages = range(1, 501)
urls = starmap(url, product(regions, pages))
moz_get = partial(requests.get, headers={'User-Agent':'Mozilla/5.0'})
responses = map(moz_get, urls)
valid_responses = filter(is_valid_response, responses)
all_data = map(get_json, valid_responses)
# all_data is a generator that will give you each page's json.
import requests
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r=requests.post("http://www.ebi.ac.uk/Tools/msa/clustalo/",data=q)
This is my script, I send this request to website, but the result looks like I did nothing, web service didn't receive my request. This method used to be fine with other website, maybe this page with a pop window to ask cookie agreement?
The form on the page you are referring to has a separate URL, namely
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi
you can verify this with a DOM inspector in your browser.
So in order to proceed with requests, you need to access the right page
r=requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data=q)
this will submit a job with your input data, it doesn't return the result directly. To check the results, it's necessary to extract the job ID from the previous response and then generate another request (with no data) to
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=...
However, you should definitely check whether this programatic access is compatible with the TOS of that website...
Here is an example:
from lxml import html
import requests
import sys
import time
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r = requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data = q)
tree = html.fromstring(r.text)
title = tree.xpath('//title/text()')[0]
#check the status and get the job id
status, job_id = map(lambda s: s.strip(), title.split(':', 1))
if status != "Job running":
sys.exit(1)
#it might take some time for the job to finish
time.sleep(10)
#download the results
r = requests.get("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=%s" % (job_id))
#prints the full response
#print(r.text)
#isolate the alignment block
tree = html.fromstring(r.text)
alignment = tree.xpath('//pre[#id="alignmentContent"]/text()')[0]
print(alignment)
I have a colleague having the task submitting gene sequences of Hepatitis C viruses from patient samples into a request form of a specific website which then identifies mutations which provides information about potential drug resistance.
This is very cumbersome and takes days.
My thought is to automate this with a Python script using urllib2 (I cannot use mechanize, I have to develop on MAC OS and for reasons I not understand neither Python setup.py install nor pip mechanize install work - so I am bound to urllib2).
My first try was to access the respective website and first submit a sample gene sequence. (On the original website you simple paste the sequence into an entry field named "or paste in" and then press "go".)
On the next page, you will get the result and I want to read out the mutations via regular expressions.
My first try:
import url lib
import urllib2
url = 'http://hcv.geno2pheno.org/index.php'
form_data = {'or paste in:': 'CTTCACGGAGGCTATGACGAGGTACTCCGCTCCCCCCGGGGACCCCCCCCAACCAGAATACGACTTGGAGCTCATAACATCGTGCTCCTCTAACGTGTCAGTCGCCCACGACGGCGCTGGAAAAAGGGTCTACTACCTTACCCGTGACCCTACAACCCCCCTCGCAAGAGCTGCGTGGGAGACAGCAAGACACACTCCAGTCAATTCCTGGCTAGGCAACATAATCATGTTTGCCCCCACATTGTGGGCGAGAATGATACTGATGACCCACTTCTTCAGTGTCCTCATCGCCAGGGATCAACTTGAACAGGCCCTTGATTGCGAAATCTACGGAGCCTGCTACTCCATTCAACCACTGGACCTACCTCCAATCATTCAAAGACTCCATGGCCTTAGCGCATTTTCACTCCACAGTTACTCTCCAGGTGAAATCAATAGGGTGGCCGCATGCCTCAGGAAACTTGGGGTCCCGCCCTTGCGAGCTTGGAGACACCGGGCCCGGAGCGTCCGCGCTAAGCTTCTGTCCAGAGGAGGCAGGGCTGCCATATGTGGCAAGTACCTCTTCAATTGGGCAGTAAGAACAAAGCTCAAACTCACTCCAATAGCGGCCGCTGGCCAGCTGGACTTGTCCGGCTGGTTCACGGCTGGCTACAGCGGGGGAGACATTTATCACAGCGTGTCTC'}
params = urllib.urlencode(form_data)
response = urllib2.urlopen(url, params)
data = response.read()
print data
What I get from "data" is the source code from http://hcv.geno2pheno.org/index.php and not from the following result page.
Therefore, I have two questions:
1) How can I be sure that my sequence was pasted into the entry field "or paste in:" properly?
2) How do I access the source code of the result page so I can apply regular expressions?
There are a couple of things going wrong here. First, you need more parameters in your form_data dict. Just because you only manually fill in one field doesn't mean that's the only parameter the server needs to complete your request. I've included a form_data dict that worked for me below. The main key you're concerned with is 'v3seq'. This is the sequence you want to "paste in".
Then, when you're requesting the page, you need to use a Request object and read the response of that request. Looks like this:
import urllib
import urllib2
url = 'http://hcv.geno2pheno.org/index.php'
form_data = {
'v3seq': 'CTTCACGGAGGCTATGACGAGGTACTCCGCTCCCCCCGGGGACCCCCCCCAACCAGAATACGACTTGGAGCTCATAACATCGTGCTCCTCTAACGTGTCAGTCGCCCACGACGGCGCTGGAAAAAGGGTCTACTACCTTACCCGTGACCCTACAACCCCCCTCGCAAGAGCTGCGTGGGAGACAGCAAGACACACTCCAGTCAATTCCTGGCTAGGCAACATAATCATGTTTGCCCCCACATTGTGGGCGAGAATGATACTGATGACCCACTTCTTCAGTGTCCTCATCGCCAGGGATCAACTTGAACAGGCCCTTGATTGCGAAATCTACGGAGCCTGCTACTCCATTCAACCACTGGACCTACCTCCAATCATTCAAAGACTCCATGGCCTTAGCGCATTTTCACTCCACAGTTACTCTCCAGGTGAAATCAATAGGGTGGCCGCATGCCTCAGGAAACTTGGGGTCCCGCCCTTGCGAGCTTGGAGACACCGGGCCCGGAGCGTCCGCGCTAAGCTTCTGTCCAGAGGAGGCAGGGCTGCCATATGTGGCAAGTACCTCTTCAATTGGGCAGTAAGAACAAAGCTCAAACTCACTCCAATAGCGGCCGCTGGCCAGCTGGACTTGTCCGGCTGGTTCACGGCTGGCTACAGCGGGGGAGACATTTATCACAGCGTGTCTC',
'H77Switch': '1',
'ignore_sgtSwitch': '1',
'alignwidth': '3',
'action': '1',
'go': 'Go',
'viewResults': '1',
'viewResSec': 'Prediction'
}
data = urllib.urlencode(form_data)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html_data = response.read()
You can then scrape the data from the response and apply your regular expressions. If you're able to get your pip working, I would also suggest taking a look at BeautifulSoup - it's an excellent library for scraping data from html.
I'm participating in a hash-breaking contest and I'm trying to automate posting strings to an html form and getting the hash score back. So far I've managed to get SOMETHING posted to the url, but its not the exact string I'm expecting, and thus the value returned for the hash is way off from the one obtained by just typing in the string manually.
import urllib.parse, urllib.request
url = "http://almamater.xkcd.com/?edu=una.edu"
data = "test".encode("ascii")
header = {"Content-Type":"application/octet-stream"}
req = urllib.request.Request(url, data, header)
f = urllib.request.urlopen(req)
print(f.read())
#parse f to pull out hash
I obtain the following hash from the site:
0fff9563bb3279289227ac77d319b6fff8d7e9f09da1247b72a0a265cd6d2a62645ad547ed8193db48cff847c06494a03f55666d3b47eb4c20456c9373c86297d630d5578ebd34cb40991578f9f52b18003efa35d3da6553ff35db91b81ab890bec1b189b7f52cb2a783ebb7d823d725b0b4a71f6824e88f68f982eefc6d19c6
This differs considerably from what I expected, which is what you get if you type in "test" (no quotes) into the form:
e21091dbb0d61bc93db4d1f278a04fe1a51165fb7262c7da31f886ae09ff3e04c41483c500db2792c59742958d8f7f39fe4f4f2cdc7940b7b25e3289b89d344e06f76305b9de525933b5df5dae2a37388f82cf76374fe363587acfb49b9d2c8fc131ef4a32c762be083b07330989b298d60e312f56a6b8a4c0f53c9b59864fb7
Obviously the code isn't doing what I'm expecting it to do. Any tips?
When you submit your form data, it also includes the field name, so when you submit "test" the data submitted actually looks like "hashable=test". Try changing your data like this:
data = "hashable=test".encode("ascii")
or alternatively:
data = urllib.parse.urlencode({'hashable': 'test'})