I have a small script that uses mechanize to manipulate a webform. Here is a screenshot of the form (without the submit button at the bottom. Don't worry about that.)
The code:
import re
import mechanize
bs = mechanize.Browser()
server = raw_input("IP to retry: ")
bs.open("http://"+server+"/avicapture.html")
assert bs.viewing_html()
bs.select_form(name="avistatus_form")
form = bs.form
bs.find_control("AVI_STATUS_ACTION").items[1].selected=True
bs.find_control("avistatuscheck0").items[0].selected=True
bs.find_control("avistatuscheck1").items[0].selected=True
bs.find_control("avistatuscheck2").items[0].selected=True
bs.find_control("avistatuscheck3").items[0].selected=True
bs.find_control("avistatuscheck4").items[0].selected=True
bs.find_control("avistatuscheck5").items[0].selected=True
print "Sending retry signal."
bs.submit()
print server+" Retried!"
As it is, it will check all six boxes and submit the form with the dropdown option (AVI_STATUS_ACTION) as [1].
How do I go about having it determine which row (correlating to the proper avistatuscheck# control (checkbox)) is the most recent, and to only submit the form with that checkbox checked? As more files are transferred, they accumulate, and I don't need to resend them all. Just the most recent.
I know a little about regex; enough to use urllib2 to load an html page into a string and grab the percentage amount from the current 'In Progress' transfer, but I'm a bit lost on how to determine the most recent transfer corresponding to the correct control (checkbox.)
The source code contains the data in a nicer format than the actual HTML, in comment form:
</td>
<!--$FREETEXT|AVI_STATUS_START_TIME0||XXXXXXXXXXXXXXXXXXXXXXXXX$-->
<td>
2014/07/11 12:00:03
</td>
<!--$FREETEXT|AVI_STATUS_END_TIME0||XXXXXXXXXXXXXXXXXXXXXXXXX$-->
<td>
2014/07/11 14:00:00
</td>
<!--$FREETEXT|AVI_STATUS_FILE_SIZE0||XXXXXXXXXXXXX$-->
<td>
You can use a regular expression to parse this:
import re
import mechanize
bs = mechanize.Browser()
server = raw_input("IP to retry: ")
response = bs.open("http://" + server + "/avicapture.html")
assert bs.viewing_html()
bs.select_form(name="avistatus_form")
matches = re.findall(
r'(?s)<!--\$FREETEXT\|AVI_STATUS_END_TIME([0-9]+).*?<td>\s*([0-9/]+ [0-9:]+)\s*\n',
response.read())
latest_id, latest_time = max(matches, key=lambda m: m[1])
form = bs.form
bs.find_control("AVI_STATUS_ACTION").items[1].selected = True
bs.find_control("avistatuscheck" + latest_id).items[0].selected = True
print "Sending retry signal."
bs.submit()
print server+" Retried!"
Related
I am trying to develop a script with python to web scraping some information on a specific website for learning purposes.
I went over a lot of different tutorials and posts, trying to gather some insights from them, they are very useful but still didn't help me to find a way to log in the website and do searches with different keywords.
I tried to use different APIs, such as requests and urllib, maybe I didn't find the right way to solve it.
The steps lists as follow:
login information set up
Send login information to the website and get response for future use
keywords setup
import header
set up cookiejar
from login response, do the search
After I tried, it will work randomly, and
here is the code:
import getpass
# marvin
# date:2018/2/7
# login stage preparation
def login_values():
login="https://www.****.com/login"
username = input("Please insert your username: ")
password = getpass.getpass("Please type in your password: ")
host="www.****.com"
#store login screts
data = {
"username": username,
"password": password,
}
return login,host,data
The following is for getting the HTML file from a website
import requests
import random
import http.cookiejar
import socket
# Set up web scraping function to output the html text file
def webscrape(login_url,host_url,login_data,target_url):
#static values preparation
##import header
user_agents = [
***
]
agent = random.choice(user_agents)
headers={'User-agent':agent,
'Accept':'*/*',
'Accept-Language':'en-US,en;q=0.9;zh-cmn-Hans',
'Host':host_url,
'charset':'utf-8',
}
##set up cookie jar
cj = http.cookiejar.CookieJar()
#
# get the html file
socket.setdefaulttimeout(20)
s=requests.Session()
req=s.post(login_url, data=login_data)
res = s.get(target_url, cookies=cj,headers=headers)
html=res.text
return html
Here is the code to get each links from html:
from bs4 import BeautifulSoup
#set up html parsing function for parsing all the list links
def getlist(keyword,loginurl,hosturl,valuesurl,html_lists):
page=1
pagenum=10# set up maximum page num
links=[]
soup=BeautifulSoup(html_lists,"lxml")
try:
for li in soup.find("div",class_="search_pager human_pager in-block").ul.find_all('li'):
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
page+=1
if page<=pagenum:
try:
nexturl=soup.find('div',class_='search_pager human_pager in-block').ul.find('li',class_='pagination-next ng-scope ').a['href'] #next page
except AttributeError:
print("{}'s links are all stored!".format(keyword))
return links
else:
chs_html=webscrape(loginurl,hosturl,valuesurl,nexturl)
soup=BeautifulSoup(chs_html,"lxml")
except AttributeError:
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
print("There is only one page")
return links
The test code is:
keyword="****"
myurl="https://www.****.com/search/os2?key={}".format(keyword)
chs_html=webscrape(login,host,values,myurl)
chs_links=getlist(keyword,login,host,values,chs_html)
targethtml=webscrape(login,host,values,chs_links[1])
There are total 22 links and one page containing 19 links, so it is supposed to have more than one page, if the result "There is only one page" shown up, it indicates a failure.
Problems:
The login_values function is to secure my login information by combining all functions to a final function, but apparently, the username and password are still really easy to show just by print() command.
This the main problem!! Like I mentioned before, this method works randomly. By the way, what I mean not working, it is that the HTML file is only the login page instead of the searching result. I want to get a better control to make it work most of the time. I checked user-agents by print agent every time to see if they are relevant, and it is not! I cleared cookies with suspicious to full storage memory, and it is not.
There are sometimes I facing max trial error or OS error, I guess it is the error from the server I was trying to reach, is there a way I can set up a wait timer for me to prevent these errors from happening?
import requests
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r=requests.post("http://www.ebi.ac.uk/Tools/msa/clustalo/",data=q)
This is my script, I send this request to website, but the result looks like I did nothing, web service didn't receive my request. This method used to be fine with other website, maybe this page with a pop window to ask cookie agreement?
The form on the page you are referring to has a separate URL, namely
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi
you can verify this with a DOM inspector in your browser.
So in order to proceed with requests, you need to access the right page
r=requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data=q)
this will submit a job with your input data, it doesn't return the result directly. To check the results, it's necessary to extract the job ID from the previous response and then generate another request (with no data) to
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=...
However, you should definitely check whether this programatic access is compatible with the TOS of that website...
Here is an example:
from lxml import html
import requests
import sys
import time
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r = requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data = q)
tree = html.fromstring(r.text)
title = tree.xpath('//title/text()')[0]
#check the status and get the job id
status, job_id = map(lambda s: s.strip(), title.split(':', 1))
if status != "Job running":
sys.exit(1)
#it might take some time for the job to finish
time.sleep(10)
#download the results
r = requests.get("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=%s" % (job_id))
#prints the full response
#print(r.text)
#isolate the alignment block
tree = html.fromstring(r.text)
alignment = tree.xpath('//pre[#id="alignmentContent"]/text()')[0]
print(alignment)
I'm using Python to scrape data from a number of web pages that have simple HTML input forms, like the 'Username:' form at the bottom of this page:
http://www.w3schools.com/html/html_forms.asp (this is just a simple example to illustrate the problem)
Firefox Inspect Element indicates this form field has the following HTML structure:
<form name="input0" target="_blank" action="html_form_action.asp" method="get">
Username:
<input name="user" size="20" type="text"></input>
<input value="Submit" type="submit"></input>
</form>
All I want to do is fill out this form and get the resulting page:
http://www.w3schools.com/html/html_form_action.asp?user=ThisIsMyUserName
Which is what is produced in my browser by entering 'ThisIsMyUserName' in the 'Username' field and pressing 'Submit'. However, every method that I have tried (details below) returns the contents of the original page containing the unaltered form without any indication the form data I submitted was recognized, i.e. I get the content from the first link above in response to my request, when I expected to receive the content of the second link.
I suspect the problem has to do with action="html_form_action.asp" in the form above, or perhaps some kind of hidden field I'm missing (I don't know what to look for - I'm new to form submission). Any suggestions?
HERE IS WHAT I'VE TRIED SO FAR:
Using urllib.requests in Python 3:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Encode dict
example_data = example_data.encode('utf-8')
# Create request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
request = urllib.request.Request(example_url, data=example_data)
# Create opener and install
my_url_opener = urllib.request.build_opener() # no handlers
urllib.request.install_opener(my_url_opener)
# Open the page and read content
web_page = urllib.request.urlopen(request)
content = web_page.read()
# Save content to file
my_html_file = open('my_html_file.html', 'wb')
my_html_file.write(content)
But what is returned to me and saved in 'my_html_file.html' is the original page containing
the unaltered form without any indication that my form data was recognized, i.e. I get this page in response: qqqhttp://www.w3schools.com/html/html_forms.asp
...which is the same thing I would have expected if I made this request without the
data parameter at all (which would change the request from a POST to a GET).
Naturally the first thing I did was check whether my request was being constructed properly:
# Just double-checking the request is set up correctly
print("GET or POST?", request.get_method())
print("DATA:", request.data)
print("HEADERS:", request.header_items())
Which produces the following output:
GET or POST? POST
DATA: b'user=ThisIsMyUserName'
HEADERS: [('Content-length', '21'), ('Content-type', 'application/x-www-form-urlencoded'), ('User-agent', 'Python-urllib/3.3'), ('Host', 'www.w3schools.com')]
So it appears the POST request has been structured correctly. After re-reading the
documentation and unsuccessfuly searching the web for an answer to this problem, I
moved on to a different tool: the requests module. I attempted to perform the same task:
import requests
example_url = 'http://www.w3schools.com/html/html_forms.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.post(example_url, params=data_to_send)
contents = response.content
And I get the same exact result. At this point I'm thinking maybe this is a Python 3
issue. So I fire up my trusty Python 2.7 and try the following:
import urllib, urllib2
data = urllib.urlencode({'user' : 'ThisIsMyUserName'})
resp = urllib2.urlopen('http://www.w3schools.com/html/html_forms.asp', data)
content = resp.read()
And I get the same result again! For thoroughness I figured I'd attempt to achieve the
same result by encoding the dictionary values into the url and attempting a GET request:
# Using Python 3
# Construct the url for the GET request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
form_values = {'user': 'ThisIsMyUserName'}
example_data = urllib.parse.urlencode(form_values)
final_url = example_url + '?' + example_data
print(final_url)
This spits out the following value for final_url:
qqqhttp://www.w3schools.com/html/html_forms.asp?user=ThisIsMyUserName
I plug this into my browser and I see that this page is exactly the same as
the original page, which is exactly what my program is downloading.
I've also tried adding additional headers and cookie support to no avail.
I've tried everything I can think of. Any idea what could be going wrong?
The form states an action and a method; you are ignoring both. The method states the form uses GET, not POST, and the action tells you to send the form data to html_form_action.asp.
The action attribute acts like any other URL specifier in an HTML page; unless it starts with a scheme (so with http://..., https://..., etc.) it is relative to the current base URL of the page.
The GET HTTP method adds the URL-encoded form parameters to the target URL with a question mark:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Create request
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
get_url = example_url + '?' + example_data
# Open the page and read content
web_page = urllib.request.urlopen(get_url)
print(web_page.read().decode(web_page.info().get_param('charset', 'utf8')))
or, using requests:
import requests
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.get(example_url, params=data_to_send)
contents = response.text
print(contents)
In both examples I also decoded the response to Unicode text (something requests makes easier for me with the response.text attribute).
So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.
My python level is Novice. I have never written a web scraper or crawler. I have written a python code to connect to an api and extract the data that I want. But for some the extracted data I want to get the gender of the author. I found this web site http://bookblog.net/gender/genie.php but downside is there isn't an api available. I was wondering how to write a python to submit data to the form in the page and extract the return data. It would be a great help if I could get some guidance on this.
This is the form dom:
<form action="analysis.php" method="POST">
<textarea cols="75" rows="13" name="text"></textarea>
<div class="copyright">(NOTE: The genie works best on texts of more than 500 words.)</div>
<p>
<b>Genre:</b>
<input type="radio" value="fiction" name="genre">
fiction
<input type="radio" value="nonfiction" name="genre">
nonfiction
<input type="radio" value="blog" name="genre">
blog entry
</p>
<p>
</form>
results page dom:
<p>
<b>The Gender Genie thinks the author of this passage is:</b>
male!
</p>
No need to use mechanize, just send the correct form data in a POST request.
Also, using regular expression to parse HTML is a bad idea. You would be better off using a HTML parser like lxml.html.
import requests
import lxml.html as lh
def gender_genie(text, genre):
url = 'http://bookblog.net/gender/analysis.php'
caption = 'The Gender Genie thinks the author of this passage is:'
form_data = {
'text': text,
'genre': genre,
'submit': 'submit',
}
response = requests.post(url, data=form_data)
tree = lh.document_fromstring(response.content)
return tree.xpath("//b[text()=$caption]", caption=caption)[0].tail.strip()
if __name__ == '__main__':
print gender_genie('I have a beard!', 'blog')
You can use mechanize to submit and retrieve content, and the re module for getting what you want. For example, the script below does it for the text of your own question:
import re
from mechanize import Browser
text = """
My python level is Novice. I have never written a web scraper
or crawler. I have written a python code to connect to an api and
extract the data that I want. But for some the extracted data I want to
get the gender of the author. I found this web site
http://bookblog.net/gender/genie.php but downside is there isn't an api
available. I was wondering how to write a python to submit data to the
form in the page and extract the return data. It would be a great help
if I could get some guidance on this."""
browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")
browser.select_form(nr=0)
browser['text'] = text
browser['genre'] = ['nonfiction']
response = browser.submit()
content = response.read()
result = re.findall(
r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!', content)
print result[0]
What does it do? It creates a mechanize.Browser and goes to the given URL:
browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")
Then it selects the form (since there is only one form to be filled, it will be the first):
browser.select_form(nr=0)
Also, it sets the entries of the form...
browser['text'] = text
browser['genre'] = ['nonfiction']
... and submit it:
response = browser.submit()
Now, we get the result:
content = response.read()
We know that the result is in the form:
<b>The Gender Genie thinks the author of this passage is:</b> male!
So we create a regex for matching and use re.findall():
result = re.findall(
r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!',
content)
Now the result is available for your use:
print result[0]
You can use mechanize, see examples for details.
from mechanize import ParseResponse, urlopen, urljoin
uri = "http://bookblog.net"
response = urlopen(urljoin(uri, "/gender/genie.php"))
forms = ParseResponse(response, backwards_compat=False)
form = forms[0]
#print form
form['text'] = 'cheese'
form['genre'] = ['fiction']
print urlopen(form.click()).read()