My python level is Novice. I have never written a web scraper or crawler. I have written a python code to connect to an api and extract the data that I want. But for some the extracted data I want to get the gender of the author. I found this web site http://bookblog.net/gender/genie.php but downside is there isn't an api available. I was wondering how to write a python to submit data to the form in the page and extract the return data. It would be a great help if I could get some guidance on this.
This is the form dom:
<form action="analysis.php" method="POST">
<textarea cols="75" rows="13" name="text"></textarea>
<div class="copyright">(NOTE: The genie works best on texts of more than 500 words.)</div>
<p>
<b>Genre:</b>
<input type="radio" value="fiction" name="genre">
fiction
<input type="radio" value="nonfiction" name="genre">
nonfiction
<input type="radio" value="blog" name="genre">
blog entry
</p>
<p>
</form>
results page dom:
<p>
<b>The Gender Genie thinks the author of this passage is:</b>
male!
</p>
No need to use mechanize, just send the correct form data in a POST request.
Also, using regular expression to parse HTML is a bad idea. You would be better off using a HTML parser like lxml.html.
import requests
import lxml.html as lh
def gender_genie(text, genre):
url = 'http://bookblog.net/gender/analysis.php'
caption = 'The Gender Genie thinks the author of this passage is:'
form_data = {
'text': text,
'genre': genre,
'submit': 'submit',
}
response = requests.post(url, data=form_data)
tree = lh.document_fromstring(response.content)
return tree.xpath("//b[text()=$caption]", caption=caption)[0].tail.strip()
if __name__ == '__main__':
print gender_genie('I have a beard!', 'blog')
You can use mechanize to submit and retrieve content, and the re module for getting what you want. For example, the script below does it for the text of your own question:
import re
from mechanize import Browser
text = """
My python level is Novice. I have never written a web scraper
or crawler. I have written a python code to connect to an api and
extract the data that I want. But for some the extracted data I want to
get the gender of the author. I found this web site
http://bookblog.net/gender/genie.php but downside is there isn't an api
available. I was wondering how to write a python to submit data to the
form in the page and extract the return data. It would be a great help
if I could get some guidance on this."""
browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")
browser.select_form(nr=0)
browser['text'] = text
browser['genre'] = ['nonfiction']
response = browser.submit()
content = response.read()
result = re.findall(
r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!', content)
print result[0]
What does it do? It creates a mechanize.Browser and goes to the given URL:
browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")
Then it selects the form (since there is only one form to be filled, it will be the first):
browser.select_form(nr=0)
Also, it sets the entries of the form...
browser['text'] = text
browser['genre'] = ['nonfiction']
... and submit it:
response = browser.submit()
Now, we get the result:
content = response.read()
We know that the result is in the form:
<b>The Gender Genie thinks the author of this passage is:</b> male!
So we create a regex for matching and use re.findall():
result = re.findall(
r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!',
content)
Now the result is available for your use:
print result[0]
You can use mechanize, see examples for details.
from mechanize import ParseResponse, urlopen, urljoin
uri = "http://bookblog.net"
response = urlopen(urljoin(uri, "/gender/genie.php"))
forms = ParseResponse(response, backwards_compat=False)
form = forms[0]
#print form
form['text'] = 'cheese'
form['genre'] = ['fiction']
print urlopen(form.click()).read()
Related
As in topic, I would like to crawl the comments on the website under "Project Activity" section: https://www.donorschoose.org/project/social-distancing-in-kindergarten/5025093/?context=false
However, what I don't understand is that the content text can neither be found in plain HTML and the response from XHR calls.
That's the end of my knowledge and I have no idea what to do beyond the two tricks above, and I am a bit lost as to where exactly those texts come from and in what way then can I crawl them. Can someone enlighten me on that?
Many thanks!!
You can use this script to load the comments from external URL:
import re
import json
import requests
url = 'https://www.donorschoose.org/project/social-distancing-in-kindergarten/5025093/?context=false'
comments_url = 'https://cdn.donorschoose.net/dwr/jsonp/ProposalMessageWebService/getProposalMessagesByProposalId?callback=projectTimelineCallback¶m0={id}&context=false'
id_ = re.search(r'/(\d+)/', url).group(1)
text = requests.get(comments_url.format(id=id_)).text
text = re.search(r'\((.*)\)', text).group(1)
data = json.loads( re.sub(r'new Date\((\d+)\)', r'\1', text) )
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
# print some info to screen:
for t in data['data']['threads']:
print(t['original']['author']['firstName'])
print(t['original']['message'])
print('-' * 80)
Prints:
Stephanie
purchased the <span>resources</span> for Ms. Carway's classroom and notified the school principal of delivery
--------------------------------------------------------------------------------
Maree
<img alt="Teacher Mail" src="https://cdn.donorschoose.net/images/project/posted_mail.gif"><span>Thank You Letter</span> posted!
--------------------------------------------------------------------------------
Maree
<strong class='good-news'>Good news: Project fully funded!</strong>
--------------------------------------------------------------------------------
...and so on.
Website link : https://www.cdw.com/search/software/?b=mic&w=f&pcurrent=1
I have to fetch "MS MPSAD SYS CTR SRV CNFG MGR LSA" from the pop up which appears after clicking on add to cart.
Basic Question is how to deal with pop up content when clicking generates the pop up
which comes from the tag as follows:
<form action="/cart/addtocart/" class="addToCartForm cart-top-addtocart-by-edc" data-addtocartasync="/cart/addtocartasync/" method="post">
<input name="__RequestVerificationToken" type="hidden" value="sB_lUL7jwTaTOwC5W0fGHnj_xjPtIKzkJPeHuQejajSQq06Sz2hlY8i-LMlfrnQ0GwxJeGwc7tQ6SIu2HGhQD821fB41"/>
<input name="ProductContext.ProductCode" type="hidden" value="3667016"/>
How to get the hidden input's value by using python?
I have tried getting values from input of hidden type but i am not sure sure how i can proceed with the same.
Here I am using inspect element for URL and FORM Data
I am using postman to get response but response is null.
def api_call(product_code, pricekey, token):
url = "https://www.cdw.com/cart/addtocartasync/"
payload = {'CartItems[0][Quantity]': '1',
'CartItems[0][Product][ProductCode]': int(product_code),
'CartItems[0][SelectedPrice][PriceKey]': str(pricekey),
'__RequestVerificationToken': str(token)}
files = [
]
response = requests.request("POST", url, data = payload, files = files)
try:
res = json.loads(response.text)
return res['cartItems'][0]['product']['name']
except:
return None
We can hit the api with post method with relevant info from the html and get the desired info.
As you can see the below inspector screenshot, we can get all the info from it.
I have a small script that uses mechanize to manipulate a webform. Here is a screenshot of the form (without the submit button at the bottom. Don't worry about that.)
The code:
import re
import mechanize
bs = mechanize.Browser()
server = raw_input("IP to retry: ")
bs.open("http://"+server+"/avicapture.html")
assert bs.viewing_html()
bs.select_form(name="avistatus_form")
form = bs.form
bs.find_control("AVI_STATUS_ACTION").items[1].selected=True
bs.find_control("avistatuscheck0").items[0].selected=True
bs.find_control("avistatuscheck1").items[0].selected=True
bs.find_control("avistatuscheck2").items[0].selected=True
bs.find_control("avistatuscheck3").items[0].selected=True
bs.find_control("avistatuscheck4").items[0].selected=True
bs.find_control("avistatuscheck5").items[0].selected=True
print "Sending retry signal."
bs.submit()
print server+" Retried!"
As it is, it will check all six boxes and submit the form with the dropdown option (AVI_STATUS_ACTION) as [1].
How do I go about having it determine which row (correlating to the proper avistatuscheck# control (checkbox)) is the most recent, and to only submit the form with that checkbox checked? As more files are transferred, they accumulate, and I don't need to resend them all. Just the most recent.
I know a little about regex; enough to use urllib2 to load an html page into a string and grab the percentage amount from the current 'In Progress' transfer, but I'm a bit lost on how to determine the most recent transfer corresponding to the correct control (checkbox.)
The source code contains the data in a nicer format than the actual HTML, in comment form:
</td>
<!--$FREETEXT|AVI_STATUS_START_TIME0||XXXXXXXXXXXXXXXXXXXXXXXXX$-->
<td>
2014/07/11 12:00:03
</td>
<!--$FREETEXT|AVI_STATUS_END_TIME0||XXXXXXXXXXXXXXXXXXXXXXXXX$-->
<td>
2014/07/11 14:00:00
</td>
<!--$FREETEXT|AVI_STATUS_FILE_SIZE0||XXXXXXXXXXXXX$-->
<td>
You can use a regular expression to parse this:
import re
import mechanize
bs = mechanize.Browser()
server = raw_input("IP to retry: ")
response = bs.open("http://" + server + "/avicapture.html")
assert bs.viewing_html()
bs.select_form(name="avistatus_form")
matches = re.findall(
r'(?s)<!--\$FREETEXT\|AVI_STATUS_END_TIME([0-9]+).*?<td>\s*([0-9/]+ [0-9:]+)\s*\n',
response.read())
latest_id, latest_time = max(matches, key=lambda m: m[1])
form = bs.form
bs.find_control("AVI_STATUS_ACTION").items[1].selected = True
bs.find_control("avistatuscheck" + latest_id).items[0].selected = True
print "Sending retry signal."
bs.submit()
print server+" Retried!"
I'm using Python to scrape data from a number of web pages that have simple HTML input forms, like the 'Username:' form at the bottom of this page:
http://www.w3schools.com/html/html_forms.asp (this is just a simple example to illustrate the problem)
Firefox Inspect Element indicates this form field has the following HTML structure:
<form name="input0" target="_blank" action="html_form_action.asp" method="get">
Username:
<input name="user" size="20" type="text"></input>
<input value="Submit" type="submit"></input>
</form>
All I want to do is fill out this form and get the resulting page:
http://www.w3schools.com/html/html_form_action.asp?user=ThisIsMyUserName
Which is what is produced in my browser by entering 'ThisIsMyUserName' in the 'Username' field and pressing 'Submit'. However, every method that I have tried (details below) returns the contents of the original page containing the unaltered form without any indication the form data I submitted was recognized, i.e. I get the content from the first link above in response to my request, when I expected to receive the content of the second link.
I suspect the problem has to do with action="html_form_action.asp" in the form above, or perhaps some kind of hidden field I'm missing (I don't know what to look for - I'm new to form submission). Any suggestions?
HERE IS WHAT I'VE TRIED SO FAR:
Using urllib.requests in Python 3:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Encode dict
example_data = example_data.encode('utf-8')
# Create request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
request = urllib.request.Request(example_url, data=example_data)
# Create opener and install
my_url_opener = urllib.request.build_opener() # no handlers
urllib.request.install_opener(my_url_opener)
# Open the page and read content
web_page = urllib.request.urlopen(request)
content = web_page.read()
# Save content to file
my_html_file = open('my_html_file.html', 'wb')
my_html_file.write(content)
But what is returned to me and saved in 'my_html_file.html' is the original page containing
the unaltered form without any indication that my form data was recognized, i.e. I get this page in response: qqqhttp://www.w3schools.com/html/html_forms.asp
...which is the same thing I would have expected if I made this request without the
data parameter at all (which would change the request from a POST to a GET).
Naturally the first thing I did was check whether my request was being constructed properly:
# Just double-checking the request is set up correctly
print("GET or POST?", request.get_method())
print("DATA:", request.data)
print("HEADERS:", request.header_items())
Which produces the following output:
GET or POST? POST
DATA: b'user=ThisIsMyUserName'
HEADERS: [('Content-length', '21'), ('Content-type', 'application/x-www-form-urlencoded'), ('User-agent', 'Python-urllib/3.3'), ('Host', 'www.w3schools.com')]
So it appears the POST request has been structured correctly. After re-reading the
documentation and unsuccessfuly searching the web for an answer to this problem, I
moved on to a different tool: the requests module. I attempted to perform the same task:
import requests
example_url = 'http://www.w3schools.com/html/html_forms.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.post(example_url, params=data_to_send)
contents = response.content
And I get the same exact result. At this point I'm thinking maybe this is a Python 3
issue. So I fire up my trusty Python 2.7 and try the following:
import urllib, urllib2
data = urllib.urlencode({'user' : 'ThisIsMyUserName'})
resp = urllib2.urlopen('http://www.w3schools.com/html/html_forms.asp', data)
content = resp.read()
And I get the same result again! For thoroughness I figured I'd attempt to achieve the
same result by encoding the dictionary values into the url and attempting a GET request:
# Using Python 3
# Construct the url for the GET request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
form_values = {'user': 'ThisIsMyUserName'}
example_data = urllib.parse.urlencode(form_values)
final_url = example_url + '?' + example_data
print(final_url)
This spits out the following value for final_url:
qqqhttp://www.w3schools.com/html/html_forms.asp?user=ThisIsMyUserName
I plug this into my browser and I see that this page is exactly the same as
the original page, which is exactly what my program is downloading.
I've also tried adding additional headers and cookie support to no avail.
I've tried everything I can think of. Any idea what could be going wrong?
The form states an action and a method; you are ignoring both. The method states the form uses GET, not POST, and the action tells you to send the form data to html_form_action.asp.
The action attribute acts like any other URL specifier in an HTML page; unless it starts with a scheme (so with http://..., https://..., etc.) it is relative to the current base URL of the page.
The GET HTTP method adds the URL-encoded form parameters to the target URL with a question mark:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Create request
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
get_url = example_url + '?' + example_data
# Open the page and read content
web_page = urllib.request.urlopen(get_url)
print(web_page.read().decode(web_page.info().get_param('charset', 'utf8')))
or, using requests:
import requests
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.get(example_url, params=data_to_send)
contents = response.text
print(contents)
In both examples I also decoded the response to Unicode text (something requests makes easier for me with the response.text attribute).
I'm trying to modify some existing code to return the value from 'title' from an API call. Was wondering if its possible?
Example API Url: http://domain.com/rest/getSong.view?u=username&p=password&v=1.8.0&id=11452
The above URL returns:
<domain-response xmlns="http://domain.org/restapi" status="ok" version="1.8.0">
<song id="11452" parent="11044" title="The Title" album="The Album"/>
</domain-response>
Now is there a way to use python to get the 'title' value if I know the id?
Example of current code using the REST API in a file called domain.py
def get_playlist(self, playlist_id):
Addon.log('get_playlist: ' + playlist_id)
payload = self.__get_json('getPlaylist.view', {'id': playlist_id})
if payload:
songs = self.listify(payload['playlist']['entry'])
self.display_music_directory(songs)
Rest of referenced code from another file called default.py
elif Addon.plugin_queries['mode'] == 'playlist':
subsonic.get_playlist(Addon.plugin_queries['playlist_id'])
As your response is in XML format, the intuitive way to use an XML parser. Here's how to use lxml to parse your response and get the title of song with ID 11452:
from lxml import etree
s = """<domain-response xmlns="http://domain.org/restapi" status="ok" version="1.8.0">
<song id="11452" parent="11044" title="The Title" album="The Album"/>
</domain-response>"""
tree = etree.fromstring(s)
song = tree.xpath("//ns:song[#id=\'11452\']",namespaces={'ns':'http://domain.org/restapi'})
print song[0].get('title')
It's worth mentioning that there's also a dirty way to get the title if you don't care about the rest content by using regular expression:
import re
print re.compile("song id=\"11452\".*?title=\"(.*?)\"").search(s).group(1)