I would like to use Mechanize (with Python) to submit a form, but unfortunately the page has been badly coded and the <select> element is not actually inside <form> tags.
So I can't use the traditional method via the form:
forms = [f for f in br.forms()]
mycontrol = forms[1].controls[0]
What can I do instead?
Here is the page I would like to scrape, and relevant bit of code - I'm interested in the la select item:
<fieldset class="searchField">
<label>By region / local authority</label>
<p id="regp">
<label>Region</label>
<select id="region" name="region"><option></option></select>
</p>
<p id="lap">
<label>Local authority</label>
<select id="la" name="la"><option></option></select>
</p>
<input id="byarea" type="submit" value="Go" />
<img id="regmap" src="/schools/performance/img/map_england.png" alt="Map of regions in England" border="0" usemap="#England" />
</fieldset>
This is actually more complex that you think, but still easy to implement. What is happening is that the webpage you linking is pulling in the local authorities by JSON (which is why the name="la" select element doesn't fill in Mechanize, which lacks Javascript). The easiest way around is to directly ask for this JSON data with Python and use the results to go directly to each data page.
import urllib2
import json
#The URL where we get our array of LA data
GET_LAS = 'http://www.education.gov.uk/cgi-bin/schools/performance/getareas.pl?level=la&code=0'
#The URL which we interpolate the LA ID into to get individual pages
GET_URL = 'http://www.education.gov.uk/schools/performance/geo/la%s_all.html'
def get_performance(la):
page = urllib2.urlopen(GET_URL % la)
#print(page.read())
#get the local authority list
las = json.loads(urllib2.urlopen(GET_LAS).read())
for la in las:
if la != 0:
print('Processing LA ID #%s (%s)' % (la[0], la[1]))
get_performance(la[0])
As you can see, you don't even need to load the page you linked or use Mechanize to do it! However, you will still need a way to parse out the school names and then then performance figures.
Related
I am trying to crawl the realtime Bitcoin-HKD Currency from https://www.coinbase.com/pt-PT/price/ with python3.
The only way I found to locate it specificly in the HTML is by this tage a with href="/pt-PT/price/bitcoin"
<a href="/pt-PT/price/bitcoin" title="Visite a moeda Bitcoin" data-element-handle="asset-highlight-top-daily-volume" class="Link__A-eh4rrz-0 hfBqui AssetHighlight__StyledLink-sc-1srucyv-1 cbFcph" color="slate">
<h2 class="AssetHighlight__Title-sc-1srucyv-2 jmJxYl">Volume mais alto (24 h)</h2>
<div class="Flex-l69ttv-0 gaVUrq">
<img src="https://dynamic-assets.coinbase.com/e785e0181f1a23a30d9476038d9be91e9f6c63959b538eabbc51a1abc8898940383291eede695c3b8dfaa1829a9b57f5a2d0a16b0523580346c6b8fab67af14b/asset_icons/b57ac673f06a4b0338a596817eb0a50ce16e2059f327dc117744449a47915cb2.png" alt="Visite a moeda Bitcoin" aria-label="Visite a moeda Bitcoin" loading="lazy" class="AssetHighlight__AssetImage-sc-1srucyv-5 lcjcxh"/>
<div class="Flex-l69ttv-0 kvilOX">
<div class="Flex-l69ttv-0 gTbYCC">
<h3 class="AssetHighlight__SubTitle-sc-1srucyv-3 gdcBEE">Bitcoin</h3>
<p class="AssetHighlight__Price-sc-1srucyv-4 bUAWAG">460 728,81 HK$</p>
Here 460 728,81 HK$ is the data wanted.
Thus I applied the following codes:
import bs4
import urllib.request as req
url="https://www.coinbase.com/prthe ice/bitcoin/hkd"
request=req.Request(url,headers={
"user-agent":"..."
})
with req.urlopen(request) as response:
data=response.read().decode("utf-8")
root=bs4.BeautifulSoup(data,"html.parser")
secBitcoin=root.find('a',href="/pt-PT/price/bitcoin")
realtimeCurrency=secBitcoin.find('p')
print(realtimeCurrency.string)
However, it always returns secBitcoin = None. No result matches.
The find function works just fine when I search 'div' label with class parameter.
I have also tried format like
.find('a[href="/pt-PT/price/bitcoin"]')
But nothing works.
It's possible the page is loading the currency values after the initial page load. You could try hitting ctrl+s to save the full webpage and open that file instead of using requests. If that also doesn't work, then I'm not sure where the problem is.
And if that does work, then you'll probably need to use something like selenium to get what you need
href is an attribute of an element and hence I think you cannot find it that way.
def is_a_and_href_matching(element):
is_a = element.name == a
if is_a and element.has_attr(href):
if element['href'] == "/pt-PT/price/bitcoin":
return True
return False
secBitcoins=root.find_all(is_a_and_href_matching)
for secBitcoin in secBitcoins:
p = setBitcoin.find('p')
I'm learning to create an Omegle bot, but the Omegle interface was created in HTML and I don't know very much about HTML nor MechanicalSoup.
In the part where the text is inserted, the code snippet is as follows:
<td class="chatmsgcell">
<div class="chatmsgwrapper">
<textarea class="chatmsg " cols="80" rows="3"></textarea>
</div>
</td>
In the part of the button to send the text, the code snippet is:
<td class="sendbthcell">
<div class="sendbtnwrapper">
<button class="sendbtn">Send<div class="btnkbshortcut">Enter</div></button>
</div>
</td>
I want to set a text in textarea and send it via button.
Looking at some examples in HTML, I guess the correct way to set text in a textarea is as follows:
<textarea>Here's a text.</textarea>
Also, I'm new at MechanicalSoup, but I think I know how to find and set a value in an HTML code:
# example in the Twitter interface
login_form = login_page.soup.find("form", {"class": "signin"})
LOGIN = "yourlogin"
login_form.find("input", {"name": "session[username_or_email]"})["value"] = LOGIN
From what I understand, the first argument is the name of the tag and a second argument is a dictionary whose first element is the name of the attribute and the second element is the value of the attribute.
But the tag textarea don't have an attribute for setting a text, like value="Here's a text.". What I should do for set a text in a textarea using MechanicalSoup?
I know it's not the answer you expect, but reading the doc would help ;-).
The full documentation is available at:
https://mechanicalsoup.readthedocs.io/
You probably want to start with the tutorial:
https://mechanicalsoup.readthedocs.io/en/stable/tutorial.html
In short, you need to select the form you want to fill-in:
browser.select_form('form[action="/post"]')
Then, filling-in fields is as simple as
browser["custname"] = "Me"
browser["custtel"] = "00 00 0001"
browser["custemail"] = "nobody#example.com"
browser["comments"] = "This pizza looks really good :-)"
I have a small .py program, rendering 2 HTML pages. One of those HTML pages has a form in it. A basic form requesting a name, and a comment. I can not figure out how to take the name and the comment from the form and store it into the csv file. I have got the coding so that the very little I already manually input into the csv file is printed/returned on the HTML page, which is one of the goals. But I can't get the data I input into the form into the csv file, then back n the HTML page. I feel like this is a simple fix, but the Flask book makes absolutely no sense to me, I'm dyslexic and I find it impossible to make sense of the examples and the written explanations.
This is the code I have for reading the csv back onto the page;
#app.route('/guestbook')
def guestbook():
with open('nameList.csv','r') as inFile:
reader=csv.reader(inFile)
names=[row for row in reader]
return render_template('guestbook.html',names=names[1:])
And this is my form coding;
<h3 class="tab">Feel free to enter your comments below</h3>
<br />
<br />
<form action="" method="get" enctype="text/plain" name="Comments Form">
<input id="namebox" type="text" maxlength="45" size="32" placeholder="Name"
class="tab"/>
<br />
<textarea id="txt1" class="textbox tab" rows="6" placeholder="Your comment"
class="tab" cols="28"></textarea>
<br />
<button class="menuitem tab" onclick="clearComment()" class="tab">Clear
comment</button>
<button class="menuitem" onclick="saveComment()" class="tab">Save comment</button>
<br>
</div>
By what I understand all you need is to save the data into the file and you don't know how to handle this in Flask, I'll try to explain it with code as clear as possible:
# request is a part of Flask's HTTP requests
from flask import request
import csv
# methods is an array that's used in Flask which requests' methods are
# allowed to be performed in this route.
#app.route('/save-comment', methods=['POST'])
def save_comment():
# This is to make sure the HTTP method is POST and not any other
if request.method == 'POST':
# request.form is a dictionary that contains the form sent through
# the HTTP request. This work by getting the name="xxx" attribute of
# the html form field. So, if you want to get the name, your input
# should be something like this: <input type="text" name="name" />.
name = request.form['name']
comment = request.form['comment']
# This array is the fields your csv file has and in the following code
# you'll see how it will be used. Change it to your actual csv's fields.
fieldnames = ['name', 'comment']
# We repeat the same step as the reading, but with "w" to indicate
# the file is going to be written.
with open('nameList.csv','w') as inFile:
# DictWriter will help you write the file easily by treating the
# csv as a python's class and will allow you to work with
# dictionaries instead of having to add the csv manually.
writer = csv.DictWriter(inFile, fieldnames=fieldnames)
# writerow() will write a row in your csv file
writer.writerow({'name': name, 'comment': comment})
# And you return a text or a template, but if you don't return anything
# this code will never work.
return 'Thanks for your input!'
I am trying to write some python code to automate the querying of an online medical calculation tool. The ressource is available at:
http://www.shef.ac.uk/FRAX/tool.aspx?lang=en
I am new to this type of thing, but understand from my research that I should be able to use the python requests package for this.
From my inspection of the page source I have identified the form element
<form method="post" action="tool.aspx?lang=en" id="form1">
And the elements that seem to directly correspond to the fields (eg. age) look like this
<input name="ctl00$ContentPlaceHolder1$toolage" type="text" id="ContentPlaceHolder1_toolage" maxlength="5" size="3" onkeypress="numericValidate(event)" style="width:40px;" />
My testing code so far looks like this (The only required fields to have filled out are age, sex, weight and height):
import requests
url="http://www.shef.ac.uk/FRAX/tool.aspx?lang=en"
payload ={'ctl00$ContentPlaceHolder1$toolage':'60',
'ctl00$ContentPlaceHolder1$year':'1954',
'ctl00$ContentPlaceHolder1$month':'01',
'ctl00$ContentPlaceHolder1$day':'01',
'ctl00$ContentPlaceHolder1$sex':'female',
'ctl00$ContentPlaceHolder1$weight':'70',
'ctl00$ContentPlaceHolder1$ht':'165',
'ctl00$ContentPlaceHolder1$facture':'no',
'ctl00$ContentPlaceHolder1facture_hip$':'no',
'ctl00$ContentPlaceHolder1$smoking':'no',
'ctl00$ContentPlaceHolder1$glu':'no',
'ctl00$ContentPlaceHolder1$rhe_art':'no',
'ctl00$ContentPlaceHolder1$sec_ost':'no',
'ctl00$ContentPlaceHolder1$alcohol':'no',
'ctl00$ContentPlaceHolder1$bmd_input':'',
'ctl00$ContentPlaceHolder1$btnCalculate':'Calculate',
}
req = requests.post(url, params=payload)
with open("requests_results.html", "w") as f:
f.write(req.content)
This however does not work. I don't get en error message but the resulting saved html page (which I would later parse for the results) contains just the initial page with no resulting values. In addition to the fields in my current payload the form also contain other elements that are perhaps necessary, such as hidden elements for some of the same data types like age
<input name="ctl00$ContentPlaceHolder1$toolagehidden" type="hidden" id="ContentPlaceHolder1_toolagehidden"
I have tried different combinations of payloads, but the results are the same. Any help would be much appreciated
You want to encode the payload before the POST. Like this:
import urllib
usefulpayload = urllib.urlencode(payload)
Then use usefulpayload in your request.
I'm trying to automate the login to a site, http://www.tthfanfic.org/login.php.
The problem I am having is that the password field has a name that is randomly generated, I have tried using it's label, type and id all of which remain static but to no avail.
Here is the HTML of the form:
<tr>
<th><label for="urealname">User Name</label></th>
<td><input type='text' id='urealname' name='urealname' value=''/> NOTE: Your user name may not be the same as your pen name.</td>
</tr>
<tr>
<th><label for="password">Password</label></th><td><input type='password' id='password' name='e008565a17664e26ac8c0e13af71a6d2'/></td>
</tr>
<tr>
<th>Remember Me</th><td><input type='checkbox' id='remember' name='remember'/>
<label for="remember">Log me in automatically for two weeks on this computer using a cookie. </label> Do not select this option if this is a public computer, or you have an evil sibling.</td>
</tr>
<tr>
<td colspan='2' style="text-align:center">
<input type='submit' value='Login' name='loginsubmit'/>
</td>
</tr>
I've tried to format that for readability but it still looks bad, consider checking the code on the supplied page.
Here is the code I get when printing the form through mechanize:
<POST http://www.tthfanfic.org/login.php application/x-www-form-urlencoded
<HiddenControl(ctkn=a40e5ff08d51a874d0d7b59173bf3d483142d2dde56889d35dd6914de92f2f2a) (readonly)>
<TextControl(urealname=)>
<PasswordControl(986f996e16074151964c247608da4aa6=)>
<CheckboxControl(remember=[on])>
<SubmitControl(loginsubmit=Login) (readonly)>>
The number sequence in the PasswordControl is the part that changes each time I reload the page, in the HTML from the site it seems to have several other tags ascribed to it but none of them work when I try to select them, that or I'm doing it incorrectly.
Here is the code I am using to try and select the control by label:
fieldTwo = br.form.find_control(label='password')
br[fieldOne] = identifier
br[fieldTwo] = password
I can post the rest of my login code if neccesary but this is the only part that is not working, I have had success with other sites where the password name remains the same.
So, is it possible for me to select the passwordControl using it's label, type or ID, or do I need to scrape its name?
EDIT: Oops, forgot to add the error message:
raise ControlNotFoundError("no control matching "+description)
mechanize._form.ControlNotFoundError: no control matching label 'password'
SOLVED:
Solution given by a guy on reddit, thanks Bliti.
Working code:
br.select_form(nr=2)
list = []
for f in br.form.controls:
list.append(f.name)
fieldTwo = list[2]
Solution given by a guy on reddit, thanks Bliti.
Working code:
#Select the form you want to use.
br.select_form(nr=2)
list = []
for f in br.form.controls:
#Add the names of each item in br.formcontrols
list.append(f.name)
#Select the correct one from the list.
fieldTwo = list[2]