Yesterday, I was trying to scrape data from this webpage (an interface to the number of people in waiting lists in several hospitals in Madrid, Spain). The form is pretty simple: one selects the hospital and then the form is populated with the different services (dermatology, endocrinology, etc) and the date (currently, only December 2016).
I was reading through the HTML code and tried to code a very simple Python script to accomplish the task of downloading all the data automatically. The full code is in this github repository, but I am going to present it here step-by-step.
First, I load the page and get a reference to the main form:
import mechanize
# Config vars
URL = "https://servicioselectronicos.sanidadmadrid.org/LEQ/Consulta.aspx"
if __name__ == "__main__":
br = mechanize.Browser()
br.open(URL)
response = br.response()
br.select_form('aspnetForm')
form = br.form
At this point, form contains a reference to the main form in the document. I then select the second hospital of the list (could have been any) and submit it (the original code does a _doPostBack to fill in the rest of the drop-down items):
# Select the second hospital and re-submit in order to have the list of
# services available in that hospital
form.controls[2].items[1].selected = True
request = br.submit()
# We can now submit with the desired selection
br.select_form('aspnetForm')
form = br.form
Now, form is a reference to the filled-in form, with the list of services available for that particular hospital and the date. I then select several of these fields and re-submit:
form.controls[2].items[1].selected = True
form.controls[3].items[1].selected = True
form.controls[4].items[0].selected = True
req_data = form.click_request_data()
response = br.submit()
However, and this is were I get really confused, response does not contain the desired result (the number of people in waiting list). The HTML code only contains the filled-in form, with the values I have selected, but nothing else.
I know this can be scrapped. I have seen another solution, written in R, that uses selenium as the browser engine. Am I missing something, or is this particular example something that cannot be scraped simply using mechanize, but somehow needs something a bit more complex?
Related
I am designing a small sports news app as a school project that scrapes data from the web, posts it to a firebase realtime database and is then used in an application being built on android studio by my project partner. So far during development I have just been deleting the database and rebuilding it every time i run the code to prevent build-up of the same data. I am wondering how i would go about checking to see if a piece of data exists before it push the data to the database.
Thanks if anyone is able to point me in the right direction. Here is my code for pushing the data to firebase:
ref = db.reference('/news')
ref.delete()
url = 'https://news.sky.com/topic/premier-league-3810'
content = requests.get(url)
soup = BeautifulSoup(content.text, "html.parser")
body = soup.find_all("h3", "sdc-site-tile__headline")
titles_list = []
links_list = []
for item in body:
headline = item.find_all('span', class_='sdc-site-tile__headline-text')[0].text
titles_list.append(headline)
link = item.find('a', class_='sdc-site-tile__headline-link').get('href')
links_list.append('https://news.sky.com' + link)
i=0
while i < len(titles_list):
ref.push({
'Title' : titles_list[i],
'Link' : links_list[i]
})
i+=1
There are a few main options here:
You can use a query to check if the data already exists, before writing it. Then when it doesn't exist yet, you can write it.
If multiple users/devices can be adding data at the same time, the above won't work as somebody may write their data just after you have checked if the values already exist. In that case you'll want to:
Use the values that are to be unique as the key of the data (so using child("your_unique_values").set instead of push), and use a transaction to ensure you don't overwrite each other's data.
I have a Flask application and need to store users' place when they navigate the content.
For example, I have a route like this: #main_bp.route('/articles/<category>/<article_number>', defaults={'category': 'new'})
The content is organized such that you page through articles under a category: starting at 0, then 1, and so forth. The URL for article number 3 would look like: articles/<category>/3
I'd like to save users' place so that if they leave the site after visiting article 3, when they navigate to the articles page they'll land on articles/<category>/3, rather than articles/<category>/0.
What is the best way to achieve this? Currently, I've modeled the data in the database so there is a column that looks like category_article_last_visited (integer). I'm able to store this data as a user browses the site, but I'm not sure how to retrieve it when they return to the articles page.
What I've tried:
#main_bp.route('/articles/<category>/<article_number>', defaults={'category': 'new', 'article_number':current_user.category_article_last_visited}), but I get an error that there is no such attribute.
Checking current_user.category_article_last_visited in the routes function and using the article number. This renders the correct content, but doesn't change the URL, which won't work.
Redirecting users if they have a value for current_user.category_article_last_visited. This doesn't seem to yield any change.
I am curious if storing in the db (assigning the value, db.commit(), etc.) is the right path, or if I should explore flask-sessions more. I need this information to persist across sessions, so that if a user logs out, clears cookies, uses a different device, etc. it is still available. I may also perform analytics on these values in the future.
Is the method I've described above the best way to achieve this? Is flask-sessions or something else preferable?
If the method outlined above is best, how do I correctly route this information so that users are directed to the page they left off, and the URL is changed appropriately?
Thanks
I would go with the redirect solution, it is more clear.
I would add an if statement at the beginning of the route-function and if there is data for this user, i would redirect to that page. For example:
#main_bp.route('/articles/<category>/<article_number>', defaults={'category': 'new'})
def routefunc():
if current_user.category_article_last_visited !=0: #or whatever your column keeps for empty data
return redirect ('/articles/'+yourcategory +'/'+ current_user.category_article_last_visited #as string
This must be combined with some other functionality, to avoid infinitive redirection to this route:
Option 1:
You can add another variable in the route that will have specific value on these redirections and will ignore this if statement. For example:
#main_bp.route('/articles/<category>/<article_number>/<check>', defaults={'category': 'new'})
def routefunc():
if current_user.category_article_last_visited !=0 and check!=1: return redirect ('/articles/'+yourcategory +'/'+ current_user.category_article_last_visited+'/1')
However in this case you must add this variable (with some other value different from 1) to all of your urls-hrefs etc and it will make your urls more "dirty". It would be effective for a small app, but i would avoid it for a big app/website with multiple internal links.
Option 2:
You could add one more column in your database table that will be 1/0 depending on when user visitis this route, directly or from redirection. In this case you must add a couple of queries to check and/or update this value before-after redirection.
Option 3:
You could create another similar route that will only handle redirections, and produce the same results (same html) but without the if statement. For example:
#main_bp.route('/articles/<category>/<article_number>', defaults={'category': 'new'})
def routefunc():
if current_user.category_article_last_visited !=0: #or whatever your column keeps for empty data
return redirect ('/articles2/'+yourcategory +'/'+ current_user.category_article_last_visited #as string
#main_bp.route2('/articles2/<category>/<article_number>', defaults={'category': 'new'})
def routefunc():
return ('yourhtml.html')
***Session based approach is not good here, as you want a long term solution.
As you probably have many categories, articles, users, you would better create a separate table specifically for this
I don't know what is the best way to achieve what you want but here's what you could try. Assuming you want to perform some analytics on the data you might want to store it in a database.
You could have a route designed to create a user cookie when a new user visits your page and redirects him to the articles page with the new cookie set:
#main_bp.route('/articles/set_cookie', "GET"])
def set_article_cookie():
sessionserializer = securecookiesessioninterface().get_signing_serializer(main_bp)
tempcookie = sessionserializer.dumps(dict(session))
resp = make_response(redirect('/articles'))
resp.set_cookie("user", tempcookie)
return resp
And your existing route in which you check if the user has already visited the page. In which case you will want to check in the database what was the last article he read and redirect him accordingly:
#main_bp.route('/articles/<category>/<article_number>', defaults={'category': 'new'})
def articles(category, article_number):
# If the user cookie is already set, check if there is some data is the database and redirect to his last article visited
cookie = request.cookies
if "user" in cookie:
# Retreive the user cookie value and check the database for this value
return redirect('/articles/' + last_article_visited)
# Else redirect the user to set_article_cookie
else:
return redirect("/set_article_cookie")
OK, here is the solution I decided on:
I update the paths of nav links throughout the site, so instead of /articles/<category>/0 it's /articles/<category>/current_user.article_number_last_visited
Since not all users have visited articles in every category, I added default routing logic, similar to:
#main_bp.route('/articles/<category>/', defaults={'article_number': 0})
#main_bp.route('/articles/<category>/<article_number>', methods=['GET', 'POST'])
This routes users correctly even if current_user.article_number is null.
I believe this will also work if the user is not logged in (and therefore there will be no article_number attribute). I haven't checked this case out thoroughly though because in my use case users have to be logged in to view content.
Here is my example form: https://docs.google.com/forms/d/e/1FAIpQLSfVXZ1721ZRrHetp1qUak9T-o-MwKA9G3q01rLAFI2OJhZjUw/viewform
I want to send a response to it with python, but I don't know how to fill the "text box", so I can't even start it. Can you help me, please?
For submitting data to google form you first need to replace viewform to formResponse in your url.
You are going to POST submission to the form response URL.
You need to keep 2 things in mind.
Get the form response URL. It can be found by replacing your form ID into the following:
https://docs.google.com/forms/d/<form_id>/formResponse
Assemble the submission. This will be a dictionary reference with keys being the IDs of the form questions and the values being what you'd like to submit. To get the IDs, again go to your live form and inspect the html (Right Click -> Inspect Elements) components where you would typically input your information. You should discover a group of ss-structure passage objects with name attribute like:
name="entry.<id>"
A simple program to send response would be:
import requests
url ="https://docs.google.com/forms/d/e/1FAIpQLSfVXZ1721ZRrHetp1qUak9T-o-MwKA9G3q01rLAFI2OJhZjUw/formResponse"
data_to_send = 'DATA' # Assign Data to be sent
requests.post(url, {"entry.685250623":data_to_send}) # Found the entry Id viewing your form
Hope this answers your question!!!
I'm currently working on a webpage using the django framework for python.
I need to have a page where admin user's can register an event into the system.
Event being: Location on a map, Description, Images, links etc..
I feel it's a bit less confusing If I have the user add location details on the first page but when he has finished choosing a location he could click next, this would take him to another page where he would finish filling out the information about the event.
I'm not sure but I think this is rather a database question than a django question.
How would I continue adding to the same table in a database between two seperate pages?
I thought about using timestamp so I could select the last modified table on the next page but I think that might be risky + if the user goes back to modify the table the timestamp is useless.
I'm using Django 1.5 + postgresql database. Any reading references that might be good to check out for this kind of operation?
I've done something similar to this before. I asked users to enter a zip code on one page and then based upon that zip code it loaded in different options for the form on the next page. Here is how I did it using request.session
Note that this is is my soultion to MY problem. This may not be exactly what you are looking for but might help you get a start. If anyone has a better solution I'd love to see it since I'm not entirely happy with my answer.
views.py
def find_zip(request):
c={}
form = FindZip()
c['form'] = form
if request.method == 'POST':
form = FindZip(request.POST)
c['form'] = form
if form.is_valid():
zip = form.data['zip']
form = ExternalDonateForm(initial={'zip':zip})
request.session['_old_post'] = request.POST
c['form'] = form
response = HttpResponseRedirect('/external')
return response
return render_to_response(
'find_zip.html',
c,
context_instance=RequestContext(request)
I then try to retrive that session from the previous view
def donate_external(request):
zip = None
if request.session.get('_old_post'):
old_post = request.session.get('_old_post')
zip = old_post['zip']
)
# rest of code ....
Based on some quick examples found on SO and other sources, I am trying to use Python urllib/urllib2 to submit a form in the following manner:
>>> import urllib, urllib2
>>> url = 'http://example.com'
>>> r_params = {'a':'test','b':'hooray'}
>>> e_params = urllib.urlencode(r_params)
>>> user_agent = 'some browser and such'
>>> headers = {'User-Agent': user_agent}
>>> req = urllib2.Request(url, e_params, headers)
>>> response = urllib2.urlopen(req)
>>> data = response.read()
I've gotten this to work, however, on the particular form I am looking for there are two buttons of type "submit". e.g.:
<b><input type="submit" name="ButtonA" value="SUBMIT"></b>
<b><input type="submit" name="ButtonB" value="LINK"></b>
I believe the problem I'm having results from the current code choosing the wrong one. How do I get a response by submitting ButtonB rather than ButtonA? Some of the stuff I've read seems to indicate that I could try using mechanize, but I was hoping to keep this simple without having to read up and learn mechanize. Is there an easy way to do this, or do I need to suck it up and actually take the time to learn and understand what I'm doing?
It should be fairly simple, if that's the case - you should look in to what exactly you're doing. Specifically, you're sending a POST request (urllib2.urlopen will send a POST request automatically if the data argument is supplied) with the data that would normally be supplied by the form element itself. In the case of multiple "submit" inputs, the name and value of the activated submit input is sent along with the rest of the form data.
So, that's all you have to do - include "ButtonB":"LINK" as data.
A quick reference so you can see how HTML does all the stuff it does:
http://www.w3.org/TR/html401/interact/forms.html#submit-format
I recommend using a tool like TamperData for Firefox to discover precisely how the site's POSTs are formed. Activate TamperData just before you're ready to click one of the buttons. When it's up, go ahead and click one. The POST will be recorded in TamperData. Find it and click on it.
Find the POSTDATA row below and double-click it. Select the "Decoded" radio button to remove the HTML escapes. Now you have a 1:1 reference you should copy when making your "r_params" dictionary. For instance, if the POSTDATA looked like this:
Name | Value
--------------------
QueryString | test
Page |
Search | blah
then you will create your dictionary like this:
r_params = {'QueryString': 'test',
'Page': '',
'Search':, 'blah'}
After you've found out what the POSTDATA looks like for each separate submit event, you'll know how to create the right dictionary to send along. Also, be sure to confirm you are POSTing to the correct URL. Good luck!