Submitting queries to, and scraping results from aspx pages using python? - python

I am trying to get results for a batch of queries to this demographics tools page: http://adlab.microsoft.com/Demographics-Prediction/DPUI.aspx
The POST action on the form calls the same page (_self) and is probably posting some event data. I read on another post here at stackoverflow that aspx pages typically need some viewstate and validation data. Do I simply save these from a request, re-send in a POST request?
Or is there a cleaner way to do this? One of those aspx viewstate parameters is about a 1000 characters and the incredible ugliness of pasting that into my code makes me think there HAS to be a better way. Any and all references to stuff I can read up will be helpful, thanks!

Perhaps mechanize may be of use.

Use urllib2. Your POST data is a simple Python dictionary. Very easy to edit and maintain.
If your form contains hidden fields -- some of which are encoded -- then you need to do a GET to get the form and the various hidden field seed values.
Once you GET the form, you can add the necessary input values to the given, hidden values and POST the response back again.
Also, you'll have to be sure that you handle any cookies. urllib2 will help with that, also.
After all, that's all a browser does, and it works in a browser. Browser's don't know ASPX from CGI from WSGI, so there's no magic because it's ASPX. You sometimes have to do a GET before a POST to get values and cookies set up properly.

I've used a combination requests and BeautifulSoup4 for a similar task.

Related

Unable to find a way to paginate through API data

I'm trying to use Python 3 requests.getto retrieve data from this page using its API. I'm interested in retrieving it using the data found here and saving the entire table into my own JSON.
Here's my attempt so far
source = requests.get("https://www.mwebexplorer.com/api/mwebblocks").json()
with open('mweb.json', 'w') as json_file:
json.dump(source, json_file)
I've looked through other questions in regards to pagination and all the other problems are able to write for loops to iterate through all the pages, but in my specific case, the link does not change when clicking next to go to the next page of data. I also can't use scrapy's xpath method to click next due to the entire table and its pagination not being accessible through HTML or XML.
Is there something I can add to my requests.get to include the entire JSON of all pages of the table?
Depending on what browser you're using it might be different, but in chrome I can go to the network tab in devtools and view the full details of the request. This reveals that it's actually a POST request, not a GET request. If you look at the payload, you can see a bunch of key-value pairs, including a start and a length.
So, try something like
requests.post("https://www.mwebexplorer.com/api/mwebblocks", data={"start": "50", "length": "50"})
or similar. You might need to include the other parts of the form data, depending on the response you get.
Keep in mind that sites frequently don't like it when you try to scrape them like this.

Python requests vs. urllib2

I have used requests library for many times and I know it has a ton of advantages. However, I was trying to retrieve the following Wikipedia page:
https://en.wikipedia.org/wiki/Talk:Land_value_tax
and requests.get retrieves it partially:
response = requests.get('https://en.wikipedia.org/wiki/Talk:Land_value_tax', verify=False)
html = response.text
I tried it using urllib2 and urllib2.urlopen and it retrieves the same page completely:
html = urllib2.urlopen('https://en.wikipedia.org/wiki/Talk:Land_value_tax').read()
Does anyone know why this happens and how to solve it using requests?
By the way, looking at the number of times this post has been viewed, I realized that people are interested to know the differences between these two libraries. If anyone knows about other differences between these two libraries, I'll appreciate it if they edit this question or post an answer and add those differences.
Seems to me the problem lies in the scripting on the target page. The js-driven content is rendered in here (especially i found calls to mediawiki). So, you need to look at web sniffer to identify it:
What to do? If you want to retrieve the whole page content, you better plugin any of libraries working out (evaluating) in page javascript. Read more here.
Update
I am not interested in retrieving the whole page and statistics or JS libraries retrieved from MediaWiki. I only need the whole content of the page (through scraping, not MediaWiki API).
The issue is that those js calls to other resources (incl. mediawiki) make possible to render the WHOLE page to client. But since the library does not support JS execution, js is not executed => page parts are not loaded from other resources => target page is not whole.

How to use urllib to fill out forms and gather data?

I come from a world of scientific computing and number crunching.
I am trying to interact with the internet to compile data so I don't have to. One task it to auto-fill out searches on Marriott.com so I can see what the best deals are all on my own.
I've attempted something simple like
import urllib
import urllib2
url = "http://marriott.com"
values = {'Location':'New York'}
data = urllib.urlencode(values)
website = urllib2.Request(url, data)
response = urllib2.urlopen(website)
stuff = response.read()
f = open('test.html','w')
f.write(stuff)
My questions are the following:
How do you know how the website receives information?
How do I know a simple "Post" will work?
If it is simple, how do I know what the names of the dictionary should be for "Values?"
How to check if it's working? The write lines at the end are an attempt for me to see if my inputs are working properly but that is insufficient.
You may also have a look at splinter, where urllib may not be useful (JS, AJAX, etc.)
For finding out the form parameters firebug could be useful.
You need to read and analyze the HTML code of the related side. Every browser has decent tools for introspecting the DOM of a site, analyzing the network traffic and requests.
Usually you want to use the mechanize module for performing automatized interactions with a web site. There is no guarantee given that this will work in every case. Nowadays many websites use AJAX or more complex client-side programming making it hard to "emulate" a human user using Python.
Apart from that: the mariott.com site does not contain an input field "Location"...so you are guessing URL parameters with having analyzed their forms and functionality?
What i do to check is use a Web-debugging proxy to view the request you send
first send a real request with your browser and compare that request to the request that your script sends. try to make the two requests match
What I use for this is Charles Proxy
Another way is view the html file you saved (in this case test.html) and view it in your browser and compare this to the actual request reponse
To findout what the dictionary should have in it is look at the page source of the page and find out the names of the forms your trying to fill. in you're case the "location"should actually be "destinationAddress.destination"
Here is a picture:
So look in the HTML code to get the names of the forms and that is what should be in the dictionary. i know that Google Chrome and Mozilla Firefox both have tools to view the structure of the html (in the Picture i used inspect element in Google Chrome)
for more info on urllib2 read here
I really hope this helps :)

Query website(that requires login information) with input parameters and retrieve results

I'm a noob at this so can anyone tell me how to login to a website and fill forms and retrieve results that can be parsed into say .csv. For instance, a website where you enter certain parameters and the server returns products that best match your input parameters. I need to retrieve the list of products with their specifications and parse them into .csv. Doing that requires me to select certain buttons on the webpage which seem to be javascript objects. I tried mechanize but it doesn't seem to work for javascript objects. I would prefer to post my queries in python. Thanks!
You have two options I can think of:
1) Figure out what the fields generated by the javascript are named, or their naming scheme, and submit the values directly to the form handler -- eliminating the need to even deal with the input page.
2) Use a framework that emulates a "real" browser and is capable of processing javascript. This question has some suggestions for frameworks, but having never used one, I can't suggest any myself.

How can I scrape this frame?

If you visit this link right now, you will probably get a VBScript error.
On the other hand, if you visit this link first and then the above link (in the same session), the page comes through.
The way this application is set up, the first page is meant to serve as a frame in the second (main) page. If you click around a bit, you'll see how it works.
My question: How do I scrape the first page with Python? I've tried everything I can think of -- urllib, urllib2, mechanize -- and all I get is 500 errors or timeouts.
I suspect the answers lies with mechanize, but my mechanize-fu isn't good enough to crack this. Can anyone help?
It always comes down to the request/response model. You just have to craft a series of http requests such that you get the desired responses. In this case, you also need the server to treat each request as part of the same session. To do that, you need to figure out how the server is tracking sessions. It could be a number of things, from cookies to hidden inputs to form actions, post data, or query strings. If I had to guess I'd put my money on a cookie in this case (I haven't checked the links). If this holds true, you need to send the first request, save the cookie you get back, and then send that cookie along with the 2nd request.
It could also be that the initial page will have buttons and links that get you to the second page. Those links will have something like <A href="http://cad.chp.ca.gov/iiqr.asp?Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b="> where a lot of the gobbedlygook is generated by the first page.
The "Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b=" part encodes some session information that you must get from the first page.
And, of course, you might even need to do both.
You might also try BeautifulSoup in addition to Mechanize. I'm not positive, but you should be able to parse the DOM down into the framed page.
I also find Tamper Data to be a rather useful plugin when I'm writing scrapers.

Categories

Resources