How to use urllib to fill out forms and gather data? - python

I come from a world of scientific computing and number crunching.
I am trying to interact with the internet to compile data so I don't have to. One task it to auto-fill out searches on Marriott.com so I can see what the best deals are all on my own.
I've attempted something simple like
import urllib
import urllib2
url = "http://marriott.com"
values = {'Location':'New York'}
data = urllib.urlencode(values)
website = urllib2.Request(url, data)
response = urllib2.urlopen(website)
stuff = response.read()
f = open('test.html','w')
f.write(stuff)
My questions are the following:
How do you know how the website receives information?
How do I know a simple "Post" will work?
If it is simple, how do I know what the names of the dictionary should be for "Values?"
How to check if it's working? The write lines at the end are an attempt for me to see if my inputs are working properly but that is insufficient.

You may also have a look at splinter, where urllib may not be useful (JS, AJAX, etc.)
For finding out the form parameters firebug could be useful.

You need to read and analyze the HTML code of the related side. Every browser has decent tools for introspecting the DOM of a site, analyzing the network traffic and requests.
Usually you want to use the mechanize module for performing automatized interactions with a web site. There is no guarantee given that this will work in every case. Nowadays many websites use AJAX or more complex client-side programming making it hard to "emulate" a human user using Python.
Apart from that: the mariott.com site does not contain an input field "Location"...so you are guessing URL parameters with having analyzed their forms and functionality?

What i do to check is use a Web-debugging proxy to view the request you send
first send a real request with your browser and compare that request to the request that your script sends. try to make the two requests match
What I use for this is Charles Proxy
Another way is view the html file you saved (in this case test.html) and view it in your browser and compare this to the actual request reponse
To findout what the dictionary should have in it is look at the page source of the page and find out the names of the forms your trying to fill. in you're case the "location"should actually be "destinationAddress.destination"
Here is a picture:
So look in the HTML code to get the names of the forms and that is what should be in the dictionary. i know that Google Chrome and Mozilla Firefox both have tools to view the structure of the html (in the Picture i used inspect element in Google Chrome)
for more info on urllib2 read here
I really hope this helps :)

Related

Web scraping using python, how to deal with ngif?

I'm trying to read the price of a fund which is not available through an API. The fund is listed here https://bors.e24.no/#!/instrument/KL-AFMI2.OSE
At first I thought this would be a simple task so I looked at beautifulsoup, but realized that what I wanted was not returned. A far as I can tell due to the:
<-- ngIf: $root.allowStreamingToggle -->
I'm a beginner so hoping someone can help me with an easy way to get this value.
I see json being returned from the following endpoint in network tab
import requests
headers = {'user-agent': 'Mozilla/5.0'}
r = requests.get('https://bors.e24.no/server/components/graphdata/(PRICE)/DAY/KL-AFMI2.OSE?points=500&stop=2019-07-30&period=1weeks', headers = headers).json()
Price is then
r['rows'][0]['values']['series']['c1']['data'][3][1]
The tag "ngIf" almost certainly means that the website you are attempting to scrape is an AngularJS app... in which case, the data is almost certainly NOT in the HTML page you are pulling and attempting to parse with BeautifulSoup.
Rather, the page is probably pulling the data later -- say, via AJAX -- and the rendering it IN to the page via Angular's client-side code.
If all that is right... then BeautifulSoup is not the right tool.
You might have some hope if you can identify the AJAX call that the page is calling, then call THAT directly. Inspect it to see the data structure; if you are lucky maybe it is JSON and then super easy to parse. If that looks promising, then you can probably simply use the requests library, and skip BeautifulSoup. But you have to do the reverse engineering to figure out WHAT you should be calling.
Here, try this: I did a little snooping with the browser console. Is this the data you are looking for? get info for KL-AFMI2.OSE
If so.. then just use that URL directly in requests.

Python get request and retrieving data from search

I'm trying to use the requests module to retrieve data from this website:
https://toelatingen.ctgb.nl/
I want to receive the found data when I put in "11462" the "Zoekterm" field for example.
data = { "searchTerm": "11462"}
session = requests.Session()
r = session.post('https://toelatingen.ctgb.nl/',data=data)
body_data = r.text
The content of the body_data does not, unfortunately, contain the information searched for.
Thanks for helping me.
The reason you're not getting the response data is because the site doesn't do the search at that url. Instead it makes a call to https://toelatingen.ctgb.nl/nl/admissions/overview.
When you're trying to get information off the internet the first thing you want to do is check how your web browser is getting the data. If you open up whatever inspection tool comes with your browser of choice (typically the hotkey is ctrl+shift+i), you should be able to find a Network tab that tracks the requests and responses the browser makes. Once you have that open, get your browser to display the information you want and watch the Network Tab while it's doing so. Check whatever responses come up to find the one that has the information you want and then replicate the request your browser used.
In your case:
The root page loads an empty page first from https://toelatingen.ctgb.nl/
It then loads a bunch of static files (mostly woff and js; these are used for styling the webpage and handling different proceedures)
Then it makes a call to https://toelatingen.ctgb.nl/nl/admissions/overview. We can be pretty sure this is the call we want at this point because the response is a json which contains the information that we see displayed on the screen.
We then copy out all the information- headers and forms, line for line- from that request, plug it in, and see if the requests module returns the same json.
If it doesn't then that most likely means we're missing something (most often a CSRF Token or a special Accept-Encoding) and we need to do some more tinkering.
I would also recommend taking a little bit of time to prune out parts of the request data/headers: most of the time they contain extra terms that the server doesn't actually need. This will save space and give you a better idea of what parts of the request you can change.

How i post data on search-Bar website using python script?

As all we know in web application we have get method and post data method.
Here my problem appear with post data.
For example i want to make my python code that access for search bar of website by insert same values and submit (the website button), then check for the page.
How the code gonna be then if there any documentation about this python concepts!
I am totally confused
Note : i am just beginner in python.
If the website relies on javascript, you're going to need to use something like Selenium which will emulate a typical browser and allow you to insert information onto a page and execute javascript commands.
If, however, the search bar simply posts data to a URL. You can determine that URL and then use requests to post the data and retrieve the result.
resp = requests.post('http://website/search', data = {'term':'value'})

Python requests vs. urllib2

I have used requests library for many times and I know it has a ton of advantages. However, I was trying to retrieve the following Wikipedia page:
https://en.wikipedia.org/wiki/Talk:Land_value_tax
and requests.get retrieves it partially:
response = requests.get('https://en.wikipedia.org/wiki/Talk:Land_value_tax', verify=False)
html = response.text
I tried it using urllib2 and urllib2.urlopen and it retrieves the same page completely:
html = urllib2.urlopen('https://en.wikipedia.org/wiki/Talk:Land_value_tax').read()
Does anyone know why this happens and how to solve it using requests?
By the way, looking at the number of times this post has been viewed, I realized that people are interested to know the differences between these two libraries. If anyone knows about other differences between these two libraries, I'll appreciate it if they edit this question or post an answer and add those differences.
Seems to me the problem lies in the scripting on the target page. The js-driven content is rendered in here (especially i found calls to mediawiki). So, you need to look at web sniffer to identify it:
What to do? If you want to retrieve the whole page content, you better plugin any of libraries working out (evaluating) in page javascript. Read more here.
Update
I am not interested in retrieving the whole page and statistics or JS libraries retrieved from MediaWiki. I only need the whole content of the page (through scraping, not MediaWiki API).
The issue is that those js calls to other resources (incl. mediawiki) make possible to render the WHOLE page to client. But since the library does not support JS execution, js is not executed => page parts are not loaded from other resources => target page is not whole.

Submitting queries to, and scraping results from aspx pages using python?

I am trying to get results for a batch of queries to this demographics tools page: http://adlab.microsoft.com/Demographics-Prediction/DPUI.aspx
The POST action on the form calls the same page (_self) and is probably posting some event data. I read on another post here at stackoverflow that aspx pages typically need some viewstate and validation data. Do I simply save these from a request, re-send in a POST request?
Or is there a cleaner way to do this? One of those aspx viewstate parameters is about a 1000 characters and the incredible ugliness of pasting that into my code makes me think there HAS to be a better way. Any and all references to stuff I can read up will be helpful, thanks!
Perhaps mechanize may be of use.
Use urllib2. Your POST data is a simple Python dictionary. Very easy to edit and maintain.
If your form contains hidden fields -- some of which are encoded -- then you need to do a GET to get the form and the various hidden field seed values.
Once you GET the form, you can add the necessary input values to the given, hidden values and POST the response back again.
Also, you'll have to be sure that you handle any cookies. urllib2 will help with that, also.
After all, that's all a browser does, and it works in a browser. Browser's don't know ASPX from CGI from WSGI, so there's no magic because it's ASPX. You sometimes have to do a GET before a POST to get values and cookies set up properly.
I've used a combination requests and BeautifulSoup4 for a similar task.

Categories

Resources