If you visit this link right now, you will probably get a VBScript error.
On the other hand, if you visit this link first and then the above link (in the same session), the page comes through.
The way this application is set up, the first page is meant to serve as a frame in the second (main) page. If you click around a bit, you'll see how it works.
My question: How do I scrape the first page with Python? I've tried everything I can think of -- urllib, urllib2, mechanize -- and all I get is 500 errors or timeouts.
I suspect the answers lies with mechanize, but my mechanize-fu isn't good enough to crack this. Can anyone help?
It always comes down to the request/response model. You just have to craft a series of http requests such that you get the desired responses. In this case, you also need the server to treat each request as part of the same session. To do that, you need to figure out how the server is tracking sessions. It could be a number of things, from cookies to hidden inputs to form actions, post data, or query strings. If I had to guess I'd put my money on a cookie in this case (I haven't checked the links). If this holds true, you need to send the first request, save the cookie you get back, and then send that cookie along with the 2nd request.
It could also be that the initial page will have buttons and links that get you to the second page. Those links will have something like <A href="http://cad.chp.ca.gov/iiqr.asp?Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b="> where a lot of the gobbedlygook is generated by the first page.
The "Center=RDCC&LogNumber=0197D0820&t=Traffic%20Hazard&l=3358%20MYRTLE&b=" part encodes some session information that you must get from the first page.
And, of course, you might even need to do both.
You might also try BeautifulSoup in addition to Mechanize. I'm not positive, but you should be able to parse the DOM down into the framed page.
I also find Tamper Data to be a rather useful plugin when I'm writing scrapers.
Related
I need your help because I have for the first time problems to get some information with Beautifulsoup .
I have two problems on this page
The green button GET COUPON CODE appear after a few moment see GIF capture
When we inspect the button link, we find a a simple href attribute that call to an out.php function that performs the opening of the destination link that I am trying to capture.
GET COUPON CODE
Thank you for your help
Your problem is a little unclear but if I understand correctly, your first problem is that the 'get coupon code' button looks like this when you render the HTML that you receive from the original page request.
The mark-up for a lot of this code is rendered dynamically using javascript. So that button is missing its href value until it gets loaded in later. You would need to also run the javascript on that page to render this after the initial request. You can't really get this easily using just the python requests library and BeautifulSoup. It will be a lot easier if you use Selenium too which lets you control a browser so it runs all that javascript for you and then you can just get the button info a couple of seconds after loading the page.
There is a way to do this all with plain requests, but it's a bit tedious. You would need to read through the requests the page makes and figure out which one gets the link for the button. The upside to this is it would cut the number of steps to get the info you need and the amount of time it takes to get. You could just use this new request every time to get the right PHP link then just get the info from there.
For your second point, I'm also not sure if I answered it already, but maybe you're also trying to get the redirect link from that PHP link. From inspecting the network requests, it looks like the info will be found in the response headers, there is no body to inspect.
(I know it says 'from cache' but the point is that the redirect is being caused by the header info)
Web-scraping adjacent question about URLs acting whacky.
If I go to glassdoor job search and enter in 6 fields (Austin, "engineering manager", fulltime, exact city, etc.. ). I get a results page with 38 results. This is the link I get. Ideally I'd like to save this link with its search criteria and reference it later.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&sc.locationSeoString=austin&locId=1139761&locT=C?jobType=fulltime&fromAge=30&radius=0&minRating=4.00
However, If I copy that exact link and paste it into a new tab, it doesn't act as desired.
It redirects to this different link, maintaining some of the criteria but losing the location criteria, bringing up thousands of results from around the country instead of just Austin.
https://www.glassdoor.com/Job/jobs.htm?sc.generalKeyword=%22engineering+manager%22&fromAge=30&radius=0&minRating=4.0
I understand I could use selenium to select all 6 fields, I'd just like to understand what's going on here and know if there is a solution involving just using a URL.
The change of URL seems to happen on the server that is handling the request. I would think this is how it's configured on the server-side endpoint for it to trim out extra parameters and redirects you to another URL. There's nothing you can do about this since however you pass it, it will always resolve into the second URL format.
I have also tried URL shortener but the same behavior persists.
The only way around this is to use Automation such as Selenium to enable the same behaviour to select and display the results from the first URL.
I want to get some information on a web page. I use requests.get to abstract the page. But I cannot find what I want. Checking it carefully, I found the info I want is in a list with a scrollbar. When I drag scrollbar down, more and more info is loaded. So I guess all the info in the list is not loaded yet when I get the page by module requests. I want to know what is happened in this process and How can I gather the information I want. (I am not familiar with Html language).
I want to know what is happened in this process
It sounds like when the user scrolls, the scrolling causes some javascript(js) to execute, and the js makes repeated requests to the server for more data. Unfortunately, the requests module cannot cause the javascript on an html page to execute--all you get back is the text of the js. The unable to execute javascript on an html page in order to retrieve what the user actually sees has been a problem for a long time. Fortunately, smart programmers have largely solved that problem. You need to use a different module. Check out the selenium module.
I am not familiar with Html language
Scraping web pages can get really tricky really fast, and some web pages proactively try to prevent computer programs from scraping their content, so you need to know both html and js in order to figure out what is going on.
I come from a world of scientific computing and number crunching.
I am trying to interact with the internet to compile data so I don't have to. One task it to auto-fill out searches on Marriott.com so I can see what the best deals are all on my own.
I've attempted something simple like
import urllib
import urllib2
url = "http://marriott.com"
values = {'Location':'New York'}
data = urllib.urlencode(values)
website = urllib2.Request(url, data)
response = urllib2.urlopen(website)
stuff = response.read()
f = open('test.html','w')
f.write(stuff)
My questions are the following:
How do you know how the website receives information?
How do I know a simple "Post" will work?
If it is simple, how do I know what the names of the dictionary should be for "Values?"
How to check if it's working? The write lines at the end are an attempt for me to see if my inputs are working properly but that is insufficient.
You may also have a look at splinter, where urllib may not be useful (JS, AJAX, etc.)
For finding out the form parameters firebug could be useful.
You need to read and analyze the HTML code of the related side. Every browser has decent tools for introspecting the DOM of a site, analyzing the network traffic and requests.
Usually you want to use the mechanize module for performing automatized interactions with a web site. There is no guarantee given that this will work in every case. Nowadays many websites use AJAX or more complex client-side programming making it hard to "emulate" a human user using Python.
Apart from that: the mariott.com site does not contain an input field "Location"...so you are guessing URL parameters with having analyzed their forms and functionality?
What i do to check is use a Web-debugging proxy to view the request you send
first send a real request with your browser and compare that request to the request that your script sends. try to make the two requests match
What I use for this is Charles Proxy
Another way is view the html file you saved (in this case test.html) and view it in your browser and compare this to the actual request reponse
To findout what the dictionary should have in it is look at the page source of the page and find out the names of the forms your trying to fill. in you're case the "location"should actually be "destinationAddress.destination"
Here is a picture:
So look in the HTML code to get the names of the forms and that is what should be in the dictionary. i know that Google Chrome and Mozilla Firefox both have tools to view the structure of the html (in the Picture i used inspect element in Google Chrome)
for more info on urllib2 read here
I really hope this helps :)
I am trying to get results for a batch of queries to this demographics tools page: http://adlab.microsoft.com/Demographics-Prediction/DPUI.aspx
The POST action on the form calls the same page (_self) and is probably posting some event data. I read on another post here at stackoverflow that aspx pages typically need some viewstate and validation data. Do I simply save these from a request, re-send in a POST request?
Or is there a cleaner way to do this? One of those aspx viewstate parameters is about a 1000 characters and the incredible ugliness of pasting that into my code makes me think there HAS to be a better way. Any and all references to stuff I can read up will be helpful, thanks!
Perhaps mechanize may be of use.
Use urllib2. Your POST data is a simple Python dictionary. Very easy to edit and maintain.
If your form contains hidden fields -- some of which are encoded -- then you need to do a GET to get the form and the various hidden field seed values.
Once you GET the form, you can add the necessary input values to the given, hidden values and POST the response back again.
Also, you'll have to be sure that you handle any cookies. urllib2 will help with that, also.
After all, that's all a browser does, and it works in a browser. Browser's don't know ASPX from CGI from WSGI, so there's no magic because it's ASPX. You sometimes have to do a GET before a POST to get values and cookies set up properly.
I've used a combination requests and BeautifulSoup4 for a similar task.