Google serves its homepage to urllib2 when a local search is made - python

When a local search is done on Google, then the user clicks on the 'More ...' link below the map, the user is then brought to a page such as this.
If the URL:
https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl
is copied out and pasted back into a browser, one arrives, as expected, at the same page. Likewise when a browser is opened with WebDriver, directly accessing the URL brings WebDriver to the same page.
When an attempt is made, however, to request the same page with urllib2, Google serves it its home page (google.com), and it means, among other things, that lxml's extraction capabilities cannot be used.
While urllib2 is not the culprit here (perhaps Google does the same with all headless requests), is there any way of getting Google to serve the desired page? A quick tests with the requests library is indicating the same issue.

I think the big hint here is in the URL:
https://www.google.com/ncr#q=chiropractors%2BNew+York,+NY&rflfq=1&rlha=0&tbm=lcl
Do you notice how there is that hash character (#) in there? Everything following the hash component is never actually sent to the server, so the server can't process it. This indicates (in this case) that the page you are seeing in WebDriver and in your browser is a result of client side scripting.
When you load up the page, your browser sends a request for https://www.google.com/ncr and google returns the home page. The homepage contains javascript that analyses the component after the hash and uses it to generate the page that you expect to see. The browser and Webdriver can do this because they process the javascript. If you disable javascript in your browser and go to that link, you'll see that the page isn't generated either.
urllib2 however, does not process javascript. All it sees is the HTML that the website initially sent along with the javascript, but it can't process the javascript that actually generates the page you are expecting.
Google is serving the page you're asking for, but your problem is that urllib2 is not equipped to render it. To fix this, you'll have to use a scraping framework that supports Javascript. Optionally in this particular case, you could simply use the non-javascript version of Google for your scraping.

Related

Requests How to use two different web sites and switching them?

Hello is there a way to use two different web site urls and switching them?
I mean i have two different websites like:
import requests
session = request.session()
firstPage = session.get("https://stackoverflow.com")
print("Hey! im in first page now!")
secondPage = session.get("https://youtube.com")
print("Hey! im in second page now!")
i know a way to do it in selenium like this: driver.switch_to.window(driver.window_handles[1])
but i want do it in "Requests" so is there a way to do it?
Selenium and Requests are two fundamentally different services. Selenium is a headless browser which fully simulates a user. Requests is a python library which simply sends HTTP requests.
Because of this, Requests is particularly good for scraping static data and data that does not involve javascript rendering (through jQuery or similar), such as RESTful APIs, which often return JSON formatted data (with no HTML styling, or page rendering at all). With Requests, after the initial HTTP request is made, the data is saved in an object, and the connection is closed.
Selenium allows you to traverse through complex, javascript-rendered menus and the like, since each page is actually built (under the hood) as if it were being displayed to a user. Selenium encapsulates everything that your browser does except displaying the HTML (including the HTTP requests that Requests is built to perform). After connecting to a page with Selenium, the connection remains open. This allows you to navigate through a complex site where you would need the full URL of the final page to use Requests.
Because of this distinction, it makes sense that Selenium would have a switch_to_window method, but Requests would not. The way your code is written, you can access the response to the HTTP get calls which you've made directly though your variables (firstPage contains the response from stackoverflow, secondPage contains the response from youtube). While using Requests, you are never "in" a page in the sense that you can be with Selenium, since it is an HTTP library and not a full headless browser.
Depending on what you're looking to scrape, it might be better to use either Requests or Selenium.

Posting to ASP.NET URL keeps bringing me back to the same page via Python requests

I am using Python requests to ping a site that uses the ASP.NET framework. One of the URL's is giving me trouble and the response is the exact same page that I posted to, but the browser does not behave this way with the same URL - it refreshes with a new URL and all (but I do not think it is redirecting technically). What are some ways I can try to troubleshoot this? I would provide code and links but it is a secured website and requires authentication/subscription.

Open url in IE Then Scrape the same using python

I am novice in python (c++ developer), I am trying to do some hands-on on web scraping on windows IE.
The problem which I am facing is, when I open a URL using "requests" library the server sends me a login page always. I figured out the problem. Its actually doing it because it presumes you are coming through IE tries to execute on function which uses some information from the SSO ( single signup object ) which is there executing on the background in Windows on the first login to the web server ( consider this as some weird setup.)
On observing this I changed my strategy & started using webbrowser lib.
Now, when I try to do a webbrowser.open("url"), the browser is opening the page properly which is great thing!!!
But, my problems now are :
1) I do not want that the browser page opened should be visible to the user ( some way that the browser is opened in background ). I tried to used this :
ie = webbrowser.BackgroundBrowser(webbrowser.iexplore)
ie.Visible = 0
ie.open('url')
but no success.
It opens the page which is visible to the user.
2) [This is main activity] I want to scrape the page which is opened in the web browser's IE page opened above. how to do?
I tried to dig into this link but did not find any APIs for getting the data.
Kindly help.
PS : I tried to use beautiful soup for scraping on some other web pages using requests. It was successful & I go the data I wanted. But not in this case.
The webbrowser module doesn't allow to do that. The get function you mentioned is to retrieve registered web browsers not to scrap a HTTP GET request.
I don't know what is triggering the behavior you described with IE, have you tried to change your User-Agent with IE ones? You can check this post for more details: Sending "User-agent" using Requests library in Python

Scraping site that uses AJAX

I've read some relevant posts here but couldn't figure an answer.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?
What you need is a headless browser for this since request module can not handle AJAX well.
One of such headless browser is selenium.
i.e.)
driver.find_element_by_id("show more").click() # This is just an example case
Normally, when you scroll down the page, Ajax will send request to the server, and the server will then response a json/xml file back to your browser to refresh the page.
You need to figure out the url linked to this json/xml file. Normally, you can open your firefox browser and open tools/web dev/web console. monitor the network activities and you can easily catch this json/xml file.
Once you find this file, then you can directly parse reviews from them (I recommend Python Module requests and bs4 to do this work) and decrease a huge amount of time. Remember to use some different clients and IPs. Be nice to the server and it won't block you.

Urllib Counts as WebPage Hit?

I'm studying opening pages through Python (3.3).
url=('http://www.google.com')
page = urllib.request.urlopen(url)
Does the above code count that as one hit to Google or does this?
os.system('start chrome.exe google.com')
The first one scrapes the page while the second one actually opens the page in a browser. I was just wondering if it made a difference page hit wise?
both do very different things.
using urllib.request.urlopen makes a single http request.
your second example will do the same, and then it will parse the document it receives and request subsequent resources (images/javascript/css/whatever). So the result of loading google.com in your browser will trigger many hits.
try it yourself by looking in your browsers development tools (usually in network section) while you load a page.

Categories

Resources