Python urllib2 open URL and wait some time - python

Here is the situation: I want to access the content of an URL in Python via urllib2.
import urllib2
url = www.iwanttoknowwhatsinside.com
hdr = {
'User-Agent': 'OpenAnything/1.0 +http://somepage.org/',
'Connection': 'keep-alive'
}
request = urllib2.Request(url, headers=hdr)
opener = urllib2.build_opener()
HTML = opener.open(request).read()
This code normally works fine. But if I access a certain page via webbrowser, it says something like "Checking your browser before accessing ... Your browser will be redirected shortly" and then the page loads. The URL never changes. ADD: Then I can freely click around on the page, or open a second Tab with the same URL. I only have to wait before the initial access.
If I try to access this page via Python, I get an urllib2.HTTPError - Service Temporary Not Available instantly, so I figured urllib2 doesn't wait that time. Is there a way to force some waittime before throwing exceptions or retrieving the content? Or am I looking at this the wrong way?

Related

Can't get the html of a page python

So I have been trying to solve this for the past 3 days and just can't know why.
I'm trying to access the html of this site that requires login first.
I tried everyway I could and all return with the same problem.
Here is what I tried:
response = requests.get('https://de-legalization.tlscontact.com/eg/CAI/myapp.php', headers=headers, params=params, cookies=cookies)
print(response.content)
payload = {
'_token': 'TOKEN HERE',
'email': 'EMAIL HERE',
'pwd': 'PASSWORDHERE',
'client_token': 'CLIENT_TOKEN HERE'
}
with requests.session() as s:
r = s.post(login_url, data=payload)
print(r.text)
I also tried using URLLIB but they all return this:
<script>window.location="https://de-legalization.tlscontact.com/eg/CAI/index.php";</script>
Anyone knows why this is happening.
Also here is the url of the page I want the html of:
https://de-legalization.tlscontact.com/eg/CAI/myapp.php
You see this particular output because it is in fact the content of the page you are downloading.
You can test it in chrome by opening the following url:
view-source:https://de-legalization.tlscontact.com/eg/CAI/myapp.php
This is how it looks like in Chrome:
This is happening because you are being redirected by the javascript code on the page.
Since the page you are trying to access requires login, you cannot access it just by sending http request to the internal page.
You either need to extract all the cookies and add them to the python script.
Or you need to use a tool like Selenium that allows you to control a browser from your Python code.
Here you can find how to extract all the cookies from the browser session:
How to copy cookies in Google Chrome?
Here you can find how to add cookies to the http request in Python:
import requests
cookies = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookies)

Request not returning same data as browser

Trying to get some values from Duolingo using Python, but urllib is giving me something different than when I navigate to the url via my browser.
Navigating to a url (https://www.duolingo.com/2017-06-30/users/215344344?fields=xpGoalMetToday) via browser gives: {"xpGoalMetToday": false}.
However, trying via the below script:
import urllib.request
url = 'http://www.duolingo.com/2017-06-30/users/215344344?fields=xpGoalMetToday'
user_agent = '[insert my local user agent copied from browser attempt]'
# header variable
headers = { 'User-Agent' : user_agent, "Cache-Control": "no-cache, max-age=0" }
# creating request
req = urllib.request.Request(url, None, headers)
print(urllib.request.urlopen(req).read())
returns just a blank {}.
As you can tell from the above, I've tried a couple things: adding a user agent, cache control. I've even tried using the response module and adding authentication (didn't work).
Any ideas? Am I missing something?
Actually when I open the link in the browser it show me {}
Maybe you have some kind of cookie set in your browser?

python requests cannot get html

I tried to get html code from a site name dcinside in Korea, i am using requests but cannot get html code
and this is my code
import requests
url = "http://gall.dcinside.com/board/lists/?id=bitcoins&page=1"
req = requests.get(url)
print (req)
print (req.content)
but the result was
Why I cannot get html codes even using requests??
Most likely they are detecting that you are trying to crawl data dynamically, and not giving any content as a response. Try pretending to be a browser and passing some User-Agent headers.
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'youremail#domain.com'
}
response = requests.get(url, headers=headers)
# use authentic mozilla or chrome user-agent strings if this doesn't work
Take a look at this:
Python Web Crawlers and "getting" html source code
Like the guy said in the aforementioned post, you should use urllib2 which will allow you to easily obtain web resources.

Selenium Webdriver / Beautifulsoup + Web Scraping + Error 416

I'm doing web scraping using selenium webdriver in Python with Proxy.
I want to browse more than 10k pages of single site using this scraping.
Issue is using this proxy I'm able to send request for single time only. when I'm sending another request on same link or another link of this site, I'm getting 416 error (kind of block IP using firewall) for 1-2 hours.
Note: I'm able to do scraping all normal sites with this code, but this site has kind of security which is prevent me for scraping.
Here is code.
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference(
"network.proxy.http", "74.73.148.42")
profile.set_preference("network.proxy.http_port", 3128)
profile.update_preferences()
browser = webdriver.Firefox(firefox_profile=profile)
browser.get('http://www.example.com/')
time.sleep(5)
element = browser.find_elements_by_css_selector(
'.well-sm:not(.mbn) .row .col-md-4 ul .fs-small a')
for ele in element:
print ele.get_attribute('href')
browser.quit()
Any solution ??
Selenium wasn't helpful for me, so I solved the problem by using beautifulsoup, the website has used security to block proxy whenever received request, so I am continuously changing proxyurl and User-Agent whenever server blocking requested proxy.
I'm pasting my code here
from bs4 import BeautifulSoup
import requests
import urllib2
url = 'http://terriblewebsite.com/'
proxy = urllib2.ProxyHandler({'http': '130.0.89.75:8080'})
# Create an URL opener utilizing proxy
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15')
result = urllib2.urlopen(request)
data = result.read()
soup = BeautifulSoup(data, 'html.parser')
ptag = soup.find('p', {'class', 'text-primary'}).text
print ptag
Note :
change proxy and User-Agent and use latest updated proxy only
few server are accepting only specific country proxy, In my case I used Proxies from United States
this process might be a slow, still u can scrap the data
Going through the 416 error issues in the following links, it seems that some cached information(cookies maybe) is creating the issues. You are able to send request for the first time and subsequent send requests fail.
https://webmasters.stackexchange.com/questions/17300/what-are-the-causes-of-a-416-error
416 Requested Range Not Satisfiable
Try choosing not to save cookies by setting a preference or deleting the cookies after every send request.
profile.set_preference("network.cookie.cookieBehavior", 2);

Get HTML source, including result of javascript and authentication

I am building a web scraper and need to get the html page source as it actually appears on the page. However, I only get a limited html source, one that does not include the needed info. I think that I am either seeing it pre javascript loaded or else maybe I'm not getting the full info because I don't have the right authentication?? My result is the same as "view source" in Chrome when what I want is what Chrome's 'inspect element' shows. My test is cimber.dk after entering flight information and searching.
I am coding in python and tried the urllib2 library. Then I heard that Selenium was good for this so I tried that, too. However, that also gets me the same limited page source.
This is what I tried with urllib2 after using Firebug to see the parameters. (I deleted all my cookies after opening cimber.dk so I was starting with a 'clean slate')
url = 'https://www.cimber.dk/booking/'
values = {'ARRANGE_BY' : 'D',...} #one for each value
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
#Using HTTPRedirectHandler instead of HTTPCookieProcessor gives the same.
urllib2.install_opener(opener)
request = urllib2.Request(url)
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0')]
request.add_header(....) # one for each header, also the cookie one
p = urllib.urlencode(values)
data = opener.open(request, p).read()
# data is now the limited source, like Chrome View Source
#I tried to add the following in some vain attempt to do a redirect.
#The result is always "HTTP Error 400: Bad request"
f = opener.open('https://wftc2.e-travel.com/plnext/cimber/Override.action')
data = f.read()
f.close()
Most libraries like this do not support javascript.
If you want javascript, you will need to either automate an existing browser or browser engine, or get a really monolithic big beefy library that is essentially an advanced web crawler.

Categories

Resources