I found that you can't read from some sites using Python's urllib2(or urllib). An example...
urllib2.urlopen("http://www.dafont.com/").read()
# Returns ''
These sites work when you visit the site with a browser. I can even scrape them using PHP(didn't try other languages). I have seen other sites with the same issue - but can't remember the URL at the moment.
My questions are...
What is the cause of this issue?
Any workarounds?
I believe it gets blocked by the User-Agent. You can change User-Agent using the following sample code:
USERAGENT = 'something'
HEADERS = {'User-Agent': USERAGENT}
req = urllib2.Request(URL_HERE, headers=HEADERS)
f = urllib2.urlopen(req)
s = f.read()
f.close()
Try setting a different user agent. Check the answers in this link.
I'm the guy who posted the question. I have some suspicions - but not sure about them - that's why I posted the question here.
What is the cause of this issue?
I think its due to the host blocking the urllib library using robot.txt or htaccess. But not sure about it. Not even sure if its possible.
Any workaround for this issue?
If you are in Unix, this will work...
contents = commands.getoutput("curl -s '"+url+"'")
Related
I am coding a web scraper for the website with the following Python code:
import requests
def scrape(url):
req = requests.get(url)
with open('out.html', 'w') as f:
f.write(req.text)
It works a few times but then an error HTML page is returned by the website (when I open my browser, I have a captcha to complete).
Is there a way to avoid this “ban” by for example changing the IP address?
As already mentioned in the comments and from yourself, changing the IP could help. To do this quite easily have a look at vpngate.py:
https://gist.github.com/Lazza/bbc15561b65c16db8ca8
An How to is provided at the link.
You can use a proxy with the requests library. You can find some free proxies at a couple different websites like https://www.sslproxies.org/ and http://free-proxy.cz/en/proxylist/country/US/https/uptime/level3 but not all of them work and they should not be trusted with sensitive information.
example:
proxy = {
"https": 'https://158.177.252.170:3128',
"http": 'https://158.177.252.170:3128'
}
response=requests.get('https://httpbin.org/ip', proxies=proxy)
I recently answered this on another question here, but using the requests-ip-rotator library to rotate IPs through API gateway is usually the most effective way.
It's free for the first million requests per region, and it means you won't have to give your data to unreliable proxy sites.
Late answer, I found this looking for IP-spoofing, but to the OP's question - as some comments point out, you may or may not actually be getting banned. Here's two things to consider:
A soft ban: they don't like bots. Simple solution that's worked for me in the past is to add headers, so they think you're a browser, e.g.,
req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
On-page active elements, scripts or popups that act as content gates, not a ban per se - e.g., country/language selector, cookie config, surveys, etc. requiring user input. Not-as-simple solution: use a webdriver like Selenium + chromedriver to render the page including JS and then add "user" clicks to deal with the problems.
I recently wanted to extract data from a website that seems to use cookies to grant me access. I do not know very much about those procedures but appearently this inteferes with my method of getting the html content of the website via Python and its requests module.
The code I am running to extract the information contains the following lines:
import responses
#...
response = requests.get(url, proxies=proxies)
content = requests.text
Where the website i am referring to is http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6675630&tag=1 and proxies is a valid dict of my proxy servers (I tested those settings on websites that seemed to work fine). However, instead of the content of the article on this site I receive the html-content of the page that you get when you do not accept cookies in your browser.
As I am not really aware of what website is really doing and lack real Web-Developement experience I could not find a solution so far, even if a similar question might have been asked before. Is there any solution to access the content of this website via Python?
startr = requests.get('https://viennaairport.com/login/')
secondr = requests.post('http://xxx/', cookies=startr.cookies)
I'm trying to get the source code of a page by using:
import urllib2
url="http://france.meteofrance.com/france/meteo?PREVISIONS_PORTLET.path=previsionsville/750560"
page =urllib2.urlopen(url)
data=page.read()
print data
and also by using a user_agent(headers)
I did not succeed to get the source code of the page!
Have you guys any ideas what can be done?
Thanks in Advance
I tried it and the requests works, but the content that you receive says that your browser must accept cookies (in french). You could probably get around that with urllib2, but I think the easiest way would be to use the requests lib (if you don't mind having an additional dependency).
To install requests:
pip install requests
And then in your script:
import requests
url = 'http://france.meteofrance.com/france/meteo?PREVISIONS_PORTLET.path=previsionsville/750560'
response = requests.get(url)
print(response.content)
I'm pretty sure the source code of the page will be what you expect then.
requests library worked for me as Martin Maillard showed.
Also in another thread I have noticed this note by leoluk here:
Edit: It's 2014 now, and most of the important libraries have been
ported and you should definitely use Python 3 if you can.
python-requests is a very nice high-level library which is easier to
use than urllib2.
So I wrote this get_page procedure:
import requests
def get_page (website_url):
response = requests.get(website_url)
return response.content
print get_page('http://example.com')
Cheers!
I tried a lot of things, "urllib" "urllib2" and many other things, but one thing worked for me for everything I needed and solved any problem I faced. It was Mechanize .This library simulates using a real browser, so it handles a lot of issues in that area.
I am a novice programmer attempting to access google insights using python. I can access sites which dont require cookies fine, but i cant seem to properly pass the cookies along. The cookines file was exported from mozilla firefox, is in the Z: drive which is also where im running python from.
Im also pretty sure my code for saving the file could be better done than reading and writing but I dont know how to do that either. Any helpo would be appreciated.
import urllib2
import cookielib
import os
url = "http://www.google.com/insights/search/overviewReport?q=eagles%2Ccsco&geo=US&cmpt=q&content=1&export=2"
cj = cookielib.MozillaCookieJar()
cj.load('cookies6.txt')
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
file = opener.open(url)
output = open('test2.csv','wb')
output.write(file.read())
output.close()
I haven't tested your code however:
As far as I can tell there seems to be nothing wrong with your code
I've tried the url you're searching and had no problems downloading the csv without any cookies
In my previous experience with google, you might be looking at the problem the wrong way, it is not that you don't have the right cookies but that google automatically blocks requests from bots. If this is the case you must replace the user agent http header to mimic an actual browser. Beware however that this is against googles terms of service and if you make too many requests per minute google will block all requests from your ip for about 8h.
I've recently written this with help from SO. Now could someone please tell me how to make it actually log onto the board. It brings up everything just in a non logged in format.
import urllib2, re
import urllib, re
logindata = urllib.urlencode({'username': 'x', 'password': 'y'})
page = urllib2.urlopen("http://www.woarl.com/board/index.php", logindata)
pagesource = page.read()
print pagesource
Someone recently asked the same question you're asking. If you read through the answers to that question you'll see code examples showing you how to stay logged in while browsing a site in a Python script using only stuff in the standard library.
The accepted answer might not be as useful to you as this other answer, since the accepted answer deals with a specific problem involving redirection. However, I recommend reading through all of the answers regardless.
You probably want to look into preserving cookies from the server.
Pycurl or Mechanize will make this much easier for you
If actually look at the page, you see that the login link takes you to http://www.woarl.com/board/ucp.php?mode=login
That page has the login form and submits to http://www.woarl.com/board/ucp.php?mode=login again with POST.
You'll then have to extract the cookies that are probably set, and put those in a CookieJar or similar.
You probably want to create an opener with these handlers and apply it to urllib2.
With these applied your cookies are handled and you'll be redirected, if server decides it wants you somewhere else.
# Create handlers
cookieHandler = urllib2.HTTPCookieProcessor() # Needed for cookie handling
redirectionHandler = urllib2.HTTPRedirectHandler() # needed for redirection (not needed for javascript redirect?)
# Create opener
opener = urllib2.build_opener(cookieHandler,redirectionHandler)
# Install the opener
urllib2.install_opener(opener)