I try to open url with python3:
import urllib.request
fp = urllib.request.urlopen("http://lebed.com/")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
But it hangs on second line.
What's the reason of this problem and how to fix it?
I suppose the reason is that the url does not support robot visiting a site visit. You need to fake a browser visit by sending browser headers along with your request
import urllib.request
url = "http://lebed.com/"
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
Tried this one on my system and it works.
Agree with Arpit Solanki. Shown output for a failed request vs successful.
Failed
GET / HTTP/1.1
Accept-Encoding: identity
Host: www.lebed.com
Connection: close
User-Agent: Python-urllib/3.5
Success
GET / HTTP/1.1
Accept-Encoding: identity
Host: www.lebed.com
Connection: close
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36
I have a simple Get request I'd like to make using Python's Request library.
import requests
HEADERS = {'user-agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/45.0.2454.101 Safari/537.36'),
'referer': 'http://stats.nba.com/scores/'}
url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500281&RangeType=2&Season=2016-17&SeasonType=Regular+Season&StartPeriod=1&StartRange=0'
response = requests.get(url, timeout=5, headers=HEADERS)
However, when I make the requests.get call, I get the error requests.exceptions.ReadTimeout: HTTPConnectionPool(host='stats.nba.com', port=80): Read timed out. (read timeout=5). But I am able to copy/paste that url into my browser and view the resulting JSON. Why is requests not able to get the result?
Your HEADERS format is wrong. I tried with this code and it worked without any issues:
import requests
HEADERS = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
}
url = 'http://stats.nba.com/stats/playbyplayv2?EndPeriod=10&EndRange=55800&GameID=0021500281&RangeType=2&Season=2016-17&SeasonType=Regular+Season&StartPeriod=1&StartRange=0'
response = requests.get(url, timeout=5, headers=HEADERS)
print(response.text)
Image path --> http://markinternational.info/data/out/366/221983609-black-hd-desktop-wallpaper.jpg
Code I am using
import urllib
urllib.urlretrieve("https://markinternational.info/data/out/366/221983609-black-hd-desktop-wallpaper.jpg" , "photu.jpg")
What it returns (returns same thing for successful or unsuccessful attempts)
('photu.jpg', <httplib.HTTPMessage instance at 0x7fe3cfb27d88>)
Can someone help?
You need to fake the user-agent to bypass this restriction by the web server.
I used Python3 and requests library, I managed to get the picture:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = 'https://markinternational.info/data/out/366/221983609-black-hd-desktop-wallpaper.jpg'
res = requests.get(url, headers=headers)
with open('photo.jpg', 'wb') as W:
W.write(res.content)
This might help.
import urllib
f = open('photu.jpg','wb')
f.write(urllib.urlopen('https://markinternational.info/data/out/366/221983609-black-hd-desktop-wallpaper.jpg').read())
f.close()
Since you're sending a raw http request without any User-Agent header, the server is not allowing the request to pass through. You can mock it with a defined User-Agent in header and it'll work as if it works on browser.
url = "https://markinternational.info/data/out/366/221983609-black-hd-desktop-wallpaper.jpg"
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
with open('image.jpg', 'wb') as img_file:
img_file.write(urllib.request.urlopen(req).read())
I am trying to make a scan of the HTML in 2 requests.
at the first one, it's working but when I am trying to use another one,
The HTML I am trying to locate are not visible its getting it wrong.
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page = requests.get('https://www.amazon.com/gp/aw/ol/B00DZKQSRQ/ref=mw_dp_olp?ie=UTF8&condition=new, headers=headers)
newHeader = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36'}
pagePrice = requests.get('https://www.amazon.com/gp/aw/ol/B01EQJU8AW/ref=mw_dp_olp?ie=UTF8&condition=new',headers=newHeader)
The first request works fine and gets me the good HTML.
The second request gives bad HTML.
I saw this package, but not success :
https://pypi.python.org/pypi/fake-useragent
And I saw this topic not unaswered :
Double user-agent tag, "user-agent: user-agent: Mozilla/"
Thank you very much!
How can I download a webpage with a user agent other than the default one on urllib2.urlopen?
I answered a similar question a couple weeks ago.
There is example code in that question, but basically you can do something like this: (Note the capitalization of User-Agent as of RFC 2616, section 14.43.)
opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open('http://www.stackoverflow.com')
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('www.example.com', None, headers)
html = urllib2.urlopen(req).read()
Or, a bit shorter:
req = urllib2.Request('www.example.com', headers={ 'User-Agent': 'Mozilla/5.0' })
html = urllib2.urlopen(req).read()
Setting the User-Agent from everyone's favorite Dive Into Python.
The short story: You can use Request.add_header to do this.
You can also pass the headers as a dictionary when creating the Request itself, as the docs note:
headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib2‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).
For python 3, urllib is split into 3 modules...
import urllib.request
req = urllib.request.Request(url="http://localhost/", headers={'User-Agent':' Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'})
handler = urllib.request.urlopen(req)
All these should work in theory, but (with Python 2.7.2 on Windows at least) any time you send a custom User-agent header, urllib2 doesn't send that header. If you don't try to send a User-agent header, it sends the default Python / urllib2
None of these methods seem to work for adding User-agent but they work for other headers:
opener = urllib2.build_opener(proxy)
opener.addheaders = {'User-agent':'Custom user agent'}
urllib2.install_opener(opener)
request = urllib2.Request(url, headers={'User-agent':'Custom user agent'})
request.headers['User-agent'] = 'Custom user agent'
request.add_header('User-agent', 'Custom user agent')
For urllib you can use:
from urllib import FancyURLopener
class MyOpener(FancyURLopener, object):
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
myopener = MyOpener()
myopener.retrieve('https://www.google.com/search?q=test', 'useragent.html')
Another solution in urllib2 and Python 2.7:
req = urllib2.Request('http://www.example.com/')
req.add_unredirected_header('User-Agent', 'Custom User-Agent')
urllib2.urlopen(req)
there are two properties of urllib.URLopener() namely:
addheaders = [('User-Agent', 'Python-urllib/1.17'), ('Accept', '*/*')] and
version = 'Python-urllib/1.17'.
To fool the website you need to changes both of these values to an accepted User-Agent. for e.g.
Chrome browser : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36'
Google bot : 'Googlebot/2.1'
like this
import urllib
page_extractor=urllib.URLopener()
page_extractor.addheaders = [('User-Agent', 'Googlebot/2.1'), ('Accept', '*/*')]
page_extractor.version = 'Googlebot/2.1'
page_extractor.retrieve(<url>, <file_path>)
changing just one property does not work because the website marks it as a suspicious request.
Try this :
html_source_code = requests.get("http://www.example.com/",
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36',
'Upgrade-Insecure-Requests': '1',
'x-runtime': '148ms'},
allow_redirects=True).content