I am trying to download a file with urllib. I am using a direct link to this rar (if I use chrome on this link, it will immediately start downloading the rar file), but when i run the following code :
file_name = url.split('/')[-1]
u = urllib.urlretrieve(url, file_name)
... all I get back is a 22kb rar file, which is obviously wrong. What is going on here? Im on OSX Mavericks w/ python 2.7.5, and here is the url.
(Disclaimer : this is a free download, as seen on the band's website
Got it. The headers were lacking alot of information. I resorted to using Requests, and with each GET request, I would add the following content to the header :
'Connection': 'keep-alive'
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'
'Cookie': 'JSESSIONID=36DAD704C8E6A4EF4B13BCAA56217961; ziplocale=en; zippop=2;'
However, I noticed that not all of this is necessary (just the Cookie is all you need), but it did the trick - I was able to download the entire file. If using urllib2 I am sure that doing the same (sending requests with the appropriate header content) would do the trick. Thank you all for the good tips, and for pointing me in the right direction. I used Fiddlr to see what my Requests GET header was missing in comparison to chrome's GET header. If you have a similar issue like mine, I suggest you check it out.
I tried this with Python using the following code that replaces urlib with urllib2:
url = "http://www29.zippyshare.com/d/12069311/2695/Del%20Paxton-Worst.%20Summer.%20Ever%20EP%20%282013%29.rar"
import urllib2
file_name = url.split('/')[-1]
response = urllib2.urlopen(url)
data = response.read()
with open(file_name, 'wb') as bin_writer:
bin_writer.write(data)
and I get the same 22k file. Trying it with wget on that URL yields the same file; however I was able to begin the download of the full file (around 35MB as I recall) by pasting the URL in the Chrome navigation bar. Perhaps they are serving different files based upon the headers that you are sending in your request? The User-Agent GET request header is going to look different to their server (i.e. not like a browser) from Python/wget than it does from your browser when you click on the button.
I did not open the .rar archives to inspect the two files.
This thread discusses setting headers with urllib2 and this is the Python documentation on how to read the response status codes from your urllib2 request which could be helpful as well.
Related
I'm playing with crawling Bing web search page using python.
I find the raw content received looks like byte type, but the attempt to decompress it has failed.
Does someone have clue what kind of data is this, and how should I extract readable from this raw content? Thanks!
My code displayed the raw content and then tried to do the gunzip, so you could see the raw content as well as error from the decompression.
Due to the raw content is too long, I just paste the first a few lines in below.
Code:
import urllib.request as Request
import gzip
req = Request.Request('www.bing.com')
req.add_header('upgrade-insecure-requests', 1)
res = Request.urlopen(req).read()
print("RAW Content: %s" %ResPage) # show raw content of web
print("Try decompression:")
print(gzip.decompress(ResPage)) # try decompression
Result:
RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ...
Try decompression:
Traceback (most recent call last):
OSError: Not a gzipped file (b'+p')
Process finished with exit code 1
It's much easier to get started with the requests library. Plus, this is also the most commonly used lib for http requests nowadays.
Install requests in your python environment:
pip install requests
In your .py file:
import requests
r = requests.get("http://www.bing.com")
print(r.text)
OSError: Not a gzipped file (b'+p')
You either need to add "accept-encoding: "gzip" or "br" to request headers or read content-encoding from the response and choose the correct one, or use requests library instead that will do everything for you.
The second problem that might appear, you need to pass a user-agent to request headers to act as a "real" user visit.
If no user-agent is being passed into request headers while using requests library it defaults to python-requests so Bing or other search engine understands that it's a bot/script, and blocks a request. Check what's your user-agent.
Pass user-agent using requests library:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines or figure out different tedious problems such as picking the correct CSS selector if the HTML layout is not the best out there.
Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.
I want to download image file from a url using python module "urllib.request", which works for some website (e.g. mangastream.com), but does not work for another (mangadoom.co) receiving error "HTTP Error 403: Forbidden". What could be the problem for the latter case and how to fix it?
I am using python3.4 on OSX.
import urllib.request
# does not work
img_url = 'http://mangadoom.co/wp-content/manga/5170/886/005.png'
img_filename = 'my_img.png'
urllib.request.urlretrieve(img_url, img_filename)
At the end of error message it said:
...
HTTPError: HTTP Error 403: Forbidden
However, it works for another website
# work
img_url = 'http://img.mangastream.com/cdn/manga/51/3140/006.png'
img_filename = 'my_img.png'
urllib.request.urlretrieve(img_url, img_filename)
I have tried the solutions from the post below, but none of them works on mangadoom.co.
Downloading a picture via urllib and python
How do I copy a remote image in python?
The solution here also does not fit because my case is to download image.
urllib2.HTTPError: HTTP Error 403: Forbidden
Non-python solution is also welcome. Your suggestion will be very appreciated.
This website is blocking the user-agent used by urllib, so you need to change it in your request. Unfortunately I don't think urlretrieve supports this directly.
I advise for the use of the beautiful requests library, the code becomes (from here) :
import requests
import shutil
r = requests.get('http://mangadoom.co/wp-content/manga/5170/886/005.png', stream=True)
if r.status_code == 200:
with open("img.png", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
Note that it seems this website does not forbide requests user-agent. But if need to be modified it is easy :
r = requests.get('http://mangadoom.co/wp-content/manga/5170/886/005.png',
stream=True, headers={'User-agent': 'Mozilla/5.0'})
Also relevant : changing user-agent in urllib
You can build an opener. Here's the example:
import urllib.request
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
url=''
local=''
urllib.request.urlretrieve(url,local)
By the way, the following codes are the same:
(none-opener)
req=urllib.request.Request(url,data,hdr)
html=urllib.request.urlopen(req)
(opener builded)
html=operate.open(url,data,timeout)
However, we are not able to add header when we use:
urllib.request.urlretrieve()
So in this case, we have to build an opener.
I try wget with the url in terminal and it works:
wget -O out_005.png http://mangadoom.co/wp-content/manga/5170/886/005.png
so my way around is to use the script below, and it works too.
import os
out_image = 'out_005.png'
url = 'http://mangadoom.co/wp-content/manga/5170/886/005.png'
os.system("wget -O {0} {1}".format(out_image, url))
I'm trying to download an image, however it does seem to work. Is it being blocked by ddos protection?
Here is the code:
urllib.request.urlretrieve("http://archive.is/Xx9t3/scr.png", "test.png")
Basically download that image as "test.png." I'm using python3 hence the urllib.request before urlretrieve.
import urllib.request
Have that as well.
Any way I can download the image? thanks!
For reasons that I cannot even imagine, the server requires a well known user agent. So you must pretend to use for example firefox and it will accept to send the image:
# first build a request object
req = urllib.request.Request("http://archive.is/Xx9t3/scr.png",
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 5.1; rv:43.0) Gecko/20100101 Firefox/43.0'})
#then use it
resp = urllib.request.urlopen(req)
with open("test.png","wb") as fd:
fd.write(resp.read())
Rather stupid, but when a server admin goes mad, just be as stupid as he is...
I'd advice you to use requests, basically the way you are trying to get the image is forbidden, check this:
import requests
import shutil
r = requests.get('http://archive.is/Xx9t3/scr.png', stream=True)
if r.status_code == 200:
with open("test.png", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
This snippet was adapted from here
The magic behind this is how the resource is retrieved, with requests that part is the stream=True line. Some servers are more restricted with this methods to pull some resources like media.
I am trying to make a small program that downloads subtitles for movie files.
I noticed however that when I follow a link in chrome and when opening it with urllib2.urlopen() does not give the same results.
As an example let's consider the link http://www.opensubtitles.org/en/subtitleserve/sub/5523343 . In chrome this redirects to http://osdownloader.org/en/osdownloader.subtitles-for.you/subtitles/5523343 which after a little while downloads the file I want.
However, when I use the following code in python, I get redirected to another page:
import urllib2
url = "http://www.opensubtitles.org/en/subtitleserve/sub/5523343"
response = urllib2.urlopen(url)
if response.url == url:
print "No redirect"
else:
print url, " --> ", response.url
Result: http://www.opensubtitles.org/en/subtitleserve/sub/5523343 --> http://www.opensubtitles.org/en/subtitles/5523343/the-musketeers-commodities-en
Why does that happen? How can I follow the same redirect as with the browser?
(I know that these sites offer APIs in python, but this is meant as practice in python and playing with urllib2 for the first time)
There's a significant difference in the request you're making from Chrome and your script using urllib2 above, and that is the HTTP header User-Agent (https://en.wikipedia.org/wiki/User_agent).
opensubtitles.org probably identifies that you're trying to programmatically retrieving the webpage, and are blocking it. Try to use one of the User-Agent strings from Chrome (more here http://www.useragentstring.com/pages/Chrome/):
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36
in your script.
See this question on how to edit your script to support a custom User-Agent header - Changing user agent on urllib2.urlopen.
I would also like to recommend using the requests library for Python instead of urllib2, as the API is much easier to understand - http://docs.python-requests.org/en/latest/.
I looked here and here for information on my issue, but with no luck.
I made some python code that is intended to grab a webpage's source, as in Safari's Web Inspector. However, I have been getting different code from my application and Safari's Web Inspector. Here is my code so far:
#!/usr/bin/python
import urllib2
# headers
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.28.10 (KHTML, like Gecko) Version/6.0.3 Safari/536.28.10',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Cache-Control': 'max-age=0'}
# request data
req = urllib2.Request("https://www.google.com/#q=rainbow&safe=active", headers=hdr)
# try to get data
try:
page = urllib2.urlopen(req)
print page.info()
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
#print content
print content
And the headers match up to what is in Web Inspector:
The code returned is different, though, for a google search for "rainbow".
My python:
http://paste.ubuntu.com/6270549/
Web Inspector:
http://paste.ubuntu.com/6270606/
As far as I know, it seems that my code is missing a large number of the ubiquitous }catch(e){gbar_._DumpException(e)} lines that are present in the Web Inspector code. Also, my code only has 78 lines, while the Web Inspector code has 235 lines. Does this mean that my code is not getting all of the javascript or some other portion of the webpage? How can I get my code retrieve the same data as the Web Inspector?
You are using the wrong link to search with google search- the correct link should be:
https://www.google.com/search?q=rainbow&safe=active
instead of:
https://www.google.com/#q=rainbow&safe=active
The second link will cause a redirect to Google's homepage when used in python, because it is incorrect (for some reason) when not used in Safari. This is why the code is different.