urlretrieve not working for this site

urlretrieve not working for this site - python

I'm trying to download an image, however it does seem to work. Is it being blocked by ddos protection?
Here is the code:
urllib.request.urlretrieve("http://archive.is/Xx9t3/scr.png", "test.png")
Basically download that image as "test.png." I'm using python3 hence the urllib.request before urlretrieve.
import urllib.request
Have that as well.
Any way I can download the image? thanks!

For reasons that I cannot even imagine, the server requires a well known user agent. So you must pretend to use for example firefox and it will accept to send the image:
# first build a request object
req = urllib.request.Request("http://archive.is/Xx9t3/scr.png",
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 5.1; rv:43.0) Gecko/20100101 Firefox/43.0'})
#then use it
resp = urllib.request.urlopen(req)
with open("test.png","wb") as fd:
fd.write(resp.read())
Rather stupid, but when a server admin goes mad, just be as stupid as he is...

I'd advice you to use requests, basically the way you are trying to get the image is forbidden, check this:
import requests
import shutil
r = requests.get('http://archive.is/Xx9t3/scr.png', stream=True)
if r.status_code == 200:
with open("test.png", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
This snippet was adapted from here
The magic behind this is how the resource is retrieved, with requests that part is the stream=True line. Some servers are more restricted with this methods to pull some resources like media.

Related

BeautifulSoup Not Parsing HTML Correctly inside Try/Except Loop

I've got a problem parsing a document with BS4, and I'm not sure what's happening. The response code is OK, the url is fine, the proxies work, everything is great, proxy shuffling works as expected, but soup comes back blank using any parser other than html5lib. The soup that html5lib comes back with stops at the <body> tag.
I'm working in Colab and I've been able to run pieces of this function successfully in another notebook, and have gotten as far as being able to loop through a set of search results, make soup out of the links, and grab my desired data, but my target website eventually blocks me, so I have switched to using proxies.
check(proxy) is a helper function that checks a list of proxies before attempting to make a requests of my target site. The problem seems to have started when I included it in try/except. I'm speculating that maybe it's something to do with the try/except being included in a for loop --- idk.
What's confounding is that I know the site isn't blocking scrapers/robots generally, as I can use BS4 in another notebook piecemeal and get what I'm looking for.
from bs4 import BeautifulSoup as bs
from itertools import cycle
import time
from time import sleep
import requests
import random
head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36', "X-Requested-With": "XMLHttpRequest"}
ips = []
proxy_pool = cycle(ips)
def scrape_boxrec():
search_result_pages = [num for num in range(0, 22700, 20)]
random.shuffle(search_result_pages)
for i in search_result_pages:
search_results_page_attempt.append(i)
proxy = next(proxy_pool)
proxies = {
'http': proxy,
'https': proxy
}
if check(proxy) == True:
url = 'https://boxrec.com/en/locations/people?l%5Brole%5D=proboxer&l%5Bdivision%5D=&l%5Bcountry%5D=&l%5Bregion%5D=&l%5Btown%5D=&l_go=&offset=' + str(i)
try:
results_source = requests.get(url, headers=head, timeout=5, proxies=proxies)
results_content = results_source.content
res_soup = bs(results_content, 'html.parser')
# WHY IS IT NOT PARSING THIS PAGE CORRECTLY!!!!
except Exception as ex:
print(ex)
else:
print("Bad Proxy. Moving On")
def check(proxy):
check_url = 'https://httpbin.org/ip'
check_proxies = {
'http': proxy,
'https': proxy
}
try:
response = requests.get(check_url, proxies=check_proxies, timeout=5)
if response.status_code == 200:
return True
except:
return False

Since nobody took a crack at it I thought I would come back through and update on a solution - my use of "X-Requested-With": "XMLHttpRequest" in my head variable is what was causing the error. I'm still new to programming, especially with making HTTP requests, but I do know it has something to do with Ajax. Anyways, when I removed that bit from the headers attribute in my request BeautifulSoup parsed the document in full.
This answer as well as this one explains in a lot more detail that this is a common approach to prevent Cross Site Request Forgery, which is why my request was always coming back empty.

How do I convert Python crawled Bing web page content to human-readable?

I'm playing with crawling Bing web search page using python.
I find the raw content received looks like byte type, but the attempt to decompress it has failed.
Does someone have clue what kind of data is this, and how should I extract readable from this raw content? Thanks!
My code displayed the raw content and then tried to do the gunzip, so you could see the raw content as well as error from the decompression.
Due to the raw content is too long, I just paste the first a few lines in below.
Code:
import urllib.request as Request
import gzip
req = Request.Request('www.bing.com')
req.add_header('upgrade-insecure-requests', 1)
res = Request.urlopen(req).read()
print("RAW Content: %s" %ResPage) # show raw content of web
print("Try decompression:")
print(gzip.decompress(ResPage)) # try decompression
Result:
RAW Content: b'+p\xe70\x0bi{)\xee!\xea\x88\x9c\xd4z\x00Tgb\x8c\x1b\xfa\xe3\xd7\x9f\x7f\x7f\x1d8\xb8\xfeaZ\xb6\xe3z\xbe\'\x7fj\xfd\xff+\x1f\xff\x1a\xbc\xc5N\x00\xab\x00\xa6l\xb2\xc5N\xb2\xdek\xb9V5\x02\t\xd0D \x1d\x92m%\x0c#\xb9>\xfbN\xd7\xa7\x9d\xa5\xa8\x926\xf0\xcc\'\x13\x97\x01/-\x03... ...
Try decompression:
Traceback (most recent call last):
OSError: Not a gzipped file (b'+p')
Process finished with exit code 1

It's much easier to get started with the requests library. Plus, this is also the most commonly used lib for http requests nowadays.
Install requests in your python environment:
pip install requests
In your .py file:
import requests
r = requests.get("http://www.bing.com")
print(r.text)

OSError: Not a gzipped file (b'+p')
You either need to add "accept-encoding: "gzip" or "br" to request headers or read content-encoding from the response and choose the correct one, or use requests library instead that will do everything for you.
The second problem that might appear, you need to pass a user-agent to request headers to act as a "real" user visit.
If no user-agent is being passed into request headers while using requests library it defaults to python-requests so Bing or other search engine understands that it's a bot/script, and blocks a request. Check what's your user-agent.
Pass user-agent using requests library:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines or figure out different tedious problems such as picking the correct CSS selector if the HTML layout is not the best out there.
Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.

Urlib2 is downlloading Corrupted pictures [duplicate]

I want to download image file from a url using python module "urllib.request", which works for some website (e.g. mangastream.com), but does not work for another (mangadoom.co) receiving error "HTTP Error 403: Forbidden". What could be the problem for the latter case and how to fix it?
I am using python3.4 on OSX.
import urllib.request
# does not work
img_url = 'http://mangadoom.co/wp-content/manga/5170/886/005.png'
img_filename = 'my_img.png'
urllib.request.urlretrieve(img_url, img_filename)
At the end of error message it said:
...
HTTPError: HTTP Error 403: Forbidden
However, it works for another website
# work
img_url = 'http://img.mangastream.com/cdn/manga/51/3140/006.png'
img_filename = 'my_img.png'
urllib.request.urlretrieve(img_url, img_filename)
I have tried the solutions from the post below, but none of them works on mangadoom.co.
Downloading a picture via urllib and python
How do I copy a remote image in python?
The solution here also does not fit because my case is to download image.
urllib2.HTTPError: HTTP Error 403: Forbidden
Non-python solution is also welcome. Your suggestion will be very appreciated.

This website is blocking the user-agent used by urllib, so you need to change it in your request. Unfortunately I don't think urlretrieve supports this directly.
I advise for the use of the beautiful requests library, the code becomes (from here) :
import requests
import shutil
r = requests.get('http://mangadoom.co/wp-content/manga/5170/886/005.png', stream=True)
if r.status_code == 200:
with open("img.png", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
Note that it seems this website does not forbide requests user-agent. But if need to be modified it is easy :
r = requests.get('http://mangadoom.co/wp-content/manga/5170/886/005.png',
stream=True, headers={'User-agent': 'Mozilla/5.0'})
Also relevant : changing user-agent in urllib

You can build an opener. Here's the example:
import urllib.request
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
url=''
local=''
urllib.request.urlretrieve(url,local)
By the way, the following codes are the same:
(none-opener)
req=urllib.request.Request(url,data,hdr)
html=urllib.request.urlopen(req)
(opener builded)
html=operate.open(url,data,timeout)
However, we are not able to add header when we use:
urllib.request.urlretrieve()
So in this case, we have to build an opener.

I try wget with the url in terminal and it works:
wget -O out_005.png http://mangadoom.co/wp-content/manga/5170/886/005.png
so my way around is to use the script below, and it works too.
import os
out_image = 'out_005.png'
url = 'http://mangadoom.co/wp-content/manga/5170/886/005.png'
os.system("wget -O {0} {1}".format(out_image, url))

download zipfile with urllib2 fails

I am trying to download a file with urllib. I am using a direct link to this rar (if I use chrome on this link, it will immediately start downloading the rar file), but when i run the following code :
file_name = url.split('/')[-1]
u = urllib.urlretrieve(url, file_name)
... all I get back is a 22kb rar file, which is obviously wrong. What is going on here? Im on OSX Mavericks w/ python 2.7.5, and here is the url.
(Disclaimer : this is a free download, as seen on the band's website

Got it. The headers were lacking alot of information. I resorted to using Requests, and with each GET request, I would add the following content to the header :
'Connection': 'keep-alive'
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'
'Cookie': 'JSESSIONID=36DAD704C8E6A4EF4B13BCAA56217961; ziplocale=en; zippop=2;'
However, I noticed that not all of this is necessary (just the Cookie is all you need), but it did the trick - I was able to download the entire file. If using urllib2 I am sure that doing the same (sending requests with the appropriate header content) would do the trick. Thank you all for the good tips, and for pointing me in the right direction. I used Fiddlr to see what my Requests GET header was missing in comparison to chrome's GET header. If you have a similar issue like mine, I suggest you check it out.

I tried this with Python using the following code that replaces urlib with urllib2:
url = "http://www29.zippyshare.com/d/12069311/2695/Del%20Paxton-Worst.%20Summer.%20Ever%20EP%20%282013%29.rar"
import urllib2
file_name = url.split('/')[-1]
response = urllib2.urlopen(url)
data = response.read()
with open(file_name, 'wb') as bin_writer:
bin_writer.write(data)
and I get the same 22k file. Trying it with wget on that URL yields the same file; however I was able to begin the download of the full file (around 35MB as I recall) by pasting the URL in the Chrome navigation bar. Perhaps they are serving different files based upon the headers that you are sending in your request? The User-Agent GET request header is going to look different to their server (i.e. not like a browser) from Python/wget than it does from your browser when you click on the button.
I did not open the .rar archives to inspect the two files.
This thread discusses setting headers with urllib2 and this is the Python documentation on how to read the response status codes from your urllib2 request which could be helpful as well.

How can I change a user agent string programmatically?

I would like to write a program that changes my user agent string.
How can I do this in Python?

I assume you mean a user-agent string in an HTTP request? This is just an HTTP header that gets sent along with your request.
using Python's urllib2:
import urllib2
url = 'http://foo.com/'
# add a header to define a custon User-Agent
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
req = urllib2.Request(url, '', headers)
response = urllib2.urlopen(req).read()

In urllib, it's done like this:
import urllib
class AppURLopener(urllib.FancyURLopener):
version = "MyStrangeUserAgent"
urllib._urlopener = AppURLopener()
and then just use urllib.urlopen normally. In urllib2, use req = urllib2.Request(...) with a parameter of headers=somedict to set all the headers you want (including user agent) in the new request object req that you make, and urllib2.urlopen(req).
Other ways of sending HTTP requests have other ways of specifying headers, of course.

Using Python you can use urllib to download webpages and use the version value to change the user-agent.
There is a very good example on http://wolfprojects.altervista.org/changeua.php
Here is an example copied from that page:
>>> from urllib import FancyURLopener
>>> class MyOpener(FancyURLopener):
... version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11)
Gecko/20071127 Firefox/2.0.0.11'
>>> myopener = MyOpener()
>>> page = myopener.open('http://www.google.com/search?q=python')
>>> page.read()
[…]Results <b>1</b> - <b>10</b> of about <b>81,800,000</b> for <b>python</b>[…]

urllib2 is nice because it's built in, but I tend to use mechanize when I have the choice. It extends a lot of urllib2's functionality (though much of it has been added to python in recent years). Anyhow, if it's what you're using, here's an example from their docs on how you'd change the user-agent string:
import mechanize
cookies = mechanize.CookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"),
("From", "responsible.person#example.com")]
Best of luck.

As mentioned in the answers above, the user-agent field in the http request header can be changed using builtin modules in python such as urllib2. At the same time, it is also important to analyze what exactly the web server sees. A recent post on User agent detection gives a sample code and output, which gives a description of what the web server sees when a programmatic request is sent.

If you want to change the user agent string you send when opening web pages, google around for a Firefox plugin. ;) For example, I found this one. Or you could write a proxy server in Python, which changes all your requests independent of the browser.
My point is, changing the string is going to be the easy part; your first question should be, where do I need to change it? If you already know that (at the browser? proxy server? on the router between you and the web servers you're hitting?), we can probably be more helpful. Or, if you're just doing this inside a script, go with any of the urllib answers. ;)

Updated for Python 3.2 (py3k):
import urllib.request
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
url = 'http://www.google.com'
request = urllib.request.Request(url, b'', headers)
response = urllib.request.urlopen(request).read()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

urlretrieve not working for this site - python

Related

BeautifulSoup Not Parsing HTML Correctly inside Try/Except Loop

How do I convert Python crawled Bing web page content to human-readable?

Urlib2 is downlloading Corrupted pictures [duplicate]

download zipfile with urllib2 fails

How can I change a user agent string programmatically?

Categories

Resources