Download a zip file from a URL using requests module in python - python

when I access this website, my browser opens a box to download a zip file.
I am trying to download the zip file through a python script (I am a begginer in coding). I would like to automate the process of downloading a batch of similar links in the future, but I am testing with only one link for now. Here is my code:
import requests
url = 'https://sigef.incra.gov.br/geo/exportar/vertice/shp/454698fd-6dfa-49a1-8096-bd9bb57b62ca'
r = requests.get(url, verify=False, allow_redirects=True)
open('verticeshp454698fd-6dfa-49a1-8096-bd9bb57b62ca.zip', 'wb').write(r.content)
As an output I get a broken zip file, not the one i wanted. I also get the following message in the command prompt:
C:\Users\joaop\AppData\Local\Programs\Python\Python38\lib\site-packages\urllib3\connectionpool.py:979: InsecureRequestWarning: Unverified HTTPS request is being made to host 'sigef.incra.gov.br'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
What steps am I missing here?
Thanks in advance for your help.

I got it working by adding / at the end of the url
import requests
# the `/` at the end is important
url = 'https://sigef.incra.gov.br/geo/exportar/vertice/shp/454698fd-6dfa-49a1-8096-bd9bb57b62ca/'
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2866.71 Safari/537.36",
}
r = requests.get(url, headers=headers, verify=False, allow_redirects=True)
# get the filename from the headers `454698fd-6dfa-49a1-8096-bd9bb57b62ca_vertice.zip`
filename = r.headers['Content-Disposition'].split("filename=")[-1]
with open(filename, 'wb') as f:
f.write(r.content)
See it in action here.

Related

Why python request return html file instead of excel?

I want to download excel file from this link via python
https://www.tfex.co.th/tfex/historicalTrading.html?locale=en_US&symbol=S50Z21&decorator=excel&series=&page=4&locale=en_US&locale=en_US&periodView=A
Here is my code:
url = 'https://www.tfex.co.th/tfex/historicalTrading.html?locale=en_US&symbol=S50Z21&decorator=excel&series=&page=4&locale=en_US&periodView=A'
resp = requests.get(url)
with open('file.xls','wb') as f:
f.write(resp.content)
But the file.xls is instead a html text file.
file.xls looks like this.1
I've tried add headers
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
resp = requests.get(url, headers=headers)
But it didn't help. Thank you in advance.
Edit:
Found a way using pandas.
import pandas as pd
url = r'https://www.tfex.co.th/tfex/historicalTrading.html?locale=en_US&symbol=S50Z21&decorator=excel&series=&page=4&locale=en_US&periodView=A'
# read into HTML tables
tables = pd.read_html(url)
# merge HTML tables
merged = pd.concat(tables)
# Write tables to excel file
merged.to_excel("output.xlsx")
Hope this helps :)
Ignore below, this was before edit:
I know this is still problematic depending on your downstream application. The code below does seem to still download it into a HTML format, but this format can be opened in excel regardless.
import requests
url = r'https://www.tfex.co.th/tfex/historicalTrading.html?locale=en_US&symbol=S50Z21&decorator=excel&series=&page=4&locale=en_US&periodView=A'
r = requests.get(url, allow_redirects=False)
excel_url = r.url
open('out.xls', 'wb').write(r.content)
When I open this in excel I get a warning, and click okay.
screenshot of file

login to a website in python 3.6 without requests module

I have been trying to login to a website using python 3.6 but it has proven to be more difficult than i originally anticipated. So far this is my code:
import urllib.request
import urllib.parse
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36"
url = "https://www.pinterest.co.uk/login/"
data = {
"email" : "my#email",
"password" : "my_password"}
data = urllib.parse.urlencode(data)
data = data.encode("utf-8")
request = urllib.request.Request(url, headers = headers, data = data)
response = urllib.request.urlopen(request)
responseurl = response.geturl()
print(responseurl)
This throws up a 403 error (forbidden), and I'm not sure why as I have added my email, passcode and even changed the user agent. Am I just missing something simple like a cookiejar?
If possible is there a way to do this without using the requests module as this is a challenge that I have been given to do this with only inbuilt modules (but I am allowed to get help so I'm not cheating)
Most sites will use a csrf token or other means to block exactly what you are attempting to do. One possible workaround would be to utilize a browser automation framework such as selenium and log in through the site's UI

Download ZIP file from the web (Python)

I am trying to download a ZIP file using from this website. I have looked at other questions like this, tried using the requests and urllib but I get the same error:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found
Any ideas on how to open the file straight from the web?
Here is some sample code
from urllib.request import urlopen
response = urlopen('http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip')
The linked url will redirect indefinitely, that's why you get the 302 error.
You can examine this yourself over here. As you can see the linked url immediately redirects to itself creating a single-url loop.
Works for me using the Requests library
import requests
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip'
response = requests.get(url)
# Unzip it into a local directory if you want
import zipfile, io
zip = zipfile.ZipFile(io.BytesIO(response.content))
zip.extractall("/path/to/your/directory")
Note that sometimes trying to access web pages programmatically leads to 302 responses because they only want you to access the page via a web browser.
If you need to fake this (don't be abusive), just set the 'User-Agent' header to be like a browser. Here's an example of making a request look like it's coming from a Chrome browser.
user_agent = 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
headers = {'User-Agent': user_agent}
requests.get(url, headers=headers)
There are several libraries (e.g. https://pypi.org/project/fake-useragent/) to help with this for more extensive scraping projects.

Urlib2 is downlloading Corrupted pictures [duplicate]

I want to download image file from a url using python module "urllib.request", which works for some website (e.g. mangastream.com), but does not work for another (mangadoom.co) receiving error "HTTP Error 403: Forbidden". What could be the problem for the latter case and how to fix it?
I am using python3.4 on OSX.
import urllib.request
# does not work
img_url = 'http://mangadoom.co/wp-content/manga/5170/886/005.png'
img_filename = 'my_img.png'
urllib.request.urlretrieve(img_url, img_filename)
At the end of error message it said:
...
HTTPError: HTTP Error 403: Forbidden
However, it works for another website
# work
img_url = 'http://img.mangastream.com/cdn/manga/51/3140/006.png'
img_filename = 'my_img.png'
urllib.request.urlretrieve(img_url, img_filename)
I have tried the solutions from the post below, but none of them works on mangadoom.co.
Downloading a picture via urllib and python
How do I copy a remote image in python?
The solution here also does not fit because my case is to download image.
urllib2.HTTPError: HTTP Error 403: Forbidden
Non-python solution is also welcome. Your suggestion will be very appreciated.
This website is blocking the user-agent used by urllib, so you need to change it in your request. Unfortunately I don't think urlretrieve supports this directly.
I advise for the use of the beautiful requests library, the code becomes (from here) :
import requests
import shutil
r = requests.get('http://mangadoom.co/wp-content/manga/5170/886/005.png', stream=True)
if r.status_code == 200:
with open("img.png", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
Note that it seems this website does not forbide requests user-agent. But if need to be modified it is easy :
r = requests.get('http://mangadoom.co/wp-content/manga/5170/886/005.png',
stream=True, headers={'User-agent': 'Mozilla/5.0'})
Also relevant : changing user-agent in urllib
You can build an opener. Here's the example:
import urllib.request
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
urllib.request.install_opener(opener)
url=''
local=''
urllib.request.urlretrieve(url,local)
By the way, the following codes are the same:
(none-opener)
req=urllib.request.Request(url,data,hdr)
html=urllib.request.urlopen(req)
(opener builded)
html=operate.open(url,data,timeout)
However, we are not able to add header when we use:
urllib.request.urlretrieve()
So in this case, we have to build an opener.
I try wget with the url in terminal and it works:
wget -O out_005.png http://mangadoom.co/wp-content/manga/5170/886/005.png
so my way around is to use the script below, and it works too.
import os
out_image = 'out_005.png'
url = 'http://mangadoom.co/wp-content/manga/5170/886/005.png'
os.system("wget -O {0} {1}".format(out_image, url))

download zipfile with urllib2 fails

I am trying to download a file with urllib. I am using a direct link to this rar (if I use chrome on this link, it will immediately start downloading the rar file), but when i run the following code :
file_name = url.split('/')[-1]
u = urllib.urlretrieve(url, file_name)
... all I get back is a 22kb rar file, which is obviously wrong. What is going on here? Im on OSX Mavericks w/ python 2.7.5, and here is the url.
(Disclaimer : this is a free download, as seen on the band's website
Got it. The headers were lacking alot of information. I resorted to using Requests, and with each GET request, I would add the following content to the header :
'Connection': 'keep-alive'
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'
'Cookie': 'JSESSIONID=36DAD704C8E6A4EF4B13BCAA56217961; ziplocale=en; zippop=2;'
However, I noticed that not all of this is necessary (just the Cookie is all you need), but it did the trick - I was able to download the entire file. If using urllib2 I am sure that doing the same (sending requests with the appropriate header content) would do the trick. Thank you all for the good tips, and for pointing me in the right direction. I used Fiddlr to see what my Requests GET header was missing in comparison to chrome's GET header. If you have a similar issue like mine, I suggest you check it out.
I tried this with Python using the following code that replaces urlib with urllib2:
url = "http://www29.zippyshare.com/d/12069311/2695/Del%20Paxton-Worst.%20Summer.%20Ever%20EP%20%282013%29.rar"
import urllib2
file_name = url.split('/')[-1]
response = urllib2.urlopen(url)
data = response.read()
with open(file_name, 'wb') as bin_writer:
bin_writer.write(data)
and I get the same 22k file. Trying it with wget on that URL yields the same file; however I was able to begin the download of the full file (around 35MB as I recall) by pasting the URL in the Chrome navigation bar. Perhaps they are serving different files based upon the headers that you are sending in your request? The User-Agent GET request header is going to look different to their server (i.e. not like a browser) from Python/wget than it does from your browser when you click on the button.
I did not open the .rar archives to inspect the two files.
This thread discusses setting headers with urllib2 and this is the Python documentation on how to read the response status codes from your urllib2 request which could be helpful as well.

Categories

Resources