Python download file from URL - python

Trying to download a file from this URL. It is an excel file and it downloads only 1 kb of it. While the file size is actually 7mb. I dont understand what has to be done here
But if copy and paste the url in IE, the entire file is downloaded
res = requests.get('http://fescodoc.***.com/fescodoc/component/getcontent?objectId=09016fe98f2b59bb&current=true')
res.raise_for_status()
playFile = open('DC_Estimation Form.xlsm', 'wb')
for chunk in res.iter_content(1024):
playFile.write(chunk)

You should set stream to true in the get(),
res = requests.get('http://fescodoc.***.com/fescodoc/component/getcontent?objectId=09016fe98f2b59bb&current=true', stream=True)
res.raise_for_status()
with open('DC_Estimation Form.xlsm', 'wb') as playFile:
for chunk in res.iter_content(1024):
playFile.write(chunk)
See here: http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

It is easier to use built-in module urllib for such cases: https://docs.python.org/2/library/urllib.html#urllib.urlretrieve
urllib.urlretrieve('http://fescodoc/component/.', 'DC_Estimation_Form.xslm')

Related

How to download a huge gz file (around 3 GB size) from a URL where there is no file name present using python

I am trying to download a file using python from a URL. However its not working and instead I am getting index.html.
Please help on same.
import requests
target_url = "https://transparency-in-coverage.uhc.com/?file=2022-07-01_United-HealthCare-Services_Third-Party-Administrator_EP1-50_C1_in-network-rates.json.gz&origin=uhc"
filename = "2022-07-01_United-HealthCare-Services_Third-Party-Administrator_EP1-50_C1_in-network-rates.json.gz"
with requests.get(target_url, stream=True) as r:
r.raise_for_status()
with open(filename, "wb") as f:
for chunk in r.iter_content(chunk_size=1024):
f.write(chunk)
That's because you're the URL you specified is for an HTML page that subsequently starts the download for the .gz file you want.
This is the link for the file:
https://mrfstorageprod.blob.core.windows.net/mrf-even/2022-07-01_ALL-SAVERS-INSURANCE-COMPANY_Insurer_PS1-50_C2_in-network-rates.json.gz?sv=2021-04-10&st=2022-07-05T22%3A19%3A13Z&se=2022-07-09T22%3A19%3A13Z&skoid=89efab61-5daa-4cf2-aa04-ce3ba9d1d1e8&sktid=db05faca-c82a-4b9d-b9c5-0f64b6755421&skt=2022-07-05T22%3A19%3A13Z&ske=2022-07-09T22%3A19%3A13Z&sks=b&skv=2021-04-10&sr=b&sp=r&sig=NaLrw2KG239S%2BpfZibvw7%2B25AAQsf9GYZ1gFK0KRN20%3D&rscd=attachment
To find it, you need to have the inspector open on the 'Network' tab whilst loading the page (or you can click on the file in the list when it loads the list of files on the page). When the download starts you'll see two files pop-up, one of which is the actual URL of the .gz file.
It does look the URL has a timestamp in it, so it might not work at a later time, I don't know.

How to download a file using requests

I am using the requests library to download a file from a URL. This is my code
for tag in soup.find_all('a'):
if '.zip' in str(tag):
file_name = str(tag).strip().split('>')[-2].split('<')[0]
link = link_name+tag.get('href')
r = requests.get(link, stream=True)
with open(os.path.join(download_path, file_name), 'wb') as fd:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
fd.write(chunk)
And then I unzip the file using this code
unzip_path = os.path.join(download_path, file_name.split('.')[0])
with zipfile.ZipFile(os.path.join(download_path, file_name), 'r') as zip_ref:
zip_ref.extractall(unzip_path)
This code looks if there is a zip file in the provided page and then downloads the zipped file in a directory. Then it will unzip the file using the zipFile library.
The problem with this code is that sometimes the download is not complete. So for example if the zipped file is 312KB long only parts of it is downloaded. And then I get a BadZipFile error. But sometimes the entire file is downloaded correctly.
I tried the same without streaming and even that results in the same problem.
How do I check if all the chunks are downloaded properly.
Maybe this works:
r = requests.get(link)
with open(os.path.join(download_path, file_name), 'wb') as fd:
fd.write(r.content)

Download a file in python with urllib2 instead of urllib

I'm trying to download a tarball file and save it locally with python. With urllib it's pretty simple:
import urllib
urllib2.urlopen(url, 'compressed_file.tar.gz')
tar = tarfile.open('compressed_file.tar.gz')
print tar.getmembers()
So my question is really simple: What's the way to achieve this using the urllib2 library?
Quoting docs:
urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[,
context]]]]])
Open the URL url, which can be either a string or a
Request object.
data may be a string specifying additional data to send to the server, or None if no such data is needed.
Nothing in urlopen interface documentation says, that second argument is a name of file where response should be written.
You need to explicitly write data read from response to file:
r = urllib2.urlopen(url)
CHUNK_SIZE = 1 << 20
with open('compressed_file.tar.gz', 'wb') as f:
# line belows downloads all file at once to memory, and dumps it to file afterwards
# f.write(r.read())
# below is preferable lazy solution - download and write data in chunks
while True:
chunk = r.read(CHUNK_SIZE)
if not chunk:
break
f.write(chunk)

Download and save PDF file with Python requests module

I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.
In [1]: import requests
In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
In [3]: response = requests.get(url)
In [4]: with open('/tmp/metadata.pdf', 'wb') as f:
...: f.write(response.text)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-4-4be915a4f032> in <module>()
1 with open('/tmp/metadata.pdf', 'wb') as f:
----> 2 f.write(response.text)
3
UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)
In [5]: import codecs
In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f:
...: f.write(response.text)
...:
I know it is a codec problem of some kind but I can't seem to get it to work.
You should use response.content in this case:
with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
From the document:
You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...
So that means: response.text return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.
And response.content return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.
You can also use response.raw instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:
import requests
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)
with open('/tmp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
chunk_size is the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000 bytes, write them into the file, and do this again, again and again, unless it finished.
So this can save your RAM. But I'd prefer use response.content instead in this case since your file is small. As you can see use response.raw is complex.
Relates:
How to download large file in python with requests.py?
How to download image using requests
In Python 3, I find pathlib is the easiest way to do this. Request's response.content marries up nicely with pathlib's write_bytes.
from pathlib import Path
import requests
filename = Path('metadata.pdf')
url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
response = requests.get(url)
filename.write_bytes(response.content)
You can use urllib:
import urllib.request
urllib.request.urlretrieve(url, "filename.pdf")
Please note I'm a beginner. If My solution is wrong, please feel free to correct and/or let me know. I may learn something new too.
My solution:
Change the downloadPath accordingly to where you want your file to be saved. Feel free to use the absolute path too for your usage.
Save the below as downloadFile.py.
Usage: python downloadFile.py url-of-the-file-to-download new-file-name.extension
Remember to add an extension!
Example usage: python downloadFile.py http://www.google.co.uk google.html
import requests
import sys
import os
def downloadFile(url, fileName):
with open(fileName, "wb") as file:
response = requests.get(url)
file.write(response.content)
scriptPath = sys.path[0]
downloadPath = os.path.join(scriptPath, '../Downloads/')
url = sys.argv[1]
fileName = sys.argv[2]
print('path of the script: ' + scriptPath)
print('downloading file to: ' + downloadPath)
downloadFile(url, downloadPath + fileName)
print('file downloaded...')
print('exiting program...')
Generally, this should work in Python3:
import urllib.request
..
urllib.request.get(url)
Remember that urllib and urllib2 don't work properly after Python2.
If in some mysterious cases requests don't work (happened with me), you can also try using
wget.download(url)
Related:
Here's a decent explanation/solution to find and download all pdf files on a webpage:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
regarding Kevin answer to write in a folder tmp, it should be like this:
with open('./tmp/metadata.pdf', 'wb') as f:
f.write(response.content)
he forgot . before the address and of-course your folder tmp should have been created already

downloading a file, not the contents

I am trying to automate downloading a .Z file from a website, but the file I get is 2kb when it should be around 700 kb and it contains a list of the contents of the page (ie: all the files available for download). I am able to download it manually without a problem. I have tried urllib and urllib2 and different configurations of each, but each does the same thing. I should add that the urlVar and fileName variables are generated in a different part of the code, but I have given an example of each here to demonstrate.
import urllib2
urlVar = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z"
fileName = txga1000.14d.Z
downFile = urllib2.urlopen(urlVar)
with open(fileName, "wb") as f:
f.write(downFile.read())
At least the urllib2documentation suggest you should use the Requestobject. This works with me:
import urllib2
req = urllib2.Request("ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z")
response = urllib2.urlopen(req)
data = response.read()
Data length seems to be 740725.
I was able to download what seems like the correct size for your file with the following python2 code:
import urllib2
filename = "txga1000.14d.Z"
url = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/{}".format(filename)
reply = urllib2.urlopen(url)
buf = reply.read()
with open(filename, "wb") as fh:
fh.write(buf)
Edit: The post above me was answered faster and is much better.. I thought I'd post since I tested and wrote this out anyways.

Categories

Resources