Unzip downloaded gzipped content on the fly - python

I am downloading a gzipped CSV with python, and I would like to write it to disk diretly as csv.
I tried several variations of the following:
url ='blabla.s3....csv.gz'
filename = 'my.csv'
compressed = requests.get(url).content
data = gzip.GzipFile(fileobj=compressed)
with open(filename, 'wb') as out_file:
out_file.write(data)
but I am getting various errors - I am not sure I am passing the right part of the response to the gzip method. If anyone has experience with this, input is appreciated.

You should be able to use zlib to decompress the response.
import zlib
res = requests.get(url)
data = zlib.decompress(res.content, zlib.MAX_WBITS|32)
Now, write to a file:
with open(filename, 'wb') as f:
f.write(data)

Related

How to download a file using requests

I am using the requests library to download a file from a URL. This is my code
for tag in soup.find_all('a'):
if '.zip' in str(tag):
file_name = str(tag).strip().split('>')[-2].split('<')[0]
link = link_name+tag.get('href')
r = requests.get(link, stream=True)
with open(os.path.join(download_path, file_name), 'wb') as fd:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
fd.write(chunk)
And then I unzip the file using this code
unzip_path = os.path.join(download_path, file_name.split('.')[0])
with zipfile.ZipFile(os.path.join(download_path, file_name), 'r') as zip_ref:
zip_ref.extractall(unzip_path)
This code looks if there is a zip file in the provided page and then downloads the zipped file in a directory. Then it will unzip the file using the zipFile library.
The problem with this code is that sometimes the download is not complete. So for example if the zipped file is 312KB long only parts of it is downloaded. And then I get a BadZipFile error. But sometimes the entire file is downloaded correctly.
I tried the same without streaming and even that results in the same problem.
How do I check if all the chunks are downloaded properly.
Maybe this works:
r = requests.get(link)
with open(os.path.join(download_path, file_name), 'wb') as fd:
fd.write(r.content)

Python write file content from requests response to corresponding file

I have an API service which responds with the file content in byte string and extension of the file. I need to write this response to the corresponding file based on the extension. The problem is I have to work with a lot of file types including pdf, pkl, sav, csv etc. I can't find a generic solution to tackle this problem
for csv files I am doing:
data = response.content.decode('utf-8').splitlines()
print(data)
import csv,re
with open("tet.csv", "w") as csv_file:
writer = csv.writer(csv_file, delimiter = '\t')
for line in data:
writer.writerow(re.split('\s+',line))
for pdf files:
with open('hh.pdf','wb') as fd:
fd.write(oR.content)
cant seem to get a generic solution to tackle this problem.
Any help is appeciated. Thanks in advance
The incoming response.content from api is of binary format, so irrespective of file type it just needs to be written in the binary format
with open(f'file.{extension}', 'w+b') as f:
f.write(res.content)

How to compress pdf without losing quality using PyPDF2 [duplicate]

I am struggling to compress my merged pdf's using the PyPDF2 module. this is my attempt based on http://www.blog.pythonlibrary.org/2012/07/11/pypdf2-the-new-fork-of-pypdf/
import PyPDF2
path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')
merger = PyPDF2.PdfFileMerger()
merger.append(fileobj=path2)
merger.append(fileobj=path)
pdf.filters.compress(merger)
merger.write(open("test_out2.pdf", 'wb'))
The error I receive is
TypeError: must be string or read-only buffer, not file
I have also tried to compressing the pdf after the merging is complete. I am basing my failed compression on what file size I got after using PDFSAM with compression.
Any thoughts? Thanks.
PyPDF2 doesn't have a reliable compression method. That said, there's a compress_content_streams() method with the following description:
Compresses the size of this page by joining all content streams and applying a FlateDecode filter.
However, it is possible that this function will perform no action if content stream compression becomes "automatic" for some reason.
Again, this won't make any difference in most cases but you can try this code:
from PyPDF2 import PdfReader, PdfWriter
writer = PdfWriter()
for pdf in ["path/to/hello.pdf", "path/to/another.pdf"]:
reader = PdfReader(pdf)
for page in reader.pages:
page.compress_content_streams()
writer.add_page(page)
with open("test_out2.pdf", "wb") as f:
writer.write(f)
Your error says that it must be string or read-only buffer, not file.
So it's better to write your merger to a byte or string.
import PyPDF2
from io import BytesIO
tmp = BytesIO()
path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')
merger = PyPDF2.PdfFileMerger()
merger.append(fileobj=path2)
merger.append(fileobj=path)
merger.write(tmp)
PyPDF2.filters.compress(tmp.getvalue())
merger.write(open("test_out2.pdf", 'wb'))
The initial approach isn't that wrong. Just add the pages to your writer and compress them before writing to a file:
...
for i in list(range(reader.numPages)):
page = reader.getPage(i)
writer.addPage(page);
for i in list(range(writer.getNumPages())):
page.compressContentStreams()
...
pypdf offers several ways to reduce the file size: https://pypdf.readthedocs.io/en/latest/user/file-size.html
compress_content_streams is one that only has the disadvantage that it might take long (depends on the PDF; think of it as ZIP-for-PDF):
from pypdf import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
for page in reader.pages:
page.compress_content_streams() # This is CPU intensive!
writer.add_page(page)
with open("out.pdf", "wb") as f:
writer.write(f)

Reading and returning jpg file in Python, ASP.NET?

I have an image.asp file, using Python, in which I am attempting to open a JPEG image and write it to the response so that it can be retrieved from the relevant link. What I have currently:
<%# LANGUAGE = Python%>
<%
path = "path/to/image.jpg"
with open(path, 'rb') as f:
jpg = f.read()
Response.ContentType = "image/jpeg"
Response.WriteBinary(jpg)
%>
In a browser, this returns the following error:
The image "url/to/image.asp" cannot be displayed because it contains errors.
I suspect the issue is that I am just not writing contents of the jpg file correctly. What do I need to fix?
Your issue is here:
with open(url, 'rb') as f:
Your variable, that contains path is named path, not url.
Make it as:
with open(path, 'rb') as f:
and it will work better.

downloading a file, not the contents

I am trying to automate downloading a .Z file from a website, but the file I get is 2kb when it should be around 700 kb and it contains a list of the contents of the page (ie: all the files available for download). I am able to download it manually without a problem. I have tried urllib and urllib2 and different configurations of each, but each does the same thing. I should add that the urlVar and fileName variables are generated in a different part of the code, but I have given an example of each here to demonstrate.
import urllib2
urlVar = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z"
fileName = txga1000.14d.Z
downFile = urllib2.urlopen(urlVar)
with open(fileName, "wb") as f:
f.write(downFile.read())
At least the urllib2documentation suggest you should use the Requestobject. This works with me:
import urllib2
req = urllib2.Request("ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/txga1000.14d.Z")
response = urllib2.urlopen(req)
data = response.read()
Data length seems to be 740725.
I was able to download what seems like the correct size for your file with the following python2 code:
import urllib2
filename = "txga1000.14d.Z"
url = "ftp://www.ngs.noaa.gov/cors/rinex/2014/100/txga/{}".format(filename)
reply = urllib2.urlopen(url)
buf = reply.read()
with open(filename, "wb") as fh:
fh.write(buf)
Edit: The post above me was answered faster and is much better.. I thought I'd post since I tested and wrote this out anyways.

Categories

Resources