how to gzip files using writelines? - python

I have a few files in my tmp folder that I want to gzip individually and upload to S3. The testList contains paths like /tmp/files/File1. SO fileName2, which I use for gzip.open() is /tmp/files/File1.gz. I want to gzip each file in the testList.
for i in testList:
fileName = i.replace("/tmp/files/", "")
fileName2 = i + '.gz'
with open("path/to/file", 'rb') as orig_file:
with gzip.open(fileName2, 'wb') as zipped_file:
zipped_file.writelines(orig_file)
bucket.upload_fileobj(zipped_file, fileName, ExtraArgs={'ContentType': "application/gzip"})
When I download the files from S3, they have a gz file type but I am unable to open them locally. It throws an error that the .gz file is empty and cannot be expanded. I believe the way I am writing content is incorrect.
How can I fix this?
Edit:
for i in testList:
fileName = i.replace("/tmp/files/", "")
fileName2 = i + '.gz'
with open(i, 'rb') as f_in:
with gzip.open(fileName2, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
f_out.upload_fileobj(zipped_file, fileName, ExtraArgs={'ContentType': "application/gzip"})
Even this, the gzip files are still not expandable.

You are getting an open file in orig_file, not just lines.
I think your use case is about turning an existing file into a compressed one. So the following should be the relevant paragraph from the Examples of usage section of the documentation:
Example of how to GZIP compress an existing file:
import gzip
import shutil
with open('/home/joe/file.txt', 'rb') as f_in:
with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

Related

Unzipping multiple .gz files into single text file using Python

I am trying to unzip multiple .gz extentions files into single .txt file. All these files have json data.
I tried the following code:
from glob import glob
import gzip
for fname in glob('.../2020-04/*gz'):
with gzip.open(fname, 'rb') as f_in:
with open('.../datafiles/202004_twitter/decompressed.txt', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
But the decompressed.txt file only has the last .gz file's data.
Just shuffle f_out to the outside, so you open it before iterating over the input files and keep that one handle open.
from glob import glob
import gzip
with open('.../datafiles/202004_twitter/decompressed.txt', 'wb') as f_out:
for fname in glob('.../2020-04/*gz'):
with gzip.open(fname, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Use "wba" mode instead. a opens in append mode. w alone will erase the file upon opening.

Trying to merge all text files in a folder and append file as well

I am trying to merge all text files in a folder. I have this part working, but when I try to append the file name before the contents of each text file, I'm getting a error that reads: TypeError: a bytes-like object is required, not 'str'
The code below must be pretty close, but something is definitely off. Any thoughts what could be wrong?
import glob
folder = 'C:\\my_path\\'
read_files = glob.glob(folder + "*.txt")
with open(folder + "final_result.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(f)
outfile.write(infile.read())
outfile.close
outfile.write(f) seems to be your problem because you opened the file with in binary mode with 'wb'. You can convert to bytes using encode You'll likely not want to close outfile in your last line either (although you aren't calling the function anyway). So something like this might work for you:
import glob
folder = 'C:\\my_path\\'
read_files = glob.glob(folder + "*.txt")
with open(folder + "final_result.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(f.encode('utf-8'))
outfile.write(infile.read())

Python "gzip" module acting weirdly if compressed extension is not ".gz"

I need to compress a file using the gzip module, but the output file extension may not be .gz.
Look at this simple code:
import gzip
import shutil
input_path = "test.txt"
output_path = input_path + ".gz"
with open(input_path, 'w') as file:
file.write("abc" * 10)
with gzip.open(output_path, 'wb') as f_out:
with open(input_path, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
It works fine. But if I replace ".gz" with ".gzip" for example, then I am not able to open the compressed file correctly:
I tried with 7-Zip and WinRar, the result is the same, and the bug persists even if I rename the file.
Does anyone know where the problem comes from, please?
I tried with compression bz2 and lzma, they seem to work properly no matter what the extension is.
You actually have two versions of file created this way:
First, .gz file:
with gzip.open("test.txt.gz", 'wb') as f_out:
with open("test.txt", 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Second, .gzip file:
with gzip.open("test.txt.gzip", 'wb') as f_out:
with open("test.txt", 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Both create a GZIP with your test.txt in it. The only difference is that in the second case, test.txt is renamed to test.txt.gzip.
The problem is that the argument to gzip.open actually has two purposes: the filename of the gzip archive and the filename of the file inside (bad design, imho).
So, if you do gzip.open("abcd", 'wb') and write to it, it will create gzip archive named abcd with a file named abcd inside.
But then, there comes magic: if the filename endswith .gz, then it behaves differently, e.g. gzip.open("bla.gz", 'wb') creates a gzip archive named bla.gz with a file named bla inside.
So, with .gz you activated the (undocumented, as far as I can see!) magic, whereas with .gzip you did not.
The filename inside the archive can be controlled by utilising gzip.GzipFile constructor instead of the gzip.open method. The gzip.GzipFile needs then a separate os.open call before it.
with open(output_path, 'wb') as f_out_gz:
with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out:
...
f_out.flush()
Note also the added f_out.flush() - according to my experience without this line the GzipFile may in some cases randomly not flush the data before the file is closed, resulting in corrupt archive.
Or as a complete example:
import gzip
import shutil
input_path = "test.txt"
output_path = input_path + ".gz"
with open(input_path, 'w') as file:
file.write("abc" * 10)
with open(output_path, 'wb') as f_out_gz:
with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out
with open(input_path, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
f_out.flush()

How to save created Zip file to file system in python?

Using python zipfile module I created a zip file as follows:
s = StringIO()
zip_file = zipfile.ZipFile(s, "w")
zip_file.write('/local/my_files/my_file.txt')
s.seek(0)
and now, I want this zip_file to be saved in my file system at path /local/my_files/ as my_file.zip.
Generally to save a noraml files I used the following flow:
with open(dest_file, 'w') as out_file:
for line in in_file:
out_file.write(line)
But, I think I can't achieve saving a zipfile with this. Can any one please help me in getting this done.
zip_file = zipfile.ZipFile("/local/my_files/my_file.zip", "w")
zip_file.write('/local/my_files/my_file.txt')
zip_file.close()
The first argument of the ZipFile object initialization is the path to which you want to save the zip file.
If you need to use StringIO, just try this code:
from StringIO import StringIO
import zipfile
s = StringIO()
with zipfile.ZipFile(s, "w", compression=zipfile.ZIP_DEFLATED) as zf:
zf.write('/local/my_files/my_file.txt')
with open('/local/my_files/my_file.zip', 'wb') as f_out:
f_out.write(s.getvalue())
Or you can do it in a simpler way:
import zipfile
with zipfile.ZipFile("/local/my_files/my_file.zip", "w", compression=zipfile.ZIP_DEFLATED) as zf:
zf.write("/local/my_files/my_file.txt")

IOError when downloading and decompressing gzip file

I'm trying to download and decompress a gzip file and then convert the resulting decompressed file which is of tsv format into a CSV format which would be easier to parse. I am trying to gather the data from the "Download Table" link in this URL. My code is as follows, where I am using the same idea as in this post, however I get the error IOError: [Errno 2] No such file or directory: 'file=data/irt_euryld_d.tsv' in the line with open(outFilePath, 'w') as outfile:
import os
import urllib2
import gzip
import StringIO
baseURL = "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?"
filename = "D:\Sidney\irt_euryld_d.tsv.gz" #Edited after heinst's comment below
outFilePath = filename[:-3]
response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
compressedFile.seek(0)
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
with open(outFilePath, 'w') as outfile:
outfile.write(decompressedFile.read())
#Now have to deal with tsv file
import csv
with open(outFilePath,'rb') as tsvin, open('ECB.csv', 'wb') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
csvout = csv.writer(csvout) #Converting output into CSV Format
Thank You
The path you were setting filename to was not a valid path to have a file written to it. So you have to change filename = "data/irt_euryld_d.tsv.gz" to be a valid path to wherever you want the irt_euryld_d.tsv.gz file to live. For example if I wanted the irt_euryld_d.tsv.gz file on my desktop I would set the value of filename = "/Users/heinst/Desktop/data/irt_euryld_d.tsv.gz". Since this is a valid path, python will not give you the No such file or directory error anymore.

Categories

Resources