Python "gzip" module acting weirdly if compressed extension is not ".gz" - python

I need to compress a file using the gzip module, but the output file extension may not be .gz.
Look at this simple code:
import gzip
import shutil
input_path = "test.txt"
output_path = input_path + ".gz"
with open(input_path, 'w') as file:
file.write("abc" * 10)
with gzip.open(output_path, 'wb') as f_out:
with open(input_path, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
It works fine. But if I replace ".gz" with ".gzip" for example, then I am not able to open the compressed file correctly:
I tried with 7-Zip and WinRar, the result is the same, and the bug persists even if I rename the file.
Does anyone know where the problem comes from, please?
I tried with compression bz2 and lzma, they seem to work properly no matter what the extension is.

You actually have two versions of file created this way:
First, .gz file:
with gzip.open("test.txt.gz", 'wb') as f_out:
with open("test.txt", 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Second, .gzip file:
with gzip.open("test.txt.gzip", 'wb') as f_out:
with open("test.txt", 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Both create a GZIP with your test.txt in it. The only difference is that in the second case, test.txt is renamed to test.txt.gzip.
The problem is that the argument to gzip.open actually has two purposes: the filename of the gzip archive and the filename of the file inside (bad design, imho).
So, if you do gzip.open("abcd", 'wb') and write to it, it will create gzip archive named abcd with a file named abcd inside.
But then, there comes magic: if the filename endswith .gz, then it behaves differently, e.g. gzip.open("bla.gz", 'wb') creates a gzip archive named bla.gz with a file named bla inside.
So, with .gz you activated the (undocumented, as far as I can see!) magic, whereas with .gzip you did not.

The filename inside the archive can be controlled by utilising gzip.GzipFile constructor instead of the gzip.open method. The gzip.GzipFile needs then a separate os.open call before it.
with open(output_path, 'wb') as f_out_gz:
with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out:
...
f_out.flush()
Note also the added f_out.flush() - according to my experience without this line the GzipFile may in some cases randomly not flush the data before the file is closed, resulting in corrupt archive.
Or as a complete example:
import gzip
import shutil
input_path = "test.txt"
output_path = input_path + ".gz"
with open(input_path, 'w') as file:
file.write("abc" * 10)
with open(output_path, 'wb') as f_out_gz:
with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out
with open(input_path, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
f_out.flush()

Related

how to gzip files using writelines?

I have a few files in my tmp folder that I want to gzip individually and upload to S3. The testList contains paths like /tmp/files/File1. SO fileName2, which I use for gzip.open() is /tmp/files/File1.gz. I want to gzip each file in the testList.
for i in testList:
fileName = i.replace("/tmp/files/", "")
fileName2 = i + '.gz'
with open("path/to/file", 'rb') as orig_file:
with gzip.open(fileName2, 'wb') as zipped_file:
zipped_file.writelines(orig_file)
bucket.upload_fileobj(zipped_file, fileName, ExtraArgs={'ContentType': "application/gzip"})
When I download the files from S3, they have a gz file type but I am unable to open them locally. It throws an error that the .gz file is empty and cannot be expanded. I believe the way I am writing content is incorrect.
How can I fix this?
Edit:
for i in testList:
fileName = i.replace("/tmp/files/", "")
fileName2 = i + '.gz'
with open(i, 'rb') as f_in:
with gzip.open(fileName2, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
f_out.upload_fileobj(zipped_file, fileName, ExtraArgs={'ContentType': "application/gzip"})
Even this, the gzip files are still not expandable.
You are getting an open file in orig_file, not just lines.
I think your use case is about turning an existing file into a compressed one. So the following should be the relevant paragraph from the Examples of usage section of the documentation:
Example of how to GZIP compress an existing file:
import gzip
import shutil
with open('/home/joe/file.txt', 'rb') as f_in:
with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

Unzipping multiple .gz files into single text file using Python

I am trying to unzip multiple .gz extentions files into single .txt file. All these files have json data.
I tried the following code:
from glob import glob
import gzip
for fname in glob('.../2020-04/*gz'):
with gzip.open(fname, 'rb') as f_in:
with open('.../datafiles/202004_twitter/decompressed.txt', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
But the decompressed.txt file only has the last .gz file's data.
Just shuffle f_out to the outside, so you open it before iterating over the input files and keep that one handle open.
from glob import glob
import gzip
with open('.../datafiles/202004_twitter/decompressed.txt', 'wb') as f_out:
for fname in glob('.../2020-04/*gz'):
with gzip.open(fname, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Use "wba" mode instead. a opens in append mode. w alone will erase the file upon opening.

Trying to merge all text files in a folder and append file as well

I am trying to merge all text files in a folder. I have this part working, but when I try to append the file name before the contents of each text file, I'm getting a error that reads: TypeError: a bytes-like object is required, not 'str'
The code below must be pretty close, but something is definitely off. Any thoughts what could be wrong?
import glob
folder = 'C:\\my_path\\'
read_files = glob.glob(folder + "*.txt")
with open(folder + "final_result.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(f)
outfile.write(infile.read())
outfile.close
outfile.write(f) seems to be your problem because you opened the file with in binary mode with 'wb'. You can convert to bytes using encode You'll likely not want to close outfile in your last line either (although you aren't calling the function anyway). So something like this might work for you:
import glob
folder = 'C:\\my_path\\'
read_files = glob.glob(folder + "*.txt")
with open(folder + "final_result.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(f.encode('utf-8'))
outfile.write(infile.read())

How to save created Zip file to file system in python?

Using python zipfile module I created a zip file as follows:
s = StringIO()
zip_file = zipfile.ZipFile(s, "w")
zip_file.write('/local/my_files/my_file.txt')
s.seek(0)
and now, I want this zip_file to be saved in my file system at path /local/my_files/ as my_file.zip.
Generally to save a noraml files I used the following flow:
with open(dest_file, 'w') as out_file:
for line in in_file:
out_file.write(line)
But, I think I can't achieve saving a zipfile with this. Can any one please help me in getting this done.
zip_file = zipfile.ZipFile("/local/my_files/my_file.zip", "w")
zip_file.write('/local/my_files/my_file.txt')
zip_file.close()
The first argument of the ZipFile object initialization is the path to which you want to save the zip file.
If you need to use StringIO, just try this code:
from StringIO import StringIO
import zipfile
s = StringIO()
with zipfile.ZipFile(s, "w", compression=zipfile.ZIP_DEFLATED) as zf:
zf.write('/local/my_files/my_file.txt')
with open('/local/my_files/my_file.zip', 'wb') as f_out:
f_out.write(s.getvalue())
Or you can do it in a simpler way:
import zipfile
with zipfile.ZipFile("/local/my_files/my_file.zip", "w", compression=zipfile.ZIP_DEFLATED) as zf:
zf.write("/local/my_files/my_file.txt")

Replace and overwrite instead of appending

I have the following code:
import re
#open the xml file for reading:
file = open('path/test.xml','r+')
#convert to string:
data = file.read()
file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
file.close()
where I'd like to replace the old content that's in the file with the new content. However, when I execute my code, the file "test.xml" is appended, i.e. I have the old content follwed by the new "replaced" content. What can I do in order to delete the old stuff and only keep the new?
You need seek to the beginning of the file before writing and then use file.truncate() if you want to do inplace replace:
import re
myfile = "path/test.xml"
with open(myfile, "r+") as f:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
f.truncate()
The other way is to read the file then open it again with open(myfile, 'w'):
with open(myfile, "r") as f:
data = f.read()
with open(myfile, "w") as f:
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
Neither truncate nor open(..., 'w') will change the inode number of the file (I tested twice, once with Ubuntu 12.04 NFS and once with ext4).
By the way, this is not really related to Python. The interpreter calls the corresponding low level API. The method truncate() works the same in the C programming language: See http://man7.org/linux/man-pages/man2/truncate.2.html
file='path/test.xml'
with open(file, 'w') as filetowrite:
filetowrite.write('new content')
Open the file in 'w' mode, you will be able to replace its current text save the file with new contents.
Using truncate(), the solution could be
import re
#open the xml file for reading:
with open('path/test.xml','r+') as f:
#convert to string:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
f.truncate()
import os#must import this library
if os.path.exists('TwitterDB.csv'):
os.remove('TwitterDB.csv') #this deletes the file
else:
print("The file does not exist")#add this to prevent errors
I had a similar problem, and instead of overwriting my existing file using the different 'modes', I just deleted the file before using it again, so that it would be as if I was appending to a new file on each run of my code.
See from How to Replace String in File works in a simple way and is an answer that works with replace
fin = open("data.txt", "rt")
fout = open("out.txt", "wt")
for line in fin:
fout.write(line.replace('pyton', 'python'))
fin.close()
fout.close()
in my case the following code did the trick
with open("output.json", "w+") as outfile: #using w+ mode to create file if it not exists. and overwrite the existing content
json.dump(result_plot, outfile)
Using python3 pathlib library:
import re
from pathlib import Path
import shutil
shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
filepath = Path("/tmp/test.xml")
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
Similar method using different approach to backups:
from pathlib import Path
filepath = Path("/tmp/test.xml")
filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))

Categories

Resources