I'm trying to use the Python GZIP module to simply uncompress several .gz files in a directory. Note that I do not want to read the files, only uncompress them. After searching this site for a while, I have this code segment, but it does not work:
import gzip
import glob
import os
for file in glob.glob(PATH_TO_FILE + "/*.gz"):
#print file
if os.path.isdir(file) == False:
shutil.copy(file, FILE_DIR)
# uncompress the file
inF = gzip.open(file, 'rb')
s = inF.read()
inF.close()
the .gz files are in the correct location, and I can print the full path + filename with the print command, but the GZIP module isn't getting executed properly. what am I missing?
If you get no error, the gzip module probably is being executed properly, and the file is already getting decompressed.
The precise definition of "decompressed" varies on context:
I do not want to read the files, only uncompress them
The gzip module doesn't work as a desktop archiving program like 7-zip - you can't "uncompress" a file without "reading" it. Note that "reading" (in programming) usually just means "storing (temporarily) in the computer RAM", not "opening the file in the GUI".
What you probably mean by "uncompress" (as in a desktop archiving program) is more precisely described (in programming) as "read a in-memory stream/buffer from a compressed file, and write it to a new file (and possibly delete the compressed file afterwards)"
inF = gzip.open(file, 'rb')
s = inF.read()
inF.close()
With these lines, you're just reading the stream. If you expect a new "uncompressed" file to be created, you just need to write the buffer to a new file:
with open(out_filename, 'wb') as out_file:
out_file.write(s)
If you're dealing with very large files (larger than the amount of your RAM), you'll need to adopt a different approach. But that is the topic for another question.
You're decompressing file into s variable, and do nothing with it. You should stop searching stackoverflow and read at least python tutorial. Seriously.
Anyway, there's several thing wrong with your code:
you need is to STORE the unzipped data in s into some file.
there's no need to copy the actual *.gz files. Because in your code, you're unpacking the original gzip file and not the copy.
you're using file, which is a reserved word, as a variable. This is not
an error, just a very bad practice.
This should probably do what you wanted:
import gzip
import glob
import os
import os.path
for gzip_path in glob.glob(PATH_TO_FILE + "/*.gz"):
if os.path.isdir(gzip_path) == False:
inF = gzip.open(gzip_path, 'rb')
# uncompress the gzip_path INTO THE 's' variable
s = inF.read()
inF.close()
# get gzip filename (without directories)
gzip_fname = os.path.basename(gzip_path)
# get original filename (remove 3 characters from the end: ".gz")
fname = gzip_fname[:-3]
uncompressed_path = os.path.join(FILE_DIR, fname)
# store uncompressed file data from 's' variable
open(uncompressed_path, 'w').write(s)
You should use with to open files and, of course, store the result of reading the compressed file. See gzip documentation:
import gzip
import glob
import os
import os.path
for gzip_path in glob.glob("%s/*.gz" % PATH_TO_FILE):
if not os.path.isdir(gzip_path):
with gzip.open(gzip_path, 'rb') as in_file:
s = in_file.read()
# Now store the uncompressed data
path_to_store = gzip_fname[:-3] # remove the '.gz' from the filename
# store uncompressed file data from 's' variable
with open(path_to_store, 'w') as f:
f.write(s)
Depending on what exactly you want to do, you might want to have a look at tarfile and its 'r:gz' option for opening files.
I was able to resolve this issue by using the subprocess module:
for file in glob.glob(PATH_TO_FILE + "/*.gz"):
if os.path.isdir(file) == False:
shutil.copy(file, FILE_DIR)
# uncompress the file
subprocess.call(["gunzip", FILE_DIR + "/" + os.path.basename(file)])
Since my goal was to simply uncompress the archive, the above code accomplishes this. The archived files are located in a central location, and are copied to a working area, uncompressed, and used in a test case. the GZIP module was too complicated for what I was trying to accomplish.
Thanks for everyone's help. It is much appreciated!
I think there is a much simpler solution than the others presented given the op only wanted to extract all the files in a directory:
import glob
from setuptools import archive_util
for fn in glob.glob('*.gz'):
archive_util.unpack_archive(fn, '.')
Related
I have been trying to make a python script to zip a file with the zipfile module. Although the text file is made into a zip file, It doesn't seem to be compressing it; testtext.txt is 1024KB whilst testtext.zip (The code's creation) is also equal to 1024KB. However, if I compress testtext.txt manually in File Explorer, the resulting zip file is compressed (To 2KB, specifically). How, if possible, can I combat this logical error?
Below is the script that I have used to (unsuccessfully) zip a text file.
from zipfile import ZipFile
textFile = ZipFile("compressedtextstuff.zip", "w")
textFile.write("testtext.txt")
textFile.close()
Well that's odd. Python's zipfile defaults to the stored compression method, which does not compress! (Why would they do that?)
You need to specify a compression method. Use ZIP_DEFLATED, which is the most widely supported.
import zipfile
zip = zipfile.ZipFile("stuff.zip", "w", zipfile.ZIP_DEFLATED)
zip.write("test.txt")
zip.close()
From the https://docs.python.org/3/library/zipfile.html#zipfile-objects it suggest example:
with ZipFile('spam.zip', 'w') as myzip:
myzip.write('eggs.txt')
So your code will be
from zipfile import ZipFile
with ZipFile('compressedtextstuff.zip', 'w', zipfile.ZIP_DEFLATED) as myzip:
myzip.write('testtext.txt')
https://docs.python.org/3/library/zipfile.html#:~:text=with%20ZipFile(%27spam.zip%27%2C%20%27w%27)%20as%20myzip%3A%0A%20%20%20%20myzip.write(%27eggs.txt%27)
In the docs they have it written with a with statement so I would try that first.
Edit:
I just came back to say that you have to specify your compression method but Mark beat me to the punch.
Here is a link to a StackOverflow post about it
https://stackoverflow.com/questions/4166447/python-zipfile-module-doesnt-seem-to-be-compressing-my-files#:~:text=This%20is%20because%20ZipFile%20requires,the%20method%20to%20be%20zipfile.
So I've had this system that scrapes and compresses files for a while now using bz2 compression. The way it does so is using the following block of code I found on SO a few months back:
Let's assume for the purposes of this post the filename is always file.XXXX where XXXX is the relevant extension. We start with .txt
### How to compress a text file
filepath_compressed = "file.tar.bz2"
with open("file.txt", 'rb') as data:
tarbz2contents = bz2.compress(data.read(), 9)
with bz2.BZ2File(filepath_compressed, 'wb') as f_comp:
f_comp.write(tarbz2contents)
Now, to decompress it, I've always got it to work using a decompression software I have called Keka which decompresses the .tar.bz2 file to .tar, then I run it through Keka again to get an "extensionless" file which I then add a .txt to on my mac and then it works.
Now, to do decompress programmatically, I've tried a few things. I've tried the stuff from this post and the code from this post. I've tried using BZ2Decompressor and BZ2File and everything. I just seem to be missing something and I'm not sure what it is.
Here is what I have so far, and I'd like to know what is wrong with this code:
import bz2, tarfile, shutil
# Decompress to tar
with bz2.BZ2File("file.tar.bz2") as fr, open("file.tar", "wb") as fw:
shutil.copyfileobj(fr, fw)
# Decompress from tar to txt
with tarfile.open("file.tar", "r:") as tar:
tar.extractall("file_out.txt")
This code crashes because of a "tarfile.ReadError: truncated header" problem. I think the first context manager outputs a binary text file, and I tried decoding that but that failed too. What am i missing here i feel like a noob.
If you would like a minimum runnable piece of code to replicate this, add the following to make a dummy file:
lines = ["Line 1","Line 2", "Line 3"]
with open("file.txt", "w") as f:
for line in lines:
f.write(line+"\n")
The thing that you're making is not a .tar.bz2 file, but rather a .bz2.bz2 file. You are compressing twice with bzip2 (the second time with no effect), and there is no tar file generation anywhere to be seen.
I am trying to gzip files using python 3. When I gzip the files, the code is changing the filename without me doing anything. I am not sure I totally understand the working of gzip module.
Below is the code:
dir_in = '/localfolder/new_files/'
dir_out = '/localfolder/zippedfiles/
file_name = 'transactions_may05'
def gzip_files(dir_in, dir_out, file_name):
with open(dir_in + file_name, 'rb') as f_in, gzip.open(dir_out + 'unprocessed.' + file_name + '.gz', 'wb') as f_out:
f_out.writelines(f_in)
Expected Output:
Outer file: unprocessed.transactions_may05.gz
when I double click it, I should get the original file transactions_may05
Current Output:
Outer file: unprocessed.transactions_may05.gz -- As expected
when I double click it the internal file also has unprocessed. appended to it. I am not sure why unprocessed. gets appended to internal file name
Internal File:unprocessed.transactions_may05
Any help would be appreciated. Thank you.
That's the expected behavior of gzip and gunzip.
As mentioned in the manual page:
gunzip takes a list of files on its command line and replaces each
file whose name ends with .gz, -gz, .z, -z, or _z (ignoring case) and
which begins with the correct magic number with an uncompressed
file without the original extension.
If you don't want the name to change, you should not modify the filename when you compress it.
I have been searching for a solution for this and haven't been able to find one. I have a directory of folders which contain multiple, very-large csv files. I'm looping through each csv in each folder in the directory to replace values of certain headers. I need the headers to be consistent (from file to file) in order to run a different script to process all the data properly.
I found this solution that I though would work: change first line of a file in python.
However this is not working as expected. My code:
from_file = open(filepath)
# for line in f:
# if
data = from_file.readline()
# print(data)
# with open(filepath, "w") as f:
print 'DBG: replacing in file', filepath
# s = s.replace(search_pattern, replacement)
for i in range(len(search_pattern)):
data = re.sub(search_pattern[i], replacement[i], data)
# data = re.sub(search_pattern, replacement, data)
to_file = open(filepath, mode="w")
to_file.write(data)
shutil.copyfileobj(from_file, to_file)
I want to replace the header values in search_pattern with values in replacement without saving or writing to a different file - I want to modify the file. I have also tried
shutil.copyfileobj(from_file, to_file, -1)
As I understand it that should copy the whole file rather than breaking it up in chunks, but it doesn't seem to have an effect on my output. Is it possible that the csv is just too big?
I haven't been able to determine a different way to do this or make this way work. Any help would be greatly appreciated!
this answer from change first line of a file in python you copied from doesn't work in windows
On Linux, you can open a file for reading & writing at the same time. The system ensures that there's no conflict, but behind the scenes, 2 different file objects are being handled. And this method is very unsafe: if the program crashes while reading/writing (power off, disk full)... the file has a great chance to be truncated/corrupt.
Anyway, in Windows, you cannot open a file for reading and writing at the same time using 2 handles. It just destroys the contents of the file.
So there are 2 options, which are portable and safe:
create a file in the same directory, once copied, delete first file, and rename the new one
Like this:
import os
import shutil
filepath = "test.txt"
with open(filepath) as from_file, open(filepath+".new","w") as to_file:
data = from_file.readline()
to_file.write("something else\n")
shutil.copyfileobj(from_file, to_file)
os.remove(filepath)
os.rename(filepath+".new",filepath)
This doesn't take much longer, because the rename operation is instantaneous. Besides, if the program/computer crashes at any point, one of the files (old or new) is valid, so it's safe.
if patterns have the same length, use read/write mode
like this:
filepath = "test.txt"
with open(filepath,"r+") as rw_file:
data = rw_file.readline()
data = "h"*(len(data)-1) + "\n"
rw_file.seek(0)
rw_file.write(data)
Here we, read the line, replace the first line by the same amount of h characters, rewind the file and write the first line back, overwriting previous contents, keeping the rest of the lines. This is also safe, and even if the file is huge, it's very fast. The only constraint is that the pattern must be of the exact same size (else you would have remainders of the previous data, or you would overwrite the next line(s) since no data is shifted)
I am trying to decompress some MMS messages sent to me zipped. The problem is that sometimes it works, and others not. And when it doesnt work, the python zipfile module complains and says that it is a bad zip file. But the zipfile decompresses fine using the unix unzip command.
This is what ive got
zippedfile = open('%stemp/tempfile.zip' % settings.MEDIA_ROOT, 'w+')
zippedfile.write(string)
z = zipfile.ZipFile(zippedfile)
I am using 'w+' and writing a string to it, the string contains a base64 decoded string representation of a zip file.
Then I do like this:
filelist = z.infolist()
images = []
for f in filelist:
raw_mimetype = mimetypes.guess_type(f.filename)[0]
if raw_mimetype:
mimetype = raw_mimetype.split('/')[0]
else:
mimetype = 'unknown'
if mimetype == 'image':
images.append(f.filename)
This way I've got a list of all the images in the zip file. But this doesnt always work, since the zipfile module complains about some of the files.
Is there a way to do this, without using the zipfile module?
Could I somehow use the unix command unzip instead of zipfile and then to the same thing to retrive all the images from the archive?
You should very probably open the file in binary mode, when writing zipped data into it. That is, you should use
zippedfile = open('%stemp/tempfile.zip' % settings.MEDIA_ROOT, 'wb+')
You might have to close and reopen the file, or maybe seek to the start of the file after writing it.
filename = '%stemp/tempfile.zip' % settings.MEDIA_ROOT
zippedfile = open(filename , 'wb+')
zippedfile.write(string)
zippedfile.close()
z = zipfile.ZipFile(filename,"r")
You say the string is base64 decoded, but you haven't shown any code that decodes it - are you sure it's not still encoded?
data = string.decode('base64')