Unzipping multiple .gz files into single text file using Python - python

I am trying to unzip multiple .gz extentions files into single .txt file. All these files have json data.
I tried the following code:
from glob import glob
import gzip
for fname in glob('.../2020-04/*gz'):
with gzip.open(fname, 'rb') as f_in:
with open('.../datafiles/202004_twitter/decompressed.txt', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
But the decompressed.txt file only has the last .gz file's data.

Just shuffle f_out to the outside, so you open it before iterating over the input files and keep that one handle open.
from glob import glob
import gzip
with open('.../datafiles/202004_twitter/decompressed.txt', 'wb') as f_out:
for fname in glob('.../2020-04/*gz'):
with gzip.open(fname, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)

Use "wba" mode instead. a opens in append mode. w alone will erase the file upon opening.

Related

how to gzip files using writelines?

I have a few files in my tmp folder that I want to gzip individually and upload to S3. The testList contains paths like /tmp/files/File1. SO fileName2, which I use for gzip.open() is /tmp/files/File1.gz. I want to gzip each file in the testList.
for i in testList:
fileName = i.replace("/tmp/files/", "")
fileName2 = i + '.gz'
with open("path/to/file", 'rb') as orig_file:
with gzip.open(fileName2, 'wb') as zipped_file:
zipped_file.writelines(orig_file)
bucket.upload_fileobj(zipped_file, fileName, ExtraArgs={'ContentType': "application/gzip"})
When I download the files from S3, they have a gz file type but I am unable to open them locally. It throws an error that the .gz file is empty and cannot be expanded. I believe the way I am writing content is incorrect.
How can I fix this?
Edit:
for i in testList:
fileName = i.replace("/tmp/files/", "")
fileName2 = i + '.gz'
with open(i, 'rb') as f_in:
with gzip.open(fileName2, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
f_out.upload_fileobj(zipped_file, fileName, ExtraArgs={'ContentType': "application/gzip"})
Even this, the gzip files are still not expandable.
You are getting an open file in orig_file, not just lines.
I think your use case is about turning an existing file into a compressed one. So the following should be the relevant paragraph from the Examples of usage section of the documentation:
Example of how to GZIP compress an existing file:
import gzip
import shutil
with open('/home/joe/file.txt', 'rb') as f_in:
with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

Python "gzip" module acting weirdly if compressed extension is not ".gz"

I need to compress a file using the gzip module, but the output file extension may not be .gz.
Look at this simple code:
import gzip
import shutil
input_path = "test.txt"
output_path = input_path + ".gz"
with open(input_path, 'w') as file:
file.write("abc" * 10)
with gzip.open(output_path, 'wb') as f_out:
with open(input_path, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
It works fine. But if I replace ".gz" with ".gzip" for example, then I am not able to open the compressed file correctly:
I tried with 7-Zip and WinRar, the result is the same, and the bug persists even if I rename the file.
Does anyone know where the problem comes from, please?
I tried with compression bz2 and lzma, they seem to work properly no matter what the extension is.
You actually have two versions of file created this way:
First, .gz file:
with gzip.open("test.txt.gz", 'wb') as f_out:
with open("test.txt", 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Second, .gzip file:
with gzip.open("test.txt.gzip", 'wb') as f_out:
with open("test.txt", 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Both create a GZIP with your test.txt in it. The only difference is that in the second case, test.txt is renamed to test.txt.gzip.
The problem is that the argument to gzip.open actually has two purposes: the filename of the gzip archive and the filename of the file inside (bad design, imho).
So, if you do gzip.open("abcd", 'wb') and write to it, it will create gzip archive named abcd with a file named abcd inside.
But then, there comes magic: if the filename endswith .gz, then it behaves differently, e.g. gzip.open("bla.gz", 'wb') creates a gzip archive named bla.gz with a file named bla inside.
So, with .gz you activated the (undocumented, as far as I can see!) magic, whereas with .gzip you did not.
The filename inside the archive can be controlled by utilising gzip.GzipFile constructor instead of the gzip.open method. The gzip.GzipFile needs then a separate os.open call before it.
with open(output_path, 'wb') as f_out_gz:
with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out:
...
f_out.flush()
Note also the added f_out.flush() - according to my experience without this line the GzipFile may in some cases randomly not flush the data before the file is closed, resulting in corrupt archive.
Or as a complete example:
import gzip
import shutil
input_path = "test.txt"
output_path = input_path + ".gz"
with open(input_path, 'w') as file:
file.write("abc" * 10)
with open(output_path, 'wb') as f_out_gz:
with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out
with open(input_path, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
f_out.flush()

How to combine several text files into one file?

I want to combine several text files into one output files.
my original code is to download 100 text files then each time I filter the text from several words and the write it to the output file.
Here is part of my code that suppose to combine the new text with the output text. The result each time overwrite the output file, delete the previous content and add the new text.
import fileinput
import glob
urls = ['f1.txt', 'f2.txt','f3.txt']
N =0;
print "read files"
for url in urls:
read_files = glob.glob(urls[N])
with open("result.txt", "wb") as outfile:
for f in read_files:
with open(f, "rb") as infile:
outfile.write(infile.read())
N+=1
and I tried this also
import fileinput
import glob
urls = ['f1.txt', 'f2.txt','f3.txt']
N =0;
print "read files"
for url in urls:
file_list = glob.glob(urls[N])
with open('result-1.txt', 'w') as file:
input_lines = fileinput.input(file_list)
file.writelines(input_lines)
N+=1
Is there any suggestions?
I need to concatenate/combine approximately 100 text files into one .txt file In sequence manner. (Each time I read one file and add it to the result.txt)
The problem is that you are re-opening the output file on each loop iteration which will cause it to overwrite -- unless you explicitly open it in append mode.
The glob logic is also unnecessary when you already know the filename.
Try this instead:
with open("result.txt", "wb") as outfile:
for url in urls:
with open(url, "rb") as infile:
outfile.write(infile.read())

How to save created Zip file to file system in python?

Using python zipfile module I created a zip file as follows:
s = StringIO()
zip_file = zipfile.ZipFile(s, "w")
zip_file.write('/local/my_files/my_file.txt')
s.seek(0)
and now, I want this zip_file to be saved in my file system at path /local/my_files/ as my_file.zip.
Generally to save a noraml files I used the following flow:
with open(dest_file, 'w') as out_file:
for line in in_file:
out_file.write(line)
But, I think I can't achieve saving a zipfile with this. Can any one please help me in getting this done.
zip_file = zipfile.ZipFile("/local/my_files/my_file.zip", "w")
zip_file.write('/local/my_files/my_file.txt')
zip_file.close()
The first argument of the ZipFile object initialization is the path to which you want to save the zip file.
If you need to use StringIO, just try this code:
from StringIO import StringIO
import zipfile
s = StringIO()
with zipfile.ZipFile(s, "w", compression=zipfile.ZIP_DEFLATED) as zf:
zf.write('/local/my_files/my_file.txt')
with open('/local/my_files/my_file.zip', 'wb') as f_out:
f_out.write(s.getvalue())
Or you can do it in a simpler way:
import zipfile
with zipfile.ZipFile("/local/my_files/my_file.zip", "w", compression=zipfile.ZIP_DEFLATED) as zf:
zf.write("/local/my_files/my_file.txt")

How to create a csv file in Python, and export (put) it to some local directory

This problem may be tricky.
I want to create a csv file from a list in Python. This csv file does not exist before. And then export it to some local directory. There is no such file in the local directory either. We just create a new csv file, and export (put) the csv file in some local directory.
I found that StringIO.StringIO can generate the csv file from a list in Python, then what are the next steps.
Thank you.
And I found the following code can do it:
import os
import os.path
import StringIO
import csv
dir = r"C:\Python27"
if not os.path.exists(dir):
os.mkdir(dir)
my_list=[[1,2,3],[4,5,6]]
with open(os.path.join(dir, "filename"+'.csv'), "w") as f:
csvfile=StringIO.StringIO()
csvwriter=csv.writer(csvfile)
for l in my_list:
csvwriter.writerow(l)
for a in csvfile.getvalue():
f.writelines(a)
Did you read the docs?
https://docs.python.org/2/library/csv.html
Lots of examples on that page of how to read / write CSV files.
One of them:
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(someiterable)
import csv
with open('/path/to/location', 'wb') as f:
writer = csv.writer(f)
writer.writerows(youriterable)
https://docs.python.org/2/library/csv.html#examples

Categories

Resources