How to save created Zip file to file system in python? - python

Using python zipfile module I created a zip file as follows:
s = StringIO()
zip_file = zipfile.ZipFile(s, "w")
zip_file.write('/local/my_files/my_file.txt')
s.seek(0)
and now, I want this zip_file to be saved in my file system at path /local/my_files/ as my_file.zip.
Generally to save a noraml files I used the following flow:
with open(dest_file, 'w') as out_file:
for line in in_file:
out_file.write(line)
But, I think I can't achieve saving a zipfile with this. Can any one please help me in getting this done.

zip_file = zipfile.ZipFile("/local/my_files/my_file.zip", "w")
zip_file.write('/local/my_files/my_file.txt')
zip_file.close()
The first argument of the ZipFile object initialization is the path to which you want to save the zip file.

If you need to use StringIO, just try this code:
from StringIO import StringIO
import zipfile
s = StringIO()
with zipfile.ZipFile(s, "w", compression=zipfile.ZIP_DEFLATED) as zf:
zf.write('/local/my_files/my_file.txt')
with open('/local/my_files/my_file.zip', 'wb') as f_out:
f_out.write(s.getvalue())
Or you can do it in a simpler way:
import zipfile
with zipfile.ZipFile("/local/my_files/my_file.zip", "w", compression=zipfile.ZIP_DEFLATED) as zf:
zf.write("/local/my_files/my_file.txt")

Related

how to gzip files using writelines?

I have a few files in my tmp folder that I want to gzip individually and upload to S3. The testList contains paths like /tmp/files/File1. SO fileName2, which I use for gzip.open() is /tmp/files/File1.gz. I want to gzip each file in the testList.
for i in testList:
fileName = i.replace("/tmp/files/", "")
fileName2 = i + '.gz'
with open("path/to/file", 'rb') as orig_file:
with gzip.open(fileName2, 'wb') as zipped_file:
zipped_file.writelines(orig_file)
bucket.upload_fileobj(zipped_file, fileName, ExtraArgs={'ContentType': "application/gzip"})
When I download the files from S3, they have a gz file type but I am unable to open them locally. It throws an error that the .gz file is empty and cannot be expanded. I believe the way I am writing content is incorrect.
How can I fix this?
Edit:
for i in testList:
fileName = i.replace("/tmp/files/", "")
fileName2 = i + '.gz'
with open(i, 'rb') as f_in:
with gzip.open(fileName2, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
f_out.upload_fileobj(zipped_file, fileName, ExtraArgs={'ContentType': "application/gzip"})
Even this, the gzip files are still not expandable.
You are getting an open file in orig_file, not just lines.
I think your use case is about turning an existing file into a compressed one. So the following should be the relevant paragraph from the Examples of usage section of the documentation:
Example of how to GZIP compress an existing file:
import gzip
import shutil
with open('/home/joe/file.txt', 'rb') as f_in:
with gzip.open('/home/joe/file.txt.gz', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

Unzipping multiple .gz files into single text file using Python

I am trying to unzip multiple .gz extentions files into single .txt file. All these files have json data.
I tried the following code:
from glob import glob
import gzip
for fname in glob('.../2020-04/*gz'):
with gzip.open(fname, 'rb') as f_in:
with open('.../datafiles/202004_twitter/decompressed.txt', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
But the decompressed.txt file only has the last .gz file's data.
Just shuffle f_out to the outside, so you open it before iterating over the input files and keep that one handle open.
from glob import glob
import gzip
with open('.../datafiles/202004_twitter/decompressed.txt', 'wb') as f_out:
for fname in glob('.../2020-04/*gz'):
with gzip.open(fname, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Use "wba" mode instead. a opens in append mode. w alone will erase the file upon opening.

Python "gzip" module acting weirdly if compressed extension is not ".gz"

I need to compress a file using the gzip module, but the output file extension may not be .gz.
Look at this simple code:
import gzip
import shutil
input_path = "test.txt"
output_path = input_path + ".gz"
with open(input_path, 'w') as file:
file.write("abc" * 10)
with gzip.open(output_path, 'wb') as f_out:
with open(input_path, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
It works fine. But if I replace ".gz" with ".gzip" for example, then I am not able to open the compressed file correctly:
I tried with 7-Zip and WinRar, the result is the same, and the bug persists even if I rename the file.
Does anyone know where the problem comes from, please?
I tried with compression bz2 and lzma, they seem to work properly no matter what the extension is.
You actually have two versions of file created this way:
First, .gz file:
with gzip.open("test.txt.gz", 'wb') as f_out:
with open("test.txt", 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Second, .gzip file:
with gzip.open("test.txt.gzip", 'wb') as f_out:
with open("test.txt", 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
Both create a GZIP with your test.txt in it. The only difference is that in the second case, test.txt is renamed to test.txt.gzip.
The problem is that the argument to gzip.open actually has two purposes: the filename of the gzip archive and the filename of the file inside (bad design, imho).
So, if you do gzip.open("abcd", 'wb') and write to it, it will create gzip archive named abcd with a file named abcd inside.
But then, there comes magic: if the filename endswith .gz, then it behaves differently, e.g. gzip.open("bla.gz", 'wb') creates a gzip archive named bla.gz with a file named bla inside.
So, with .gz you activated the (undocumented, as far as I can see!) magic, whereas with .gzip you did not.
The filename inside the archive can be controlled by utilising gzip.GzipFile constructor instead of the gzip.open method. The gzip.GzipFile needs then a separate os.open call before it.
with open(output_path, 'wb') as f_out_gz:
with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out:
...
f_out.flush()
Note also the added f_out.flush() - according to my experience without this line the GzipFile may in some cases randomly not flush the data before the file is closed, resulting in corrupt archive.
Or as a complete example:
import gzip
import shutil
input_path = "test.txt"
output_path = input_path + ".gz"
with open(input_path, 'w') as file:
file.write("abc" * 10)
with open(output_path, 'wb') as f_out_gz:
with gzip.GzipFile(fileobj=f_out_gz, filename=input_path, mode='wb') as f_out
with open(input_path, 'rb') as f_in:
shutil.copyfileobj(f_in, f_out)
f_out.flush()

IOError when downloading and decompressing gzip file

I'm trying to download and decompress a gzip file and then convert the resulting decompressed file which is of tsv format into a CSV format which would be easier to parse. I am trying to gather the data from the "Download Table" link in this URL. My code is as follows, where I am using the same idea as in this post, however I get the error IOError: [Errno 2] No such file or directory: 'file=data/irt_euryld_d.tsv' in the line with open(outFilePath, 'w') as outfile:
import os
import urllib2
import gzip
import StringIO
baseURL = "http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?"
filename = "D:\Sidney\irt_euryld_d.tsv.gz" #Edited after heinst's comment below
outFilePath = filename[:-3]
response = urllib2.urlopen(baseURL + filename)
compressedFile = StringIO.StringIO()
compressedFile.write(response.read())
compressedFile.seek(0)
decompressedFile = gzip.GzipFile(fileobj=compressedFile, mode='rb')
with open(outFilePath, 'w') as outfile:
outfile.write(decompressedFile.read())
#Now have to deal with tsv file
import csv
with open(outFilePath,'rb') as tsvin, open('ECB.csv', 'wb') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
csvout = csv.writer(csvout) #Converting output into CSV Format
Thank You
The path you were setting filename to was not a valid path to have a file written to it. So you have to change filename = "data/irt_euryld_d.tsv.gz" to be a valid path to wherever you want the irt_euryld_d.tsv.gz file to live. For example if I wanted the irt_euryld_d.tsv.gz file on my desktop I would set the value of filename = "/Users/heinst/Desktop/data/irt_euryld_d.tsv.gz". Since this is a valid path, python will not give you the No such file or directory error anymore.

Replace and overwrite instead of appending

I have the following code:
import re
#open the xml file for reading:
file = open('path/test.xml','r+')
#convert to string:
data = file.read()
file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
file.close()
where I'd like to replace the old content that's in the file with the new content. However, when I execute my code, the file "test.xml" is appended, i.e. I have the old content follwed by the new "replaced" content. What can I do in order to delete the old stuff and only keep the new?
You need seek to the beginning of the file before writing and then use file.truncate() if you want to do inplace replace:
import re
myfile = "path/test.xml"
with open(myfile, "r+") as f:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
f.truncate()
The other way is to read the file then open it again with open(myfile, 'w'):
with open(myfile, "r") as f:
data = f.read()
with open(myfile, "w") as f:
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
Neither truncate nor open(..., 'w') will change the inode number of the file (I tested twice, once with Ubuntu 12.04 NFS and once with ext4).
By the way, this is not really related to Python. The interpreter calls the corresponding low level API. The method truncate() works the same in the C programming language: See http://man7.org/linux/man-pages/man2/truncate.2.html
file='path/test.xml'
with open(file, 'w') as filetowrite:
filetowrite.write('new content')
Open the file in 'w' mode, you will be able to replace its current text save the file with new contents.
Using truncate(), the solution could be
import re
#open the xml file for reading:
with open('path/test.xml','r+') as f:
#convert to string:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
f.truncate()
import os#must import this library
if os.path.exists('TwitterDB.csv'):
os.remove('TwitterDB.csv') #this deletes the file
else:
print("The file does not exist")#add this to prevent errors
I had a similar problem, and instead of overwriting my existing file using the different 'modes', I just deleted the file before using it again, so that it would be as if I was appending to a new file on each run of my code.
See from How to Replace String in File works in a simple way and is an answer that works with replace
fin = open("data.txt", "rt")
fout = open("out.txt", "wt")
for line in fin:
fout.write(line.replace('pyton', 'python'))
fin.close()
fout.close()
in my case the following code did the trick
with open("output.json", "w+") as outfile: #using w+ mode to create file if it not exists. and overwrite the existing content
json.dump(result_plot, outfile)
Using python3 pathlib library:
import re
from pathlib import Path
import shutil
shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
filepath = Path("/tmp/test.xml")
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
Similar method using different approach to backups:
from pathlib import Path
filepath = Path("/tmp/test.xml")
filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))

Categories

Resources