Is anyone can provide example how to create zip file from csv file using Python/Pandas package?
Thank you
Use
df.to_csv('my_file.gz', compression='gzip')
From the docs:
compression : string, optional
a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first
argument is a filename
See discussion of support of zip files here.
In the to_csv() method of pandas, besides the compression type (gz, zip etc) you can specify the archive file name - just pass the dict with necessary params as the compression parameter:
compression_opts = dict(method='zip',
archive_name='out.csv')
df.to_csv('out.zip', compression=compression_opts)
In the example above, the first argument of the to_csv method defines the name of the [ZIP] archive file, the method key of the dict defines [ZIP] compression type and the archive_name key of the dict defines the name of the [CSV] file inside the archive file.
Result:
├─ out.zip
│ └─ out.csv
See details in to_csv() pandas docs
In response to Stefan's answer, add '.csv.gz' for the zip csv file to work
df.to_csv('my_file.csv.gz', compression='gzip')
Hope that helps
The Pandas to_csv compression has some security vulnerabilities where it leaves the absolute path of the file in the zip archive on Linux machine. Not to mention one might want to save a file in the highest level of a zipped file. The following function addresses this issue by using zipfile. On top of that, it doesn't suffer from pickle protocol change (4 to 5).
from pathlib import Path
import zipfile
def save_compressed_df(df, dirPath, fileName):
"""Save a Pandas dataframe as a zipped .csv file.
Parameters
----------
df : pandas.core.frame.DataFrame
Input dataframe.
dirPath : str or pathlib.PosixPath
Parent directory of the zipped file.
fileName : str
File name without extension.
"""
dirPath = Path(dirPath)
path_zip = dirPath / f'{fileName}.csv.zip'
txt = df.to_csv(index=False)
with zipfile.ZipFile(path_zip, 'w', zipfile.ZIP_DEFLATED) as zf:
zf.writestr(f'{fileName}.csv', txt)
Related
I have a file .pdf in a folder and I have a .xls with two-column. In the first column I have the filename without extension .pdf and in the second column, I have a value.
I need to open file .xls, match the value in the first column with all filenames in the folder and rename each file .pdf with the value in the second column.
Is it possible?
Thank you for your support
Angelo
You'll want to use the pandas library within python. It has a function called pandas.read_excel that is very useful for reading excel files. This will return a dataframe, which will allow you to use iloc or other methods of accessing the values in the first and second columns. From there, I'd recommend using os.rename(old_name, new_name), where old_name and new_name are the paths to where your .pdf files are kept. A full example of the renaming part looks like this:
import os
# Absolute path of a file
old_name = r"E:\demos\files\reports\details.txt"
new_name = r"E:\demos\files\reports\new_details.txt"
# Renaming the file
os.rename(old_name, new_name)
I've purposely left out a full explanation because you simply asked if it is possible to achieve your task, so hopefully this points you in the right direction! I'd recommend asking questions with specific reproducible code in the future, in accordance with stackoverflow guidelines.
I would encourage you to do this with a .csv file instead of a xls, as is a much easier format (requires 0 formatting of borders, colors, etc.).
You can use the os.listdir() function to list all files and folders in a certain directory. Check os built-in library docs for that. Then grab the string name of each file, remove the .pdf, and read your .csv file with the names and values, and the rename the file.
All the utilities needed are built-in python. Most are the os lib, other are just from csv lib and normal opening of files:
with open(filename) as f:
#anything you have to do with the file here
#you may need to specify what permits are you opening the file with in the open function
I am having trouble finding the compression options available to me..
At the bottom of this page:
to_csv
they have an example that shows 2 options:
compression_opts = dict(method='zip',
archive_name='out.csv')
But I see no listing of all options available.. and can't find one elsewhere.
I'd love to see the full list (assuming there are more than these 2)
End goal currently: the zip operation zips the file up in a zip file, but all the folders are also within the zip file, so that the file is actually buried in a bunch of folders within the zip. I'm sure there is an easy option to prevent the folders from being added to the zip...
I think I understand your question. Let's say I have a dataframe, and I want to save it locally in a zipfile. And let's say I want that zipfile saved at the location somepath/myfile.zip
Let's say I run this program (also assuming that somepath/ is a valid folder in the current working directory):
### with_path.py
import pandas as pd
filename = "myfile"
df = pd.DataFrame([["a", 1], ["b", 2]])
compression_options = {"method": "zip"}
df.to_csv(f"somepath/{filename}.zip", compression=compression_options)
If I list the content of the resulting file, I can see the path I wanted to store the zip file at was ALSO used as the name of the file INSIDE the zip, including the folder structure, and still named .zip even, which is weird:
(.venv) pandas_test % unzip -l somepath/myfile.zip
Archive: somepath/myfile.zip
Length Date Time Name
--------- ---------- ----- ----
17 09-17-2021 12:56 somepath/myfile.zip
--------- -------
17 1 file
Instead, I can supply an archive_name as a compression option to explicitly provide a name for my file inside the zip. Like so:
### without_path.py
import pandas as pd
filename = "myfile"
df = pd.DataFrame([["a", 1], ["b", 2]])
compression_options = {"method": "zip", "archive_name": f"{filename}.csv"}
df.to_csv(f"somepath/{filename}.zip", compression=compression_options)
Now although our resulting zip file was still written at the desired location of somepath/ the file inside the zip does NOT include the path as part of the filename, and is correctly named with a .csv extension.
(.venv) pandas_test % unzip -l somepath/myfile.zip
Archive: somepath/myfile.zip
Length Date Time Name
--------- ---------- ----- ----
17 09-17-2021 12:59 myfile.csv
--------- -------
17 1 file
The strange default behavior doesn't seem to be called out in the documentation, but you can see the use of the archive_name parameter in the final example of the pandas.DataFrame.to_csv documentation. IMHO they should throw an error and force you to provide an archive_name value, because I can't imagine when you would want to name the file inside a zip the exact same as the zip file itself.
After some reseach about to_csv's integrated the compression mechanisms, I would suggest a different appoach for your problem:
Assuming that you have a number of DataFrames that you want to write to your zip file as individual csv files (for this example I keep the DataFrames in a list, so I can loop over them):
df_list = []
df_list.append(pd.DataFrame({'col1':[1, 2, 3],
'col2':['A', 'B', 'C']}))
df_list.append(pd.DataFrame({'col1':[4, 5, 6],
'col2':['a', 'b', 'c']}))
Then you can convert each of these DataFrames to a csv string in memory (not a file) and write that string to your zip archive as a file (e.g. df0.csv, df1.csv, ...):
with zipfile.ZipFile(file="out.zip", mode="w") as zf:
for index, df in enumerate(df_list):
csv_str = df.to_csv(index=False)
zf.writestr("{}{}.csv".format("df", index), csv_str)
EDIT:
Here is what I think pandas does with the compression options (you can look at the code in Github or in your local filesystem among the python libraries):
When the save function in pandas/io/formats/csvs.py is called, it will use get_handle from pandas/io/common.py with the compression options as a parameter. There, method is expected as the first entry. For zip, it will use a class named _BytesZipFile (derived from zipfile.ZipFile) with the handle (file path or buffer), mode and archive_name, which explains the example from the pandas documentation. Other parameters in **kwargs will just be passed through to the __init__ function of the super class (except for compression, which is set to ZIP_DEFLATED).
So it seems that you can pass allowZip64, compresslevel (python 3.7 and above), and strict_timestamps (python 3.8 and above) as documented here, which I could verify at least for allowZip64 with python 3.6.
I do not see a way to use something like the -j / --junk-paths option in the zipfile library.
I am having a peculiar problem when writing zip files through to_csv.
Using GZIP:
df.to_csv(path_or_buf = 'sample.csv.gz', compression="gzip", index = None, sep = ",", header=True, encoding='utf-8-sig')
gives a neat gzip file with name 'sample.csv.gz' and inside it I get my csv 'sample.csv'
However, things change when using ZIP
df.to_csv(path_or_buf = 'sample.csv.zip', compression="zip", index = None, sep = ",", header=True, encoding='utf-8-sig')
gives a zip file with name 'sample.csv.zip', but inside it the csv has been renamed to 'sample.csv.zip' as well.
Removing the extra '.zip' from the file gives the csv back.
How can I implement zip extension without this issue?
I need to have zip files as a requirement that I can't bypass.
I am using python 2.7 on windows 10 machine.
Thanks in advance for help.
It is pretty straightforward in pandas since version 1.0.0 using dict as compression options:
filename = 'sample'
compression_options = dict(method='zip', archive_name=f'{filename}.csv')
df.to_csv(f'{filename}.zip', compression=compression_options, ...)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
As the thread linked in the comment discusses, ZIP's directory-like nature makes it hard to do what you want without making a lot of assumptions or complicating the arguments for to_csv
If your goal is to write the data directly to a ZIP file, that's harder than you'd think.
If you can bear temporarily writing your data to the filesystem, you can use Python's zipfile module to put that file in a ZIP with the name you preferred, and then delete the file.
import zipfile
import os
df.to_csv('sample.csv',index=None,sep=",",header=True,encoding='utf-8-sig')
with zipfile.ZipFile('sample.zip', 'w') as zf:
zf.write('sample.csv')
os.remove('sample.csv')
Since Pandas 1.0.0 it's possible to set compression using to_csv().
Example in one line:
df.to_csv('sample.zip', compression={'method': 'zip', 'archive_name': 'sample.csv'})
Say you unzip a file called file123.zip with zipfile.ZipFile, which yields an unzipped file saved to a known path. However, this unzipped file has a completely random name. How do you determine this completely random filename? Or is there some way to control what the name of the unzipped file is?
I am trying to implement this in python.
By "random" I assume that you mean that the files are named arbitrarily.
You can use ZipFile.read() which unzips the file and returns its contents as a string of bytes. You can then write that string to a named file of your choice.
from zipfile import ZipFile
with ZipFile('file123.zip') as zf:
for i, name in enumerate(zf.namelist()):
with open('outfile_{}'.format(i), 'wb') as f:
f.write(zf.read(name))
This will write each file from the archive to a file named output_n in the current directory. The names of the files contained in the archive are obtained with ZipFile.namelist(). I've used enumerate() as a simple method of generating the file names, however, you could substitute that with whatever naming scheme you require.
If the filename is completely random you can first check for all filenames in a particular directory using os.listdir(). Now you know the filename and can do whatever you want with it :)
See this topic for more information.
I'm using zipfile and under some circumstance I need to create an empty zip file for some placeholder purpose. How can I do this?
I know this:
Changed in version 2.7.1: If the file is created with mode 'a' or 'w'
and then closed without adding any files to the archive, the
appropriate ZIP structures for an empty archive will be written to the
file.
but my server uses a lower version as 2.6.
You can create an empty zip file without the need to zipfile as:
empty_zip_data = b'PK\x05\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
with open('empty.zip', 'wb') as zip:
zip.write(empty_zip_data)
empty_zip_data is the data of an empty zip file.
You can simply do:
from zipfile import ZipFile
archive_name = 'test_file.zip'
with ZipFile(archive_name, 'w') as file:
pass