Python pandas to_csv zip format - python

I am having a peculiar problem when writing zip files through to_csv.
Using GZIP:
df.to_csv(path_or_buf = 'sample.csv.gz', compression="gzip", index = None, sep = ",", header=True, encoding='utf-8-sig')
gives a neat gzip file with name 'sample.csv.gz' and inside it I get my csv 'sample.csv'
However, things change when using ZIP
df.to_csv(path_or_buf = 'sample.csv.zip', compression="zip", index = None, sep = ",", header=True, encoding='utf-8-sig')
gives a zip file with name 'sample.csv.zip', but inside it the csv has been renamed to 'sample.csv.zip' as well.
Removing the extra '.zip' from the file gives the csv back.
How can I implement zip extension without this issue?
I need to have zip files as a requirement that I can't bypass.
I am using python 2.7 on windows 10 machine.
Thanks in advance for help.

It is pretty straightforward in pandas since version 1.0.0 using dict as compression options:
filename = 'sample'
compression_options = dict(method='zip', archive_name=f'{filename}.csv')
df.to_csv(f'{filename}.zip', compression=compression_options, ...)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

As the thread linked in the comment discusses, ZIP's directory-like nature makes it hard to do what you want without making a lot of assumptions or complicating the arguments for to_csv
If your goal is to write the data directly to a ZIP file, that's harder than you'd think.
If you can bear temporarily writing your data to the filesystem, you can use Python's zipfile module to put that file in a ZIP with the name you preferred, and then delete the file.
import zipfile
import os
df.to_csv('sample.csv',index=None,sep=",",header=True,encoding='utf-8-sig')
with zipfile.ZipFile('sample.zip', 'w') as zf:
zf.write('sample.csv')
os.remove('sample.csv')

Since Pandas 1.0.0 it's possible to set compression using to_csv().
Example in one line:
df.to_csv('sample.zip', compression={'method': 'zip', 'archive_name': 'sample.csv'})

Related

Is there a way to automatically add a filename extension when using the pandas df.to_csv?

I know how to use 'defaultextension' and 'filetypes', as follows:
self.filetypes = (('CSV files', '*.csv'), ('CSV files', '*.csv'))
self.result_file = fd.asksaveasfile(filetypes = self.filetypes, defaultextension = 'csv')
I can simply add the extension when entering the filename, but I'd prefer not to do that. If I enter 'result' for my filename, I'd like for the actual filename to be result.csv.
While I'm at it, I know my filetypes specification looks a little odd, two identical options. When reading files, I couldn't figure out how to provide only one option without getting an error message. This seems to work, at least when reading. Not sure if that's part of my problem when writing.
you can use Pathlib.path.suffix attribute from the pathlib library.
assuming you have a file called 'test.csv' in the same directory.
from pathlib import Path
file = Path('test.csv')
df = pd.read_csv(file)
#do stuff
df.to_csv(f"new_name.{file.suffix}")
print(file.suffix)
'.csv'
I just figured this out myself. Rather than delete the question, I thought it might be better to provide the answer, in case someone else would benefit from it.
I was using the 'asksaveasfile()' function, which actually opens a file for writing, creating it if necessary.
I have discovered that 'asksaveasfilename()' returns the name of a file the user proposes to open, but does not actually open the file. That allows me to easily add whatever extension I like prior to opening the file and writing to it.

Is it possible to change save path of a file saved by an external library?

I use a library in python called pyansys in which I use a method called save_as_vtk.
There it is: documentation
This method generates a file for me and saves it to my working directory. I would like that file to be saved elsewhere... I don't want it moved because sometimes it is 20+ Gb and it would take too long.
Anybody has an idea?
Thank you!
I'm the maintainer of the pyansys package.
This was answered in https://github.com/akaszynski/pyansys/issues/219
Repeated here:
It appears that ResultFile.save_as_vtk already has a filename parameter:
def save_as_vtk(self, filename, rsets=None, result_types=['ENS']):
"""Writes results to a vtk readable file.
The file extension will select the type of writer to use.
``'.vtk'`` will use the legacy writer, while ``'.vtu'`` will
select the VTK XML writer.
Parameters
----------
filename : str
Filename of grid to be written. The file extension will
select the type of writer to use. ``'.vtk'`` will use the
legacy writer, while ``'.vtu'`` will select the VTK XML
writer.

how to read multiple csv files in a directory through python csv() function?

In one of my directory, I have multiple CSV files. I wanted to read the content of all the CSV file through a python code and print the data but till now I am not able to do so.
All the CSV files have the same number of columns and the same column names as well.
I know a way to list all the CSV files in the directory and iterate over them through "os" module and "for" loop.
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
Now use the "csv" module to read the files name
reader = csv.reader(files)
till here I expect the output to be the names of the CSV files. which happens to be sorted. for example, names are 1.csv, 2.csv so on. But the output is as below
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
<_csv.reader object at 0x0000019F97E0E730>
<_csv.reader object at 0x0000019F97E0E528>
if I add next() function after the csv.reader(), I get below output
['1']
['2']
['3']
['4']
['5']
['6']
This happens to be the initials of my CSV files name. Which is partially correct but not fully.
Apart from this once I have the files iterated, how to see the contents of the CSV files on the screen? Today I have 6 files. Later on, I could have 100 files. So, it's not possible to use the file handling method in my scenario.
Any suggestions?
The easiest way I found during developing my project is by using dataframe, read_csv, and glob.
import glob
import os
import pandas as pd
folder_name = 'train_dataset'
file_type = 'csv'
seperator =','
dataframe = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob(folder_name + "/*."+file_type)],ignore_index=True)
Here, all the csv files are loaded into 1 big dataframe.
I would recommend reading your CSVs using the pandas library.
Check this answer here: Import multiple csv files into pandas and concatenate into one DataFrame
Although you asked for python in general, pandas does a great job at data I/O and would help you here in my opinion.
till here I expect the output to be the names of the CSV files
This is the problem. csv.reader objects do not represent filenames. They represent lazy objects which may be iterated to yield rows from a CSV file. Or, if you wish to print the entire CSV file, you can call list on the csv.reader object:
for files in os.listdir("C:\\Users\\AmiteshSahay\\Desktop\\test_csv"):
reader = csv.reader(files)
print(list(reader))
if I add next() function after the csv.reader(), I get below output
Yes, this is what you should expect. Calling next on an iterator will give you the next value which comes out of that iterator. This would be the first line of each file. For example:
from io import StringIO
import csv
some_file = StringIO("""1
2
3""")
with some_file as fin:
reader = csv.reader(fin)
print(next(reader))
['1']
which happens to be sorted. for example, names are 1.csv, 2.csv so on.
This is either a coincidence or a correlation between the filename and the contents of the respective file. Calling next(reader) will not output part of a filename.
Apart from this once I have the files iterated, how to see the
contents of the csv files on the screen?
Use the print command, as in the examples above.
Today I have 6 files. Later on, I could have 100 files. So, it's not
possible to use the file handling method in my scenario.
This is not true. You can define a function to print all or part or your csv file. Then call that function in a for loop with filename as an input.
If you want to import your files as separate dataframes, you can try this:
import pandas as pd
import os
filenames = os.listdir("../data/") # lists all csv files in your directory
def extract_name_files(text): # removes .csv from the name of each file
name_file = text.strip('.csv').lower()
return name_file
names_of_files = list(map(extract_name_files,filenames)) # creates a list that will be used to name your dataframes
for i in range(0,len(names_of_files)): # saves each csv in a dataframe structure
exec(names_of_files[i] + " = pd.read_csv('../data/'+filenames[i])")
You can read and store several dataframes into separate variables using two lines of code.
import pandas as pd
datasets_list = ['users', 'calls', 'messages', 'internet', 'plans']
users, calls, messages, internet, plans = [(pd.read_csv(f'datasets/{dataset_name}.csv')) for dataset_name in datasets_list]

Python/Pandas create zip file from csv

Is anyone can provide example how to create zip file from csv file using Python/Pandas package?
Thank you
Use
df.to_csv('my_file.gz', compression='gzip')
From the docs:
compression : string, optional
a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first
argument is a filename
See discussion of support of zip files here.
In the to_csv() method of pandas, besides the compression type (gz, zip etc) you can specify the archive file name - just pass the dict with necessary params as the compression parameter:
compression_opts = dict(method='zip',
archive_name='out.csv')
df.to_csv('out.zip', compression=compression_opts)
In the example above, the first argument of the to_csv method defines the name of the [ZIP] archive file, the method key of the dict defines [ZIP] compression type and the archive_name key of the dict defines the name of the [CSV] file inside the archive file.
Result:
├─ out.zip
│ └─ out.csv
See details in to_csv() pandas docs
In response to Stefan's answer, add '.csv.gz' for the zip csv file to work
df.to_csv('my_file.csv.gz', compression='gzip')
Hope that helps
The Pandas to_csv compression has some security vulnerabilities where it leaves the absolute path of the file in the zip archive on Linux machine. Not to mention one might want to save a file in the highest level of a zipped file. The following function addresses this issue by using zipfile. On top of that, it doesn't suffer from pickle protocol change (4 to 5).
from pathlib import Path
import zipfile
def save_compressed_df(df, dirPath, fileName):
"""Save a Pandas dataframe as a zipped .csv file.
Parameters
----------
df : pandas.core.frame.DataFrame
Input dataframe.
dirPath : str or pathlib.PosixPath
Parent directory of the zipped file.
fileName : str
File name without extension.
"""
dirPath = Path(dirPath)
path_zip = dirPath / f'{fileName}.csv.zip'
txt = df.to_csv(index=False)
with zipfile.ZipFile(path_zip, 'w', zipfile.ZIP_DEFLATED) as zf:
zf.writestr(f'{fileName}.csv', txt)

How to create an empty zip file?

I'm using zipfile and under some circumstance I need to create an empty zip file for some placeholder purpose. How can I do this?
I know this:
Changed in version 2.7.1: If the file is created with mode 'a' or 'w'
and then closed without adding any files to the archive, the
appropriate ZIP structures for an empty archive will be written to the
file.
but my server uses a lower version as 2.6.
You can create an empty zip file without the need to zipfile as:
empty_zip_data = b'PK\x05\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
with open('empty.zip', 'wb') as zip:
zip.write(empty_zip_data)
empty_zip_data is the data of an empty zip file.
You can simply do:
from zipfile import ZipFile
archive_name = 'test_file.zip'
with ZipFile(archive_name, 'w') as file:
pass

Categories

Resources