Accessing zip compression options in pandas to_csv - python

I am having trouble finding the compression options available to me..
At the bottom of this page:
to_csv
they have an example that shows 2 options:
compression_opts = dict(method='zip',
archive_name='out.csv')
But I see no listing of all options available.. and can't find one elsewhere.
I'd love to see the full list (assuming there are more than these 2)
End goal currently: the zip operation zips the file up in a zip file, but all the folders are also within the zip file, so that the file is actually buried in a bunch of folders within the zip. I'm sure there is an easy option to prevent the folders from being added to the zip...

I think I understand your question. Let's say I have a dataframe, and I want to save it locally in a zipfile. And let's say I want that zipfile saved at the location somepath/myfile.zip
Let's say I run this program (also assuming that somepath/ is a valid folder in the current working directory):
### with_path.py
import pandas as pd
filename = "myfile"
df = pd.DataFrame([["a", 1], ["b", 2]])
compression_options = {"method": "zip"}
df.to_csv(f"somepath/{filename}.zip", compression=compression_options)
If I list the content of the resulting file, I can see the path I wanted to store the zip file at was ALSO used as the name of the file INSIDE the zip, including the folder structure, and still named .zip even, which is weird:
(.venv) pandas_test % unzip -l somepath/myfile.zip
Archive: somepath/myfile.zip
Length Date Time Name
--------- ---------- ----- ----
17 09-17-2021 12:56 somepath/myfile.zip
--------- -------
17 1 file
Instead, I can supply an archive_name as a compression option to explicitly provide a name for my file inside the zip. Like so:
### without_path.py
import pandas as pd
filename = "myfile"
df = pd.DataFrame([["a", 1], ["b", 2]])
compression_options = {"method": "zip", "archive_name": f"{filename}.csv"}
df.to_csv(f"somepath/{filename}.zip", compression=compression_options)
Now although our resulting zip file was still written at the desired location of somepath/ the file inside the zip does NOT include the path as part of the filename, and is correctly named with a .csv extension.
(.venv) pandas_test % unzip -l somepath/myfile.zip
Archive: somepath/myfile.zip
Length Date Time Name
--------- ---------- ----- ----
17 09-17-2021 12:59 myfile.csv
--------- -------
17 1 file
The strange default behavior doesn't seem to be called out in the documentation, but you can see the use of the archive_name parameter in the final example of the pandas.DataFrame.to_csv documentation. IMHO they should throw an error and force you to provide an archive_name value, because I can't imagine when you would want to name the file inside a zip the exact same as the zip file itself.

After some reseach about to_csv's integrated the compression mechanisms, I would suggest a different appoach for your problem:
Assuming that you have a number of DataFrames that you want to write to your zip file as individual csv files (for this example I keep the DataFrames in a list, so I can loop over them):
df_list = []
df_list.append(pd.DataFrame({'col1':[1, 2, 3],
'col2':['A', 'B', 'C']}))
df_list.append(pd.DataFrame({'col1':[4, 5, 6],
'col2':['a', 'b', 'c']}))
Then you can convert each of these DataFrames to a csv string in memory (not a file) and write that string to your zip archive as a file (e.g. df0.csv, df1.csv, ...):
with zipfile.ZipFile(file="out.zip", mode="w") as zf:
for index, df in enumerate(df_list):
csv_str = df.to_csv(index=False)
zf.writestr("{}{}.csv".format("df", index), csv_str)
EDIT:
Here is what I think pandas does with the compression options (you can look at the code in Github or in your local filesystem among the python libraries):
When the save function in pandas/io/formats/csvs.py is called, it will use get_handle from pandas/io/common.py with the compression options as a parameter. There, method is expected as the first entry. For zip, it will use a class named _BytesZipFile (derived from zipfile.ZipFile) with the handle (file path or buffer), mode and archive_name, which explains the example from the pandas documentation. Other parameters in **kwargs will just be passed through to the __init__ function of the super class (except for compression, which is set to ZIP_DEFLATED).
So it seems that you can pass allowZip64, compresslevel (python 3.7 and above), and strict_timestamps (python 3.8 and above) as documented here, which I could verify at least for allowZip64 with python 3.6.
I do not see a way to use something like the -j / --junk-paths option in the zipfile library.

Related

Python files pdf rename

I have a file .pdf in a folder and I have a .xls with two-column. In the first column I have the filename without extension .pdf and in the second column, I have a value.
I need to open file .xls, match the value in the first column with all filenames in the folder and rename each file .pdf with the value in the second column.
Is it possible?
Thank you for your support
Angelo
You'll want to use the pandas library within python. It has a function called pandas.read_excel that is very useful for reading excel files. This will return a dataframe, which will allow you to use iloc or other methods of accessing the values in the first and second columns. From there, I'd recommend using os.rename(old_name, new_name), where old_name and new_name are the paths to where your .pdf files are kept. A full example of the renaming part looks like this:
import os
# Absolute path of a file
old_name = r"E:\demos\files\reports\details.txt"
new_name = r"E:\demos\files\reports\new_details.txt"
# Renaming the file
os.rename(old_name, new_name)
I've purposely left out a full explanation because you simply asked if it is possible to achieve your task, so hopefully this points you in the right direction! I'd recommend asking questions with specific reproducible code in the future, in accordance with stackoverflow guidelines.
I would encourage you to do this with a .csv file instead of a xls, as is a much easier format (requires 0 formatting of borders, colors, etc.).
You can use the os.listdir() function to list all files and folders in a certain directory. Check os built-in library docs for that. Then grab the string name of each file, remove the .pdf, and read your .csv file with the names and values, and the rename the file.
All the utilities needed are built-in python. Most are the os lib, other are just from csv lib and normal opening of files:
with open(filename) as f:
#anything you have to do with the file here
#you may need to specify what permits are you opening the file with in the open function

Python: Excel file (xlsx) export with variable as the file path, using pandas

I defined an .xlsx file path as the variable output:
print(output)
r'C:\Users\Kev\Documents\Python code.xlsx'
I want to export a pandas dataframe as an .xlxs file, but need to have the file path as the output variable.
I can get the code to work with the file path. I've tried about a dozen ways (copying and/or piecing code together from documentation, stack overflow, blogs, etc.) and getting a variety of errors. None worked. Here is one that worked with the file path:
df = pd.DataFrame(file_list)
df.to_excel(r'C:\Users\Kev\Documents\Python code.xlsx', index=False)
I would want something like:
df.to_excel(output, index=False)
In any form or package, as long as it produces the same xlsx file and won’t need to be edited to change the file path and name (that would be done where the variable output is defined.
I've attempted several iterations on the XlsxWriter site, the openpyxl site, the pandas site, etc.
(with the appropriate python packages). Working in Jupyter Notebook, Python 3.8.
Any resources, packages, or code that will help me to use a variable in place of a file path for an xlsx export from a pandas dataframe?
Why I want it like this is a long story, but basically I'll have several places at the top of the code where myself and other (inexperienced) coders can quickly put file paths in and search for keywords (rather than hunt through code to find where to replace paths). The data itself is file paths that I'll iteratively search through (this is the beginning of a larger project).
try to put the path this way
output = "C://Users//Kev//Documents//Python code.xlsx"
df.to_excel(output , index=False)
Always worked for me
or you can also do like
output = "C://Users//Kev//Documents//"
df.to_excel(output +"Python code.xlsx" , index=False)
os module would be the most useful here:
from os import path
output = path.abspath("your_excel_file.xlsx")
print(output)
this will return the current working directory path plus the file name you've put into the abspath function as a parameter. Also for those interested about why some people use backslash "\" and not forwardslash "/" while writing file paths here is a good stackoverflow answer for it So what IS the right direction of the path's slash (/ or \) under Windows?
You can use format strings with python3
import pandas as pd
df = pd.DataFrame({"a":"b"}, {"c": "d"})
file_name = "filename.xlsx"
df.to_excel(f"/your/path/to/file/{file_name}", index=False)
Assuming that OP's dataframe is df, that OP is using Windows and wants to store the file in the Desktop, OP's username is cowboykevin05, and the filename that one wants is 0001.xlsx, one can use os.path as follows
from os import path
df.to_excel(path.join('C:\\Users\\cowboykevin05\\Desktop', '0001.xlsx'), index=False)

Duplicate in list created from filenames (python)

I'm trying to create a list of excel files that are saved to a specific directory, but I'm having an issue where when the list is generated it creates a duplicate entry for one of the file names (I am absolutely certain there is not actually a duplicate of the file).
import glob
# get data file names
path =r'D:\larvalSchooling\data'
filenames = glob.glob(path + "/*.xlsx")
output:
>>> filenames
['D:\\larvalSchooling\\data\\copy.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_70dpf_GroupA_n5_20200808_1015-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx']
you'll note 'D:\larvalSchooling\data\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx' is listed twice.
Rather than going through after the fact and removing duplicates I was hoping to figure out why it's happening to begin with.
I'm using python 3.7 on windows 10 pro
If you wrote the code to remove duplicates (which can be as simple as filenames = set(filenames)) you'd see that you still have two filenames. Print them out one on top of the other to make a visual comparison easier:
'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx',
'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx'
The second one has a leading ~ (probably an auto-backup).
Whenever you open an excel file it will create a ghost copy that works as a temporary backup copy for that specific file. In this case:
Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
~$ Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
This means that the file is open by some software and it's showing you that backup inside(usually that file is hidden from the explorer as well)
Just search for the program and close it. Other actions, such as adding validation so the "~$.*.xlsx" type of file is ignored should be also implemented if this is something you want to avoid.
You can use os.path.splittext to get the file extension and loop through the directory using os.listdir . The open excel files can be skipped using the following code:
filenames = []
for file in os.listdir('D:\larvalSchooling\data'):
filename, file_extension = os.path.splitext(file)
if file_extension == '.xlsx':
if not file.startswith('~$'):
filenames.append(file)
Note: this might not be the best solution, but it'll get the job done :)

Convert dta files in csv

I want to convert several dta files into csv.
So far my code is (to be honest I used an answer I found on stackoverflow...)
library(foreign)
setwd("C:\Users\Victor\Folder")
for (f in Sys.glob('*.dta'))
write.csv(read.dta(f), file = gsub('dta$', 'csv', f))
It works, but if my folder contains sub-folders they are ignored.
My problem is that I have 11 sub-folders (which may contain sub-folders themselves) I would like to find a way to loop my folder and sub-folders, because right now I need to change my working directory for each sub-folders and.
I'm using R now, I tried to use pandas (python) but it seems that the quality of the conversion is debatable...
Thank you
In R to do this you just set recursive = T in list.files.
Actually, specifying recursion when dealing with directories is kind of general -- it works with command line operations in OS's including Linux and Windows with commands like rm -rf and applies to multiple functions in R.
This post has a nice example:
How to use R to Iterate through Subfolders and bind CSV files of the same ID?
Their example (which is different only in what they're doing with the results of the directory/subdirectory search) is:
lapply(c('1234' ,'1345','1456','1560'),function(x){
sources.files <- list.files(path=TF,
recursive=T,
pattern=paste('*09061*',x,'*.csv',sep='')
,full.names=T)
## You read all files with the id and bind them
dat <- do.call(rbind,lapply(sources.files,read.csv))
### write the file for the
write(dat,paste('agg',x,'.csv',sep='')
}
So for you pattern = '.dta' and just set your base directory in path.
Consider using base R's list.files() as the recursive argument specifies to search in subdirectories. You will also want full.names set to return absolute paths for file referencing.
So, set your pattern to look for .dta extensions (i.e., Stata datasets) and then run the read in and write out function:
import foreign
statafiles <- list.files("C:\\Users\\Victor\\Folder", pattern="\\.dta$",
recursive = TRUE, full.names = TRUE)
lapply(statafiles, function(x) {
df <- read.dta(x)
write.csv(df, gsub(".dta", ".csv", x))
})
And the counterpart in Python pandas which has built-in methods to read and write stata files:
import os
import pandas as pd
for dirpath, subdirs, files in os.walk("C:\\Users\\Victor\\Folder"):
for f in files:
if f.endswith(".dta"):
df = pd.read_stata(os.path.join(dirpath, f))
df.to_csv(os.path.join(dirpath, f.replace(".dta", ".csv")))

Python/Pandas create zip file from csv

Is anyone can provide example how to create zip file from csv file using Python/Pandas package?
Thank you
Use
df.to_csv('my_file.gz', compression='gzip')
From the docs:
compression : string, optional
a string representing the compression to use in the output file, allowed values are ‘gzip’, ‘bz2’, ‘xz’, only used when the first
argument is a filename
See discussion of support of zip files here.
In the to_csv() method of pandas, besides the compression type (gz, zip etc) you can specify the archive file name - just pass the dict with necessary params as the compression parameter:
compression_opts = dict(method='zip',
archive_name='out.csv')
df.to_csv('out.zip', compression=compression_opts)
In the example above, the first argument of the to_csv method defines the name of the [ZIP] archive file, the method key of the dict defines [ZIP] compression type and the archive_name key of the dict defines the name of the [CSV] file inside the archive file.
Result:
├─ out.zip
│ └─ out.csv
See details in to_csv() pandas docs
In response to Stefan's answer, add '.csv.gz' for the zip csv file to work
df.to_csv('my_file.csv.gz', compression='gzip')
Hope that helps
The Pandas to_csv compression has some security vulnerabilities where it leaves the absolute path of the file in the zip archive on Linux machine. Not to mention one might want to save a file in the highest level of a zipped file. The following function addresses this issue by using zipfile. On top of that, it doesn't suffer from pickle protocol change (4 to 5).
from pathlib import Path
import zipfile
def save_compressed_df(df, dirPath, fileName):
"""Save a Pandas dataframe as a zipped .csv file.
Parameters
----------
df : pandas.core.frame.DataFrame
Input dataframe.
dirPath : str or pathlib.PosixPath
Parent directory of the zipped file.
fileName : str
File name without extension.
"""
dirPath = Path(dirPath)
path_zip = dirPath / f'{fileName}.csv.zip'
txt = df.to_csv(index=False)
with zipfile.ZipFile(path_zip, 'w', zipfile.ZIP_DEFLATED) as zf:
zf.writestr(f'{fileName}.csv', txt)

Categories

Resources