Pandas - save multiple CSV in a zip archive - python

I need to save multiple dataframes in CSV, all in a same zip file.
Is it possible without making temporary files?
I tried using zipfile:
with zipfile.ZipFile("archive.zip", "w") as zf:
with zf.open(f"file1.csv", "w") as buffer:
data_frame.to_csv(buffer, mode="wb")
This works with to_excel but fails with to_csv as as zipfiles expects binary data and to_csv writes a string, despite the mode="wb" parameter:
.../lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 283, in _save_header
writer.writerow(encoded_labels)
.../lib/python3.8/zipfile.py", line 1137, in write
TypeError: a bytes-like object is required, not 'str'
On the other hand, I tried using the compression parameter of to_csv, but the archive is overwritten, and only the last dataframe remains in the final archive.
If no other way, I'll use temporary files, but I was wondering if someone have an idea to allow to_csv and zipfile work together.
Thanks in advance!

I would approach this following way
import io
import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
string_io = io.StringIO()
df.to_csv(string_io)
string_io.seek(0)
df_bytes = string_io.read().encode('utf-8')
as df_bytes is bytes it should now work with zipfile. Edit: after looking into to_csv help I found simpler way, to get bytes namely:
import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
df_bytes = df.to_csv().encode('utf-8')

For saving multiple excel files from dataframe in a zip file
import zipfile
from zipfile import ZipFile
import pandas as pd
df1 = pd.DataFrame({"x":[1,2,3]})
df2 = pd.DataFrame({"y":[4,5,6]})
df3 = pd.DataFrame({"z":[7,8,9]})
with zipfile.ZipFile("rishabh.zip", "w") as zf:
with zf.open(f"check1.xlsx", "w") as buffer:
df1.to_excel(buffer,index=False)
with zf.open(f"check2.xlsx", "w") as buffer:
df2.to_excel(buffer,index=False)
with zf.open(f"check3.xlsx", "w") as buffer:
df3.to_excel(buffer, index=False)

Related

Haw to save the new result which are extracting from csv file in python?

I have the (M.csv) file contain many columns.
I extract 2 columns from the (M.CSV) file by using DictReader, Now I want to save these 2 columns in New CSV file. I use data.to_csv(.....) but not working.
this is the code:
import pandas as pd
import csv
with open('M.csv', newline='') as f:
data = csv.DictReader(f)
print("I_CaIIHK",",", 'Swm')
print("---------------------------------")
for row in data:
print(row['I_CaIIHK_err'],"," ,row['Swm'])
data.to_csv('New.csv',index=False)
the code is run and I got the 2 columns but cannot saving in new csv file.
And this is the error:
AttributeError: 'DictReader' object has no attribute 'to_csv'
enter image description here
It looks like you are trying to call the pandas.DataFrame.to_csv method on a csv.DictReader object (which is not a DataFrame).
You should be able to read the CSV file into a DataFrame with the pandas.read_csv function, specifying only the columns you want with the usecols argument, then save the data with the pandas.DataFrame.to_csv method.
Then you don't need the csv library at all :).
Something like:
import pandas as pd
COLUMNS_TO_USE = ["I_CaIIHK_err", "Swm"]
PATH_TO_FILE = "./M.csv"
OUTPUT_FILE = "New.csv"
df = pd.read_csv(filepath=PATH_TO_FILE, usecols=COLUMNS_TO_USE)
df.to_csv(path_or_buf=OUTPUT_FILE)

Extract a particular value of csv file without uploading whole file

So I have a several tables in the format of csv, I am using Python and the csv module. I want to extract a particular value, lets say column=80 row=109.
Here is a random example:
import csv
with open('hugetable.csv', 'r') as file:
reader = csv.reader(file)
print(reader[109][80])
I am doing this many times with large tables and I would like to avoid loading the whole table into an array (line 2 above) to ask for a single value. Is there a way to open the file, load the specific value and close it again? Would this process be more efficient than what I have done above?
Thanks for all the answers, all answers so far work pretty well.
You could try reading the file without csv library:
row = 108
column = 80
with open('hugetable.csv', 'r') as file:
header = next(file)
for _ in range(row-1):
_ = next(file)
line = next(file)
print(line.strip().split(',')[column])
You can try pandas to load only certain columns of your csv file
import pandas as pd
pd.read_csv('foo.csv',usecols=["column1", "column2"])
You could use pandas to load it
import pandas as pd
text = pd.read_csv('Book1.csv', sep=',', header=None, skiprows= 100, nrows=3)
print(text[50])

pandas: write dataframe to excel file *object* (not file)?

I have a dataframe that I want to convert to excel file, and return it using HTTP. Dataframe's to_excel method accepts either a path, or an ExcelWriter, which, in turn, refers to a path.
Is there any way to convert the dataframe to a file object, without writing it to disk?
This can be done using the BytesIO Object in the standard library:
import pandas
from io import BytesIO
# Create Random Data for example
cols = ["col1", "col2"]
df = pandas.DataFrame.from_records([{k: 0.0 for k in cols} for _ in range(25)])
# Create an in memory binary file object, and write the dataframe to it.
in_memory_fp = BytesIO()
df.to_excel(in_memory_fp)
# Write the file out to disk to demonstrate that it worked.
in_memory_fp.seek(0,0)
with open("my_file.xlsx", 'wb') as f:
f.write(in_memory_fp.read())
In the above example, I wrote the object out to a file so you can verify that it works. If you want to just return the raw binary data in memory all you need is:
in_memory_fp.seek(0,0)
binary_xl = in_memory_fp.read()

Reading CSV files from zip archive with python-3.x

I have a zipped archive that contains several csv files.
For instance, assume myarchive.zip contains myfile1.csv, myfile2.csv, myfile3.csv
In python 2.7 I was able to load iteratively all the myfiles in pandas using
import pandas as pd
import zipfile
with zipfile.ZipFile(myarchive.zip, 'r') as zippedyear:
for filename in ['myfile1.csv', 'myfile2.csv', 'myfile3.csv']:
mydf = pd.read_csv(zippedyear.open(filename))
Now doing the same thing with Python 3 throws the error
ParserError: iterator should return strings, not bytes (did you open
the file in text mode?)
I am at a loss here. Any idea what is the issue?
Thanks!
Strange indeed, since the only mode you can specify is r/w (character modes).
Here's a workaround; read the file using file.read, load the data into a StringIO buffer, and pass that to read_csv.
from io import StringIO
with zipfile.ZipFile(myarchive.zip, 'r') as zippedyear:
for filename in ['myfile1.csv', 'myfile2.csv', 'myfile3.csv']:
with zippedyear.open(filename) as f:
mydf = pd.read_csv(io.StringIO(f.read()))

Python: Parsing multiple csv files and skip files without a keyword

I am trying to read some .csv field data on python for post-processing, I typically just use something like:
for flist in glob('*.csv'):
df = pd.read_csv(flist, delimiter = ',')
However I need to filter through the bad files which contain "Run_Terminated" somewhere in the file and skip the file entirely. I'm still new to python so I'm not familiar with all of its functionalities, any input would be appreciated. Thank you.
What you could do is first read the file fully in memory (using a io.StringIO file-like object and look for the Run_Terminated string anywhere in the file (dirty, but should be OK),
Then pass the handle to read_csv (since you can pass a handle OR a filename) so you don't have to read it again from the file.
import pandas as pd
import glob
import io
for flist in glob('*.csv'):
with open(flist) as f:
data = io.StringIO()
data.write(f.read())
if "Run_Terminated" not in data.getvalue():
data.seek(0) # rewind or it won't read anything
df = pd.read_csv(data, delimiter = ',')

Categories

Resources