Reading CSV files from zip archive with python-3.x - python

I have a zipped archive that contains several csv files.
For instance, assume myarchive.zip contains myfile1.csv, myfile2.csv, myfile3.csv
In python 2.7 I was able to load iteratively all the myfiles in pandas using
import pandas as pd
import zipfile
with zipfile.ZipFile(myarchive.zip, 'r') as zippedyear:
for filename in ['myfile1.csv', 'myfile2.csv', 'myfile3.csv']:
mydf = pd.read_csv(zippedyear.open(filename))
Now doing the same thing with Python 3 throws the error
ParserError: iterator should return strings, not bytes (did you open
the file in text mode?)
I am at a loss here. Any idea what is the issue?
Thanks!

Strange indeed, since the only mode you can specify is r/w (character modes).
Here's a workaround; read the file using file.read, load the data into a StringIO buffer, and pass that to read_csv.
from io import StringIO
with zipfile.ZipFile(myarchive.zip, 'r') as zippedyear:
for filename in ['myfile1.csv', 'myfile2.csv', 'myfile3.csv']:
with zippedyear.open(filename) as f:
mydf = pd.read_csv(io.StringIO(f.read()))

Related

Pandas - save multiple CSV in a zip archive

I need to save multiple dataframes in CSV, all in a same zip file.
Is it possible without making temporary files?
I tried using zipfile:
with zipfile.ZipFile("archive.zip", "w") as zf:
with zf.open(f"file1.csv", "w") as buffer:
data_frame.to_csv(buffer, mode="wb")
This works with to_excel but fails with to_csv as as zipfiles expects binary data and to_csv writes a string, despite the mode="wb" parameter:
.../lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 283, in _save_header
writer.writerow(encoded_labels)
.../lib/python3.8/zipfile.py", line 1137, in write
TypeError: a bytes-like object is required, not 'str'
On the other hand, I tried using the compression parameter of to_csv, but the archive is overwritten, and only the last dataframe remains in the final archive.
If no other way, I'll use temporary files, but I was wondering if someone have an idea to allow to_csv and zipfile work together.
Thanks in advance!
I would approach this following way
import io
import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
string_io = io.StringIO()
df.to_csv(string_io)
string_io.seek(0)
df_bytes = string_io.read().encode('utf-8')
as df_bytes is bytes it should now work with zipfile. Edit: after looking into to_csv help I found simpler way, to get bytes namely:
import pandas as pd
df = pd.DataFrame({"x":[1,2,3]})
df_bytes = df.to_csv().encode('utf-8')
For saving multiple excel files from dataframe in a zip file
import zipfile
from zipfile import ZipFile
import pandas as pd
df1 = pd.DataFrame({"x":[1,2,3]})
df2 = pd.DataFrame({"y":[4,5,6]})
df3 = pd.DataFrame({"z":[7,8,9]})
with zipfile.ZipFile("rishabh.zip", "w") as zf:
with zf.open(f"check1.xlsx", "w") as buffer:
df1.to_excel(buffer,index=False)
with zf.open(f"check2.xlsx", "w") as buffer:
df2.to_excel(buffer,index=False)
with zf.open(f"check3.xlsx", "w") as buffer:
df3.to_excel(buffer, index=False)

use Pandas HDFStore to open file in read only mode

I needed compatibility between Pandas versions, so pickle was not enough, and I stored a bunch of dataframes like this:
import pandas as pd
hdf = pd.HDFStore('storage.h5')
hdf.put('mydata', df_mydata)
...and brought them back like this:
df_mydata = hdf.get('df_mydata')
Thing is, in Python, you can usually open a file read-only like this:
f = open('workfile', 'r')
I saved the dataframes for local use as it takes too long and stresses out a server to pull them out of SQL otherwise. How can you open these .h5 files so as to not accidentally alter them?
Try:
hdf = pd.HDFStore('storage.h5', 'r')
this class comes from pytables. You can read the doc here:pytables

Python: Parsing multiple csv files and skip files without a keyword

I am trying to read some .csv field data on python for post-processing, I typically just use something like:
for flist in glob('*.csv'):
df = pd.read_csv(flist, delimiter = ',')
However I need to filter through the bad files which contain "Run_Terminated" somewhere in the file and skip the file entirely. I'm still new to python so I'm not familiar with all of its functionalities, any input would be appreciated. Thank you.
What you could do is first read the file fully in memory (using a io.StringIO file-like object and look for the Run_Terminated string anywhere in the file (dirty, but should be OK),
Then pass the handle to read_csv (since you can pass a handle OR a filename) so you don't have to read it again from the file.
import pandas as pd
import glob
import io
for flist in glob('*.csv'):
with open(flist) as f:
data = io.StringIO()
data.write(f.read())
if "Run_Terminated" not in data.getvalue():
data.seek(0) # rewind or it won't read anything
df = pd.read_csv(data, delimiter = ',')

CParserError: Error tokenizing data

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:
In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.
Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")
I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

Closing file after using to_csv()

I am new to python and so far I am loving the ipython notebook for learning. Am I using the to_csv() function to write out a pandas dataframe out to a file. I wanted to open the csv to see how it would look in excel and it would only open in read only mode because it was still in use by another How do I close the file?
import pandas as pd
import numpy as np
import statsmodels.api as sm
import csv
df = pd.DataFrame(file)
path = "File_location"
df.to_csv(path+'filename.csv', mode='wb')
This will write out the file no problem but when I "check" it in excel I get the read only warning. This also brought up a larger question for me. Is there a way to see what files python is currently using/touching?
This is the better way of doing it.
With context manager, you don't have to handle the file resource.
with open("thefile.csv", "w") as f:
df.to_csv(f)
#rpattiso
thank you.
try opening and closing the file yourself:
outfile = open(path+'filename.csv', 'wb')
df.to_csv(outfile)
outfile.close()
The newest pandas to_csv closes the file automatically when it's done.

Categories

Resources