Read in-memory csv file with open() in python - python

I have a dataframe. I want to write this dataframe from an excel file to a csv without saving it to disk. Looked like StringIO() was the clear solution. I then wanted to open the file like object from in memory with open() but was getting a type error. How do I get around this type error and read the in memory csv file with open()? Or, actually, is it wrong to assume the open() will work?
TypeError: expected str, bytes or os.PathLike object, not StringIO
The error cites the row below. From the code that's even further below.
f = open(writer_file)
To get a proper example I had to open the file "pandas_example" after creation. I then removed a row, then the code runs into an empty row.
from pandas import util
df = util.testing.makeDataFrame()
df.to_excel('pandas_example.xlsx')
df_1 = pd.read_excel('pandas_example.xlsx')
writer_file = io.StringIO()
write_to_the_in_mem_file = csv.writer(writer_file, dialect='excel', delimiter=',')
write_to_the_in_mem_file.writerow(df_1)
f = open(writer_file)
while f.readline() not in (',,,,,,,,,,,,,,,,,,\n', '\n'):
pass
final_df = pd.read_csv(f, header=None)
f.close()

Think of writer_file as what is returned from open(), you don't need to open it again.
For example:
import pandas as pd
from pandas import util
import io
# Create test file
df = util.testing.makeDataFrame()
df.to_excel('pandas_example.xlsx')
df_1 = pd.read_excel('pandas_example.xlsx')
writer_file = io.StringIO()
df_1.to_csv(writer_file)
writer_file.seek(0) # Rewind back to the start
for line in writer_file:
print(line.strip())
The to_csv() call writes the dataframe into your memory file in CSV format.

After
writer_file = io.StringIO()
, writer_file is already a file-like object. It already has a readline method, as well as read, write, seek, etc. See io.TextIOBase, which io.StringIO inherits from.
In other words, open(writer_file) is unnecessary or, rather, it results in a type error (as you already observed).

Related

Python - Pandas "BadGzipFile" Error When Reading in ".json.gz" File

I am trying to read in data from a ".json.gz" file as a dataframe. I keep getting an error indicating that it is a "BadGzipFile". However, when I unzip the file manually (i.e., just double clicking it in my finder), I am able to successfully open the json file. This leads me to believe that the file is fine, but when I run the below code in Python, I receive the "BadGzipFile" error.
I am very new to .gzip files and have done a fair bit of research trying to figure out what the issue is. So far, I have been unsuccessful. Any help would be greatly appreciated!
Here is my code:
import os
import json
import gzip
file_path = '/data/data_0_0_0.json.gz'
with gzip.open(file_path, 'rb') as f:
df = pd.read_json(f, compression='gzip', lines=True)
And here is the error I am receiving:
BadGzipFile: Not a gzipped file (b'{"')
What's happening with your code here:
with gzip.open(file_path, 'rb') as f:
df = pd.read_json(f, compression='gzip', lines=True)
Is that you're opening a Gzip file at file_path. Then you're telling Pandas that the thing that you opened (f), is itself another Gzip file. It isn't; it's a Json file. When it says BadGzipFile with that starting bracket, it is telling you that the file it found starts with a bracket instead of the Gzip file's magic number.
You should change it either to open the file with gzip and then directly read the resulting file or have Pandas read the file.
The first would be:
with gzip.open(file_path, 'rb') as f:
df = pd.read_json(f, lines=True)
The second is actually easier. Because pd.read_json will infer the compression format based on the file name and your file ends with .gz, you can just write:
df = pd.read_json(file_path)

Convert bytes to a file object in python

I have a small application that reads local files using:
open(diefile_path, 'r') as csv_file
open(diefile_path, 'r') as file
and also uses linecache module
I need to expand the use to files that send from a remote server.
The content that is received by the server type is bytes.
I couldn't find a lot of information about handling IOBytes type and I was wondering if there is a way that I can convert the bytes chunk to a file-like object.
My goal is to use the API is specified above (open,linecache)
I was able to convert the bytes into a string using data.decode("utf-8"),
but I can't use the methods above (open and linecache)
a small example to illustrate
data = 'b'First line\nSecond line\nThird line\n'
with open(data) as file:
line = file.readline()
print(line)
output:
First line
Second line
Third line
can it be done?
open is used to open actual files, returning a file-like object. Here, you already have the data in memory, not in a file, so you can instantiate the file-like object directly.
import io
data = b'First line\nSecond line\nThird line\n'
file = io.StringIO(data.decode())
for line in file:
print(line.strip())
However, if what you are getting is really just a newline-separated string, you can simply split it into a list directly.
lines = data.decode().strip().split('\n')
The main difference is that the StringIO version is slightly lazier; it has a smaller memory foot print compared to the list, as it splits strings off as requested by the iterator.
The answer above that using StringIO would need to specify an encoding, which may cause wrong conversion.
from Python Documentation using BytesIO:
from io import BytesIO
f = BytesIO(b"some initial binary data: \x00\x01")

Decoding UTF8 literals in a CSV file

Question:
Does anyone know how I could transform this b"it\\xe2\\x80\\x99s time to eat" into this it's time to eat
More details & my code:
Hello everyone,
I'm currently working with a CSV file which full of rows with UTF8 literals in them, for example:
b"it\xe2\x80\x99s time to eat"
The end goal is to to get something like this:
it's time to eat
To achieve this I have tried using the following code:
import pandas as pd
file_open = pd.read_csv("/Users/Downloads/tweets.csv")
file_open["text"]=file_open["text"].str.replace("b\'", "")
file_open["text"]=file_open["text"].str.encode('ascii').astype(str)
file_open["text"]=file_open["text"].str.replace("b\"", "")[:-1]
print(file_open["text"])
After running the code the row that I took as an example is printed out as:
it\xe2\x80\x99s time to eat
I have tried solving this issue
using the following code to open the CSV file:
file_open = pd.read_csv("/Users/Downloads/tweets.csv", encoding = "utf-8")
which printed out the example row in the following manner:
it\xe2\x80\x99s time to eat
and I have also tried decoding the rows using this:
file_open["text"]=file_open["text"].str.decode('utf-8')
Which gave me the following error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Thank you very much in advance for your help.
b"it\\xe2\\x80\\x99s time to eat" sounds like your file contains an escaped encoding.
In general, you can convert this to a proper Python3 string with something like:
x = b"it\\xe2\\x80\\x99s time to eat"
x = x.decode('unicode-escape').encode('latin1').decode('utf8')
print(x) # it’s time to eat
(Use of .encode('latin1') explained here)
So, if after you use pd.read_csv(..., encoding="utf8") you still have escaped strings, you can do something like:
pd.read_csv(..., encoding="unicode-escape")
# ...
# Now, your values will be strings but improperly decoded:
# itâs time to eat
#
# So we encode to bytes then decode properly:
val = val.encode('latin1').decode('utf8')
print(val) # it’s time to eat
But I think it's probably better to do this to the whole file instead of to each value individually, for example with StringIO (if the file isn't too big):
from io import StringIO
# Read the csv file into a StringIO object
sio = StringIO()
with open('yourfile.csv', 'r', encoding='unicode-escape') as f:
for line in f:
line = line.encode('latin1').decode('utf8')
sio.write(line)
sio.seek(0) # Reset file pointer to the beginning
# Call read_csv, passing the StringIO object
df = pd.read_csv(sio, encoding="utf8")

Pandas ParserError EOF character when reading multiple csv files to HDF5

Using Python3, Pandas 0.12
I'm trying to write multiple csv files (total size is 7.9 GB) to a HDF5 store to process later onwards. The csv files contain around a million of rows each, 15 columns and data types are mostly strings, but some floats. However when I'm trying to read the csv files I get the following error:
Traceback (most recent call last):
File "filter-1.py", line 38, in <module>
to_hdf()
File "filter-1.py", line 31, in to_hdf
for chunk in reader:
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 578, in __iter__
yield self.read(self.chunksize)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
ret = self._engine.read(nrows)
File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
data = self._reader.read(nrows)
File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
File "parser.pyx", line 740, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7146)
File "parser.pyx", line 781, in pandas.parser.TextReader._read_rows (pandas\parser.c:7568)
File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)
File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: EOF inside string starting at line 754991
Closing remaining open files: ta_store.h5... done
Edit:
I managed to find a file that produced this problem. I think it's reading an EOF character. However I have no clue to overcome this problem. Given the large size of the combined files I think it's too cumbersome to check each single character in each string. (Even then I would still not be sure what to do.) As far as I checked, there are no strange characters in the csv files that could raise the error.
I also tried passing error_bad_lines=False to pd.read_csv(), but the error persists.
My code is the following:
# -*- coding: utf-8 -*-
import pandas as pd
import os
from glob import glob
def list_files(path=os.getcwd()):
''' List all files in specified path '''
list_of_files = [f for f in glob('2013-06*.csv')]
return list_of_files
def to_hdf():
""" Function that reads multiple csv files to HDF5 Store """
# Defining path name
path = 'ta_store.h5'
# If path exists delete it such that a new instance can be created
if os.path.exists(path):
os.remove(path)
# Creating HDF5 Store
store = pd.HDFStore(path)
# Reading csv files from list_files function
for f in list_files():
# Creating reader in chunks -- reduces memory load
reader = pd.read_csv(f, chunksize=50000)
# Looping over chunks and storing them in store file, node name 'ta_data'
for chunk in reader:
chunk.to_hdf(store, 'ta_data', mode='w', table=True)
# Return store
return store.select('ta_data')
return 'Finished reading to HDF5 Store, continuing processing data.'
to_hdf()
Edit
If I go into the CSV file that raises the CParserError EOF... and manually delete all rows after the line that is causing the problem, the csv file is read properly. However all I'm deleting are blank rows anyway.
The weird thing is that when I manually correct the erroneous csv files, they are loaded fine into the store individually. But when I again use a list of multiple files the 'false' files still return me errors.
I had a similar problem. The line listed with the 'EOF inside string' had a string that contained within it a single quote mark ('). When I added the option quoting=csv.QUOTE_NONE it fixed my problem.
For example:
import csv
df = pd.read_csv(csvfile, header = None, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
I have the same problem, and after adding these two params to my code, the problem is gone.
read_csv (...quoting=3, error_bad_lines=False)
I realize this is an old question, but I wanted to share some more details on the root cause of this error and why the solution from #Selah works.
From the csv.py docstring:
* quoting - controls when quotes should be generated by the writer.
It can take on any of the following module constants:
csv.QUOTE_MINIMAL means only when required, for example, when a
field contains either the quotechar or the delimiter
csv.QUOTE_ALL means that quotes are always placed around fields.
csv.QUOTE_NONNUMERIC means that quotes are always placed around
fields which do not parse as integers or floating point
numbers.
csv.QUOTE_NONE means that quotes are never placed around fields.
csv.QUOTE_MINIMAL is the default value and " is the default quotechar. If somewhere in your csv file you have a quotechar it will be parsed as a string until another occurrence of the quotechar. If your file has odd number of quotechars the last one will not be closed before reaching the EOF (end of file). Also be aware that anything between the quotechars will be parsed as a single string. Even if there are many line breaks (expected to be parsed as separate rows) it all goes into a single field of the table. So the line number that you get in the error can be misleading. To illustrate with an example consider this:
In[4]: import pandas as pd
...: from io import StringIO
...: test_csv = '''a,b,c
...: "d,e,f
...: g,h,i
...: "m,n,o
...: p,q,r
...: s,t,u
...: '''
...:
In[5]: test = StringIO(test_csv)
In[6]: pd.read_csv(test)
Out[6]:
a b c
0 d,e,f\ng,h,i\nm n o
1 p q r
2 s t u
In[7]: test_csv_2 = '''a,b,c
...: "d,e,f
...: g,h,i
...: "m,n,o
...: "p,q,r
...: s,t,u
...: '''
...: test_2 = StringIO(test_csv_2)
...:
In[8]: pd.read_csv(test_2)
Traceback (most recent call last):
...
...
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2
The first string has 2 (even) quotechars. So each quotechar is closed and the csv is parsed without an error, although probably not what we expected. The other string has 3 (odd) quotechars. The last one is not closed and the EOF is reached hence the error. But line 2 that we get in the error message is misleading. We would expect 4, but since everything between first and second quotechar is parsed as a string our "p,q,r line is actually second.
Make your inner loop like this will allow you to detect the 'bad' file (and further investigate)
from pandas.io import parser
def to_hdf():
.....
# Reading csv files from list_files function
for f in list_files():
# Creating reader in chunks -- reduces memory load
try:
reader = pd.read_csv(f, chunksize=50000)
# Looping over chunks and storing them in store file, node name 'ta_data'
for chunk in reader:
chunk.to_hdf(store, 'ta_data', table=True)
except (parser.CParserError) as detail:
print f, detail
The solution is to use the parameter engine=’python’ in the read_csv function. The Pandas CSV parser can use two different “engines” to parse a CSV file – Python or C (which is also the default).
pandas.read_csv(filepath, sep=',', delimiter=None,
header='infer', names=None,
index_col=None, usecols=None, squeeze=False,
..., engine=None, ...)
The Python engine is described to be “slower, but is more feature complete” in the Pandas documentation.
engine : {‘c’, ‘python’}
My error:
ParserError: Error tokenizing data. C error: EOF inside string
starting at row 4488'
was resolved by adding delimiter="\t" in my code as:
import pandas as pd
df = pd.read_csv("filename.csv", delimiter="\t")
Use
engine="python",
error_bad_lines=False,
on the read_csv .
The full call will be like this:
df = pd.read_csv(csvfile,
delimiter="\t",
engine="python",
error_bad_lines=False,
encoding='utf-8')
For me, the other solutions did not work and caused me quite a headache. error_bad_lines=False still gives the error C error: EOF inside string starting at line. Using a different quoting didn't give the desired results either, since I did not want to have quotes in my text.
I realised that there was a bug in Pandas 0.20. Upgrading to version 0.21 completely solved my issue. More info about this bug, see: https://github.com/pandas-dev/pandas/issues/16559
Note: this may be Windows-related as mentioned in the URL.
After looking up for a solution for hours, I have finally come up with a workaround.
The best way to eliminate this C error: EOF inside string starting at line exception without multiprocessing efficiency reduction is to preprocess the input data (if you have such an opportunity).
Replace all of the '\n' entries in the input file on, for instance ', ', or on any other unique symbols sequence (for example, 'aghr21*&'). Then you will be able to read_csv the data into your dataframe.
After you have read the data, you may want to replace all of your unique symbols sequences ('aghr21*&'), back on '\n'.
Had similar issue while trying to pull data from a Github repository. Simple mistake, was trying to pull data from the git blob (the html rendered part) instead of the raw csv.
If you're pulling data from a git repo, make sure your link doesn't include a \<repo name\>/blob unless you're specifically interested in html code from the repo.

How can I use io.StringIO() with the csv module?

I tried to backport a Python 3 program to 2.7, and I'm stuck with a strange problem:
>>> import io
>>> import csv
>>> output = io.StringIO()
>>> output.write("Hello!") # Fail: io.StringIO expects Unicode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unicode argument expected, got 'str'
>>> output.write(u"Hello!") # This works as expected.
6L
>>> writer = csv.writer(output) # Now let's try this with the csv module:
>>> csvdata = [u"Hello", u"Goodbye"] # Look ma, all Unicode! (?)
>>> writer.writerow(csvdata) # Sadly, no.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unicode argument expected, got 'str'
According to the docs, io.StringIO() returns an in-memory stream for Unicode text. It works correctly when I try and feed it a Unicode string manually. Why does it fail in conjunction with the csv module, even if all the strings being written are Unicode strings? Where does the str come from that causes the Exception?
(I do know that I can use StringIO.StringIO() instead, but I'm wondering what's wrong with io.StringIO() in this scenario)
The Python 2.7 csv module doesn't support Unicode input: see the note at the beginning of the documentation.
It seems that you'll have to encode the Unicode strings to byte strings, and use io.BytesIO, instead of io.StringIO.
The examples section of the documentation includes examples for a UnicodeReader and UnicodeWriter wrapper classes (thanks #AlexeyKachayev for the pointer).
Please use StringIO.StringIO().
http://docs.python.org/library/io.html#io.StringIO
http://docs.python.org/library/stringio.html
io.StringIO is a class. It handles Unicode. It reflects the preferred Python 3 library structure.
StringIO.StringIO is a class. It handles strings. It reflects the legacy Python 2 library structure.
I found this when I tried to serve a CSV file via Flask directly without creating the CSV file on the file system. This works:
import io
import csv
data = [[u'cell one', u'cell two'], [u'cell three', u'cell four']]
output = io.BytesIO()
writer = csv.writer(output, delimiter=',')
writer.writerows(data)
your_csv_string = output.getvalue()
See also
More about CSV
The Flask part
A few notes about Strings / Unicode
From csv documentation:
The csv module doesn’t directly support reading and writing Unicode,
but it is 8-bit-clean save for some problems with ASCII NUL
characters. So you can write functions or classes that handle the
encoding and decoding for you as long as you avoid encodings like
UTF-16 that use NULs. UTF-8 is recommended.
You can find example of UnicodeReader, UnicodeWriter here http://docs.python.org/2/library/csv.html
To use CSV reader/writer with 'memory files' in python 2.7:
from io import BytesIO
import csv
csv_data = """a,b,c
foo,bar,foo"""
# creates and stores your csv data into a file the csv reader can read (bytes)
memory_file_in = BytesIO(csv_data.encode(encoding='utf-8'))
# classic reader
reader = csv.DictReader(memory_file_in)
# writes a csv file
fieldnames = reader.fieldnames # here we use the data from the above csv file
memory_file_out = BytesIO() # create a memory file (bytes)
# classic writer (here we copy the first file in the second file)
writer = csv.DictWriter(memory_file_out, fieldnames)
for row in reader:
print(row)
writer.writerow(row)

Categories

Resources