pandas read_csv from BytesIO - python

I have a BytesIO file-like object, containing a CSV.
I want to read it into a Pandas dataframe, without writing to disk in between.
MWE
In my use case I downloaded the file straight into BytesIO.
For this MWE I'll have a file on disk, read it into BytesIO, then read that into Pandas.
The disk step is just to make a MWE.
file.csv
a,b
1,2
3,4
Script:
import pandas as pd
from io import BytesIO
bio = BytesIO()
with open('file.csv', 'rb') as f:
bio.write(f.read())
# now we have a BytesIO with a CSV
df = pd.read_csv(bio)
Result:
Traceback (most recent call last):
File "pandas-io.py", line 8, in <module>
df = pd.read_csv(bio)
File "/home/ec2-user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/ec2-user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 457, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/ec2-user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
self._make_engine(self.engine)
File "/home/ec2-user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/ec2-user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1917, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Note that this sounds like a similar problem to the title of this post, but the error messages are different, and that post has the X-Y problem.

The error says the file is empty.
That's because after writing to a BytesIO object, the file pointer is at the end of the file, ready to write more. So when Pandas tries to read it, it starts reading after the last byte that was written.
So you need to move the pointer back to the start, for Pandas to read.
bio.seek(0)
df = pd.read_csv(bio)

Related

Pandas giving file not found error while trying to access csv via anaconda prompt

Beginner here. trying to load this table via python so i can figure out how i can manipulate it and gain some insight with the eventual intention of calculating the WOE and/or running a regression.
The command ran fine on a test db of two rows i created so it must be something to do with the format of the csv im trying to use. Its a file with 8000 customers and 50 associated variables including some dates and then counts, sums and averages for 30, 60 and 90 day windows of a number of different factors. Could any of this be the reason i get the error message at the bottom?
(* are just redaction's)
data = pd.read_csv("C:\Users\******\Desktop\*******.csv")
>>> data = pd.read_csv(r"C:\Users\******\Desktop\**************")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 895, in __init__
self._make_engine(self.engine)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Users\******\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1853, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 705, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: [Errno 2] File b'C:\\Users\\******\\Desktop\\**************' does not exist: b'C:\\Users\\******\\Desktop\\**************'
....
add r(raw string) before ":
data = pd.read_csv(r"C:\Users******\Desktop*******.csv")
You should replace single backslash with double backslash, like so
data = pd.read_csv("C:\\Users******\\Desktop*******.csv")
or prefix path with r
data = pd.read_csv(r"C:\Users******\Desktop*******.csv")
See here for full description on which characters need escaping in python strings.
Its better to create a separate folder where keep both data and your csv file...
Then just read by only file name... Try to press tab when you are in parenthesis
because it will give you also suggestion where you will see if the file is available or not.
df = pd.read_csv('filename.csv)

Python Pandas: Error tokenizing data. C error: EOF inside string starting when reading 1GB CSV file

I'm reading a 1 GB CSV file in chunks of 10,000 rows. The file has 1106012 rows and 171 columns, other smaller sized file does not show any error and finish off successfully but when i read this 1 GB file it shows error every time on exactly line number 1106011 which is a second last line of file, i can manually remove that line but that is not the solution because i have hundreds of other file of that same size and i cannot fix all the lines manually. can anyone help me with that please.
def extract_csv_to_sql(input_file_name, header_row, size_of_chunk, eachRow):
df = pd.read_csv(input_file_name,
header=None,
nrows=size_of_chunk,
skiprows=eachRow,
low_memory=False,
error_bad_lines=False,
sep=',')
# engine='python'
# quoting=csv.QUOTE_NONE
# encoding='utf-8'
df.columns = header_row
df = df.drop_duplicates(keep='first')
df = df.apply(lambda x: x.astype(str).str.lower())
return df
I'm then calling this function within a loop and works just fine.
huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
I read this Pandas ParserError EOF character when reading multiple csv files to HDF5, this read_csv() & EOF character in string cause parsing issue and this https://github.com/pandas-dev/pandas/issues/11654 and many more and tried to include read_csv parameter such as
engine='python'
quoting=csv.QUOTE_NONE // Hangs and even the python shell, don't know why
encoding='utf-8'
but none of it worked, its still throwing the following error
Error:
Traceback (most recent call last):
File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 115, in <module>
huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 24, in extract_csv_to_sql
sep=',')
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 411, in _read
data = parser.read(nrows)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10885)
File "pandas\_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)
File "pandas\_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)
File "pandas\_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 1106011
>>>
If you are under linux, try to remove all non printable caracter.
Try to load your file after this operation.
tr -dc '[:print:]\n' < file > newfile
I inquired many solutions, some of them worked but It affected the calculous used this one and it will skip the line that is causing the error:
pd.read_csv(file,engine='python', error_bad_lines=False)
#engine='python' provides a better output

Reading in csv file as dataframe from hdfs

I'm using pydoop to read in a file from hdfs, and when I use:
import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
print f.read()
It shows me the file in stdout.
Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:
>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
self._make_engine(self.engine)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist
I know next to nothing about hdfs, but I wonder if the following might work:
with hd.open("/home/file.csv") as f:
df = pd.read_csv(f)
I assume read_csv works with a file handle, or in fact any iterable that will feed it lines. I know the numpy csv readers do.
pd.read_csv("/home/file.csv") would work if the regular Python file open works - i.e. it reads the file a regular local file.
with open("/home/file.csv") as f:
print f.read()
But evidently hd.open is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs documentation.
you can use the following code to read csv from hdfs
import pandas as pd
import pyarrow as pa
hdfs_config = {
"host" : "XXX.XXX.XXX.XXX",
"port" : 8020,
"user" : "user"
}
fs = pa.hdfs.connect(hdfs_config['host'], hdfs_config['port'],
user=hdfs_config['user'])
df=pd.read_csv(fs.open("/home/file.csv"))
Use read instead open, it works
with hd.read("/home/file.csv") as f:
df = pd.read_csv(f)

Python Error when reading data from .xls file

I need to read a few xls files into Python.The sample data file can be found through Link:data.file. I tried:
import pandas as pd
pd.read_excel('data.xls',sheet=1)
But it gives an error message:
ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' ->
LookupError: unknown encoding: unknown_codepage_21010 Traceback (most
recent call last):
File "", line 1, in
pd.read_excel('data.xls',sheet=1)
File "C:\Anaconda3\lib\site-packages\pandas\io\excel.py", line 113,
in read_excel
return ExcelFile(io, engine=engine).parse(sheetname=sheetname, **kwds)
File "C:\Anaconda3\lib\site-packages\pandas\io\excel.py", line 150,
in init
self.book = xlrd.open_workbook(io)
File "C:\Anaconda3\lib\site-packages\xlrd__init__.py", line 435, in
open_workbook
ragged_rows=ragged_rows,
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 116, in
open_workbook_xls
bk.parse_globals()
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 1170, in
parse_globals
self.handle_codepage(data)
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 794, in
handle_codepage
self.derive_encoding()
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 775, in
derive_encoding
_unused = unicode(b'trial', self.encoding)
File "C:\Anaconda3\lib\site-packages\xlrd\timemachine.py", line 30,
in
unicode = lambda b, enc: b.decode(enc)
LookupError: unknown encoding: unknown_codepage_21010
Anyone could help with this problem?
PS: I know if I open the file in windows excel, and resave it, the code could work, but I am looking for a solution without manual adjustment.
using the ExcelFile class, I was successfully able to read the file into python.
let me know if this helps!
import xlrd
import pandas as pd
xls = pd.ExcelFile(’C:\data.xls’)
xls.parse(’Index Constituents Data’, index_col=None, na_values=[’NA’])
The below worked for me.
import xlrd
my_xls = xlrd.open_workbook('//myshareddrive/something/test.xls',encoding_override="gb2312")

Python Pandas read_csv with nrows=1

I have this code reading a text file with headers. ANd append another file with the same headers to it. As the main file is very huge, I only want to read in part of it and get the column headers.
I will get this error if the only line there is the header. And I do not have an idea of how many rows the file has. What I would like to achieve is to read in the file and get the column header of the file. Because I want to append another file to it, I am trying to ensure that the columns are correct.
import pandas as pd
main = pd.read_csv(main_input, nrows=1)
data = pd.read_csv(file_input)
data = data.reindex_axis(main.columns, axis=1)
data.to_csv(main_input,
quoting=csv.QUOTE_ALL,
mode='a', header=False, index=False)
Examine the stack trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 420, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 221, in _read
return parser.read(nrows)
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 626, in read
ret = self._engine.read(nrows)
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 1070, in read
data = self._reader.read(nrows)
File "parser.pyx", line 727, in pandas.parser.TextReader.read (pandas\parser.c:7110)
File "parser.pyx", line 774, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7671)
StopIteration
It seems that the whole file may be being read into memory. You can specify a chunksize= in read_csv(...) as discussed in the docs here.
I think that read_csvs memory usage had been overhauled in version 0.10. So pandas your version makes a difference too see this answer from #WesMcKinney and the associated comments. The changes were also discussed a while ago on Wes' blog
import pandas as pd
from cStringIO import StringIO
csv_data = """\
header, I want
0.47094534, 0.40249001,
0.45562164, 0.37275901,
0.05431775, 0.69727892,
0.24307614, 0.92250565,
0.85728819, 0.31775839,
0.61310243, 0.24324426,
0.669575 , 0.14386658,
0.57515449, 0.68280618,
0.58448533, 0.51793506,
0.0791515 , 0.33833041,
0.34361147, 0.77419739,
0.53552098, 0.47761297,
0.3584255 , 0.40719249,
0.61492079, 0.44656684,
0.77277236, 0.68667805,
0.89155627, 0.88422355,
0.00214914, 0.90743799
"""
tfr = pd.read_csv(StringIO(csv_data), header=None, chunksize=1)
main = tfr.get_chunk()

Categories

Resources