I have this code reading a text file with headers. ANd append another file with the same headers to it. As the main file is very huge, I only want to read in part of it and get the column headers.
I will get this error if the only line there is the header. And I do not have an idea of how many rows the file has. What I would like to achieve is to read in the file and get the column header of the file. Because I want to append another file to it, I am trying to ensure that the columns are correct.
import pandas as pd
main = pd.read_csv(main_input, nrows=1)
data = pd.read_csv(file_input)
data = data.reindex_axis(main.columns, axis=1)
data.to_csv(main_input,
quoting=csv.QUOTE_ALL,
mode='a', header=False, index=False)
Examine the stack trace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 420, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 221, in _read
return parser.read(nrows)
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 626, in read
ret = self._engine.read(nrows)
File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 1070, in read
data = self._reader.read(nrows)
File "parser.pyx", line 727, in pandas.parser.TextReader.read (pandas\parser.c:7110)
File "parser.pyx", line 774, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7671)
StopIteration
It seems that the whole file may be being read into memory. You can specify a chunksize= in read_csv(...) as discussed in the docs here.
I think that read_csvs memory usage had been overhauled in version 0.10. So pandas your version makes a difference too see this answer from #WesMcKinney and the associated comments. The changes were also discussed a while ago on Wes' blog
import pandas as pd
from cStringIO import StringIO
csv_data = """\
header, I want
0.47094534, 0.40249001,
0.45562164, 0.37275901,
0.05431775, 0.69727892,
0.24307614, 0.92250565,
0.85728819, 0.31775839,
0.61310243, 0.24324426,
0.669575 , 0.14386658,
0.57515449, 0.68280618,
0.58448533, 0.51793506,
0.0791515 , 0.33833041,
0.34361147, 0.77419739,
0.53552098, 0.47761297,
0.3584255 , 0.40719249,
0.61492079, 0.44656684,
0.77277236, 0.68667805,
0.89155627, 0.88422355,
0.00214914, 0.90743799
"""
tfr = pd.read_csv(StringIO(csv_data), header=None, chunksize=1)
main = tfr.get_chunk()
Related
I have a BytesIO file-like object, containing a CSV.
I want to read it into a Pandas dataframe, without writing to disk in between.
MWE
In my use case I downloaded the file straight into BytesIO.
For this MWE I'll have a file on disk, read it into BytesIO, then read that into Pandas.
The disk step is just to make a MWE.
file.csv
a,b
1,2
3,4
Script:
import pandas as pd
from io import BytesIO
bio = BytesIO()
with open('file.csv', 'rb') as f:
bio.write(f.read())
# now we have a BytesIO with a CSV
df = pd.read_csv(bio)
Result:
Traceback (most recent call last):
File "pandas-io.py", line 8, in <module>
df = pd.read_csv(bio)
File "/home/ec2-user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/ec2-user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 457, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/ec2-user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 895, in __init__
self._make_engine(self.engine)
File "/home/ec2-user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/home/ec2-user/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 1917, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 545, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Note that this sounds like a similar problem to the title of this post, but the error messages are different, and that post has the X-Y problem.
The error says the file is empty.
That's because after writing to a BytesIO object, the file pointer is at the end of the file, ready to write more. So when Pandas tries to read it, it starts reading after the last byte that was written.
So you need to move the pointer back to the start, for Pandas to read.
bio.seek(0)
df = pd.read_csv(bio)
I'm reading a 1 GB CSV file in chunks of 10,000 rows. The file has 1106012 rows and 171 columns, other smaller sized file does not show any error and finish off successfully but when i read this 1 GB file it shows error every time on exactly line number 1106011 which is a second last line of file, i can manually remove that line but that is not the solution because i have hundreds of other file of that same size and i cannot fix all the lines manually. can anyone help me with that please.
def extract_csv_to_sql(input_file_name, header_row, size_of_chunk, eachRow):
df = pd.read_csv(input_file_name,
header=None,
nrows=size_of_chunk,
skiprows=eachRow,
low_memory=False,
error_bad_lines=False,
sep=',')
# engine='python'
# quoting=csv.QUOTE_NONE
# encoding='utf-8'
df.columns = header_row
df = df.drop_duplicates(keep='first')
df = df.apply(lambda x: x.astype(str).str.lower())
return df
I'm then calling this function within a loop and works just fine.
huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
I read this Pandas ParserError EOF character when reading multiple csv files to HDF5, this read_csv() & EOF character in string cause parsing issue and this https://github.com/pandas-dev/pandas/issues/11654 and many more and tried to include read_csv parameter such as
engine='python'
quoting=csv.QUOTE_NONE // Hangs and even the python shell, don't know why
encoding='utf-8'
but none of it worked, its still throwing the following error
Error:
Traceback (most recent call last):
File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 115, in <module>
huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 24, in extract_csv_to_sql
sep=',')
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 411, in _read
data = parser.read(nrows)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10885)
File "pandas\_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)
File "pandas\_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)
File "pandas\_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 1106011
>>>
If you are under linux, try to remove all non printable caracter.
Try to load your file after this operation.
tr -dc '[:print:]\n' < file > newfile
I inquired many solutions, some of them worked but It affected the calculous used this one and it will skip the line that is causing the error:
pd.read_csv(file,engine='python', error_bad_lines=False)
#engine='python' provides a better output
I'm using pydoop to read in a file from hdfs, and when I use:
import pydoop.hdfs as hd
with hd.open("/home/file.csv") as f:
print f.read()
It shows me the file in stdout.
Is there any way for me to read in this file as dataframe? I've tried using pandas' read_csv("/home/file.csv"), but it tells me that the file cannot be found. The exact code and error is:
>>> import pandas as pd
>>> pd.read_csv("/home/file.csv")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
self._make_engine(self.engine)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/lib64/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 353, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)
File "pandas/parser.pyx", line 591, in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)
IOError: File /home/file.csv does not exist
I know next to nothing about hdfs, but I wonder if the following might work:
with hd.open("/home/file.csv") as f:
df = pd.read_csv(f)
I assume read_csv works with a file handle, or in fact any iterable that will feed it lines. I know the numpy csv readers do.
pd.read_csv("/home/file.csv") would work if the regular Python file open works - i.e. it reads the file a regular local file.
with open("/home/file.csv") as f:
print f.read()
But evidently hd.open is using some other location or protocol, so the file is not local. If my suggestion doesn't work, then you (or we) need to dig more into the hdfs documentation.
you can use the following code to read csv from hdfs
import pandas as pd
import pyarrow as pa
hdfs_config = {
"host" : "XXX.XXX.XXX.XXX",
"port" : 8020,
"user" : "user"
}
fs = pa.hdfs.connect(hdfs_config['host'], hdfs_config['port'],
user=hdfs_config['user'])
df=pd.read_csv(fs.open("/home/file.csv"))
Use read instead open, it works
with hd.read("/home/file.csv") as f:
df = pd.read_csv(f)
I need to read a few xls files into Python.The sample data file can be found through Link:data.file. I tried:
import pandas as pd
pd.read_excel('data.xls',sheet=1)
But it gives an error message:
ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' ->
LookupError: unknown encoding: unknown_codepage_21010 Traceback (most
recent call last):
File "", line 1, in
pd.read_excel('data.xls',sheet=1)
File "C:\Anaconda3\lib\site-packages\pandas\io\excel.py", line 113,
in read_excel
return ExcelFile(io, engine=engine).parse(sheetname=sheetname, **kwds)
File "C:\Anaconda3\lib\site-packages\pandas\io\excel.py", line 150,
in init
self.book = xlrd.open_workbook(io)
File "C:\Anaconda3\lib\site-packages\xlrd__init__.py", line 435, in
open_workbook
ragged_rows=ragged_rows,
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 116, in
open_workbook_xls
bk.parse_globals()
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 1170, in
parse_globals
self.handle_codepage(data)
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 794, in
handle_codepage
self.derive_encoding()
File "C:\Anaconda3\lib\site-packages\xlrd\book.py", line 775, in
derive_encoding
_unused = unicode(b'trial', self.encoding)
File "C:\Anaconda3\lib\site-packages\xlrd\timemachine.py", line 30,
in
unicode = lambda b, enc: b.decode(enc)
LookupError: unknown encoding: unknown_codepage_21010
Anyone could help with this problem?
PS: I know if I open the file in windows excel, and resave it, the code could work, but I am looking for a solution without manual adjustment.
using the ExcelFile class, I was successfully able to read the file into python.
let me know if this helps!
import xlrd
import pandas as pd
xls = pd.ExcelFile(’C:\data.xls’)
xls.parse(’Index Constituents Data’, index_col=None, na_values=[’NA’])
The below worked for me.
import xlrd
my_xls = xlrd.open_workbook('//myshareddrive/something/test.xls',encoding_override="gb2312")
I created a file by using:
store = pd.HDFStore('/home/.../data.h5')
and stored some tables using:
store['firstSet'] = df1
store.close()
I closed down python and reopened in a fresh environment.
How do I reopen this file?
When I go:
store = pd.HDFStore('/home/.../data.h5')
I get the following error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/misc/apps/linux/python-2.6.1/lib/python2.6/site-packages/pandas-0.10.0-py2.6-linux-x86_64.egg/pandas/io/pytables.py", line 207, in __init__
self.open(mode=mode, warn=False)
File "/misc/apps/linux/python-2.6.1/lib/python2.6/site-packages/pandas-0.10.0-py2.6-linux-x86_64.egg/pandas/io/pytables.py", line 302, in open
self.handle = _tables().openFile(self.path, self.mode)
File "/apps/linux/python-2.6.1/lib/python2.6/site-packages/tables/file.py", line 230, in openFile
return File(filename, mode, title, rootUEP, filters, **kwargs)
File "/apps/linux/python-2.6.1/lib/python2.6/site-packages/tables/file.py", line 495, in __init__
self._g_new(filename, mode, **params)
File "hdf5Extension.pyx", line 317, in tables.hdf5Extension.File._g_new (tables/hdf5Extension.c:3039)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "H5F.c", line 1582, in H5Fopen
unable to open file
File "H5F.c", line 1373, in H5F_open
unable to read superblock
File "H5Fsuper.c", line 334, in H5F_super_read
unable to find file signature
File "H5Fsuper.c", line 155, in H5F_locate_signature
unable to find a valid file signature
End of HDF5 error back trace
Unable to open/create file '/home/.../data.h5'
What am I doing wrong here? Thank you.
In my hands, following approach works best:
df = pd.DataFrame(...)
"write"
with pd.HDFStore('test.h5', mode='w') as store:
store.append('df', df, data_columns= df.columns, format='table')
"read"
with pd.HDFStore('test.h5', mode='r') as newstore:
df_restored = newstore.select('df')
You could try doing instead:
store = pd.io.pytables.HDFStore('/home/.../data.h5')
df1 = store['firstSet']
or use the read method directly:
df1 = pd.read_hdf('/home/.../data.h5', 'firstSet')
Either way, you should have pandas 0.12.0 or higher...
I had the same problem and finally fixed it by installing the pytables module (next to the pandas modules which I was using):
conda install pytables
which got me numexpr-2.4.3 and pytables-3.2.0
After that it worked. I am using pandas 0.16.2 under python 2.7.9