Errors when loading .csv file using pandas in python - python

I have a large sized csv file, approximately 6gb, and it's taking a lot of time to load on to python. I get the following error:
import pandas as pd
df = pd.read_csv('nyc311.csv', low_memory=False)
Python(1284,0x7fffa37773c0) malloc: *** mach_vm_map(size=18446744071562067968) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 401, in _read
data = parser.read()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 851, in pandas.parser.TextReader.read (pandas/parser.c:10438)
File "pandas/parser.pyx", line 939, in pandas.parser.TextReader._read_rows (pandas/parser.c:11607)
File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: out of memory
I don't think I am understanding the error code, the last line seems to suggest that the file is too big to load? I also tried low_memory=FALSE option but this did not work either.
I'm not sure what " can't allocate region" mean, could it be possible that the header includes 'region' and pandas cannot locate the column underneath?

Out of memory issue occur due to RAM.
There's no other explaination for that.
Sum of all data memory-overheads for in-RAM objects !< RAM
malloc: *** mach_vm_map(size=18446744071562067968) failed You can clearly understand from this error statement.
Try using.
df = pd.read_csv('nyc311.csv',chunksize =5000,lineterminator='\r')
Or, if reading this csv is only a part of your program, and if there are any other dataframes created before,try cleaning them if not in use.
import gc
del old_df #clear dataframes not in use
gc.collect() # collect Garbage
del gc.garbage[:] # Clears RAM
`

Related

Read h5 files through a NAS

I'm reading h5 files from remote, previously it was on a server but more recently I had to store some on a NAS device. When I'm trying to read the ones on the NAS, for some of them I have the following error:
HDF5ExtError: HDF5 error back trace
File "C:\ci\hdf5_1593121603621\work\src\H5Dio.c", line 199, in H5Dread
can't read data
File "C:\ci\hdf5_1593121603621\work\src\H5Dio.c", line 603, in H5D__read
can't read data
File "C:\ci\hdf5_1593121603621\work\src\H5Dcontig.c", line 621, in H5D__contig_read
contiguous read failed
File "C:\ci\hdf5_1593121603621\work\src\H5Dselect.c", line 283, in H5D__select_read
read error
File "C:\ci\hdf5_1593121603621\work\src\H5Dselect.c", line 218, in H5D__select_io
read error
File "C:\ci\hdf5_1593121603621\work\src\H5Dcontig.c", line 956, in H5D__contig_readvv
can't perform vectorized sieve buffer read
File "C:\ci\hdf5_1593121603621\work\src\H5VM.c", line 1500, in H5VM_opvv
can't perform operation
File "C:\ci\hdf5_1593121603621\work\src\H5Dcontig.c", line 753, in H5D__contig_readvv_sieve_cb
block read failed
File "C:\ci\hdf5_1593121603621\work\src\H5Fio.c", line 118, in H5F_block_read
read through page buffer failed
File "C:\ci\hdf5_1593121603621\work\src\H5PB.c", line 732, in H5PB_read
read through metadata accumulator failed
File "C:\ci\hdf5_1593121603621\work\src\H5Faccum.c", line 260, in H5F__accum_read
driver read request failed
File "C:\ci\hdf5_1593121603621\work\src\H5FDint.c", line 205, in H5FD_read
driver read request failed
File "C:\ci\hdf5_1593121603621\work\src\H5FDsec2.c", line 725, in H5FD_sec2_read
file read failed: time = Tue May 10 11:37:06 2022
, filename = 'Y:/myFolder\myFile.h5', file descriptor = 4, errno = 22, error message = 'Invalid argument', buf = 0000020F03F14040, total read size = 16560000, bytes this sub-read = 16560000, bytes actually read = 18446744073709551615, offset = 480252764
End of HDF5 error back trace
Problems reading the array data.
I don't really understand the error, it always happend for the same files, but I can open the file and read the data myself with HDFView. If I put it on the server I can read it without problem with the same lines of code (path is correct for both):
hdf5store = pd.HDFStore(myPath[fich])
datacopy = hdf5store['my_data']
Btw the error occurs at this 2nd line of code. Right now I don't have access to the server and can't copy the file on local because I don't have enough space. If anyone know how to correct this so I could continue to work through the NAS ?

Python Pandas: Error tokenizing data. C error: EOF inside string starting when reading 1GB CSV file

I'm reading a 1 GB CSV file in chunks of 10,000 rows. The file has 1106012 rows and 171 columns, other smaller sized file does not show any error and finish off successfully but when i read this 1 GB file it shows error every time on exactly line number 1106011 which is a second last line of file, i can manually remove that line but that is not the solution because i have hundreds of other file of that same size and i cannot fix all the lines manually. can anyone help me with that please.
def extract_csv_to_sql(input_file_name, header_row, size_of_chunk, eachRow):
df = pd.read_csv(input_file_name,
header=None,
nrows=size_of_chunk,
skiprows=eachRow,
low_memory=False,
error_bad_lines=False,
sep=',')
# engine='python'
# quoting=csv.QUOTE_NONE
# encoding='utf-8'
df.columns = header_row
df = df.drop_duplicates(keep='first')
df = df.apply(lambda x: x.astype(str).str.lower())
return df
I'm then calling this function within a loop and works just fine.
huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
I read this Pandas ParserError EOF character when reading multiple csv files to HDF5, this read_csv() & EOF character in string cause parsing issue and this https://github.com/pandas-dev/pandas/issues/11654 and many more and tried to include read_csv parameter such as
engine='python'
quoting=csv.QUOTE_NONE // Hangs and even the python shell, don't know why
encoding='utf-8'
but none of it worked, its still throwing the following error
Error:
Traceback (most recent call last):
File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 115, in <module>
huge_chunk_return = extract_csv_to_sql(huge_input_filename, huge_header_row, the_size_of_chunk_H, each_Row_H)
File "C:\Users\WCan\Desktop\wcan_new_python\pandas_test_3.py", line 24, in extract_csv_to_sql
sep=',')
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 411, in _read
data = parser.read(nrows)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "C:\Users\WCan\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\io\parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 893, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10885)
File "pandas\_libs\parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)
File "pandas\_libs\parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)
File "pandas\_libs\parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 1106011
>>>
If you are under linux, try to remove all non printable caracter.
Try to load your file after this operation.
tr -dc '[:print:]\n' < file > newfile
I inquired many solutions, some of them worked but It affected the calculous used this one and it will skip the line that is causing the error:
pd.read_csv(file,engine='python', error_bad_lines=False)
#engine='python' provides a better output

how can fetch data from database chunlby chunk by chunk using Python PETL or pygramETL or pandas

is their any way to fetch data from db using chunkwise
i have around 30 million data in my db it will cause big memory usage without using this
i tried with pandas version 0.17.1
for sd in psql.read_sql(sql,myconn,chunksize=100):
print sd
but throwing
/usr/bin/python2.7 /home/subin/PythonIDE/workspace/python/pygram.py
Traceback (most recent call last):
File "/home/subin/PythonIDE/workspace/python/pygram.py", line 20, in <module>
for sd in psql.read_sql(sql,myconn,chunksize=100):
File "/usr/lib/python2.7/dist-packages/pandas/io/sql.py", line 1565, in _query_iterator
parse_dates=parse_dates)
File "/usr/lib/python2.7/dist-packages/pandas/io/sql.py", line 137, in _wrap_result
coerce_float=coerce_float)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 969, in from_records
coerce_float=coerce_float)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 5279, in _to_arrays
dtype=dtype)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 5357, in _list_to_arrays
content = list(lib.to_object_array_tuples(data).T)
TypeError: Argument 'rows' has incorrect type (expected list, got tuple)
please help me

pandas HDFStore - how to reopen?

I created a file by using:
store = pd.HDFStore('/home/.../data.h5')
and stored some tables using:
store['firstSet'] = df1
store.close()
I closed down python and reopened in a fresh environment.
How do I reopen this file?
When I go:
store = pd.HDFStore('/home/.../data.h5')
I get the following error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/misc/apps/linux/python-2.6.1/lib/python2.6/site-packages/pandas-0.10.0-py2.6-linux-x86_64.egg/pandas/io/pytables.py", line 207, in __init__
self.open(mode=mode, warn=False)
File "/misc/apps/linux/python-2.6.1/lib/python2.6/site-packages/pandas-0.10.0-py2.6-linux-x86_64.egg/pandas/io/pytables.py", line 302, in open
self.handle = _tables().openFile(self.path, self.mode)
File "/apps/linux/python-2.6.1/lib/python2.6/site-packages/tables/file.py", line 230, in openFile
return File(filename, mode, title, rootUEP, filters, **kwargs)
File "/apps/linux/python-2.6.1/lib/python2.6/site-packages/tables/file.py", line 495, in __init__
self._g_new(filename, mode, **params)
File "hdf5Extension.pyx", line 317, in tables.hdf5Extension.File._g_new (tables/hdf5Extension.c:3039)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "H5F.c", line 1582, in H5Fopen
unable to open file
File "H5F.c", line 1373, in H5F_open
unable to read superblock
File "H5Fsuper.c", line 334, in H5F_super_read
unable to find file signature
File "H5Fsuper.c", line 155, in H5F_locate_signature
unable to find a valid file signature
End of HDF5 error back trace
Unable to open/create file '/home/.../data.h5'
What am I doing wrong here? Thank you.
In my hands, following approach works best:
df = pd.DataFrame(...)
"write"
with pd.HDFStore('test.h5', mode='w') as store:
store.append('df', df, data_columns= df.columns, format='table')
"read"
with pd.HDFStore('test.h5', mode='r') as newstore:
df_restored = newstore.select('df')
You could try doing instead:
store = pd.io.pytables.HDFStore('/home/.../data.h5')
df1 = store['firstSet']
or use the read method directly:
df1 = pd.read_hdf('/home/.../data.h5', 'firstSet')
Either way, you should have pandas 0.12.0 or higher...
I had the same problem and finally fixed it by installing the pytables module (next to the pandas modules which I was using):
conda install pytables
which got me numexpr-2.4.3 and pytables-3.2.0
After that it worked. I am using pandas 0.16.2 under python 2.7.9

Pickle: Reading a dictionary, EOFError

I recently found out about pickle, which is amazing. But it errors on me when used for my actual script, testing it with a one item dictionary it worked fine. My real script is thousands of lines of code storing various objects within maya into it. I do not know if it has anything to do with the size, I have read around a lot of threads here but none are specific to my error.
I have tried writing with all priorities. No luck.
This is my output code:
output = open('locatorsDump.pkl', 'wb')
pickle.dump(l.locators, output, -1)
output.close()
This is my read code:
jntdump = open('locatorsDump.pkl', 'rb')
test = pickle.load(jntdump)
jntdump.close()
This is the error:
# Error: Error in maya.utils._guiExceptHook:
# File "C:\Program Files\Autodesk\Maya2011\Python\lib\site-packages\pymel-1.0.0-py2.6.egg\maya\utils.py", line 277, in formatGuiException
# exceptionMsg = excLines[-1].split(':',1)[1].strip()
# IndexError: list index out of range
#
# Original exception was:
# Traceback (most recent call last):
# File "<maya console>", line 3, in <module>
# File "C:\Program Files\Autodesk\Maya2011\bin\python26.zip\pickle.py", line 1370, in load
# return Unpickler(file).load()
# File "C:\Program Files\Autodesk\Maya2011\bin\python26.zip\pickle.py", line 858, in load
# dispatch[key](self)
# File "C:\Program Files\Autodesk\Maya2011\bin\python26.zip\pickle.py", line 880, in load_eof
# raise EOFError
# EOFError #
Try using pickle.dumps() and pickle.loads() as a test.
If you don't recieve the same error, you know it is related to the file write.

Categories

Resources