PyTables writing error - python

I am creating and filling a PyTables Carray the following way:
#a,b = scipy.sparse.csr_matrix
f = tb.open_file('../data/pickle/dot2.h5', 'w')
filters = tb.Filters(complevel=1, complib='blosc')
out = f.create_carray(f.root, 'out', tb.Atom.from_dtype(a.dtype),
shape=(l, n), filters=filters)
bl = 2048
l = a.shape[0]
for i in range(0, l, bl):
out[:,i:min(i+bl, l)] = (a.dot(b[:,i:min(i+bl, l)])).toarray()
The script was running fine for nearly two days (I estimated that it would need at least 4 days more).
However, suddenly I received this error stack trace:
File "prepare_data.py", line 168, in _tables_dot
out[:,i:min(i+bl, l)] = (a.dot(b[:,i:min(i+bl, l)])).toarray()
File "/home/psinger/venv/local/lib/python2.7/site-packages/tables/array.py", line 719, in __setitem__
self._write_slice(startl, stopl, stepl, shape, nparr)
File "/home/psinger/venv/local/lib/python2.7/site-packages/tables/array.py", line 809, in _write_slice
self._g_write_slice(startl, stepl, countl, nparr)
File "hdf5extension.pyx", line 1678, in tables.hdf5extension.Array._g_write_slice (tables/hdf5extension.c:16287)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "../../../src/H5Dio.c", line 266, in H5Dwrite
can't write data
File "../../../src/H5Dio.c", line 671, in H5D_write
can't write data
File "../../../src/H5Dchunk.c", line 1840, in H5D_chunk_write
error looking up chunk address
File "../../../src/H5Dchunk.c", line 2299, in H5D_chunk_lookup
can't query chunk address
File "../../../src/H5Dbtree.c", line 998, in H5D_btree_idx_get_addr
can't get chunk info
File "../../../src/H5B.c", line 362, in H5B_find
can't lookup key in subtree
File "../../../src/H5B.c", line 340, in H5B_find
unable to load B-tree node
File "../../../src/H5AC.c", line 1322, in H5AC_protect
H5C_protect() failed.
File "../../../src/H5C.c", line 3567, in H5C_protect
can't load entry
File "../../../src/H5C.c", line 7957, in H5C_load_entry
unable to load entry
File "../../../src/H5Bcache.c", line 143, in H5B_load
wrong B-tree signature
End of HDF5 error back trace
Internal error modifying the elements (H5ARRAYwrite_records returned errorcode -6)
I am really clueless what the problem is as it was running fine for about a quarter of the dataset. Disk space is available.

Related

pd.read_parquet() works fine on pycharm but not on ubuntu (KeyError: 24)

I am making a program that reads all my s3 bucket. As I have a lot of them, I want to run it on an EC2 instance. My program works fine on pycharm, but as soon as I try to run it on my ubuntu instance I get this error: :
File "/home/ubuntu/DataRecap/main.py", line 72, in <module>
create_table()
File "/home/ubuntu/DataRecap/main.py", line 43, in create_table
small_column = get_column()
File "/home/ubuntu/DataRecap/main.py", line 32, in get_column
df = pd.read_parquet(buffer)
File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 493, in read_parquet
return impl.read(
File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 347, in read
result = parquet_file.to_pandas(columns=columns, **kwargs)
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/api.py", line 751, in to_pandas
self.read_row_group_file(rg, columns, categories, index,
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/api.py", line 361, in read_row_group_file
core.read_row_group(
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/core.py", line 608, in read_row_group
read_row_group_arrays(file, rg, columns, categories, schema_helper,
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/core.py", line 580, in read_row_group_arrays
read_col(column, schema_helper, file, use_cat=name+'-catdef' in out,
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/core.py", line 466, in read_col
dic2 = convert(dic2, se)
File "/home/ubuntu/.local/lib/python3.10/site-packages/fastparquet/converted_types.py", line 249, in convert
parquet_thrift.ConvertedType._VALUES_TO_NAMES[ctype]) # pylint:disable=protected-access
KeyError: 24
I have no idea why it does not work. Here is my code :
buffer = io.BytesIO()
object = s3.Object(bucket, parquet_name)
object.download_fileobj(buffer)
df1 = pd.read_parquet(buffer)
Any idea?
Thanks you very much in advance

Error tokenizing data. C error: out of memory - python

I am trying to read 4 .txt files delimited by |.
As one of them is over 1Gb df_tradeCash_mhi = pd.concat(chunk_read(mhi_tradeCashFiles, "MHI"))
I found the 'chunk' method to read them, but I am getting Error tokenizing data. Out of memory.
Does anyone know how I can solve this problem?
Below is my code
def findmefile(directory, containsInFilename):
entity_filenames = {}
for file in os.listdir(directory):
if containsInFilename in file:
if file[:5] == "Trade":
entity_filenames["MHI"] = file
else:
entity_filenames[re.findall("(.*?)_", file)[0]] = file
return entity_filenames
# Get the core Murex file names
mhi_tradeFiles = findmefile(CoreMurexFilesLoc, "Trade")
mhi_tradeCashFiles = findmefile(CoreMurexFilesLoc, "TradeCash_")
mheu_tradeFiles = findmefile(CoreMurexFilesLoc, "MHEU")
mheu_tradeCashFiles = findmefile(CoreMurexFilesLoc, "MHEU_TradeCash")
# Read the csv using chunck
mylist = []
size = 10**2
def chunk_read(fileName, entity):
for chunk in pd.read_csv(
CoreMurexFilesLoc + "\\" + fileName[entity],
delimiter="|",
low_memory=False,
chunksize=size,
):
mylist.append(chunk)
return mylist
df_trade_mhi = pd.concat(chunk_read(mhi_tradeFiles, "MHI"))
df_trade_mheu = pd.concat(chunk_read(mheu_tradeFiles, "MHEU"))
df_tradeCash_mheu = pd.concat(chunk_read(mheu_tradeCashFiles, "MHEU"))
df_tradeCash_mhi = pd.concat(chunk_read(mhi_tradeCashFiles, "MHI"))
df_trades = pd.concat(
[df_trade_mheu, df_trade_mhi, df_tradeCash_mheu, df_tradeCash_mhi]
)
del df_trade_mhi
del df_tradeCash_mhi
del df_trade_mheu
del df_tradeCash_mheu
# Drop any blank fields and duplicates
nan_value = float("NaN")
df_trades.replace("", nan_value, inplace=True)
df_trades.dropna(subset=["MurexCounterpartyRef"], inplace=True)
df_trades.drop_duplicates(subset=["MurexCounterpartyRef"], inplace=True)
counterpartiesList = df_trades["MurexCounterpartyRef"].tolist()
print(colored('All Core Murex trade and tradeCash data loaded.', "green"))
Error:
Traceback (most recent call last):
File "h:\DESKTOP\test_check\check_securityPrices.py", line 52, in <module>
df_tradeCash_mhi = pd.concat(chunk_read(mhi_tradeCashFiles, "MHI"))
File "h:\DESKTOP\test_check\check_securityPrices.py", line 39, in chunk_read
for chunk in pd.read_csv(
File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1024, in __next__
return self.get_chunk()
File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1074, in get_chunk
return self.read(nrows=size)
File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\readers.py", line 1047, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Users\MIRABR\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 228, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 783, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 857, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1925, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
I think the problem is obvious - you're running out of memory because you're trying to load so much data into memory at once, and then process it.
You need to either:
get a machine with more memory.
re-architect the solution to use a pipelined approach using either a generator or coroutine pipeline to do the processing stepwise over your data.
The problem with the first approach is it won't scale indefinitely and is expensive. The second way is the right way to do it, but needs more coding.
As a good reference on generator/coroutine type pipeline approaches check out any of the pycon talks by David Beazley.

Conversion from big csv to parquet using python error

I have csv file that approximately has 200+ cols and 1mil+ rows. When I am converting from csv to python, i had error:
csv_file = 'bigcut.csv'
chunksize = 100_000
parquet_file ='output.parquet'
parser=argparse.ArgumentParser(description='Process Arguments')
parser.add_argument("--fname",action="store",default="",help="specify <run/update>")
args=parser.parse_args()
argFname=args.__dict__["fname"]
csv_file=argFname
csv_stream = pd.read_csv(csv_file, encoding = 'utf-8',sep=',', >chunksize=chunksize, low_memory=False)
for i, chunk in enumerate(csv_stream):
print("Chunk", i)
if i==0:
parquet_schema = pa.Table.from_pandas(df=chunk).schema
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
When I ran, it produces the following error
File "pyconv.py", line 25, in <module>
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
File "pyarrow/table.pxi", line 1217, in pyarrow.lib.Table.from_pandas
File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 387, in dataframe_to_arrays
convert_types))
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 376, in convert_column
raise e
File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in convert_column
return pa.array(col, type=ty, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 169, in pyarrow.lib.array
File "pyarrow/array.pxi", line 69, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)", 'Conversion failed for column agent_number__c with type float64')
I am new pandas/pyarrow/python, if anyone has any recommendation what should i do next to debug is appreciated.
'utf-32-le' codec can't decode bytes in position 0-3
It looks like a library is trying to decode your data in utf-32-le whereas you read the csv data as utf-8.
So you'll somehow have to tell that reader (pyarrow.lib) to read as utf-8 (I don't know Python/Parquet so I can't provide the exact code to do this).
The csv has around 3mils records. I managed to catch 1 potential problem.
On 1 of the col has data type of string/text. Somehow most of them are numeric, however some of them is mixed with text, for example many of them are 1000,230,400 etc but few of them was being entered like 5k, 100k, 29k.
So the code somehow did not like it try to set as number/int all around.
Can you advise?

I/O Error while saving Excel file - Python

Im using python to open an existing excel file and do some formatting and save and close the file. My code is working good when the file size is small but when excel size is big (apprx. 40MB) I'm getting Serialization I/O error and Im sure it due to memory problem or due to my code. Kindly help.
System Config:
RAM - 8 GB
32 - bit operation
Windows 7
Code:
import numpy as np
from openpyxl import load_workbook
from openpyxl.styles import colors, Font
dest_loc='/Users/abdulr06/Documents/Python Scripts/'
np.seterr(divide='ignore', invalid='ignore')
SRC='TSYS'
YM1='201707'
dest_file=dest_loc+SRC+'_'+''+YM1+'.xlsx'
sheetname = [SRC+''+' GL-Recon']
#Following code is common for rest of the sourc systems
wb=load_workbook(dest_file)
fmtB=Font(color=colors.BLUE)
fmtR=Font(color=colors.RED)
for i in range(len(sheetname)):
sheet1=wb.get_sheet_by_name(sheetname[i])
print(sheetname[i])
last_record=sheet1.max_row+1
for m in range(2,last_record):
if -30 <= sheet1.cell(row=m,column=5).value <=30:
ft=sheet1.cell(row=m,column=5)
ft.font=fmtB
ft.number_format = '_(* #,##0.00_);_(* (#,##0.00);_(* "-"??_);_(#_)'
ft1=sheet1.cell(row=m,column=6)
ft1.number_format = '0.00%'
else:
ft=sheet1.cell(row=m,column=5)
ft.font=fmtR
ft.number_format = '_(* #,##0.00_);_(* (#,##0.00);_(* "-"??_);_(#_)'
ft1=sheet1.cell(row=m,column=6)
ft1.number_format = '0.00%'
wb.save(filename=dest_file)
Exception:
Traceback (most recent call last):
File "<ipython-input-17-fc16d9a46046>", line 6, in <module>
wb.save(filename=dest_file)
File "C:\Users\abdulr06\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\workbook\workbook.py", line 263, in save
save_workbook(self, filename)
File "C:\Users\abdulr06\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\writer\excel.py", line 239, in save_workbook
writer.save(filename, as_template=as_template)
File "C:\Users\abdulr06\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\writer\excel.py", line 222, in save
self.write_data(archive, as_template=as_template)
File "C:\Users\abdulr06\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\writer\excel.py", line 80, in write_data
self._write_worksheets(archive)
File "C:\Users\abdulr06\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\writer\excel.py", line 163, in _write_worksheets
xml = sheet._write(self.workbook.shared_strings)
File "C:\Users\abdulr06\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\worksheet\worksheet.py", line 776, in _write
return write_worksheet(self, shared_strings)
File "C:\Users\abdulr06\AppData\Local\Continuum\Anaconda3\lib\site-packages\openpyxl\writer\worksheet.py", line 263, in write_worksheet
xf.write(worksheet.page_breaks.to_tree())
File "serializer.pxi", line 1016, in lxml.etree._FileWriterElement.__exit__ (src\lxml\lxml.etree.c:141944)
File "serializer.pxi", line 904, in lxml.etree._IncrementalFileWriter._write_end_element (src\lxml\lxml.etree.c:140137)
File "serializer.pxi", line 999, in lxml.etree._IncrementalFileWriter._handle_error (src\lxml\lxml.etree.c:141630)
File "serializer.pxi", line 195, in lxml.etree._raiseSerialisationError (src\lxml\lxml.etree.c:131006)
SerialisationError: IO_WRITE
Why do you allocate font at each loop?
fmt=Font(color=colors.BLUE)
Or red, create two fonts red and blue, once and then use it, each time you are allocating Font, you are using more memory.
Optimise your code at first. Less code -> less errors, for example:
mycell = sheet1.cell(row=m,column=5)
if -30 <= mycell.value <=30:
mycell.font = redfont
This should ensure that you do not have the issue again (hopefully)

TypeError: not JSON serializable Py2neo Batch submit

I am creating a huge graph database with over 1.4 million nodes and 160 million relationships. My code looks as follows:
from py2neo import neo4j
# first we create all the nodes
batch = neo4j.WriteBatch(graph_db)
nodedata = []
for index, i in enumerate(words): # words is predefined
batch.create({"term":i})
if index%5000 == 0: #so as not to exceed the batch restrictions
results = batch.submit()
for x in results:
nodedata.append(x)
batch = neo4j.WriteBatch(graph_db)
results = batch.submit()
for x in results:
nodedata.append(x)
#nodedata contains all the node instances now
#time to create relationships
batch = neo4j.WriteBatch(graph_db)
for iindex, i in enumerate(weightdata): #weightdata is predefined
batch.create((nodedata[iindex], "rel", nodedata[-iindex], {"weight": i})) #there is a different way how I decide the indexes of nodedata, but just as an example I put iindex and -iindex
if iindex%5000 == 0: #again batch constraints
batch.submit() #this is the line that shows error
batch = neo4j.WriteBatch(graph_db)
batch.submit()
I am getting the following error:
Traceback (most recent call last):
File "test.py", line 53, in <module>
batch.submit()
File "/usr/lib/python2.6/site-packages/py2neo/neo4j.py", line 2116, in submit
for response in self._submit()
File "/usr/lib/python2.6/site-packages/py2neo/neo4j.py", line 2085, in _submit
for id_, request in enumerate(self.requests)
File "/usr/lib/python2.6/site-packages/py2neo/rest.py", line 427, in _send
return self._client().send(request)
File "/usr/lib/python2.6/site-packages/py2neo/rest.py", line 351, in send
rs = self._send_request(request.method, request.uri, request.body, request.$
File "/usr/lib/python2.6/site-packages/py2neo/rest.py", line 326, in _send_re$
data = json.dumps(data, separators=(",", ":"))
File "/usr/lib64/python2.6/json/__init__.py", line 237, in dumps
**kw).encode(obj)
File "/usr/lib64/python2.6/json/encoder.py", line 367, in encode
chunks = list(self.iterencode(o))
File "/usr/lib64/python2.6/json/encoder.py", line 306, in _iterencode
for chunk in self._iterencode_list(o, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 204, in _iterencode_list
for chunk in self._iterencode(value, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 309, in _iterencode
for chunk in self._iterencode_dict(o, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 275, in _iterencode_dict
for chunk in self._iterencode(value, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 317, in _iterencode
for chunk in self._iterencode_default(o, markers):
File "/usr/lib64/python2.6/json/encoder.py", line 323, in _iterencode_default
newobj = self.default(o)
File "/usr/lib64/python2.6/json/encoder.py", line 344, in default
raise TypeError(repr(o) + " is not JSON serializable")
TypeError: 3448 is not JSON serializable
Could anybody please suggest me what exactly is happening here and how can I overcome it? Any kind of help would be appreciated. Thanks in advance! :)
It's hard to tell without being able to run your code with the same data set but this is likely to be caused by the type of the items in weightdata.
Step through your code or print the data type as you go to determine what the type of i is within the {"weight": i} portion of the relationship descriptor. You may find that this is not an int - which would be required for JSON number serialisation. If this theory is correct, you will need to find a way to cast or otherwise convert that property value into an int before using it in a property set.
I've never used the p2neo, but if I look at the documentation
This:
batch.create((nodedata[iindex], "rel", nodedata[-iindex], {"weight": i}))
Is missing the rel() Part:
batch.create(rel(nodedata[iindex], "rel", nodedata[-iindex], {"weight": i}))

Categories

Resources