Memory leak logs indicate shapely.wkt.loads - python

I have a code like below.
shapes_list = [shapely.wkt.loads(entry['shape'] if type(entry) is dict else entry) for entry in entries]
The problem is that tracemalloc points to this place as the memory leak. Has anyone encountered this?
EDIT logs:
*** Trace for largest memory block - (102367 blocks, 12395.703125 Kb) ***
File "file.py", line 27
masks = obj.generate_masks()
File "/code/src/project/processing/file.py", line 514
image_objects_dict=image_objects_dict)
File "/code/src/project/processing/mask.py", line 322
shapes_list = [shapely.wkt.loads(entry['shape'] if type(entry) is dict else entry) for entry in entries]
File "/code/src/project/processing/mask.py", line 322
shapes_list = [shapely.wkt.loads(entry['shape'] if type(entry) is dict else entry) for entry in entries]
File "/code/venv/lib/python3.7/site-packages/shapely/wkt.py", line 22
return geos.WKTReader(geos.lgeos).read(data)
File "/code/venv/lib/python3.7/site-packages/shapely/geos.py", line 337
return geom_factory(geom)
File "/code/venv/lib/python3.7/site-packages/shapely/geometry/base.py", line 84
ob._set_geom(g)
File "/code/venv/lib/python3.7/site-packages/shapely/geometry/base.py", line 241
self._empty()
File "/code/venv/lib/python3.7/site-packages/shapely/geometry/base.py", line 199
self._is_empty = True
File "/code/venv/lib/python3.7/site-packages/shapely/geometry/base.py", line 250
super().__setattr__(name, value)

Related

from_personal_row() takes 1 positional argument but 2 were given

The code below (and linked in full here) is attempting to read from a .csv uploaded to Google Sheets, however I cannot get past the following error:
Traceback (most recent call last):
File "import_report.py", line 211, in <module>
main()
File "import_report.py", line 163, in main
all_trackings.extend(objects_to_sheet.download_from_sheet(from_personal_row, sheet_id, tab_name))
File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\tenacity\__init__.py", line 329, in wrapped_f
return self.call(f, *args, **kw)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\tenacity\__init__.py", line 409, in call
do = self.iter(retry_state=retry_state)
File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\tenacity\__init__.py", line 368, in iter
raise retry_exc.reraise()
File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\tenacity\__init__.py", line 186, in reraise
raise self.last_attempt.result()
File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 432, in result
return self.__get_result()
File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 388, in __get_result
raise self._exception
File "C:\Users\xxx\AppData\Local\Programs\Python\Python38\lib\site-packages\tenacity\__init__.py", line 412, in call
result = fn(*args, **kwargs)
File "C:\Users\xxx\Documents\GitHub\order-tracking\lib\objects_to_sheet.py", line 26, in download_from_sheet
return [from_row_fn(header, value) for value in values]
File "C:\Users\xxx\Documents\GitHub\order-tracking\lib\objects_to_sheet.py", line 26, in <listcomp>
return [from_row_fn(header, value) for value in values]
TypeError: from_personal_row() takes 1 positional argument but 2 were given
I've read a lot of threads regarding similar errors other posters have encountered, but I can't figure out how to apply the advice here.
Google Sheet CSV:
Code:
def from_personal_row(row: Dict[str, str]) -> Optional[Tracking]:
tracking_col = row['Carrier Name & Tracking Number']
if not tracking_col:
return None
tracking = tracking_col.split('(')[1].replace(')', '')
orders = {row['Order ID'].upper()}
price_str = str(row['Subtotal']).replace(',', '').replace('$', '').replace('N/A', '0.0')
price = float(price_str) if price_str else 0.0
to_email = row['Ordering Customer Email']
ship_date = get_ship_date(str(row["Shipment Date"]))
street_1 = row['Shipping Address Street 1']
city = row['Shipping Address City']
state = row['Shipping Address State']
address = f"{street_1} {city}, {state}"
group, reconcile = get_group(address)
if group is None:
return None
tracked_cost = 0.0
items = price_str
merchant = 'Amazon'
return Tracking(
tracking,
group,
orders,
price,
to_email,
ship_date=ship_date,
tracked_cost=tracked_cost,
items=items,
merchant=merchant,
reconcile=reconcile)

pandas merge command failing in parallel loop - "ValueError: buffer source array is read-only"

I am writing a bootstrap algorithm using parallel loops and pandas. The problem i experience is that a merge command inside the parallel loop causes a "ValueError: buffer source array is read-only" error - but only if i use the full dataset to merge (120k lines). Any subset with less than 12k lines will work just fine and so i infer it is not a problem of the syntax. What can i do?
Current pandas version is 0.24.2 and cython is 0.29.7.
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
r = call_item()
File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 567, in __call__
return self.func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/parallel.py", line 225, in __call__
for func, args, kwargs in self.items]
File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/parallel.py", line 225, in <listcomp>
for func, args, kwargs in self.items]
File "<ipython-input-72-cdb83eaf594c>", line 12, in bootstrap
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 6868, in merge
copy=copy, indicator=indicator, validate=validate)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 48, in merge
return op.get_result()
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 546, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 756, in _get_join_info
right_indexer) = self._get_join_indexers()
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 735, in _get_join_indexers
how=self.how)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1130, in _get_join_indexers
llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1662, in _factorize_keys
rlab = rizer.factorize(rk)
File "pandas/_libs/hashtable.pyx", line 111, in pandas._libs.hashtable.Int64Factorizer.factorize
File "stringsource", line 653, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 348, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-73-652c1db5701b> in <module>()
1 num_cores = multiprocessing.cpu_count()
----> 2 results = Parallel(n_jobs=num_cores, prefer='processes', verbose = 5)(delayed(bootstrap)() for i in range(n_trials))
3 #pd.DataFrame(results[0])
~/.local/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
932
933 with self._backend.retrieval_context():
--> 934 self.retrieve()
935 # Make sure that we get a last message telling us we are done
936 elapsed_time = time.time() - self._start_time
~/.local/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
831 try:
832 if getattr(self._backend, 'supports_timeout', False):
--> 833 self._output.extend(job.get(timeout=self.timeout))
834 else:
835 self._output.extend(job.get())
~/.local/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
519 AsyncResults.get from multiprocessing."""
520 try:
--> 521 return future.result(timeout=timeout)
522 except LokyTimeoutError:
523 raise TimeoutError()
/usr/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()
/usr/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
ValueError: buffer source array is read-only
and the code is
def bootstrap():
df_resample_ids = skl.utils.resample(ob_ids)
df_resample_ids = pd.DataFrame(df_resample_ids).sort_values(by="0").reset_index(drop=True)
df_resample_ids.columns = [ob_id_field]
df_resample = pd.DataFrame(df_resample_ids.merge(df, on = ob_id_field))
return df_resample
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores, prefer='processes', verbose = 5)(delayed(bootstrap)() for i in range(n_trials))
The algo will create resampled/replaced IDs from an ID variable and use the merge command to create a new dataset based on the resampled IDs and the original dataset stored in df. If i cut out a subset of the original dataset (anywhere) leaving less than ~12k lines, then the parallel loop will finish without an error and do as expected.
As requested, below is a new snippet to re-create the data structures and mirror the principal approach i am currently working on:
import pandas as pd
import sklearn as skl
import multiprocessing
from joblib import Parallel, delayed
df = pd.DataFrame(np.random.randn(200000, 24), columns=list('ABCDDEFGHIJKLMNOPQRSTUVW'))
df["ID"] = df.index.drop_duplicates().tolist()
ob_ids = df.index.drop_duplicates().tolist()
def bootstrap2():
df_resample_ids = skl.utils.resample(ob_ids)
df_resample_ids = pd.DataFrame(df_resample_ids).sort_values(by=0).reset_index(drop=True)
df_resample_ids.columns = ['ID']
df_resample = pd.DataFrame(df1.merge(df_resample_ids, on = 'ID'))
result = df_resample
return result
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores, prefer='processes', verbose = 5)(delayed(bootstrap2)() for i in range(n_trials))
However, i notice that when the data is completely made up of np.random numbers, the loop goes through without an error. The dtypes of the original dataframe are:
start_rtg int64
end_rtg float64
days_diff float64
ultimate_customer_system_id int64
How can i avoid the read-only error?
posting an answer to my question as i found that one of the variables was of int64 datatype. when i converted all variables to float64, the error disappeared. so it is an issue that is restricted to certain datatypes only...
cheers
stephan

PyQt5 plt.show() not-blocking function

I've a problem when I have to display the results as a histogram with the plt.show() function, I need the function to be non-blocking because I need the return value of the principal funciont to make new calculations and show a new histogram.
(P.S. i'm using PyCharm ide and Python 3.6.7rc2 matplotlib 3.0.0)
This is the function in the code:
def plotHistStd(data_file, attribute, number, method):
original = copy.deepcopy(data_file)
modified = copy.deepcopy(data_file)
tmp_arr = []
tmp_arr2 = []
if method == 1:
for i in range(0, original.__len__()):
if original[i] > 0:
if number > 0:
tmp = log1p(original[i]+number)
tmp_arr.append(copy.deepcopy(original[i]))
tmp_arr2.append(copy.deepcopy(tmp.real))
else:
tmp = np.log1p(original[i]+number)
tmp_arr.append(copy.deepcopy(original[i]))
tmp_arr2.append(copy.deepcopy(tmp.real))
else:
tmp = np.log1p(original[i]+number)
tmp_arr.append(copy.deepcopy(original[i]))
tmp_arr2.append(copy.deepcopy(tmp.real))
modified[i] = tmp.real
elif method == 2:
for i in range(0, original.__len__()):
if original[i] > 0:
tmp2 = pow((original[i]), (1.0/number))
tmp_arr.append(copy.deepcopy(original[i]))
tmp_arr2.append(copy.deepcopy(tmp2.real))
else:
tmp2 = cm.exp(number*cm.log(original[i]))
tmp_arr.append(copy.deepcopy(original[i]))
tmp_arr2.append(copy.deepcopy(tmp2.real))
modified[i] = tmp2.real
v_max = max(tmp_arr)
v_min = min(tmp_arr)
v_max2 = max(tmp_arr2)
v_min2 = min(tmp_arr2)
v_max = max(v_max, v_max2)
v_min = min(v_min, v_min2)
fig = plt.figure()
ax1 = fig.add_subplot(2, 1, 1)
ax2 = fig.add_subplot(2, 1, 2)
n, bins, patches = ax1.hist(original, color="blue", bins=15, range=(v_min, v_max))
ax1.set_xlabel(attribute)
ax1.set_ylabel('Frequency')
n, bins, patches = ax2.hist(modified, color="red", bins=15, range=(v_min, v_max))
ax2.set_xlabel(attribute)
ax2.set_ylabel('Frequency')
plt.subplots_adjust(hspace=0.3)
plt.xticks(rotation=90)
plt.show() #need these non-blocking version
return modified
The function is called from the main window, developed in another Python class file.
I tried to insert plt.ion() after plt.show() but it does not work.
P.s. add the function in the main window that calls that indicates in precedence:
def apply_stdandardization(self):
if self.file[self.file.__len__() - 4:self.file.__len__()] != ".csv":
QtWidgets.QMessageBox.warning(QtWidgets.QWidget(),"Alert Message", "No CSV file was selected!")
else:
number = self.lineEdit_2.text()
index = self.comboBox_2.currentText()
std_method = 0
#data_file = pd.read_csv(self.file)
number = float(number)
try:
number = float(number)
except Exception:
QtWidgets.QMessageBox.warning(QtWidgets.QWidget(), "Alert Message",
"Insert only numeric value")
if self.rad_log.isChecked():
std_method = 1
elif self.rad_exp.isChecked():
std_method = 2
self.new_df = std_function(str(index), number, std_method)
EDIT:
I tried to insert plt.ion() before plt.show(), actually the program is not blocked, but the histogram window does not open correctly and if I click on it, the program crashes with the following errors:
Fatal Python error: PyEval_RestoreThread: NULL tstate
Thread 0x000016c0 (most recent call first):
File "C:\Python36\lib\threading.py", line 299 in wait
File "C:\Python36\lib\threading.py", line 551 in wait
File "C:\PyCharm\helpers\pydev\pydevd.py", line 128 in _on_run
File "C:\PyCharm\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 320 in run
File "C:\Python36\lib\threading.py", line 916 in _bootstrap_inner
File "C:\Python36\lib\threading.py", line 884 in _bootstrap
Thread 0x00000fb8 (most recent call first):
File "C:\PyCharm\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 382 in _on_run
File "C:\PyCharm\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 320 in run
File "C:\Python36\lib\threading.py", line 916 in _bootstrap_inner
File "C:\Python36\lib\threading.py", line 884 in _bootstrap
Thread 0x00000b28 (most recent call first):
File "C:\Python36\lib\threading.py", line 299 in wait
File "C:\Python36\lib\queue.py", line 173 in get
File "C:\PyCharm\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 459 in _on_run
File "C:\PyCharm\helpers\pydev\_pydevd_bundle\pydevd_comm.py", line 320 in run
File "C:\Python36\lib\threading.py", line 916 in _bootstrap_inner
File "C:\Python36\lib\threading.py", line 884 in _bootstrap
Current thread 0x00000b08 (most recent call first):
File "C:/Users/Enrico/PycharmProjects/PythonDataset/PDA.py", line 681 in <module>
File "C:\PyCharm\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18 in execfile
File "C:\PyCharm\helpers\pydev\pydevd.py", line 1135 in run
File "C:\PyCharm\helpers\pydev\pydevd.py", line 1735 in main
File "C:\PyCharm\helpers\pydev\pydevd.py", line 1741 in <module>
Process finished with exit code 255

using Dask library to merge two large dataframes

I am very new to dask. I am trying to merge two dataframes (one in a small file that fits a pandas dataframe size but I'm using it as a dask dataframe for convenience, the other is really large). I try to save the result in a csv file since I know it might not fit in a dataframe.
import pandas as pd
import dask.dataframe as dd
AF=dd.read_csv("../data/AuthorFieldOfStudy.csv")
AF.columns=['AID','FID']
#extract subset of Authors to reduce final merge size
AF = AF.loc[AF['FID'] == '0271BC14']
#This is a large file 9 MB
PAA=dd.read_csv("../data/PAA.csv")
PAA.columns=['PID','AID', 'AffID']
result = dd.merge(AF,PAA, on='AID')
result.to_csv("../data/CompSciPaperAuthorAffiliations.csv").compute()
I get the following error, and don't quite understand it:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1-6b2f889f44ff> in <module>()
14 result = dd.merge(AF,PAA, on='AID')
15
---> 16 result.to_csv("../data/CompSciPaperAuthorAffiliations.csv").compute()
/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.pyc in to_csv(self, filename, **kwargs)
936 """ See dd.to_csv docstring for more information """
937 from .io import to_csv
--> 938 return to_csv(self, filename, **kwargs)
939
940 def to_delayed(self):
/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.pyc in to_csv(df, filename, name_function, compression, compute, get, **kwargs)
411 if compute:
412 from dask import compute
--> 413 compute(*values, get=get)
414 else:
415 return values
/usr/local/lib/python2.7/dist-packages/dask/base.pyc in compute(*args, **kwargs)
177 dsk = merge(var.dask for var in variables)
178 keys = [var._keys() for var in variables]
--> 179 results = get(dsk, keys, **kwargs)
180
181 results_iter = iter(results)
/usr/local/lib/python2.7/dist-packages/dask/threaded.pyc in get(dsk, result, cache, num_workers, **kwargs)
74 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
75 cache=cache, get_id=_thread_get_id,
---> 76 **kwargs)
77
78 # Cleanup pools associated to dead threads
/usr/local/lib/python2.7/dist-packages/dask/async.pyc in get_async(apply_async, num_workers, dsk, result, cache, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, dumps, loads, **kwargs)
491 _execute_task(task, data) # Re-execute locally
492 else:
--> 493 raise(remote_exception(res, tb))
494 state['cache'][key] = res
495 finish_task(dsk, key, state, results, keyorder.get)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 14: ordinal not in range(128)
Traceback
---------
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 268, in execute_task
result = _execute_task(task, data)
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/shuffle.py", line 329, in collect
res = p.get(part)
File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 73, in get
return self.get([keys], **kwargs)[0]
File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 79, in get
return self._get(keys, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/partd/encode.py", line 30, in _get
for chunk in raw]
File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 144, in deserialize
for block, dt, shape in zip(b_blocks, dtypes, shapes)]
File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 127, in deserialize
l = decode(l)
File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 114, in decode
return list(map(decode, o))
File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 110, in decode
return [item.decode() for item in o]

PyTables writing error

I am creating and filling a PyTables Carray the following way:
#a,b = scipy.sparse.csr_matrix
f = tb.open_file('../data/pickle/dot2.h5', 'w')
filters = tb.Filters(complevel=1, complib='blosc')
out = f.create_carray(f.root, 'out', tb.Atom.from_dtype(a.dtype),
shape=(l, n), filters=filters)
bl = 2048
l = a.shape[0]
for i in range(0, l, bl):
out[:,i:min(i+bl, l)] = (a.dot(b[:,i:min(i+bl, l)])).toarray()
The script was running fine for nearly two days (I estimated that it would need at least 4 days more).
However, suddenly I received this error stack trace:
File "prepare_data.py", line 168, in _tables_dot
out[:,i:min(i+bl, l)] = (a.dot(b[:,i:min(i+bl, l)])).toarray()
File "/home/psinger/venv/local/lib/python2.7/site-packages/tables/array.py", line 719, in __setitem__
self._write_slice(startl, stopl, stepl, shape, nparr)
File "/home/psinger/venv/local/lib/python2.7/site-packages/tables/array.py", line 809, in _write_slice
self._g_write_slice(startl, stepl, countl, nparr)
File "hdf5extension.pyx", line 1678, in tables.hdf5extension.Array._g_write_slice (tables/hdf5extension.c:16287)
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "../../../src/H5Dio.c", line 266, in H5Dwrite
can't write data
File "../../../src/H5Dio.c", line 671, in H5D_write
can't write data
File "../../../src/H5Dchunk.c", line 1840, in H5D_chunk_write
error looking up chunk address
File "../../../src/H5Dchunk.c", line 2299, in H5D_chunk_lookup
can't query chunk address
File "../../../src/H5Dbtree.c", line 998, in H5D_btree_idx_get_addr
can't get chunk info
File "../../../src/H5B.c", line 362, in H5B_find
can't lookup key in subtree
File "../../../src/H5B.c", line 340, in H5B_find
unable to load B-tree node
File "../../../src/H5AC.c", line 1322, in H5AC_protect
H5C_protect() failed.
File "../../../src/H5C.c", line 3567, in H5C_protect
can't load entry
File "../../../src/H5C.c", line 7957, in H5C_load_entry
unable to load entry
File "../../../src/H5Bcache.c", line 143, in H5B_load
wrong B-tree signature
End of HDF5 error back trace
Internal error modifying the elements (H5ARRAYwrite_records returned errorcode -6)
I am really clueless what the problem is as it was running fine for about a quarter of the dataset. Disk space is available.

Categories

Resources