using Dask library to merge two large dataframes

using Dask library to merge two large dataframes - python

I am very new to dask. I am trying to merge two dataframes (one in a small file that fits a pandas dataframe size but I'm using it as a dask dataframe for convenience, the other is really large). I try to save the result in a csv file since I know it might not fit in a dataframe.
import pandas as pd
import dask.dataframe as dd
AF=dd.read_csv("../data/AuthorFieldOfStudy.csv")
AF.columns=['AID','FID']
#extract subset of Authors to reduce final merge size
AF = AF.loc[AF['FID'] == '0271BC14']
#This is a large file 9 MB
PAA=dd.read_csv("../data/PAA.csv")
PAA.columns=['PID','AID', 'AffID']
result = dd.merge(AF,PAA, on='AID')
result.to_csv("../data/CompSciPaperAuthorAffiliations.csv").compute()
I get the following error, and don't quite understand it:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1-6b2f889f44ff> in <module>()
14 result = dd.merge(AF,PAA, on='AID')
15
---> 16 result.to_csv("../data/CompSciPaperAuthorAffiliations.csv").compute()
/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.pyc in to_csv(self, filename, **kwargs)
936 """ See dd.to_csv docstring for more information """
937 from .io import to_csv
--> 938 return to_csv(self, filename, **kwargs)
939
940 def to_delayed(self):
/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.pyc in to_csv(df, filename, name_function, compression, compute, get, **kwargs)
411 if compute:
412 from dask import compute
--> 413 compute(*values, get=get)
414 else:
415 return values
/usr/local/lib/python2.7/dist-packages/dask/base.pyc in compute(*args, **kwargs)
177 dsk = merge(var.dask for var in variables)
178 keys = [var._keys() for var in variables]
--> 179 results = get(dsk, keys, **kwargs)
180
181 results_iter = iter(results)
/usr/local/lib/python2.7/dist-packages/dask/threaded.pyc in get(dsk, result, cache, num_workers, **kwargs)
74 results = get_async(pool.apply_async, len(pool._pool), dsk, result,
75 cache=cache, get_id=_thread_get_id,
---> 76 **kwargs)
77
78 # Cleanup pools associated to dead threads
/usr/local/lib/python2.7/dist-packages/dask/async.pyc in get_async(apply_async, num_workers, dsk, result, cache, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, dumps, loads, **kwargs)
491 _execute_task(task, data) # Re-execute locally
492 else:
--> 493 raise(remote_exception(res, tb))
494 state['cache'][key] = res
495 finish_task(dsk, key, state, results, keyorder.get)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 14: ordinal not in range(128)
Traceback
---------
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 268, in execute_task
result = _execute_task(task, data)
File "/usr/local/lib/python2.7/dist-packages/dask/async.py", line 249, in _execute_task
return func(*args2)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/shuffle.py", line 329, in collect
res = p.get(part)
File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 73, in get
return self.get([keys], **kwargs)[0]
File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 79, in get
return self._get(keys, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/partd/encode.py", line 30, in _get
for chunk in raw]
File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 144, in deserialize
for block, dt, shape in zip(b_blocks, dtypes, shapes)]
File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 127, in deserialize
l = decode(l)
File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 114, in decode
return list(map(decode, o))
File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 110, in decode
return [item.decode() for item in o]

Related

Failed to convert a NumPy array to a Tensor while converting pandas dataframe

I am trying to convert a pandas dataframe to a tf dataset, but i constantly rrun into this problem:
Traceback (most recent call last):
File "/home/arch_poppin/dev/AI/reviews/rev.py", line 36, in <module>
dataset = tf.data.Dataset.from_tensor_slices((df.values, label.values))
File "/usr/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 689, in from_tensor_slices
return TensorSliceDataset(tensors)
File "/usr/lib/python3.8/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 3066, in __init__
element = structure.normalize_element(element)
File "/usr/lib/python3.8/site-packages/tensorflow/python/data/util/structure.py", line 129, in normalize_element
ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
File "/usr/lib/python3.8/site-packages/tensorflow/python/profiler/trace.py", line 163, in wrapped
return func(*args, **kwargs)
File "/usr/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1535, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/lib/python3.8/site-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 52, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "/usr/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 264, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/usr/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 276, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/usr/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 301, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/usr/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 98, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type int).
this is my code looks like:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import io
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
# In[0] open file
df = pd.read_csv(r'PATH')
df = df.sample(frac=1).reset_index(drop=True)
# In[1] make comma separated integers into list
objectColumnList = list(df.select_dtypes(include=['object']).columns)
for column in objectColumnList:
colArr = []
for row in df[column]:
arr = np.asarray(row.split(',')).astype(np.float32)
colArr.append(arr)
df[column] = colArr
# In[2] mnake datatset
label = df.pop('MYLABELS')
dataset = tf.data.Dataset.from_tensor_slices((df.values, label.values))
Now the link to the csv file I'm using, if you want reproduce the error:
https://mega.nz/file/uOwiwK5K#FVG7K0glMh2mGa53UDWQiG6iKgNFn5972Kdjb-gmAV4 I had to remove the column names for privacy reasons

IsADirectoryError: [Errno 21] Is a directory: '/' error while using multiprocessing

Say, I have a function to run multiple data frames in a list. Like this,
listdF = [os.path.join(os.sep,path,x) for x in os.listdir(path) if x.endswith('.csv')]
def corre_arrys(listdF):
data = []
for files in listdF:
df = pd.read_csv(files,sep='\t',header=0,engine='python')
#do something
return(df)
When I try to run the above function as it is, there is no error. It prints out the output I needed. However, when I try to run it using multiprocessing like follows,
from multiprocessing import Pool
NUM_PROCS = 8
pool = Pool(processes=NUM_PROCS)
allDfs = pool.map(corre_arrys,listdF)
It is throwing the following error message,
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/alva/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/alva/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "<ipython-input-42-e4b97b52ffff>", line 4, in corre_arrys
df = pd.read_csv(files,sep='\t',header=0,engine='python')
File "/home/alva/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/alva/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "/home/alva/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "/home/alva/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1126, in _make_engine
self._engine = klass(self.f, **self.options)
File "/home/alva/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2269, in __init__
memory_map=self.memory_map,
File "/home/alva/anaconda3/lib/python3.7/site-packages/pandas/io/common.py", line 431, in get_handle
f = open(path_or_buf, mode, errors="replace", newline="")
IsADirectoryError: [Errno 21] Is a directory: '/'
"""
The above exception was the direct cause of the following exception:
IsADirectoryError Traceback (most recent call last)
<ipython-input-46-4971753cdf30> in <module>
4 NUM_PROCS = 8
5 pool = Pool(processes=NUM_PROCS)
----> 6 allDfs = pool.map(corre_arrys,listdF)
~/anaconda3/lib/python3.7/multiprocessing/pool.py in map(self, func, iterable, chunksize)
266 in a list that is returned.
267 '''
--> 268 return self._map_async(func, iterable, mapstar, chunksize).get()
269
270 def starmap(self, func, iterable, chunksize=None):
~/anaconda3/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
655 return self._value
656 else:
--> 657 raise self._value
658
659 def _set(self, i, obj):
IsADirectoryError: [Errno 21] Is a directory: '/'
The listDF looks like the following, which has both paths and files.
['/path/scripts/pc_2_lc_1_T.csv',
'/path/scripts/pc_2_lc_2_T.csv',
'/path/scripts/pc_1_lc_1_T.csv',
'/path/scripts/pc_1_lc_2_T.csv']
I am not able to understand where is the exact problem.
Any help is greatly appreciated. Thanks!!

From your stack trace it looks like a directory is creeping in your listdF and pandas.read_csv() fails trying to load that. Try explicitly filtering out directories :
listDf = [x for x in os.listdir(path) if os.path.isfile(os.path.join(path, x)) and x.endswith('.csv')]

Pandas index labelling type error within multiprocessing code

I've been wrestling with getting the below multiprocessing code to work (to append nearest stores to a customer file using co-ordinates).
I believe it's a pandas issue that's causing the problem, potentially something to do with passing the dataframe into the function parallelize_dataframe() where it's splits into different numpy arrays (that's just a guess). Oddly, when I run on the full postcodes file (rather than the test customer file), it doesn't crash (ran for 15 mins until I stopped it), however, as postcodes is 2.6m records long, I don't know if it just hadn't reached the point where it would crash, or if I'm introducing the problem when I create the test files.
It's a long process that utilises most of my CPU, so I want to prove it works on the the test files first before letting it run for a long time on the full file.
Either way, it persistently throws an index labelling type error (at end of post).
Any help with this appreciated.
import multiprocess as mp #pip install multiprocess
import pandas as pd
import numpy as np
import functools
postcodes = pd.read_csv('national_postcode_stats_file.csv')
customers = postcodes.sample(n = 10000, random_state=1) # customers test file
stores = postcodes.sample(n = 100, random_state=1) # store test file
stores.reset_index(inplace=True)
cores = mp.cpu_count() # 8 CPUs
partitions = cores
def parallelize_dataframe(data, func):
data_split = np.array_split(data, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
def dist_func(stores, data):
# Reimport libraries (parellel processes completed in fresh interpretter each time)
import pandas as pd
import numpy as np
def nearest(inlat1, inlon1, inlat2, inlon2, store, postcode):
lat1 = np.radians(inlat1)
lat2 = np.radians(inlat2)
longdif = np.radians(inlon2 - inlon1)
r = 6371.1009 # gives d in kilometers
d = np.arccos(np.sin(lat1)*np.sin(lat2) + np.cos(lat1)*np.cos(lat2) * np.cos(longdif)) * r
near = pd.DataFrame({'store': store, 'postcode': postcode, 'distance': d})
near_min = near.loc[near['distance'].idxmin()]
x = str(near_min['store']) + '~' + str(near_min['postcode']) + '~' + str(near_min['distance'])
return x
data['appended'] = data['lat'].apply(nearest, args=(data['long'], stores['lat'], stores['long'], stores['index'], stores['pcds']))
data[['store','store_postcode','distance_km']] = data['appended'].str.split("~",expand=True)
return data
dist_func_with_stores = functools.partial(dist_func, stores) # Needed to pass stores to parrellize_dataframe
dist = parallelize_dataframe(customers, dist_func_with_stores)
And the full error:
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\multiprocess\pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\multiprocess\pool.py", line 44, in mapstar
return list(map(*args))
File "<ipython-input-34-7a1b788055e2>", line 41, in dist_func
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 3591, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2217, in pandas._libs.lib.map_infer
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 3578, in f
return func(x, *args, **kwds)
File "<ipython-input-34-7a1b788055e2>", line 37, in nearest
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1500, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1912, in _getitem_axis
self._validate_key(key, axis)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py", line 1799, in _validate_key
self._convert_scalar_indexer(key, axis)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py", line 262, in _convert_scalar_indexer
return ax._convert_scalar_indexer(key, kind=self.name)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\numeric.py", line 211, in _convert_scalar_indexer
._convert_scalar_indexer(key, kind=kind))
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2877, in _convert_scalar_indexer
return self._invalid_indexer('label', key)
File "C:\Users\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3067, in _invalid_indexer
kind=type(key)))
TypeError: cannot do label indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [nan] of <class 'float'>
"""
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-34-7a1b788055e2> in <module>
45 dist_func_with_stores = functools.partial(dist_func, stores) # Needed to pass stores to parrellise_dataframe
46
---> 47 dist = parallelize_dataframe(customers, dist_func_with_stores)
<ipython-input-34-7a1b788055e2> in parallelize_dataframe(data, func)
16 data_split = np.array_split(data, partitions)
17 pool = mp.Pool(cores)
---> 18 data = pd.concat(pool.map(func, data_split))
19 pool.close()
20 pool.join()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\multiprocess\pool.py in map(self, func, iterable, chunksize)
266 in a list that is returned.
267 '''
--> 268 return self._map_async(func, iterable, mapstar, chunksize).get()
269
270 def starmap(self, func, iterable, chunksize=None):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\multiprocess\pool.py in get(self, timeout)
655 return self._value
656 else:
--> 657 raise self._value
658
659 def _set(self, i, obj):
TypeError: cannot do label indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [nan] of <class 'float'>

Solved it - changed the apply to a lambda function - works fine now.
data['appended'] = data.apply(lambda row: nearest(row['lat'], row['long'], stores['lat'], stores['long'], stores['index'], stores['pcds']),axis=1)

pandas merge command failing in parallel loop - "ValueError: buffer source array is read-only"

I am writing a bootstrap algorithm using parallel loops and pandas. The problem i experience is that a merge command inside the parallel loop causes a "ValueError: buffer source array is read-only" error - but only if i use the full dataset to merge (120k lines). Any subset with less than 12k lines will work just fine and so i infer it is not a problem of the syntax. What can i do?
Current pandas version is 0.24.2 and cython is 0.29.7.
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker
r = call_item()
File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 567, in __call__
return self.func(*args, **kwargs)
File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/parallel.py", line 225, in __call__
for func, args, kwargs in self.items]
File "/home/ubuntu/.local/lib/python3.6/site-packages/joblib/parallel.py", line 225, in <listcomp>
for func, args, kwargs in self.items]
File "<ipython-input-72-cdb83eaf594c>", line 12, in bootstrap
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 6868, in merge
copy=copy, indicator=indicator, validate=validate)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 48, in merge
return op.get_result()
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 546, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 756, in _get_join_info
right_indexer) = self._get_join_indexers()
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 735, in _get_join_indexers
how=self.how)
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1130, in _get_join_indexers
llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1662, in _factorize_keys
rlab = rizer.factorize(rk)
File "pandas/_libs/hashtable.pyx", line 111, in pandas._libs.hashtable.Int64Factorizer.factorize
File "stringsource", line 653, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 348, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-73-652c1db5701b> in <module>()
1 num_cores = multiprocessing.cpu_count()
----> 2 results = Parallel(n_jobs=num_cores, prefer='processes', verbose = 5)(delayed(bootstrap)() for i in range(n_trials))
3 #pd.DataFrame(results[0])
~/.local/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
932
933 with self._backend.retrieval_context():
--> 934 self.retrieve()
935 # Make sure that we get a last message telling us we are done
936 elapsed_time = time.time() - self._start_time
~/.local/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
831 try:
832 if getattr(self._backend, 'supports_timeout', False):
--> 833 self._output.extend(job.get(timeout=self.timeout))
834 else:
835 self._output.extend(job.get())
~/.local/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
519 AsyncResults.get from multiprocessing."""
520 try:
--> 521 return future.result(timeout=timeout)
522 except LokyTimeoutError:
523 raise TimeoutError()
/usr/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()
/usr/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
ValueError: buffer source array is read-only
and the code is
def bootstrap():
df_resample_ids = skl.utils.resample(ob_ids)
df_resample_ids = pd.DataFrame(df_resample_ids).sort_values(by="0").reset_index(drop=True)
df_resample_ids.columns = [ob_id_field]
df_resample = pd.DataFrame(df_resample_ids.merge(df, on = ob_id_field))
return df_resample
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores, prefer='processes', verbose = 5)(delayed(bootstrap)() for i in range(n_trials))
The algo will create resampled/replaced IDs from an ID variable and use the merge command to create a new dataset based on the resampled IDs and the original dataset stored in df. If i cut out a subset of the original dataset (anywhere) leaving less than ~12k lines, then the parallel loop will finish without an error and do as expected.
As requested, below is a new snippet to re-create the data structures and mirror the principal approach i am currently working on:
import pandas as pd
import sklearn as skl
import multiprocessing
from joblib import Parallel, delayed
df = pd.DataFrame(np.random.randn(200000, 24), columns=list('ABCDDEFGHIJKLMNOPQRSTUVW'))
df["ID"] = df.index.drop_duplicates().tolist()
ob_ids = df.index.drop_duplicates().tolist()
def bootstrap2():
df_resample_ids = skl.utils.resample(ob_ids)
df_resample_ids = pd.DataFrame(df_resample_ids).sort_values(by=0).reset_index(drop=True)
df_resample_ids.columns = ['ID']
df_resample = pd.DataFrame(df1.merge(df_resample_ids, on = 'ID'))
result = df_resample
return result
num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores, prefer='processes', verbose = 5)(delayed(bootstrap2)() for i in range(n_trials))
However, i notice that when the data is completely made up of np.random numbers, the loop goes through without an error. The dtypes of the original dataframe are:
start_rtg int64
end_rtg float64
days_diff float64
ultimate_customer_system_id int64
How can i avoid the read-only error?

posting an answer to my question as i found that one of the variables was of int64 datatype. when i converted all variables to float64, the error disappeared. so it is an issue that is restricted to certain datatypes only...
cheers
stephan

Dask Dataframe Distributed Process ID Access Denied

I am running a set of pandas-like transformations to a dask dataframe, using the "distributed" set-up, running on my own machine - so using 8 workers corresponding to my computer's 8 cores.
I have the default set up of a distributed client:
from dask.distributed import Client
c = Client()
The process runs successfully with a small amount of data (1000 records), but when I scale it up only slightly to 7500 records, I get the following warnings:
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:58103, threads: 1>>
Traceback (most recent call last):
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/psutil/_psosx.py", line 348, in catch_zombie
yield
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/psutil/_psosx.py", line 387, in _get_pidtaskinfo
ret = cext.proc_pidtaskinfo_oneshot(self.pid)
ProcessLookupError: [Errno 3] No such process
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
return self.callback()
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/distributed/nanny.py", line 245, in memory_monitor
memory = psutil.Process(self.process.pid).memory_info().rss
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/psutil/_common.py", line 337, in wrapper
return fun(self)
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/psutil/__init__.py", line 1049, in memory_info
return self._proc.memory_info()
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/psutil/_psosx.py", line 330, in wrapper
return fun(self, *args, **kwargs)
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/psutil/_psosx.py", line 456, in memory_info
rawtuple = self._get_pidtaskinfo()
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/psutil/_common.py", line 337, in wrapper
return fun(self)
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/psutil/_psosx.py", line 387, in _get_pidtaskinfo
ret = cext.proc_pidtaskinfo_oneshot(self.pid)
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/Users/user1/anaconda3/envs/ldaenv/lib/python3.6/site-packages/psutil/_psosx.py", line 361, in catch_zombie
raise AccessDenied(proc.pid, proc._name)
psutil._exceptions.AccessDenied: psutil.AccessDenied (pid=17998)
Which repeats itself multiple times, as dask attempts to start the computation block again. After it's failed the amount of times specified in the config file, there is finally a KilledWorker error e.g. the below. I've tried this with different lengths of data and the KilledWorker is sometimes on a melt task, sometimes on an apply task.
KilledWorker Traceback (most recent call last)
<ipython-input-28-7ba288919b51> in <module>()
1 #Optional checkpoint to view output
2 with ProgressBar():
----> 3 output = aggdf.compute()
4 output.head()
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
133 dask.base.compute
134 """
--> 135 (result,) = compute(self, traverse=False, **kwargs)
136 return result
137
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
331 postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)
332 else (None, a) for a in args]
--> 333 results = get(dsk, keys, **kwargs)
334 results_iter = iter(results)
335 return tuple(a if f is None else f(next(results_iter), *a)
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, **kwargs)
1997 secede()
1998 try:
-> 1999 results = self.gather(packed, asynchronous=asynchronous)
2000 finally:
2001 for f in futures.values():
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
1435 return self.sync(self._gather, futures, errors=errors,
1436 direct=direct, local_worker=local_worker,
-> 1437 asynchronous=asynchronous)
1438
1439 #gen.coroutine
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
590 return future
591 else:
--> 592 return sync(self.loop, func, *args, **kwargs)
593
594 def __repr__(self):
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
252 e.wait(1000000)
253 if error[0]:
--> 254 six.reraise(*error[0])
255 else:
256 return result[0]
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/distributed/utils.py in f()
236 yield gen.moment
237 thread_state.asynchronous = True
--> 238 result[0] = yield make_coro()
239 except Exception as exc:
240 logger.exception(exc)
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/tornado/gen.py in run(self)
1053
1054 try:
-> 1055 value = future.result()
1056 except Exception:
1057 self.had_exception = True
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
236 if self._exc_info is not None:
237 try:
--> 238 raise_exc_info(self._exc_info)
239 finally:
240 self = None
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/tornado/gen.py in run(self)
1061 if exc_info is not None:
1062 try:
-> 1063 yielded = self.gen.throw(*exc_info)
1064 finally:
1065 # Break up a reference to itself
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
1313 six.reraise(type(exception),
1314 exception,
-> 1315 traceback)
1316 if errors == 'skip':
1317 bad_keys.add(key)
~/anaconda3/envs/ldaenv/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
691 if value.__traceback__ is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
KilledWorker: ("('melt-b85b6b6b1aee5b5aaa8d24db1de65b8a', 0)", 'tcp://127.0.0.1:58108')
I'm not very familiar with the distributed or tornado packages, or the underlying architecture of which processes are being created and killed - is anyone able to point me in the right direction to debug/resolve this?
In the meantime I am switching to the default dask dataframe behaviour of multithreaded computation, which works successfully with a large amount of data.

It looks like your workers are dying for some reason. Unfortunately it's not clear from the workers what the cause is. You might consider setting up the cluster manually to have clearer access to the worker logs
$ dask-scheduler # run this in one terminal
$ dask-worker tcp://localhost:8786 # run this in another
worker logs will appear here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using Dask library to merge two large dataframes - python

Related

Failed to convert a NumPy array to a Tensor while converting pandas dataframe

IsADirectoryError: [Errno 21] Is a directory: '/' error while using multiprocessing

Pandas index labelling type error within multiprocessing code

pandas merge command failing in parallel loop - "ValueError: buffer source array is read-only"

Dask Dataframe Distributed Process ID Access Denied

Categories

Resources