I am looping through a set of large files, and using multiprocessing for manipulation/writing. I create an iterable out of my dataframe and pass it to multiprocessing's map function. The processing is fine for the smaller files, but when I hit the larger ones (~10g) I get the error:
python struct.error: 'i' format requires -2147483648 <= number <= 2147483647
the code:
data = np.array_split(data, 10)
with mp.Pool(processes=5, maxtasksperchild=1) as pool1:
pool1.map(write_in_parallel, data)
pool1.close()
pool1.join()
Based on this answer I thought the problem is the file I am passing to map is too large. So I tried first splitting the dataframe into 1.5g chunks and passing each independently to map, but I am still receiving the same error.
Full traceback:
Traceback (most recent call last):
File "_FNMA_LLP_dataprep_final.py", line 51, in <module>
write_files()
File "_FNMA_LLP_dataprep_final.py", line 29, in write_files
'.txt')
File "/DATAPREP/appl/FNMA_LLP/code/FNMA_LLP_functions.py", line 116, in write_dynamic_columns_fannie
pool1.map(write_in_parallel, first)
File "/opt/Python364/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/opt/Python364/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/opt/Python364/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/opt/Python364/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/opt/Python364/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
In the answer you mentioned was also another gist: the data should be loaded by the child function. In your case, it's function write_in_parallel. What I recommend you is to alter your child function in the next way:
def write_in_parallel('/path/to/your/data'):
""" We'll make an assumption that your data is stored in csv file"""
data = pd.read_csv('/path/to/your/data')
...
Then your "Pool code" should look like this:
with mp.Pool(processes=(mp.cpu_count() - 1)) as pool:
chunks = pool.map(write_in_parallel, ('/path/to/your/data',))
df = pd.concat(chunks)
I hope that will help you.
Related
Performing distributed training, I have the following code like this:
training_sampler = DistributedSampler(training_set, num_replicas=2, rank=0)
training_generator = data.DataLoader(training_set, **params, sampler=training_sampler)
for x, y, z in training_generator: # Error occurs here.
...
Overall, I get the following message:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/ubuntu/VC/ppg_training_extraction/ppg_training_scripts/train_ASR_trim_scp.py", line 336, in train
for local_batch_src, local_batch_tgt, lengths in dataloaders[phase]:
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 352, in __iter__
return self._get_iterator()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 294, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 827, in __init__
self._reset(loader, first_iter=True)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 857, in _reset
self._try_put_index()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1091, in _try_put_index
index = self._next_index()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 427, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
for idx in self.sampler:
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 97, in __iter__
indices = torch.randperm(len(self.dataset), generator=g).tolist() # type: ignore
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
Now at that line, I ran the following instructions in pdb:
(Pdb) g = torch.Generator()
(Pdb) g.manual_seed(0)
<torch._C.Generator object at 0x7ff7f8143110>
(Pdb) indices = torch.randperm(4556, generator=g).tolist()
(Pdb) indices = torch.randperm(455604, generator=g).tolist()
*** RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'
Why am I getting the runtime error when the upperbound integer is high, but not when it’s low enough?
Note, I ran on a clean Python session and found
>>> import torch
>>> g = torch.Generator()
>>> g.manual_seed(0)
<torch._C.Generator object at 0x7f9d2dfb39f0>
>>> indices = torch.randperm(455604, generator=g).tolist()
that this worked fine. Is it some configuration in how I’m handling distributed training among multiple GPUs? Any sort of insights would be appreciated!
I just met the same problem using dataloader and I found the following helps without removing torch.set_default_tensor_type('torch.cuda.FloatTensor')
data.DataLoader(..., generator=torch.Generator(device='cuda'))
since I don't want to manually add .to('cuda') for tons of tensors in my code
I had the same issue. Looks like somebody found the source of the problem here:
torch.set_default_tensor_type('torch.cuda.FloatTensor')
I fixed my problem by getting rid of this line in my code and manually using .to(device).
I have a pandas data frame that consist of a single column of numpy arrays. I can use the numpy.mean function to calculate the mean of the arrays.
import numpy
import pandas
f = pandas.DataFrame({"a":[numpy.array([1.0, 2.0]), numpy.array([3.0, 4.0])]})
numpy.mean(f["a"]) # returns array([2., 3.])
I want to do the same thing in Dask.
import dask.dataframe
import dask.array
g = dask.dataframe.from_pandas(f, npartitions=1)
dask.array.mean(g["a"], dtype="float64")
(You have to specify the dtype, otherwise you get a TypeError: unsupported operand type(s) for /: 'NoneType' and 'int' exception.)
The call to dask.array.mean returns the following, which looks correct.
dask.array<mean_agg-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>
However, when I run dask.array.mean(g["a"], dtype="float64").compute() to get the final value I get a ValueError: setting an array element with a sequence. exception. The full stack is as follows.
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/base.py", line 165, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/base.py", line 436, in compute
results = schedule(dsk, keys, **kwargs)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/threaded.py", line 81, in get
**kwargs
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/local.py", line 486, in get_async
raise_exception(exc, tb)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/local.py", line 316, in reraise
raise exc
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/local.py", line 222, in execute_task
result = _execute_task(task, data)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/core.py", line 118, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/core.py", line 118, in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/optimization.py", line 982, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/utils.py", line 29, in apply
return func(*args, **kwargs)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/dask/array/reductions.py", line 539, in mean_chunk
total = sum(x, dtype=dtype, **kwargs)
File "<__array_function__ internals>", line 6, in sum
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2229, in sum
initial=initial, where=where)
File "/Users/wmcneill/src.private/radius_limit/venv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 90, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: setting an array element with a sequence.
Is it possible to perform the equivalent Dask operation?
It would be good if Dask Dataframe handled this case, but it doesn't today. It's not actually that surprising given the situation.
Your dataframe is a bit odd, in that elements of that dataframe are themselves Numpy arrays.
>>> f
a
0 [1.0, 2.0]
1 [3.0, 4.0]
As a result, Pandas thinks that this is an object dtype dataframe
>>> f.dtypes
a object
dtype: object
Because Dask Dataframe is lazy it doesn't actually keep track of all of the data at any given point, it only knows the dtypes, which in this case are pretty non-informative. Dask Dataframe doesn't really know what do to with a mean computation on these complex elements. It doesn't know if your elements are numpy arrays or strings, or custom Python objects, etc..
So it errs and you need to provide a data type explicitly.
The full solution to this problem is probably for Pandas to develop a much more complex dtype heirarchy, but that's probably unlikely near term.
Ideally Dask Dataframe would give a better error message here encouraging you to specify a dtype manually. If you wanted to raise an issue, that would be welcome.
I have tried testing what is in a graph that I created to see whether nodes were indeed created.
The code to create a small graph for testing:
from gremlin_python import statics
from gremlin_python.structure.graph import Graph
from gremlin_python.process.graph_traversal import __
from gremlin_python.process.strategies import *
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
graph = Graph()
g = graph.traversal().withRemote(DriverRemoteConnection('ws://localhost:8182/gremlin','g'))
# in a loop add nodes and properties to get a small graph for testing
t = g.addV('testnode').property('val',1)
for i in range(2,11):
t = g.addV('testnode').property('val', i)
t.iterate()
# proceed to create edge (as_ and from_ contain an underscore because as & from are python's reserved words)
g.V().has("val", 2).as_("a").V().has("val", 4).as_("b").addE("link").property("someproperty", "abc").from_("a").to("b").iterate()
list1 = []
list1 = g.V().has("val", 2).toList()
print(len(list1))
to which I would expect to have the value "1" returned in the terminal, which happened correctly while testing previously (and now fails).
However, this returns an error:
Traceback (most recent call last):
File "test_addingVEs.py", line 47, in <module>
list1 = g.V().has("val_i", 2).toList()
File "/home/user/.local/lib/python3.5/site-packages/gremlin_python/process/traversal.py", line 52, in toList
return list(iter(self))
File "/home/user/.local/lib/python3.5/site-packages/gremlin_python/process/traversal.py", line 43, in __next__
self.traversal_strategies.apply_strategies(self)
File "/home/user/.local/lib/python3.5/site-packages/gremlin_python/process/traversal.py", line 346, in apply_strategies
traversal_strategy.apply(traversal)
File "/home/user/.local/lib/python3.5/site-packages/gremlin_python/driver/remote_connection.py", line 143, in apply
remote_traversal = self.remote_connection.submit(traversal.bytecode)
File "/home/user/.local/lib/python3.5/site-packages/gremlin_python/driver/driver_remote_connection.py", line 54, in submit
results = result_set.all().result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 405, in result
return self.__get_result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
raise self._exception
File "/home/user/.local/lib/python3.5/site-packages/gremlin_python/driver/resultset.py", line 81, in cb
f.result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
return self.__get_result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
raise self._exception
File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/user/.local/lib/python3.5/site-packages/gremlin_python/driver/connection.py", line 77, in _receive
self._protocol.data_received(data, self._results)
File "/home/user/.local/lib/python3.5/site-packages/gremlin_python/driver/protocol.py", line 98, in data_received
"{0}: {1}".format(status_code, message["status"]["message"]))
gremlin_python.driver.protocol.GremlinServerError: 598:
A timeout occurred during traversal evaluation of [RequestMessage
{, requestId=d56cce63-77f3-4c1f-9c14-3f5f33d4a67b, op='bytecode', processor='traversal', args={gremlin=[[], [V(), has(val, 2)]], aliases={g=g}}}]
- consider increasing the limit given to scriptEvaluationTimeout
The .toList() function did work previously, but not anymore.
Is there anything wrong in my code, or should I look elsewhere for a possible cause?
Well, the error is indicating the problem:
A timeout occurred during traversal evaluation of [RequestMessage
{, requestId=d56cce63-77f3-4c1f-9c14-3f5f33d4a67b, op='bytecode', processor='traversal', args={gremlin=[[], [V(), has(val, 2)]], aliases={g=g}}}]
- consider increasing the limit given to scriptEvaluationTimeout
Of course, assuming the default scriptEvaluationTimeout of 30 seconds, it should not take that long to return a result of the query you are executing unless you have a significant amount of vertices and you do not have an index on "val". So given that your graph is really small, I don't see why such an execution would take so long.
I don't know what you're environment is like that you're testing on, but if you're running all of JanusGraph/Cassandra on a highly underpowered machine I guess something highly resource starved could take a long time to execute. I think that I would try to increase the scriptEvaluationTimeout as suggested in the error to see just how high you have to increase it to get the result back. If you don't have indices on val you probably should add those anyway (though I don't think that's your problem unless that vertex count is bigger than your code is indicating).
I am getting a memory error when trying to use a mask to select values for a 4M rows table with 3 columns.
When I run df.memory_usage().sum() it returns me 173526080 which equates to 1,38820864 gb and I have 32gb of RAM. So it doesn't seem like it should run out of RAM as there is no previous code consuming lots of RAM.
This method worked for previous versions of code with the same 4M rows.
The code I run is:
x = df[exit_point] > 0
print(df[x].shape)
The error I get is:
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\frame.py", line 2175, in _getitem_array
return self._take(indexer, axis=0, convert=False)
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\generic.py", line 2143, in _take
self._consolidate_inplace()
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\generic.py", line 3677, in _consolidate_inplace
self._protect_consolidate(f)
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\generic.py", line 3666, in _protect_consolidate
result = f()
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\generic.py", line 3675, in f
self._data = self._data.consolidate()
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\internals.py", line 3826, in consolidate
bm._consolidate_inplace()
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\internals.py", line 3831, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\internals.py", line 4853, in _consolidate
_can_consolidate=_can_consolidate)
File "C:\Users\joaoa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\internals.py", line 4876, in _merge_blocks
new_values = new_values[argsort]
MemoryError
I am lost on how to start debugging this. Any clues and hints would be very much appreciated.
Maybe this helps:
[1] Use the low_memory=False argument while importing the file. For example:
df = pd.read_csv('filepath', low_memory=False)
[2] Use the dtype argument while importing the file.
[3] If you use Jupyter Notebook: Kernel > Restart & Clear Output.
Hope this helps!
What am I trying to do?
pd.read_csv(... nrows=###) can read the top nrows of a file. I'd like to do the same while using pd.read_hdf(...).
What is the problem?
I am confused by the documentation. start and stop look like what I need but when I try it, a ValueError is returned. The second thing I tried was using nrows=10 thinking that it might be an allowable **kwargs. When I do, no errors are thrown but also the full dataset is returned instead of just 10 rows.
Question: How does one correctly read a smaller subset of rows from an HDF file? (edit: without having to read the whole thing into memory first!)
Below is my interactive session:
>>> import pandas as pd
>>> df = pd.read_hdf('storage.h5')
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
df = pd.read_hdf('storage.h5')
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 367, in read_hdf
raise ValueError('key must be provided when HDF5 file '
ValueError: key must be provided when HDF5 file contains multiple datasets.
>>> import h5py
>>> f = h5py.File('storage.h5', mode='r')
>>> list(f.keys())[0]
'table'
>>> f.close()
>>> df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
df = pd.read_hdf('storage.h5', key='table', start=0, stop=10)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 370, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 740, in select
return it.get_result()
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 1447, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 733, in func
columns=columns, **kwargs)
File "C:\Python35\lib\site-packages\pandas\io\pytables.py", line 2890, in read
return self.obj_type(BlockManager(blocks, axes))
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 2795, in __init__
self._verify_integrity()
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 3006, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)
File "C:\Python35\lib\site-packages\pandas\core\internals.py", line 4280, in construction_error
passed, implied))
ValueError: Shape of passed values is (614, 593430), indices imply (614, 10)
>>> df = pd.read_hdf('storage.h5', key='table', nrows=10)
>>> df.shape
(593430, 614)
Edit:
I just attempted to use where:
mylist = list(range(30))
df = pd.read_hdf('storage.h5', key='table', where='index=mylist')
Received a TypeError indicating a Fixed format store (the default format value of df.to_hdf(...)):
TypeError: cannot pass a where specification when reading from a
Fixed format store. this store must be selected in its entirety
Does this mean I can't select a subset of rows if the format is Fixed format?
I ran into the same problem. I am pretty certain by now that https://github.com/pandas-dev/pandas/issues/11188 tracks this very problem. It is a ticket from 2015 and it contains a repro. Jeff Reback suggested that this is actually a bug, and he even pointed us towards a solution back in 2015. It's just that nobody built that solution yet. I might have a try.
Seems like this now works, at least with pandas 1.0.1. Just provide start and stop arguments:
df = pd.read_hdf('test.h5', '/floats/trajectories', start=0, stop=5)