MemoryError from melt or concat with large data - python

I got the error when I try to run pd.melt().
I checked on this post and tried to modified the code and still got the error. (LINK)
Here is my original code:
melted = pd.melt(df, ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value').sort_values('ID')
After modifying:
pivot_list = list()
chunk_size = 100000
for i in range(0, len(df), chunk_size):
row_pivot = pd.melt(df.iloc[i:i+chunk_size], ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value')
pivot_list.append(row_pivot)
melted = pd.concat(pivot_list).sort_values('ID')
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File /path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/path/Current_Proj/Main_Dir/Python_Program.py", line 122, in My_Function
melted = pd.concat(pivot_list).sort_values('ID')
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 307, in concat
return op.get_result()
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 532, in get_result
new_data = concatenate_managers(
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 222, in concatenate_managers
values = _concatenate_join_units(join_units, concat_axis, copy=copy)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
to_concat = [
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 487, in <listcomp>
ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 466, in get_reindexed_values
values = algos.take_nd(values, indexer, axis=ax)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 108, in take_nd
return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 149, in _take_nd_ndarray
out = np.empty(out_shape, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File /path/Current_Proj/Main_Dir/Python_Program.py", line 222, in <module>
result = pool.starmap(My_Function, zip(arg1, arg2, arg3))
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
numpy.core._exceptions.MemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
I think the main issue came from melt() and concat() parts.
Any idea to deal with should be thankful.

Usually, when you get a "MemoryError: unable to allocate" error, this falls into the "user error" category of requesting a reshape operation which is simply too large to fit into memory.
pd.melt is a memory-intensive operation which not only requires creating new arrays for all elements in your data, it also reshapes your data into a less efficient format, creating many duplicates for current values. the result and the memory penalty will depend on the structure of your data and the number of value columns.
Give the pandas docs on reshaping by melt a close read, and calculate whether you can afford to create an array of all elements in your id_vars column and repeat them for all columns specified by value_vars.
As an example, if your dataframe has 1M rows and 1000 columns, with all cells as float32, the dataframe would take up approximately 4GB in memory. If you then try to melt and specify 4 id_vars, then you'll have 4*1M id cells which will each get repeated (996) times, giving you 4*1e6*996 giving you 4Bn cells for the index. Additionally, you'll have a column with 1e6*996 "variables" and finally the same number of "values". You'd need to know the length and dtype of all the column names and the data types of the cells, but this simple example would result in a 23 GB array even if all values were relatively compact float32s.
Melt is a helpful convenience function for reshaping small dataframes. If you have a dataframe which is anywhere near the size I'm talking about in this example, I'd mostly suggest you don't do this, or if you really do need to reshape this way, then you need to get serious about understanding the operation and chunking the data in a way that is tailored to your data's size. You may want to write out the data iteratively rather than attempting to concatenate the data at the end. This isn't something that will work out of the box - expect some trial & error. You could also look into using out-of-core computation tools - dask.dataframe has a port of melt which could leverage multiple cores and write in parallel to disk.

Related

Python3 Pandas - handle overflow when casting to number greater than data type int64

I am writing a standard script where I will fetch the data from database, do some manipulation and insert data back into another table.
I am facing an overflow issue while converting a column's type in Dataframe.
Here's an example :
import numpy as np
import pandas as pd
d = {'col1': ['66666666666666666666666666666']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].astype('int64')
print(df)
Error :
Traceback (most recent call last):
File "HelloWorld.py", line 6, in <module>
df['col1'] = df['col1'].astype('int64')
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5548, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 604, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 409, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 595, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/cast.py", line 974, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/_libs/lib.pyx", line 615, in pandas._libs.lib.astype_intsafe
OverflowError: Python int too large to convert to C long
I cannot control the values inside d['col1'] because in the actual code it is being generated by another function.
How can I solve this problem if I want to keep the final data type as 'int64'.
I was thinking to catch the exception and then assign the largest int64 value to the whole column but then the rows of the column which are not overflowing might also lead to inconsistent results.
Can you advise me on some elegant solutions here?
With your idea, you can use np.iinfo
ii64 = np.iinfo(np.int64)
df['col1'] = df['col1'].astype('float128').clip(ii64.min, ii64.max).astype('int64')
print(df)
# Output
col1
0 9223372036854775807
Take care of the limit of float128 too :-D
>>> np.finfo(np.float128)
finfo(resolution=1e-18, min=-1.189731495357231765e+4932, max=1.189731495357231765e+4932, dtype=float128)
>>> np.iinfo('int64')
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

How to match fit_tranform of the imputed dataset with the original dataset while handling missing data in an ML model?

When trying to fill up missing values using KNNImputer algorithm using the following line of code:
pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.columns)
I am receiving error message:
Traceback (most recent call last):
File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\database_analyzer.py", line 384, in <module>
main()
File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\database_analyzer.py", line 232, in main
train_data_engineered = missingvalue_handler(train_data_engineered)
File "c:\Users\myname\Desktop\Project\PythonTool\calculator\database-analyzer\utilities_module.py", line 1268, in missingvalue_handler
return pd.DataFrame(knn_imputer.fit_transform(new_data),
File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\frame.py", line 695, in __init__
mgr = ndarray_to_mgr(
File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\internals\construction.py", line 351, in ndarray_to_mgr
_check_values_indices_shape_match(values, index, columns)
File "C:\ProgramData\Anaconda3\envs\tf\lib\site-packages\pandas\core\internals\construction.py", line 422, in _check_values_indices_shape_match
raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (196, 1032), indices imply (196, 1033)
I know there reason for this is that imputer actually imputes one column completely bringing them down from 1033 to 1032. How can I fix the issue while not knowing which column has been removed?
I actually figured it out. I did not need to know the exact column name. I made the following change to make sure data.shape[1] and len(data.columns) match while making a pandas dataframe from the imputed dataset.
pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.columns)
to
pd.DataFrame(knn_imputer.fit_transform(data),
index=data.index,
columns=data.dropna(axis=1, how='all').columns)

Can I use my harddisk as memory when I get MemoryError in Python?

I have big dask dataframe 45 million records
that I am trying to pivot using
features_df = df_features.pivot_table(index='filename', columns='code', values='frequency')
but I get this error
File "C:\Users\ASMGX\Anaconda3\lib\site-packages\pandas\core\sorting.py", line 65, in get_group_index
labels, shape = map(list, zip(*map(maybe_lift, labels, shape)))
File "pandas\_libs\algos_common_helper.pxi", line 361, in pandas._libs.algos.ensure_int64
MemoryError
Is there anyway I can use my harddisk to act as memory or a place to keep temp files?

Issue when computing/merging dask dataframe(s) when index is categorical

I'm trying to use dask to process a dataset which does not fit into memory. It's time series data for various "IDs". After reading dask documentation, I chose to use the "parquet" file format and partitioning by "ID".
However, while reading from parquet, and setting the index I encountered a "TypeError: to union ordered Categoricals, all categories must be the same" which I did not manage to solve by myself.
This code replicates the issue I'm having:
import dask.dataframe as dd
import numpy as np
import pandas as pd
import traceback
# create ids
ids = ["AAA", "BBB", "CCC", "DDD"]
# create data
df = pd.DataFrame(index=np.random.choice(ids, 50), data=np.random.rand(50, 1), columns=["FOO"]).reset_index().rename(columns={"index": "ID"})
# serialize to parquet
f = r"C:/temp/foo.pq"
df.to_parquet(f, compression='gzip', engine='fastparquet', partition_cols=["ID"])
# read with dask
df = dd.read_parquet(f)
try:
df = df.set_index("ID")
except Exception as ee:
print(traceback.format_exc())
at this point I get the following error:
~\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\arrays\categorical.py in check_for_ordered(self, op)
1492 if not self.ordered:
1493 raise TypeError(
-> 1494 f"Categorical is not ordered for operation {op}\n"
1495 "you can use .as_ordered() to change the "
1496 "Categorical to an ordered one\n"
TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one
I then did:
# we order the categorical
df.ID = df.ID.cat.as_ordered()
df = df.set_index("ID")
And, when I'm trying to use df.compute(scheduler="processes"), I get the TypeError I mentioned before:
try:
schd_str = 'processes'
aa = df.compute(scheduler=schd_str)
print(f"{schd_str}: OK")
except:
print(f"{schd_str}: KO")
print(traceback.format_exc())
gives:
Traceback (most recent call last):
File "<ipython-input-6-e15c4e86fee2>", line 3, in <module>
aa = df.compute(scheduler=schd_str)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 166, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 103, in finalize
return _concat(results)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 98, in _concat
else methods.concat(args2, uniform=True, ignore_index=ignore_index)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 431, in concat_pandas
ind = concat([df.index for df in dfs])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 400, in concat_pandas
return pd.CategoricalIndex(union_categoricals(dfs), name=dfs[0].name)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\dtypes\concat.py", line 352, in union_categoricals
raise TypeError("Categorical.ordered must be the same")
TypeError: Categorical.ordered must be the same
Surprisingly enough, using df.compute(scheduler="threads"), df.compute(scheduler="synchronous"), or not setting the index at all works properly.
However, it does not seem to be something I should do since I'm actually trying to merge several of these datasets, and thought that setting the index would result in a speed-up over not setting any. (I'm getting the same exact error when trying to merge two dataframes indexed this way)
I tried to inspect df._meta, and it turns out my categories are "known" as they should be? dask-categoricals
I also read this github post about something that looks similar but somehow did not find a solution.
Thanks for your help,

How to add two columns efficiently in Pandas DataFrame?

I have quite large dataset (over 6 million rows with just a few columns). When I try to add two float64 columns (data['C'] = data.A + data.B) it gives me a memory error:
Traceback (most recent call last):
File "01_processData.py", line 354, in <module>
prepareData(snp)
File "01_processData.py", line 161, in prepareData
data['C'] = data.A + data.C
File "/usr/local/lib/python2.7/dist-packages/pandas/core/ops.py", line 480, in wrapper
return_indexers=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/tseries/index.py", line 976, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1304, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas/core/index.py", line 1345, in _join_non_unique
how=how, sort=True)
File "/usr/local/lib/python2.7/dist-packages/pandas/tools/merge.py", line 465, in _get_join_indexers
return join_func(left_group_key, right_group_key, max_groups)
File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError
I understand that this operation uses index to properly calculate output, but it seems inefficient, since by the fact that two columns belong to the same DataFrame they have perfect alignment.
I was able to solve the problem by using
data['C'] = data.A.values + data.B.values
but I wonder if there is a method designed to do this or more elegant solution?
I cannot reproduce what you are doing (as it won't hit the alignment issue as the indexes are the same).
In master/0.14 (releasing shortly)
In [2]: df = DataFrame(np.random.randn(6000000,2),columns=['A','C'],index=pd.MultiIndex.from_product([['foo','bar'],range(3000000)]))
In [3]: df.values.nbytes
Out[3]: 96000000
In [4]: %memit df['D'] = df['A'] + df['C']
maximum of 1: 625.839844 MB per loop
However in 0.13.1. (I do remember some optimizations were put in 0.14)
In [3]: %memit df['D'] = df['A'] + df['C']
maximum of 1: 1113.671875 MB per loop
Do you have a hierarchical index set? My python used to crash with that, but reset_index() prior to summing used to help. However, this was not reproduced by others, so this is not a "guaranteed improvement".
See my post on this

Categories

Resources