Issue when computing/merging dask dataframe(s) when index is categorical

Issue when computing/merging dask dataframe(s) when index is categorical - python

I'm trying to use dask to process a dataset which does not fit into memory. It's time series data for various "IDs". After reading dask documentation, I chose to use the "parquet" file format and partitioning by "ID".
However, while reading from parquet, and setting the index I encountered a "TypeError: to union ordered Categoricals, all categories must be the same" which I did not manage to solve by myself.
This code replicates the issue I'm having:
import dask.dataframe as dd
import numpy as np
import pandas as pd
import traceback
# create ids
ids = ["AAA", "BBB", "CCC", "DDD"]
# create data
df = pd.DataFrame(index=np.random.choice(ids, 50), data=np.random.rand(50, 1), columns=["FOO"]).reset_index().rename(columns={"index": "ID"})
# serialize to parquet
f = r"C:/temp/foo.pq"
df.to_parquet(f, compression='gzip', engine='fastparquet', partition_cols=["ID"])
# read with dask
df = dd.read_parquet(f)
try:
df = df.set_index("ID")
except Exception as ee:
print(traceback.format_exc())
at this point I get the following error:
~\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\arrays\categorical.py in check_for_ordered(self, op)
1492 if not self.ordered:
1493 raise TypeError(
-> 1494 f"Categorical is not ordered for operation {op}\n"
1495 "you can use .as_ordered() to change the "
1496 "Categorical to an ordered one\n"
TypeError: Categorical is not ordered for operation max
you can use .as_ordered() to change the Categorical to an ordered one
I then did:
# we order the categorical
df.ID = df.ID.cat.as_ordered()
df = df.set_index("ID")
And, when I'm trying to use df.compute(scheduler="processes"), I get the TypeError I mentioned before:
try:
schd_str = 'processes'
aa = df.compute(scheduler=schd_str)
print(f"{schd_str}: OK")
except:
print(f"{schd_str}: KO")
print(traceback.format_exc())
gives:
Traceback (most recent call last):
File "<ipython-input-6-e15c4e86fee2>", line 3, in <module>
aa = df.compute(scheduler=schd_str)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 166, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in compute
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\base.py", line 438, in <listcomp>
return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 103, in finalize
return _concat(results)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\core.py", line 98, in _concat
else methods.concat(args2, uniform=True, ignore_index=ignore_index)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 431, in concat_pandas
ind = concat([df.index for df in dfs])
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 383, in concat
ignore_index=ignore_index,
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\dask\dataframe\methods.py", line 400, in concat_pandas
return pd.CategoricalIndex(union_categoricals(dfs), name=dfs[0].name)
File "C:\Users\xxx\.conda\envs\env_dask_py37\lib\site-packages\pandas\core\dtypes\concat.py", line 352, in union_categoricals
raise TypeError("Categorical.ordered must be the same")
TypeError: Categorical.ordered must be the same
Surprisingly enough, using df.compute(scheduler="threads"), df.compute(scheduler="synchronous"), or not setting the index at all works properly.
However, it does not seem to be something I should do since I'm actually trying to merge several of these datasets, and thought that setting the index would result in a speed-up over not setting any. (I'm getting the same exact error when trying to merge two dataframes indexed this way)
I tried to inspect df._meta, and it turns out my categories are "known" as they should be? dask-categoricals
I also read this github post about something that looks similar but somehow did not find a solution.
Thanks for your help,

Related

MemoryError from melt or concat with large data

I got the error when I try to run pd.melt().
I checked on this post and tried to modified the code and still got the error. (LINK)
Here is my original code:
melted = pd.melt(df, ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value').sort_values('ID')
After modifying:
pivot_list = list()
chunk_size = 100000
for i in range(0, len(df), chunk_size):
row_pivot = pd.melt(df.iloc[i:i+chunk_size], ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value')
pivot_list.append(row_pivot)
melted = pd.concat(pivot_list).sort_values('ID')
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File /path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/path/Current_Proj/Main_Dir/Python_Program.py", line 122, in My_Function
melted = pd.concat(pivot_list).sort_values('ID')
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 307, in concat
return op.get_result()
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 532, in get_result
new_data = concatenate_managers(
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 222, in concatenate_managers
values = _concatenate_join_units(join_units, concat_axis, copy=copy)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
to_concat = [
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 487, in <listcomp>
ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 466, in get_reindexed_values
values = algos.take_nd(values, indexer, axis=ax)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 108, in take_nd
return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 149, in _take_nd_ndarray
out = np.empty(out_shape, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File /path/Current_Proj/Main_Dir/Python_Program.py", line 222, in <module>
result = pool.starmap(My_Function, zip(arg1, arg2, arg3))
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
numpy.core._exceptions.MemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
I think the main issue came from melt() and concat() parts.
Any idea to deal with should be thankful.

Usually, when you get a "MemoryError: unable to allocate" error, this falls into the "user error" category of requesting a reshape operation which is simply too large to fit into memory.
pd.melt is a memory-intensive operation which not only requires creating new arrays for all elements in your data, it also reshapes your data into a less efficient format, creating many duplicates for current values. the result and the memory penalty will depend on the structure of your data and the number of value columns.
Give the pandas docs on reshaping by melt a close read, and calculate whether you can afford to create an array of all elements in your id_vars column and repeat them for all columns specified by value_vars.
As an example, if your dataframe has 1M rows and 1000 columns, with all cells as float32, the dataframe would take up approximately 4GB in memory. If you then try to melt and specify 4 id_vars, then you'll have 4*1M id cells which will each get repeated (996) times, giving you 4*1e6*996 giving you 4Bn cells for the index. Additionally, you'll have a column with 1e6*996 "variables" and finally the same number of "values". You'd need to know the length and dtype of all the column names and the data types of the cells, but this simple example would result in a 23 GB array even if all values were relatively compact float32s.
Melt is a helpful convenience function for reshaping small dataframes. If you have a dataframe which is anywhere near the size I'm talking about in this example, I'd mostly suggest you don't do this, or if you really do need to reshape this way, then you need to get serious about understanding the operation and chunking the data in a way that is tailored to your data's size. You may want to write out the data iteratively rather than attempting to concatenate the data at the end. This isn't something that will work out of the box - expect some trial & error. You could also look into using out-of-core computation tools - dask.dataframe has a port of melt which could leverage multiple cores and write in parallel to disk.

Python3 Pandas - handle overflow when casting to number greater than data type int64

I am writing a standard script where I will fetch the data from database, do some manipulation and insert data back into another table.
I am facing an overflow issue while converting a column's type in Dataframe.
Here's an example :
import numpy as np
import pandas as pd
d = {'col1': ['66666666666666666666666666666']}
df = pd.DataFrame(data=d)
df['col1'] = df['col1'].astype('int64')
print(df)
Error :
Traceback (most recent call last):
File "HelloWorld.py", line 6, in <module>
df['col1'] = df['col1'].astype('int64')
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 5548, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 604, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 409, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 595, in astype
values = astype_nansafe(vals1d, dtype, copy=True)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/cast.py", line 974, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/_libs/lib.pyx", line 615, in pandas._libs.lib.astype_intsafe
OverflowError: Python int too large to convert to C long
I cannot control the values inside d['col1'] because in the actual code it is being generated by another function.
How can I solve this problem if I want to keep the final data type as 'int64'.
I was thinking to catch the exception and then assign the largest int64 value to the whole column but then the rows of the column which are not overflowing might also lead to inconsistent results.
Can you advise me on some elegant solutions here?

With your idea, you can use np.iinfo
ii64 = np.iinfo(np.int64)
df['col1'] = df['col1'].astype('float128').clip(ii64.min, ii64.max).astype('int64')
print(df)
# Output
col1
0 9223372036854775807
Take care of the limit of float128 too :-D
>>> np.finfo(np.float128)
finfo(resolution=1e-18, min=-1.189731495357231765e+4932, max=1.189731495357231765e+4932, dtype=float128)
>>> np.iinfo('int64')
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

Cannot concatenate object when adding to a DataFrame

I am trying to add a sentence as well as a coin(like a label in this case I guess) to a DataFrame. Although I keep getting this error:
Traceback (most recent call last):
File "c:\Users\gjohn\Documents\code\machineLearning\trading_bot\filter.py", line 132, in <module>
df = df.append({'coin': coin, 'review': sentence}, ignore_index=True)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\series.py", line 2877, in append
return concat(
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\reshape\concat.py", line 294, in concat
op = _Concatenator(
File "C:\Users\gjohn\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\reshape\concat.py", line 384, in __init__
raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'dict'>'; only Series and DataFrame objs are valid
Here is the code:
data = pd.read_csv('C:\\Users\\gjohn\\Documents\\code\\machineLearning\\trading_bot\\testreviews.csv')
df = data['review'] # Create a dataframe of the reviews.
classes = data['class'] # Create a dataframe of the classes.
for sentence in sentences:
coin = find_coin(common_words, sentence)
if len(sentence) > 0 and coin != None:
df = df.append({'coin': coin, 'review': sentence}, ignore_index=True)
I can't find how to fix this and I really need help, it would be great if you could help me out. Thanks!
Also sorry for the messy code :D

What is the sentence you use to construct the dictionary?
Perhaps you should check if the dictionary is constructed correctly?

pandas.DataFrame.agg does not work with np.std?

I am trying to use the pandas.DataFrame.agg function on the first column of a dataframe with the agg function is numpy.std.
I dont know why it works with numpy.mean but not numpy.std?
Can someone tell me in what circumstance that happens.
This is very strange ##
The following describes what I am facing.
My source is like this:
print(type(dataframe))
print(dataframe.head(5))
first_col = dataframe.columns.values[0]
agg_df = dataframe.agg({first_col: [np.mean]})
print(agg_df)
then it shows the result like this
<class 'pandas.core.frame.DataFrame'>
ax
0 -98.06
1 -97.81
2 -96.00
3 -93.44
4 -92.94
ax
mean -98.06
now I change the function from np.mean into np.std (without changing anything else)
print(type(dataframe))
print(dataframe.head(5))
first_col = dataframe.columns.values[0]
agg_df = dataframe.agg({first_col: [np.std]})
print(agg_df)
it shows the errors
Traceback (most recent call last):
File "C:\prediction_framework_django\predictions\predictor.py", line 112, in pre_aggregated_unseen_data
agg_df = dataframe.agg({axis: [np.std]})
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\frame.py", line 7578, in aggregate
result, how = self._aggregate(func, axis, *args, **kwargs)
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\frame.py", line 7609, in _aggregate
return aggregate(self, arg, *args, **kwargs)
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 582, in aggregate
return agg_dict_like(obj, arg, _axis), True
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 768, in agg_dict_like
results = {key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items()}
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 768, in <dictcomp>
results = {key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items()}
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\series.py", line 3974, in aggregate
result, how = aggregate(self, func, *args, **kwargs)
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 586, in aggregate
return agg_list_like(obj, arg, _axis=_axis), None
File "C:\prediction_framework_django\env\lib\site-packages\pandas\core\aggregation.py", line 672, in agg_list_like
raise ValueError("no results")
ValueError: no results
So the error is
in agg_list_like raise ValueError("no results") ValueError: no results
Thank you for your time and help.

Simply use the pandas builtin:
# Note the use of string to denote the function here
df.agg({first_col: ['mean', 'std']})
# You can also simply use the following
df[first_col].mean()
df[first_col].std()
[EDIT]: The error that you are getting is probably resulting from mixed types. You can check that all dtypes are float by looking at df.dtypes. If you have one that is object, then convert the problematic ones (probably empty strings) into whatever you need and np.std and pandas' builtin std should work

Tableau error "All Fields must be aggregate or constant" when invoking TabPy SCRIPT_REAL

I am calling a TabPy server via a calculated field in a Tableau worksheet to run a hypothesis test: does the rate of Bookings vary significantly by Group?
I have a table such as:
Group Bookings
0 A 1
1 A 0
3998 B 1
3999 B 0
In Python, on the same server (using the python 2.7 docker image) the test I want is simply:
from scipy.stats import fisher_exact
df_cont_tbl = pd.crosstab(df['Group'], df['Bookings'])
prop_test = fisher_exact(df_cont_tbl)
print 'Fisher exact test: Odds ratio = {:.2f}, p-value = {:.3f}'.format(*prop_test)
Returns: Fisher exact test: Odds ratio = 1.21, p-value = 0.102
I connected Tableau to the TabPy server and can execute a hello-world calculated field. For example, I get 42 back with the calculated field: SCRIPT_REAL("return 42", ATTR([Group]),ATTR([Bookings]) )
However, I try to invoke the stats function above with a calculated field to extract the p-value:
SCRIPT_REAL("
import pandas as pd
from scipy.stats import fisher_exact
df_cont_tbl = pd.crosstab(_arg1, _arg2)
prop_test = fisher_exact(df_cont_tbl)
return prop_test[1]
", [Group], [Bookings] )
I get the notification: The calculation contains errors with the drop-down All fields must be aggregate or constant when using table calculation functions or fields from multiple data sources
I tried wrapping the inputs with ATTR(), as in:
SCRIPT_REAL("
import pandas as pd
from scipy.stats import fisher_exact
df_cont_tbl = pd.crosstab(_arg1, _arg2)
prop_test = fisher_exact(df_cont_tbl)
return prop_test[1]
",ATTR([Group]), ATTR([Bookings])
)
Which changes the notification to "The calculation is valid" but returns a Pandas ValueError from the server:
An error occurred while communicating with the External Service.
Error processing script
Error when POST /evaluate: Traceback
Traceback (most recent call last):
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tabpy_server/tabpy.py", line 467, in post
result = yield self.call_subprocess(function_to_evaluate, arguments)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tornado/concurrent.py", line 232, in result
raise_exc_info(self._exc_info)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tornado/gen.py", line 1014, in run
yielded = self.gen.throw(*exc_info)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tabpy_server/tabpy.py", line 488, in call_subprocess
ret = yield future
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/tornado/gen.py", line 1008, in run
value = future.result()
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/concurrent/futures/_base.py", line 400, in result
return self.__get_result()
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/concurrent/futures/_base.py", line 359, in __get_result
reraise(self._exception, self._traceback)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/concurrent/futures/_compat.py", line 107, in reraise
exec('raise exc_type, exc_value, traceback', {}, locals_)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/concurrent/futures/thread.py", line 61, in run
result = self.fn(*self.args, **self.kwargs)
File "<string>", line 5, in _user_script
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/pandas/tools/pivot.py", line 479, in crosstab
df = DataFrame(data)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/pandas/core/frame.py", line 266, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/pandas/core/frame.py", line 402, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/pandas/core/frame.py", line 5398, in _arrays_to_mgr
index = extract_index(arrays)
File "/opt/conda/envs/Tableau-Python-Server/lib/python2.7/site-packages/pandas/core/frame.py", line 5437, in extract_index
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
Error type : ValueError
Error message : If using all scalar values, you must pass an index
Example dataset:
To generate the CSV I am connecting to:
import os
import pandas as pd
import numpy as np
from collections import namedtuple
OUTPUT_LOC = os.path.expanduser('~/TabPy_demo/ab_test_demo_results.csv')
GroupObs = namedtuple('GroupObs', ['name','n','p'])
obs = [GroupObs('A',3000,.10),GroupObs('B',1000,.13)]
# note true odds ratio = (13/87)/(10/90) = 1.345
np.random.seed(2019)
df = pd.concat( [ pd.DataFrame({'Group': grp.name,
'Bookings': pd.Series(np.random.binomial(n=1,
p=grp.p, size=grp.n))
}) for grp in obs
],ignore_index=True )
df.to_csv(OUTPUT_LOC,index=False)

Old question, but perhaps this will help someone else. There are a couple of issues. First is in relation to the way the data is passed to the pd.crosstab. Tableau passes a list of values to the tabpy server so wrap this in an array to fix your error you are getting.
SCRIPT_REAL(
"
import pandas as pd
import numpy as np
from scipy.stats import fisher_exact
df_cont_tbl = pd.crosstab(np.array(_arg1), np.array(_arg2))
prop_test = fisher_exact(df_cont_tbl)
return prop_test[1]
",
attr([Group]), attr([Bookings])
)
Another problem is the way the table calculation is being performed. You want to send tabpy two lists of information each as long as your table. In the default case tableau wants to calculate at the row level which is not going to work.
I included the row count F1 into the csv that I built the workbook on and made sure to calculate the python value along this function.
Now when you put F1 into the worksheet it will return the P-value as many times as you have rows, A workaround for this is to wrap your calculation in another calculation to only return the value if it is the first row and place this in your worksheet.
Now you can place this into a worksheet.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.