I'm trying to run a function over many partitions of a Dask dataframe. The code requires unpacking tuples and works well with Pandas but not with Dask map_partitions. The data corresponds to lists of tuples, where the length of the lists can vary, but the tuples are always of a known fixed length.
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, y = t
# Do more stuff
# Create Pandas dataframe
# Each list may have a different length, tuples have fixed known length
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
# Pandas to Dask
ddf = dd.from_pandas(df, npartitions=2)
# Run function over Pandas dataframe
func(df)
# Run function over Dask dataframe
ddf.map_partitions(func).compute()
Here, the Pandas version runs with no issues. However, the Dask one, raises the error:
ValueError: Metadata inference failed in `func`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
ValueError('not enough values to unpack (expected 2, got 1)')
In my original function, I'm using these tuples as auxiliary variables, and the data which is finally returned is completely different so using meta doesn't fix the problem. How can I unpack the tuples?
When you use map_partitions without specifying meta, dask will try to run the functions to infer what the output is. This can cause problems if your function is not compatible with the sample dataframe used, you can see this sample dataframe with ddf._meta_nonempty (in this case it will return a column of foo).
An easy fix in this case is to provide meta, it's okay for returned data to be of different format, e.g. if each returned result is a list, you can provide meta=list:
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, y = t
return [1,2,3]
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
ddf = dd.from_pandas(df, npartitions=2)
ddf.map_partitions(func, meta=list).compute()
Another approach is to make your function compatible with the sample dataframe used. The sample dataframe has an object column but it contains foo rather than a list of tuples, so it cannot be unpacked as a tuple. Modifying your function to accept non-tuple columns (with x, *y = t) will make it work:
import dask.dataframe as dd
import pandas as pd
def func(df):
for index, row in df.iterrows():
tuples = row['A']
for t in tuples:
x, *y = t
return [1,2,3]
df = pd.DataFrame({'A': [[(1, 1), (3, 4)], [(3, 2)]]})
ddf = dd.from_pandas(df, npartitions=2)
#notice that no meta is specified here
ddf.map_partitions(func).compute()
Related
I'm trying to extract the following dictionary using a pandas data frame into Excel:
results = {'ZF_DTSPP': [735.0500558302846,678.5413714617252,772.0300704610595,722.254907241738,825.2955175305726], 'ZF_DTSPPG': [732.0500558302845,637.4786326591071,655.8462451037873,721.404907241738,821.8455175305724]}
This is my code:
df = pd.DataFrame(data=results, index=[5, 2])
df = (df.T)
print(df)
df.to_excel('dict1.xlsx')
Somehow I always receive following error:
"ValueError: Shape of passed values is (5, 2), indices imply (2, 2)".
What can I do? How do I need to adapt the index?
Is there a way to compare the different values of "ZF_DTSPP" and "ZF_DTSPPG" directly with python?
You can use pd.DataFrame.from_dict as shown in pandas-from-dict, then your code:
df = pd.DataFrame.from_dict(results)
I want to concatenate the multidimensional output of a NumPy computation matching in dimensions the shape of the input (with regards to rows and respective selected columns).
But it fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
I do not want to flatten the indices first - so is there another way to get it to work?
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame({('metrik_0', Timestamp('2020-01-01 00:00:00')): {(1, 1): 2.5393693602911447, (1, 5): 4.316896324314225, (1, 6): 4.271001191238499, (1, 9): 2.8712588011247377, (1, 11): 4.0458495954752545}, ('metrik_0', Timestamp('2020-01-01 01:00:00')): {(1, 1): 4.02779063729038, (1, 5): 3.3849606155101224, (1, 6): 4.284114856052976, (1, 9): 3.980919941298365, (1, 11): 5.042488191587525}, ('metrik_0', Timestamp('2020-01-01 02:00:00')): {(1, 1): 2.374592085569529, (1, 5): 3.3405503781564487, (1, 6): 3.4049690284720366, (1, 9): 3.892686173978996, (1, 11): 2.1876998087043127}})
def compute_return_columns_to_df(df, colums_to_process,axis=0):
method = 'compute_result'
renamed_base_levels = map(lambda x: f'{x}_{method}', colums_to_process.get_level_values(0).unique())
renamed_columns = colums_to_process.set_levels(renamed_base_levels, level=0)
#####
# perform calculation in numpy here
# for the sake of simplicity (and as the actual computation is irrelevant - it is omitted in this minimal example)
result = df[colums_to_process].values
#####
result = pd.DataFrame(result, columns=renamed_columns)
display(result)
return pd.concat([df, result], axis=1) # fails with: NotImplementedError: Can only union MultiIndex with MultiIndex or Index of tuples, try mi.to_flat_index().union(other) instead.
# I do not want to flatten the indices first - so is there another way to get it to work?
compute_return_columns_to_df(df[df.columns[0:3]].head(), df.columns[0:2])
The reason why your code failed is in:
result = df[colums_to_process].values
result = pd.DataFrame(result, columns=renamed_columns)
Note that result has:
column names with the top index level renamed to
metrik_0_compute_result (so far OK),
but the row index is the default single level index,
composed of consecutive numbers.
Then, when you concatenate df and result, Pandas attempts to
align both source DataFrames on the row index, but they are incompatible
(df has a MultiIndex, whereas result has an "ordinary" index).
Change this part of your code to:
result = df[colums_to_process]
result.columns = renamed_columns
This way result keeps the original index and concat raises no
exception.
Another remark: Your function contains axis parameter, which is
never used. Consider removing it.
Another possible approach
Since result has a default (single level) index, you can leave the
previous part of code as is, but reset the index in df before joining:
return pd.concat([df.reset_index(drop=True), result], axis=1)
This way both DataFrames have the same indices and you can concatenate
them as well.
I have a dataframe, where one column consists of Sympy symbols, and another column that consists of values.
import sympy as sym
import pandas as pd
d1,c,bc = sym.symbols("\delta, c, b_c")
values = [(d1,1),(c,2),(bc,3)]
df = pd.DataFrame(values, columns = ['Symbol', 'Value'])
df['Symbol'].sort_values()
When I run the above, I get the error (which was expected, because Sympy symbols aren't sortable themselves, but the actual string contained is sortable.
TypeError: cannot determine truth value of Relational
Sympy symbols are sortable if you can apply the .name method to them. I've done this with numpy arrays and lists of dictionaries:
import numpy as np
values = np.array(values, dtype = [('Symbol','O'),('Value','O')])
values = sorted(values, key = lambda x: x['Symbol'].name)
display(values)
Output:
>>[(\delta, 1), (b_c, 3), (c, 2)]
I'm wondering if it's possible with dataframes because I'd rather not convert to a different format to apply a sort.
I am not sure this is more efficient but perhaps you could create a new column e.g 'Symbol_names' for the data frame with names only and sort by that. You can always drop the column after
df['Symbol_names'] = df['Symbol'].apply(lambda x: x.name)
df = df.sort_values('Symbol_names')\
.drop("Symbol_names", axis=1)\
.reset_index(drop=True) # optional
I have a large dataframe and would like to update specific values at known row and column indices. I would like to do this without an explicit for loop.
For example:
import string
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 10), index = range(10), columns = list(string.ascii_lowercase)[:10])
I have arbitrary arrays of indexes, columns, and values that I would like to use to update df. For example:
update_values = [0,-2,-3]
update_index = [3,5,7]
update_columns = ["d","g","i"]
I can loop over the arrays to update the original dataframe:
for i,j,v in zip(update_index, update_columns, update_values):
df.loc[i,j] = v
but would like to use a technique not involving an explicit for loop.
Use the underlying numpy values
indexes = map(df.columns.get_loc, update_columns)
df.values[update_index, list(indexes)] = update_values
try using loc which is used to specify the needed indexes and columns names loc[[index_names], [columns_names]]
df.loc[[3,5,7], ["d","g","i"]] = [0,-2,-3]
I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511
The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))