Preserve divisions in Dask dataframe after join

Preserve divisions in Dask dataframe after join - python

Let say I'm in the following situation:
import pandas as pd
import dask.dataframe as dd
import random
s = "abcd"
lst = 10*[0]+list(range(1,6))
n = int(1e2)
df = pd.DataFrame({"col1": [random.choice(s) for i in range(n)],
"col2": [random.choice(lst) for i in range(n)]})
df["idx"] = df.col1
df = df[["idx","col1","col2"]]
def fun(data):
if data["col2"].mean()>1:
return 2
else:
return 1
df.set_index("idx", inplace=True)
ddf1 = dd.from_pandas(df, npartitions=4)
gpb = ddf1.groupby("col1").apply(fun, meta=pd.Series(name='col3'))
ddf2 = ddf1.join(gpb.to_frame(), on="col1")
While ddf1.known_divisions is True ddf2.known_divisions is False I would like to preserve the same division on the ddf2 dataframe.
In one random example what I even got an empty partition.
for i in range(ddf1.npartitions):
print(i, len(ddf1.get_partition(i)), len(ddf2.get_partition(i)))
0 27 50
1 29 0
2 23 21
3 21 29

Related

add a new column to a serie pandas with condition

I have a dataframe called 'erm' like this:
enter image description here
I would like to add a new column 'typeRappel' xith value = 1 if erm['Calcul'] has value 4.
This is my code:
# IF ( calcul = 4 ) TypeRappel = 1.
# erm.loc[erm.Calcul = 4, "typeRappel"] = 1
#erm["typeRappel"] = np.where(erm['Calcul'] = 4.0, 1, 0)
# erm["Terminal"] = ["1" if c = "010" for c in erm['Code']]
# erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
import numpy as np
import pandas as pd
erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
But this code send me an error like this:
enter image description here
What can be the problem ??
# IF ( calcul = 4 ) TypeRappel = 1.
# erm.loc[erm.Calcul = 4, "typeRappel"] = 1
#erm["typeRappel"] = np.where(erm['Calcul'] = 4.0, 1, 0)
# erm["Terminal"] = ["1" if c = "010" for c in erm['Code']]
# erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
import numpy as np
import pandas as pd
erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]

You can achieve what you want using lambda
import pandas as pd
df = pd.DataFrame(
data=[[1,2],[4,5],[7,8],[4,11]],
columns=['Calcul','other_col']
)
df['typeRappel'] = df['Calcul'].apply(lambda x: 1 if x == 4 else None)
This results in
Calcul
other_col
typeRappel
1
2
NaN
4
5
1.0
7
8
NaN
4
11
1.0

You have 2 way for this
first way:
use from .loc method because you have just 1 condition
df['new']=None
df.loc[df.calcul.eq(4), 'new'] =1
Second way:
use from numpy.select method
import numpy as np
cond=[df.calcul.eq(4)]
df['new']= np.select(cond, [1], None)

import numpy as np
import pandas as pd
#erm['typeRappel']=None
erm.loc[erm.Calcul.eq(4), 'typeRappel'] = 1
import numpy as np
cond=[erm.Calcul.eq(4)]
erm['ok']= np.select(cond, [1], None)

How can I fill NaN values in a dataframe with the average of the values above it?

I'm looking to make it so that NaN values in a dataframe are filled in by the mean of all the values up to that point, as such:
A
0 1
1 2
2 3
3 4
4 5
5 NaN
6 NaN
7 11
8 NaN
Would become
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4

You can solve it by running the following code
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": [ 1, 2, 3, 4, 5, pd.NA, pd.NA, 11, pd.NA ]
})
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
It iterates on each NaN and fills it with the mean of the previous values, including those filled NaNs.
At the end you will have:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
EDIT
As stated by RichieV, performance may be an issue with this solution (its runtime complexity is O(N^2)) when there are many NaNs, but we also should avoid python iterations, since they are slow when compared to native pandas / numpy calls.
Here is an optimized version:
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = int(cumsum / cumnum)
last_idx = idx
Result:
>>> df
A
0 1
1 2
2 3
3 4
4 5
5 3
6 3
7 11
8 4
Since in the worst case the script should pass on the dataframe twice, the runtime complexity is now O(N).

Marco's answer works fine but it can be optimized with incremental average formulas, from math.stackexchange.com
Here is an adaptation of that other question (not the exact formula, just the concept).
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
The main advantage with this code is not having to read all items up to the current index on each iteration to get the mean.
This option still uses a python loop--which is not the best choice with pandas--but there seems to be no way around it for this use case (hopefully someone will get inspired by this and find such method without a loop).
Performance tests
Three alternative functions were defined:
incremental: My answer.
from_origin: Marco's original answer.
incremental_pandas: Marco's updated answer.
Tests were done using timeit module with 3 repetitions on random samples with 0.4 probability of NaN.
Full code for testing
import pandas as pd
import numpy as np
import timeit
import collections
from matplotlib import pyplot as plt
def incremental(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
cumsum = 0
expanding_mean = []
for i, xi in enumerate(df['A']):
if pd.isna(xi):
mean = cumsum / i # divide by number of items up to previous row
expanding_mean.append(mean)
cumsum += mean
else:
cumsum += xi
df.loc[df['A'].isna(), 'A'] = expanding_mean
return df
def incremental_pandas(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
last_idx = None
cumsum = 0
cumnum = 0
for idx in df[pd.isna(df["A"])].index:
prev_values = df.loc[ last_idx : idx, "A" ]
# for some reason, pandas includes idx on the slice, so we remove it
prev_values = prev_values[ : -1 ]
cumsum += prev_values.sum()
cumnum += len(prev_values)
df.loc[idx, "A"] = cumsum / cumnum
last_idx = idx
return df
def from_origin(df: pd.DataFrame):
# error handling
if pd.isna(df.iloc[0, 0]):
df.iloc[0, 0] = 0
for idx in df[pd.isna(df["A"])].index:
df.loc[idx, "A"] = np.mean(df.loc[ : idx, "A" ])
return df
def get_random_sample(n, p):
np.random.seed(123)
return pd.DataFrame({'A':
np.random.choice(list(range(10)) + [np.nan],
size=n, p=[(1 - p) / 10] * 10 + [p])})
r = 3
p = 0.4 # portion of NaNs
# check result from all functions
results = []
for func in [from_origin, incremental, incremental_pandas]:
random_df = get_random_sample(1000, p)
new_df = random_df.copy(deep=True)
results.append(func(new_df))
print('Passed' if all(np.allclose(r, results[0]) for r in results[1:])
else 'Failed', 'implementation test')
timings = {}
for n in np.geomspace(10, 10000, 10):
random_df = get_random_sample(int(n), p)
timings[n] = collections.defaultdict(float)
results = {}
for func in ['incremental', 'from_origin', 'incremental_pandas']:
timings[n][func] = (
timeit.timeit(f'{func}(random_df.copy(deep=True))', number=r, globals=globals())
/ r
)
timings = pd.DataFrame(timings).T
print(timings)
timings.plot()
plt.xlabel('size of array')
plt.ylabel('avg runtime (s)')
plt.ylim(0)
plt.grid(True)
plt.tight_layout()
plt.show()
plt.close('all')

How to print only the first and last 5?

I would like to print the first and last 5 of my one hot encoding data. The code is below. When it prints the first and last 30 are printed.
Code:
from random import randint
import pandas_datareader.data as web
import pandas as pd
import datetime
import itertools as it
import numpy as np
import csv
df = pd.read_csv('C:Users\GrahamFam\Desktop\Data Archive\Daily3mid(Archive).txt')
df.columns = ['Date','b1','b2','b3']
df = df.set_index('Date')
reversed_df = df.iloc[::-1]
n=5
#print(reversed_df.drop(df.index[n:-n]))
df = pd.read_csv('C:Users\GrahamFam\Desktop\Data Archive\Daily3eve(Archive).txt')
df.columns = ['Date','b1','b2','b3']
df = df.set_index('Date')
reversed_df = df.iloc[::-1]
n=5
print(reversed_df.drop(df.index[n:-n]),("\n"))
BallOne = pd.get_dummies(reversed_df.b1)
BallTwo = pd.get_dummies(reversed_df.b2)
BallThree = pd.get_dummies(reversed_df.b3)
print(BallOne)
print(BallTwo)
print(BallThree)

You can use the head and tail function. You can read about them here
>>> DataFrame.head(n)
>>> DataFrame.tail(n)
where n is the no. of elements you want

If you're fine displaying the tail before the head, then can use np.r_ slicing from negative to positive:
import pandas as pd
import numpy as np
df = pd.DataFrame(list(range(30)))
df.iloc[np.r_[-3:3]]
# 0
#27 27
#28 28
#29 29
#0 0
#1 1
#2 2
Otherwise slice explicitly:
n = 3
l = len(df)
df.iloc[np.r_[0:n, l-n:l]]
# 0
#0 0
#1 1
#2 2
#27 27
#28 28
#29 29

Creating a new feature column on grouped data in a Pandas dataframe

I have a Pandas dataframe with the columns ['week', 'price_per_unit', 'total_units']. I wish to create a new column called 'weighted_price' as follows: first group by 'week' and then for each week calculate price_per_unit * total_units / sum(total_units) for that week. I have code that does this:
import pandas as pd
import numpy as np
def create_features_by_group(df):
# first group data
grouped = df.groupby(['week'])
df_temp = pd.DataFrame(columns=['weighted_price'])
# run through the groups and create the weighted_price per group
for name, group in grouped:
res = (group['total_units'] * group['price_per_unit']) / np.sum(group['total_units'])
for idx in res.index:
df_temp.loc[idx] = [res[idx]]
df.join(df_temp['weighted_price'])
return df
The only problem is that this is very, very slow. Is there some faster way to do this?
I used the following code to test the function.
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['week', 'price_per_unit', 'total_units'])
for i in range(10):
df.loc[i] = [round(int(i % 3), 0) , 10 * np.random.rand(), round(10 * np.random.rand(), 0)]

I think you need to do it this way:
df
price total_units week
0 5 100 1
1 7 200 1
2 9 150 2
3 11 250 2
4 13 125 2
def fun(table):
table['measure'] = table['price'] * (table['total_units'] / table['total_units'].sum())
return table
df.groupby('week').apply(fun)
price total_units week measure
0 5 100 1 1.666667
1 7 200 1 4.666667
2 9 150 2 2.571429
3 11 250 2 5.238095
4 13 125 2 3.095238

I have grouped the dataset by 'Week' to calculate the weighted price for each week.
Then I joined the original dataset with the grouped dataset to get the result:
# importing the libraries
import pandas as pd
import numpy as np
# creating the dataset
df = {
'Week' : [1,1,1,1,2,2],
'price_per_unit' : [10,11,22,12,12,45],
'total_units' : [10,10,10,10,10,10]
}
df = pd.DataFrame(df)
df['price'] = df['price_per_unit'] * df['total_units']
# calculate the total sales and total number of units sold in each week
df_grouped_week = df.groupby(by = 'Week').agg({'price' : 'sum', 'total_units' : 'sum'}).reset_index()
# calculate the weighted price
df_grouped_week['wt_price'] = df_grouped_week['price'] / df_grouped_week['total_units']
# merging df and df_grouped_week
df_final = pd.merge(df, df_grouped_week[['Week', 'wt_price']], how = 'left', on = 'Week')

How to apply a function to a dask dataframe and return multiple values?

In pandas, I use the typical pattern below to apply a vectorized function to a df and return multiple values. This is really only necessary when the said function produces multiple independent outputs from a single task. See my overly trivial example:
import pandas as pd
df = pd.DataFrame({'val1': [1, 2, 3, 4, 5],
'val2': [1, 2, 3, 4, 5]})
def myfunc(in1, in2):
out1 = in1 + in2
out2 = in1 * in2
return (out1, out2)
df['out1'], df['out2'] = zip(*df.apply(lambda x: myfunc(x['val1'], x['val2']), axis=1))
Currently I write a separate function to chunk the pandas df and using multiprocessing for efficiency gains, but I would like to use dask to accomplish this task instead. Continuing the example, here is how I would run a vectorized function to return a single value when using dask:
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=2)
def simple_func(in1, in2):
out1 = in1 + in2
return out1
df['out3'] = ddf.map_partitions(lambda x: simple_func(x['val1'], x['val2']), meta=(None, 'i8')).compute()
Now I would like to use dask and return two values as in the pandas example. I have tried to add a list to meta and return a tuple but just get errors. Is this possible in dask and how?

I think the problem here stems from the way you are combining your results is not great. Ideally you would use df.apply with the result_expand argument and then use df.merge. Porting this code from Pandas to Dask is trivial. For pandas this would be:
Pandas
import pandas as pd
def return_two_things(x, y):
return (
x + y,
x * y,
)
def pandas_wrapper(row):
return return_two_things(row['val1'], row['val2'])
df = pd.DataFrame({
'val1': range(1, 6),
'val2': range(1, 6),
})
res = df.apply(pandas_wrapper, axis=1, result_type='expand')
res.columns = ['out1', 'out2']
full = df.merge(res, left_index=True, right_index=True)
print(full)
Which outputs:
val1 val2 out1 out2
0 1 1 2 1
1 2 2 4 4
2 3 3 6 9
3 4 4 8 16
4 5 5 10 25
Dask
For Dask, applying the function to the data and collating the results is virtually identical:
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=2)
# here 0 and 1 refer to the default column names of the resulting dataframe
res = ddf.apply(pandas_wrapper, axis=1, result_type='expand', meta={0: int, 1: int})
# which are renamed out1, and out2 here
res.columns = ['out1', 'out2']
# this merge is considered "embarrassingly parallel", as a worker does not need to contact
# any other workers when it is merging the results (that it created) with the input data it used.
full = ddf.merge(res, left_index=True, right_index=True)
print(full.compute())
Outputing:
val1 val2 out1 out2
0 1 1 2 1
1 2 2 4 4
2 3 3 6 9
3 4 4 8 16
4 5 5 10 25

Late to the party. Perhaps this was not possible back when the question was asked.
I don't like the ending assignment pattern. As far as I am able to find dask does not allow new column assignment like pandas does.
You need to set the meta value to the basic type you are returning. You can return a dict, tuple, set, or list quite simply from my testing. The meta doesn't actually seem to care if the type matches the type of the return object anyway.
import pandas
import dask.dataframe
def myfunc(in1, in2):
out1 = in1 + in2
out2 = in1 * in2
return (out1, out2)
df = pandas.DataFrame({'val1': [1, 2, 3, 4, 5],
'val2': [1, 2, 3, 4, 5]})
ddf = dask.dataframe.from_pandas(df, npartitions=2)
df['out1'], df['out2'] = zip(*df.apply(lambda x: myfunc(x['val1'], x['val2']), axis=1))
output = ddf.map_partitions(lambda part: part.apply(lambda x: myfunc(x['val1'], x['val2']), axis=1), meta=tuple).compute()
out1, out2 = zip(*output)
ddf = ddf.assign(out1 = pandas.Series(out1))
ddf = ddf.assign(out2 = pandas.Series(out2))
print('\nPandas\n',df)
print('\nDask\n',ddf.compute())
print('\nEqual\n',ddf.eq(df).compute().all())
outputs:
Pandas
val1 val2 out1 out2
0 1 1 2 1
1 2 2 4 4
2 3 3 6 9
3 4 4 8 16
4 5 5 10 25
Dask
val1 val2 out1 out2
0 1 1 2 1
1 2 2 4 4
2 3 3 6 9
3 4 4 8 16
4 5 5 10 25
Equal
val1 True
val2 True
out1 True
out2 True
dtype: bool
It helps to note that map_partition's lambda return is a partition of the larger dataframe (based, in this case, on your npartitions value). Which you would then treat like any other dataframe with your .apply().

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Preserve divisions in Dask dataframe after join - python

Related

add a new column to a serie pandas with condition

How can I fill NaN values in a dataframe with the average of the values above it?

How to print only the first and last 5?

Creating a new feature column on grouped data in a Pandas dataframe

How to apply a function to a dask dataframe and return multiple values?

Categories

Resources