I'm trying to understand how can I convert a lambda function to a normal one. I have this lambda function that it supposed to fill the null values of each column with the mode
def fill_nn(data):
df= data.apply(lambda column: column.fillna(column.mode()[0]))
return df
I tried this:
def fill_nn(df):
for column in df:
if df[column].isnull().any():
return df[column].fillna(df[column].mode()[0])
Hi 👋 Hope you are doing well!
If I understood your question correctly then the best possible way will be similar to this:
import pandas as pd
def fill_missing_values(series: pd.Series) -> pd.Series:
"""Fill missing values in series/column."""
value_to_use = series.mode()[0]
return series.fillna(value=value_to_use)
df = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"B": [None, 2, 3, 4, None],
"C": [None, None, 3, 4, None],
}
)
df = df.apply(fill_missing_values) # type: ignore
print(df)
# A B C
# 0 1 2.0 3.0
# 1 2 2.0 3.0
# 2 3 3.0 3.0
# 3 4 4.0 4.0
# 4 5 2.0 3.0
but personally, I would still use the lambda as it requires less code and is easier to handle (especially for such a small task).
Related
Let's say I have this Pandas series:
num = pd.Series([1,2,3,4,5,6,5,6,4,2,1,3])
What I want to do is to get a number, say 5, and return the index where it previously occurred. So if I'm using the element 5, I should get 4 as the element appears in indices 4 and 6. Now I want to do this for all of the elements of the series, and can be easily done using a for loop:
for idx,x in enumerate(num):
idx_prev = num[num == x].idxmax()
if(idx_prev < idx):
return idx_prev
However, this process consumes too much time for longer series lengths due to the looping. Is there a way to implement the same thing but in a vectorized form? The output should be something like this:
[NaN,NaN,NaN,NaN,NaN,NaN,4,5,3,1,0,2]
You can use groupby to shift the index:
num.index.to_series().groupby(num).shift()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
7 5.0
8 3.0
9 1.0
10 0.0
11 2.0
dtype: float64
It's possible to keep working in numpy.
Equivalent of [num[num == x].idxmax() for idx,x in enumerate(num)] using numpy is:
_, out = np.unique(num.values, return_inverse=True)
which assigns
array([0, 1, 2, 3, 4, 5, 4, 5, 3, 1, 0, 2], dtype=int64)
to out. Now you can assign bad values of out to Nans like this:
out_series = pd.Series(out)
out_series[out >= np.arange(len(out))] = np.nan
I'm trying to calculate a rolling statistic that requires all variables in a window from two input columns.
My only solution involves a for loop. Is there a more efficient way, perhaps using Pandas' rolling and apply functions?
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])[1]
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.rolling(2).apply(lambda x: f(x), raw=False) # KeyError: 'a'
I get KeyError: 'a' because df gets passed to f() one series (column) at a time. Specifying axis=1 sends one row and all columns to f(), but neither approach provides the required set of observations.
You could try rolling, mean and sum:
df['result'] = df.rolling(2).mean().sum(axis=1)
a b result
0 1 5 0.0
1 2 6 7.0
2 3 7 9.0
3 4 8 11.0
EDIT
Adding a different answer based upon new information in the question by OP.
Set up the function.
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])
Create the data and dataframe:
a_data = [1,2,3,4]
b_data = [5,6,7,8]
df = pd.DataFrame(data={'a': a_data, 'b': b_data})
a b
0 1 5
1 2 6
2 3 7
3 4 8
I gather after researching coint that you are trying to pass two rolling arrays to f['a'] and f['b']. The following will create the arrays and dataframe.
n=2
arr_a = [df['a'].shift(x).values[::-1][:n] for x in range(len(df['a']))[::-1]]
arr_b = [df['b'].shift(x).values[::-1][:n] for x in range(len(df['b']))[::-1]]
df1 = pd.DataFrame(data={'a': arr_a, 'b': arr_b})
n is the size of the rolling window.
df1
a b
0 [1.0, nan] [5.0, nan]
1 [2.0, 1.0] [6.0, 5.0]
2 [3.0, 2.0] [7.0, 6.0]
3 [4, 3] [8, 7]
Then you can use apply.(f) to send in the rows of arrays.
df1.iloc[(n-1):,].apply(f, axis=1)
Your output is as follows:
1 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
2 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
3 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
dtype: object
When I run this I do get an error for perfectly colinear data, but I suspect that will disappear with real data.
Also, I know a purely vecotorized solution might have been faster. I wonder what the performance will be like for this if it what you are looking for?
Hats off to #Zero who really had the solution for this problem here.
I tried placing the sum before the rolling:
import pandas as pd
import time
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.copy()
s = time.time()
df2.loc[:, 'mean1'] = df.sum(axis = 1).rolling(2).mean()
print(time.time() - s)
s = time.time()
df2.loc[:, 'mean2'] = df.rolling(2).mean().sum(axis=1)
print(time.time() - s)
df2
0.003737926483154297
0.005460023880004883
a b mean1 mean2
0 1 5 NaN 0.0
1 2 6 7.0 7.0
2 3 7 9.0 9.0
3 4 8 11.0 11.0
It is slightly faster than the previous answer, but works the same and maybe in large datasets the difference migth significant.
You can modify it to select the columns of interest only:
s = time.time()
print(df[['a', 'b']].sum(axis = 1).rolling(2).mean())
print(time.time() - s)
0 NaN
1 7.0
2 9.0
3 11.0
dtype: float64
0.0033559799194335938
If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN
There is already an answer that deals with a relatively simple dataframe that is given here.
However, the dataframe I have at hand has multiple columns and large number of rows. One Dataframe contains three dataframes attached along axis=0. (Bottom end of one is attached to the top of the next.) They are separated by a row of NaN values.
How can I create three dataframes out of this one data by splitting it along the NaN rows?
Like in the answer you linked, you want to create a column which identifies the group number. Then you can apply the same solution.
To do so, you have to make a test for all the values of a row to be NaN. I don't know if there is such a test builtin in pandas, but pandas has a test to check if a Series is full of NaN. So what you want to do is to perform that on the transpose of your dataframe, so that your "Series" is actually your row:
df["group_no"] = df.isnull().all(axis=1).cumsum()
At that point you can use the same technique from that answer to split the dataframes.
You might want to do a .dropna() at the end, because you will still have the NaN rows in your result.
Ran into this same question in 2022. Here's what I did to split dataframes on rows with NaNs, caveat is this relies on pip install python-rle for run-length encoding:
import rle
def nanchucks(df):
# It chucks NaNs outta dataframes
# True if whole row is NaN
df_nans = pd.isnull(df).sum(axis="columns").astype(bool)
values, counts = rle.encode(df_nans)
df_nans = pd.DataFrame({"values": values, "counts": counts})
df_nans["cum_counts"] = df_nans["counts"].cumsum()
df_nans["start_idx"] = df_nans["cum_counts"].shift(1)
df_nans.loc[0, "start_idx"] = 0
df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column
df_nans["end_idx"] = df_nans["cum_counts"] - 1
# Only keep the chunks of data w/o NaNs
df_nans = df_nans[df_nans["values"] == False]
indices = []
for idx, row in df_nans.iterrows():
indices.append((row["start_idx"], row["end_idx"]))
return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]
Examples:
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})
print(nanchucks(sample_df1))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 3 3.0 3.0 3.0
# 4 4.0 4.0 4.0]
print(nanchucks(sample_df2))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 4 4.0 4.0 4.0]
For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?
query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0
Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2