statistics on subsets of a pandas dataframe - python

Let's consider this pandas dataframe:
A B C D
2012-08-16 2 1 1 7
2012-08-17 6 4 8 6
2012-08-18 8 3 1 1
2012-08-19 7 2 8 9
2012-08-20 6 7 5 8
2012-08-21 1 3 3 3
2012-08-22 8 2 3 8
2012-08-23 7 1 7 4
2012-08-24 2 6 0 6
2012-08-25 4 6 8 1
I would like to make statistics by making subsets on values contained in the column A. A minimal example to achieve that is:
new = pd.DataFrame()
for id in set(df.A):
sub = df[df.A == id)
new = new.append([{'B_mean': sub.B.mean(), 'B_std': sub.B.std(), 'id': id},])
I would like to know if there is a more efficient way to do that.

Like this? It calculated the mean and standard deviation on each ID grouping from A.
df.groupby('A').agg({'B': ['mean', 'std']})
B
mean std
A
1 3.0 NaN
2 3.5 3.535534
4 6.0 NaN
6 5.5 2.121320
7 1.5 0.707107
8 2.5 0.707107

Related

Closest non equal row in a column in Pandas dataframe

I have this df
d={}
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
I would like to create a column that is going to have the following non-equal value of column qty. Meaning that if qty is equal to 5 and its next row is 5 I am going to skip it and look until I find next value not equal to 5, In my case it is 6. And all this should be grouped by id
Here is the desired dataframe.
d['id']=['1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2']
d['qty']=[5,5,5,5,5,6,5,5,1,1,2,2,2,3,5,8]
d['qty2']=[6,6,6,6,6,5,'NAN','NAN',2,2,3,3,3,5,8,'NAN']
Any help is very much appreciated
You can groupby.shift, mask the identical values, and groupby.bfill:
# shift up per group
s = df.groupby('id')['qty'].shift(-1)
# keep only the different values and bfill per group
df['qty2'] = s.where(df['qty'].ne(s)).groupby(df['id']).bfill()
output:
id qty qty2
0 1 5 6.0
1 1 5 6.0
2 1 5 6.0
3 1 5 6.0
4 1 5 6.0
5 1 6 5.0
6 1 5 NaN
7 1 5 NaN
8 2 1 2.0
9 2 1 2.0
10 2 2 3.0
11 2 2 3.0
12 2 2 3.0
13 2 3 5.0
14 2 5 8.0
15 2 8 NaN

Why does pd.rolling and .apply() return multiple outputs from a function returning a single value?

I'm trying to create a rolling function that:
Divides two DataFrames with 3 columns in each df.
Calculate the mean of each row from the output in step 1.
Sums the averages from step 2.
This could be done by using pd.iterrows() hence looping through each row. However, this would be inefficient when working with larger datasets. Therefore, my objective is to create a pd.rolling function that could do this much faster.
What I would need help with is to understand why my approach below returns multiple values while the function I'm using only returns a single value.
EDIT : I have updated the question with the code that produces my desired output.
This is the test dataset I'm working with:
#import libraries
import pandas as pd
import numpy as np
#create two dataframes
values = {'column1': [7,2,3,1,3,2,5,3,2,4,6,8,1,3,7,3,7,2,6,3,8],
'column2': [1,5,2,4,1,5,5,3,1,5,3,5,8,1,6,4,2,3,9,1,4],
"column3" : [3,6,3,9,7,1,2,3,7,5,4,1,4,2,9,6,5,1,4,1,3]
}
df1 = pd.DataFrame(values)
df2 = pd.DataFrame([[2,3,4],[3,4,1],[3,6,1]])
print(df1)
print(df2)
column1 column2 column3
0 7 1 3
1 2 5 6
2 3 2 3
3 1 4 9
4 3 1 7
5 2 5 1
6 5 5 2
7 3 3 3
8 2 1 7
9 4 5 5
10 6 3 4
11 8 5 1
12 1 8 4
13 3 1 2
14 7 6 9
15 3 4 6
16 7 2 5
17 2 3 1
18 6 9 4
19 3 1 1
20 8 4 3
0 1 2
0 2 3 4
1 3 4 1
2 3 6 1
One method to achieve my desired output by looping through each row:
RunningSum = []
for index, rows in df1.iterrows():
if index > 3:
Div = abs((((df2 / df1.iloc[index-3+1:index+1].reset_index(drop="True").values)-1)*100))
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
RunningSum.append(SumOfAverages)
#printing my desired output values
print(RunningSum)
[330.42328042328046,
212.0899470899471,
152.06349206349208,
205.55555555555554,
311.9047619047619,
209.1269841269841,
197.61904761904765,
116.94444444444444,
149.72222222222223,
430.0,
219.51058201058203,
215.34391534391537,
199.15343915343914,
159.6031746031746,
127.6984126984127,
326.85185185185185,
204.16666666666669]
However, this would be timely when working with large datasets. Therefore, I've tried to create a function which applies to a pd.rolling() object.
def SumOfAverageFunction(vals):
Div = df2 / vals.reset_index(drop="True")
Average = Div.mean(axis=0)
SumOfAverages = np.sum(Average)
return SumOfAverages
RunningSum = df1.rolling(window=3,axis=0).apply(SumOfAverageFunction)
The problem here is that my function returns multiple output. How can I solve this?
print(RunningSum)
column1 column2 column3
0 NaN NaN NaN
1 NaN NaN NaN
2 3.214286 4.533333 2.277778
3 4.777778 3.200000 2.111111
4 5.888889 4.416667 1.656085
5 5.111111 5.400000 2.915344
6 3.455556 3.933333 5.714286
7 2.866667 2.066667 5.500000
8 2.977778 3.977778 3.063492
9 3.555556 5.622222 1.907937
10 2.750000 4.200000 1.747619
11 1.638889 2.377778 3.616667
12 2.986111 2.005556 5.500000
13 5.333333 3.075000 4.750000
14 4.396825 5.000000 3.055556
15 2.174603 3.888889 2.148148
16 2.111111 2.527778 1.418519
17 2.507937 3.500000 3.311111
18 2.880952 3.000000 5.366667
19 2.722222 3.370370 5.750000
20 2.138889 5.129630 5.666667
After reordering of operations, your calculations can be simplified
BASE = df2.sum(axis=0) /3
BASE_series = pd.Series({k: v for k, v in zip(df1.columns, BASE)})
result = df1.rdiv(BASE_series, axis=1).sum(axis=1)
print(np.around(result[4:], 3))
Outputs:
4 5.508
5 4.200
6 2.400
7 3.000
...
if you dont want to calculate anything before index 4 then change:
df1.iloc[4:].rdiv(...

Create rolling average in dataframe until a set point

I have a dataframe like this:
month val1 val2 val3
1 2 3 5
2 3 4 7
3 5 1 2
4 7 4 3
5 2 6 4
6 2 2 2
The last month in my initial column is 6 here, but could be anything from month 1 to month 12. I want to calculate a rolling average based on the last 2 values, for each val column until month 12. To get something like this:
month val1 val2 val3
1 2 3 5
2 3 4 7
3 5 1 2
4 7 4 3
5 2 6 4
6 2 2 2
7 2 4 3
8 2 3 2.5
9 2 3.5 2.75
10 2 3.25 2.63
11 2 3.38 2.69
12 2 3.32 2.66
The main problem is that appending rows to dataframes is a very inefficient process (i.e. creating a new dataframe series each iteration and appending it to the original dataframe will be extremely costly).
Probably the best way to do this is to create an array from the dataframe, do the rolling calculations there, and convert the result into a new dataframe.
import pandas as pd
import numpy as np
# create dataframe with the first month removed to show the solution is generalizable
df = pd.DataFrame({'month':[2,3,4,5,6],'val1':[3,5,7,2,2],'val2':[4,1,4,6,2],'val3':[7,2,3,4,2]})
df
month val1 val2 val3
0 2 3 4 7
1 3 5 1 2
2 4 7 4 3
3 5 2 6 4
4 6 2 2 2
# extract values of the dataframe as numpy and perform rolling operations
# separate out months from other columns
array_values = df.drop(columns = 'month').values
# loop from most recent month to month 12
for month in range(df.month.iloc[-1],12):
array_values = np.append(array_values, np.apply_along_axis(np.mean, 0,array_values[-2:]).reshape(1,3), axis = 0)
array_months = np.append(df.month.values, np.arange(df.month.values[-1]+1,13,1))
array_months = array_months.reshape(len(array_months),1)
array_values = np.append(array_months, array_values, axis = 1)
new_df = pd.DataFrame(data = array_values, columns = df.columns)
new_df.month = new_df.month.astype('int')
Output:
new_df
month val1 val2 val3
0 2 3.0 4.0000 7.00000
1 3 5.0 1.0000 2.00000
2 4 7.0 4.0000 3.00000
3 5 2.0 6.0000 4.00000
4 6 2.0 2.0000 2.00000
5 7 2.0 4.0000 3.00000
6 8 2.0 3.0000 2.50000
7 9 2.0 3.5000 2.75000
8 10 2.0 3.2500 2.62500
9 11 2.0 3.3750 2.68750
10 12 2.0 3.3125 2.65625
Define the following function, generating the rows for the rest of
the current year, based on last 2 rows:
def getRest(last2):
last2 = last2.set_index('month')
lastMonth = last2.index[1]
rv = []
for mnth in range(lastMonth, 12):
newRow = last2.mean()
newRow.name = mnth + 1
rv.append(newRow)
last2 = last2.drop([mnth - 1])
last2 = last2.append(newRow)
return rv
Then invoke it the following way, concatenating with the original
DataFrame:
pd.concat([df, pd.concat(getRest(df.iloc[-2:]), axis=1).T.reset_index()
.rename(columns={'index': 'month'})], ignore_index=True)
The result is:
month val1 val2 val3
0 1 2.0 3.0000 5.00000
1 2 3.0 4.0000 7.00000
2 3 5.0 1.0000 2.00000
3 4 7.0 4.0000 3.00000
4 5 2.0 6.0000 4.00000
5 6 2.0 2.0000 2.00000
6 7 2.0 4.0000 3.00000
7 8 2.0 3.0000 2.50000
8 9 2.0 3.5000 2.75000
9 10 2.0 3.2500 2.62500
10 11 2.0 3.3750 2.68750
11 12 2.0 3.3125 2.65625
If you want, save this result under either the original variable
or another one.

Compute difference between values in dataframe column

i have this dataframe:
a b c d
4 7 5 12
3 8 2 8
1 9 3 5
9 2 6 4
i want the column 'd' to become the difference between n-value of column a and n+1 value of column 'a'.
I tried this but it doesn't run:
for i in data.index-1:
data.iloc[i]['d']=data.iloc[i]['a']-data.iloc[i+1]['a']
can anyone help me?
Basically what you want is diff.
df = pd.DataFrame.from_dict({"a":[4,3,1,9]})
df["d"] = df["a"].diff(periods=-1)
print(df)
Output
a d
0 4 1.0
1 3 2.0
2 1 -8.0
3 9 NaN
lets try simple way:
df=pd.DataFrame.from_dict({'a':[2,4,8,15]})
diff=[]
for i in range(len(df)-1):
diff.append(df['a'][i+1]-df['a'][i])
diff.append(np.nan)
df['d']=diff
print(df)
a d
0 2 2.0
1 4 4.0
2 8 7.0
3 15 NaN

How to fill first N/A cell when apply rolling mean to a column -python

I need to apply rolling mean to a column as showing in pic1 s3, after i apply rolling mean and set windows = 5, i got correct answer , but left first 4 rows empty,as showing in pic2 sa3.
i want to fill the first 4 empty cells in pic2 sa3 with the mean of all data in pic1 s3 up to the current row,as showing in pic3 a3.
how can i do with with an easy function besides the rolling mean method.
I think need parameter min_periods=1 in rolling:
min_periods : int, default None
Minimum number of observations in window required to have a value (otherwise result is NA). For a window that is specified by an offset, this will default to 1.
df = df.rolling(5, min_periods=1).mean()
Sample:
np.random.seed(1256)
df = pd.DataFrame(np.random.randint(10, size=(10, 5)), columns=list('abcde'))
print (df)
a b c d e
0 1 5 8 8 9
1 3 6 3 0 6
2 7 0 1 5 1
3 6 6 5 0 4
4 4 9 4 6 1
5 7 7 5 8 3
6 0 7 2 8 2
7 4 8 3 5 5
8 8 2 0 9 2
9 4 7 1 5 1
df = df.rolling(5, min_periods=1).mean()
print (df)
a b c d e
0 1.000000 5.000000 8.00 8.000000 9.000000
1 2.000000 5.500000 5.50 4.000000 7.500000
2 3.666667 3.666667 4.00 4.333333 5.333333
3 4.250000 4.250000 4.25 3.250000 5.000000
4 4.200000 5.200000 4.20 3.800000 4.200000
5 5.400000 5.600000 3.60 3.800000 3.000000
6 4.800000 5.800000 3.40 5.400000 2.200000
7 4.200000 7.400000 3.80 5.400000 3.000000
8 4.600000 6.600000 2.80 7.200000 2.600000
9 4.600000 6.200000 2.20 7.000000 2.600000
So you want to add:
df['sa3'].fillna(df['s3'].mean(), inplace=True)
Hopefully I used correct column names.
You can use pandas to find the rolling mean and then fill the NaN with zero.
Use something like the following:
col = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame(col)
df['rm'] = df.rolling(5).mean().fillna(value =0, inplace=False)
print df
0 rm
0 1 0.0
1 2 0.0
2 3 0.0
3 4 0.0
4 5 3.0
5 6 4.0
6 7 5.0
7 8 6.0
8 9 7.0
I see, some of the answers are dealing with null and replacing them with mean and some answers are creating rolling mean but not replacing nulls with it. So i figured out the code myself and posting it here.
df['Col']= df['Col'].fillna(df['Col'].rolling(4,center=True,min_periods=1).mean())
'4' is the length of rolling window
centre = True indicates that the replaced value will will consider half the value above and half values below the null values to replace.

Categories

Resources