dataframe previous rows mean - python

I have the following dataframe, and want to create a new column 'measurement_mean', I expect the value of each row in the new column, to be the mean of all the previous 'Measurement' values.
How can I do this?
Measurement
0 2.0
1 4.0
2 3.0
3 0.0
4 100.0
5 3.0
6 2.0
7 1.0

Use pandas.Series.expanding
df[‘measurement_mean’] = df.Measurement.expanding().mean()

df['measurement_mean'] = df.Measurement.cumsum()/(df.index+1)

Related

Forward fill on custom value in pandas dataframe

I am looking to perform forward fill on some dataframe columns.
the ffill method replaces missing values or NaN with the previous filled value.
In my case, I would like to perform a forward fill, with the difference that I don't want to do that on Nan but for a specific value (say "*").
Here's an example
import pandas as pd
import numpy as np
d = [{"a":1, "b":10},
{"a":2, "b":"*"},
{"a":3, "b":"*"},
{"a":4, "b":"*"},
{"a":np.nan, "b":50},
{"a":6, "b":60},
{"a":7, "b":70}]
df = pd.DataFrame(d)
with df being
a b
0 1.0 10
1 2.0 *
2 3.0 *
3 4.0 *
4 NaN 50
5 6.0 60
6 7.0 70
The expected result should be
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
If replacing "*" by np.nan then ffill, that would cause to apply ffill to column a.
Since my data has hundreds of columns, I was wondering if there is a more efficient way than looping over all columns, check if it countains "*", then replace and ffill.
You can use df.mask with df.isin with df.replace
df.mask(df.isin(['*']),df.replace('*',np.nan).ffill())
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
I think you're going in the right direction, but here's a complete solution. What I'm doing is 'marking' the original NaN values, then replacing "*" with NaN, using ffill, and then putting the original NaN values back.
df = df.replace(np.NaN, "<special>").replace("*", np.NaN).ffill().replace("<special>", np.NaN)
output:
a b
0 1.0 10.0
1 2.0 10.0
2 3.0 10.0
3 4.0 10.0
4 NaN 50.0
5 6.0 60.0
6 7.0 70.0
And here's an alternative solution that does the same thing, without the 'special' marking:
original_nan = df.isna()
df = df.replace("*", np.NaN).ffill()
df[original_nan] = np.NaN

How to set rolling window size by size of each group?

I have a data frame like below:
>df
ID Value
---------------
1 1.0
1 2.0
1 3.0
1 4.0
2 6.0
2 7.0
2 8.0
3 2.0
I want to calculate min/max/sum/mean/var on 'value' field of last int(group size /2) records of each group instead of fix number of records.
For ID =1, apply min/max/sum/mean/var on 'value' field of last 4/2=2 records
For ID =2, apply min/max/sum/mean/var on 'value' field of last 3/2=1 records.
For ID =3, apply min/max/sum/mean/var on 'value' field of last 1 records since it only have one records in the group.
so the output should be
Value
ID min max sum mean var
----------------------------------
1 3.0 4.0 7.0 3.5 0.5 # the last 4/2 rows for group with ID =1
2 7.0 7.0 7.0 7.0 0.5 # the last 3/2 rows for group with ID =2
3 2.0 2.0 2.0 2.0 Nan # the last 1 rows for group with ID =3
I am thinking to use the rolling function like below:
df_group=df.groupby('ID')
.apply(lambda x: x \
.sort_values(by=['ID'])
.rolling(window=int(x.size/2),min_periods=1)
.agg({'Value':['min','max','sum','mean','var']})
.tail(1)
)
but the result turns out to be as below
Value
min max sum mean var
ID
------------------------------------------------
1 3 1.0 4.0 10.0 2.5 1.666667
2 6 6.0 8.0 21.0 7.0 1.000000
3 7 2.0 2.0 2.0 2.0 NaN
it seems the x.size does not work at all.
Is there any way to set the rolling size based on group size?
A possible solution, with :
import pandas as pd
df = pd.DataFrame(dict(ID=[1,1,1,1,2,2,2,3],
Value=[1,2,3,4,6,7,8,2]))
print(df)
##
ID Value
0 1 1
1 1 2
2 1 3
3 1 4
4 2 6
5 2 7
6 2 8
7 3 2
Loop over groups as below
#Object to store the result
stats = []
#Group over ID
for ID, Values in df.groupby('ID'):
# tail : to get last n values, with n max between 1 and group length / 2
# describe : to get the statistics
_stat = Values.tail(max(1,int(len(Values)/2)))['Value'].describe()
#Add group ID to the result
_stat.loc['ID'] = ID
#Store the result
stats.append(_stat)
#Create the new dataframe
pd.DataFrame(stats).set_index('ID')
Result
count mean std min 25% 50% 75% max
ID
1.0 2.0 3.5 0.707107 3.0 3.25 3.5 3.75 4.0
2.0 1.0 8.0 NaN 8.0 8.00 8.0 8.00 8.0
3.0 1.0 2.0 NaN 2.0 2.00 2.0 2.00 2.0
Links :
How to loop over grouped Pandas dataframe?
Series Describe

How to convert a DataFrame that only has unique value in one column based on one column in specific value in Pandas

I have a DataFrame like this way:
item_id item_price
1 10.0
1 5.0
1 6.0
1 7.0
2 2.0
3 3.0
4 5.0
And I try to get a DataFrame that item_price column only consists of a series of unique values. And drop other rows that don't fit this condition like this way:
item_id item_price
2 2.0
3 3.0
4 5.0
BUT, I am confused on how to implement it in Pandas. Any help would be appreciated.
Use drop_duplicates with parameter subset for identify column for check duplicates and keep=False for remove all dupe rows:
df = df.drop_duplicates(subset=['item_id'], keep=False)
print (df)
item_id item_price
4 2 2.0
5 3 3.0
6 4 5.0

pandas take average on odd rows

I want to fill in data between each row in a dataframe with an average of current and next row (where columns are numeric)
starting data:
time value value_1 value-2
0 0 0 4 3
1 2 1 6 6
intermediate df:
time value value_1 value-2
0 0 0 4 3
1 1 0 4 3 #duplicate of row 0
2 2 1 6 6
3 3 1 6 6 #duplicate of row 2
I would like to create df_1:
time value value_1 value-2
0 0 0 4 3
1 1 0.5 5 4.5 #average of row 0 and 2
2 2 1 6 6
3 3 2 8 8 #average of row 2 and 4
To to this I appended a copy of the starting dataframe to create the intermediate dataframe shown above:
df = df_0.append(df_0)
df.sort_values(['time'], ascending=[True], inplace=True)
df = df.reset_index()
df['value_shift'] = df['value'].shift(-1)
df['value_shift_1'] = df['value_1'].shift(-1)
df['value_shift_2'] = df['value_2'].shift(-1)
then I was thinking of applying a function to each column:
def average_vals(numeric_val):
#average every odd row
if int(row.name) % 2 != 0:
#take average of value and value_shift for each value
#but this way I need to create 3 separate functions
Is there a way to do this without writing a separate function for each column and applying to each column one by one (in real data I have tens of columns)?
How about this method using DataFrame.reindex and DataFrame.interpolate
df.reindex(np.arange(len(df.index) * 2) / 2).interpolate().reset_index(drop=True)
Explanation
Reindex, in half steps reindex(np.arange(len(df.index) * 2) / 2)
This gives a DataFrame like this:
time value value_1 value-2
0.0 0.0 0.0 4.0 3.0
0.5 NaN NaN NaN NaN
1.0 2.0 1.0 6.0 6.0
1.5 NaN NaN NaN NaN
Then use DataFrame.interpolate to fill in the NaN values .... the default will be linear interpolation, so mean in this case.
Finaly, use .reset_index(drop=True) to fix your index.
Should give
time value value_1 value-2
0 0.0 0.0 4.0 3.0
1 1.0 0.5 5.0 4.5
2 2.0 1.0 6.0 6.0
3 2.0 1.0 6.0 6.0

Drop Pandas columns with a high percentage of NaN values [duplicate]

This question already has answers here:
How to drop column according to NAN percentage for dataframe?
(4 answers)
Closed 4 years ago.
Let's say I have the following data.
df = pd.DataFrame({'group':list('aaaabbbb'),
'val':[1,3,3,np.NaN,5,6,6,2],
'id':[1,np.NaN,np.NaN,np.NaN,np.NaN,3,np.NaN,3]})
df
I want to drop columns where the percentage of NaN values is over 50%. I could do it manually by running the following and then using drop.
df.isnull().sum()/len(df)*100
However, I was wondering if there was an elegant and quick code to do this?
Could use thresh param of dropna.
df.dropna(axis=1, thresh=int(0.5*len(df)))
Use mean with boolean indexing for remove columns:
print (df.isnull().mean() * 100)
group 0.0
id 62.5
val 12.5
dtype: float64
df1 = df.loc[:, df.isnull().mean() <= .5]
print (df1)
group val
0 a 1.0
1 a 3.0
2 a 3.0
3 a NaN
4 b 5.0
5 b 6.0
6 b 6.0
7 b 2.0
df.dropna(thresh=len(df)//2,axis=1)
Out[57]:
group val
0 a 1.0
1 a 3.0
2 a 3.0
3 a NaN
4 b 5.0
5 b 6.0
6 b 6.0
7 b 2.0

Categories

Resources