DataFrame: fillna() with running sum of valid values

DataFrame: fillna() with running sum of valid values - python

I'm working a Pandas Dataframe, that looks like this:
0 Data
1
2
3
4 5
5
6
7
8 21
9
10 2
11
12
13
14
15
I'm trying to fill the blank with next valid values by: df.fillna(method='backfill'). This works, but then I need to add the previous valid value to the next valid value, from the bottom up, such as:
0 Data
1 28
2 28
3 28
4 28
5 23
6 23
7 23
8 23
9 2
10 2
11
12
13
14
15
I can get this to work by looping over it, but is there a method within pandas that can do this?
Thanks a lot!

You could reverse the df, then fillna(0) and then cumsum and reverse again:
In [12]:
df = df[::-1].fillna(0).cumsum()[::-1]
df
Out[12]:
Data
0 28.0
1 28.0
2 28.0
3 28.0
4 23.0
5 23.0
6 23.0
7 23.0
8 2.0
9 2.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
here we use slicing notation to reverse the df, then replace all NaN with 0, perform cumsum and reverse back

Another simple way to do that : df.sum()-df.fillna(0).cumsum()

Related

Find polynomial relationship between two pandas df columns and extend it to the rest of the dataset

Herebelow is an example of my dataset:
[index] [pressure] [flow rate]
0 Nan 0
1 Nan 0
2 3 25
3 5 35
4 6 42
5 Nan 44
6 Nan 46
7 Nan 0
8 5 33
9 4 26
10 3 19
11 Nan 0
12 Nan 0
13 Nan 39
14 Nan 36
15 Nan 41
I would like to find a polynomial relationship between the pressure and flow rate where the data for both are present (in this example we can see there are data points for both pressure and flow rate from index 0 to index 4), and then I need to extend the values of pressure for Nan values based on the polynomial relationship that I found above up to the point where the data for both are present again (in this case the data is again present from index 8 to index 11), in which case I need to find a new polynomial relationship between pressure and flow rate and extend the pressure values further based on my new relationship up to the next available data and so on.
I appreciate any advice on how best to accomplish that.

You can interpolate:
df['[pressure 2]'] = df.set_index('[flow rate]')['[pressure]'].interpolate('polynomial', order=2).values
Output
[index] [pressure] [flow rate] [pressure 2]
0 0 2.0 21 2.000000
1 1 4.0 29 4.000000
2 2 3.0 25 3.000000
3 3 5.0 35 5.000000
4 4 6.0 42 6.000000
5 5 NaN 44 6.000000
6 6 NaN 46 NaN
7 7 NaN 50 NaN
8 8 5.0 33 5.000000
9 9 4.0 26 4.000000
10 10 3.0 19 3.000000
11 11 6.0 44 6.000000
12 12 NaN 41 5.915690
13 13 NaN 39 5.578449
14 14 NaN 36 5.044156
15 15 NaN 40 5.775173
NB. The remaining NaNs cannot be interpolated without ambiguity, you can ffill if needed

How to compare with the previous line after reassignment [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 months ago.
Improve this question
Compare each row of column A with the previous row
If greater than, reassign to the value of the previous row
If less than, the value is unchanged
Now the problem is that each time the comparison is made with the original value
What I want is, to compare with the previous line after reassignment
import pandas as pd
import numpy as np
d={'A':[16,19,18,15,13,16]}
df = pd.DataFrame(d)
df['A_changed']=np.where(df.A>df.A.shift(),df.A.shift(),df.A)
df
A A_changed
0 16 16.0
1 19 16.0
2 18 18.0
3 15 15.0
4 13 13.0
5 16 13.0
expected output
A A_changed
0 16 16.0
1 19 16.0
2 18 16.0
3 15 15.0
4 13 13.0
5 16 13.0

Are you trying to do cummin?
df['compare_min'] = df['A'].cummin()
Output:
A compare compare_min
0 5 5.0 5
1 14 5.0 5
2 12 12.0 5
3 15 12.0 5
4 13 13.0 5
5 16 13.0 5
df['b'] = [10, 11, 12, 5, 8, 2]
df['compare_min_b'] = df['b'].cummin()
Output:
A compare compare_min b compare_min_b
0 5 5.0 5 10 10
1 14 5.0 5 11 10
2 12 12.0 5 12 10
3 15 12.0 5 5 5
4 13 13.0 5 8 5
5 16 13.0 5 2 2
Update using your example, this exactly what cummin does:
d={'A':[16,19,18,15,13,16]}
df = pd.DataFrame(d)
df['A_change'] = df['A'].cummin()
df
Output:
A A_changed A_change
0 16 16.0 16
1 19 16.0 16
2 18 18.0 16
3 15 15.0 15
4 13 13.0 13
5 16 13.0 13
Here is why your code will not work:
d={'A':[16,19,18,15,13,16]}
df = pd.DataFrame(d)
df['A_shift'] = df['A'].shift()
df
Output:
A A_shift
0 16 NaN
1 19 16.0
2 18 19.0
3 15 18.0
4 13 15.0
5 16 13.0
Look at the output of the shifted column, what you want to do is keep the cumulative mine instead of just comparing A to shifted A. Hence index 2 is not giving you what you expected.

Is it possible to use pandas.DataFrame.rolling window period 5 with skipping today's value in that

need to get output in column <5_Days_Up> like the image.
Date price 5_Days_Up
20-May-21 1
21-May-21 2
22-May-21 4
23-May-21 5
24-May-21 6 5
25-May-21 7 6
26-May-21 8 7
27-May-21 9 8
28-May-21 10 9
29-May-21 11 10
30-May-21 12 11
31-May-21 13 12
1-Jun-21 14 13
2-Jun-21 15 14
But, got the output like this.
Date price 5_Days_Up
20-May-21 1
21-May-21 2
22-May-21 4
23-May-21 5
24-May-21 6 6
25-May-21 7 7
26-May-21 8 8
27-May-21 9 9
28-May-21 10 10
29-May-21 11 11
30-May-21 12 12
31-May-21 13 13
1-Jun-21 14 14
2-Jun-21 15 15
Here, in python pandas, I am using
df['5_Days_Up'] = df['price'].rolling(window=5).max()
is there a way to get the maximum value of the last 5 periods after skipping the today's price using the same rolling() or any other?

Your data has only 4 (instead of 5) previous entries before the entry on date 24-May-21 with price equals 6 (owing to there is no price equals 3 in the data sample.) Therefore, your first entry to show non-NaN value will start from the date 25-May-21 with price equals 7.
To include up to the previous entry (exclude current entry), you can use the parameter closed='left' to achieve this:
df['5_Days_Up'] = df['price'].rolling(window=5, closed='left').max()
Result:
Date price 5_Days_Up
0 20-May-21 1 NaN
1 21-May-21 2 NaN
2 22-May-21 4 NaN
3 23-May-21 5 NaN
4 24-May-21 6 NaN
5 25-May-21 7 6.0
6 26-May-21 8 7.0
7 27-May-21 9 8.0
8 28-May-21 10 9.0
9 29-May-21 11 10.0
10 30-May-21 12 11.0
11 31-May-21 13 12.0
12 1-Jun-21 14 13.0
13 2-Jun-21 15 14.0

Combine only certain rows in dataframe effeciently

So I have a dataframe that has the beginning and end times of certain activities in subsequent rows that have the same id and activity. Every now and then there is a row without an end that I want to drop evtl. (id 3 & 5 in this example). The rows that are paired (with id/act pairs: 1/10,2/10 & 1/10 at a different time) can be merged, i.e. the second row can be dropped. I can add the end times simply by shifting one column, but I am having a hard time getting rid of the unnecessary rows without iterating through the whole dataframe.
import pandas as pd
df = pd.DataFrame([[1,10,20],[1,10,25],[2,10,40],[2,10,41],[3,10,42],[1,10,45],[1,10,45],[5,10,50]], columns=['id','act','time'])
df["time 2"]=df["time"].shift(-1)

Thank yo uso much for the quick reply, but I actually fixed this myself with a very simple solution:
df = pd.DataFrame([[1,10,20],[1,10,25],[2,10,40],[2,10,41],[3,10,42],[1,10,45],[1,10,45],[5,10,50]], columns=['id','act','time'])
id act time
0 1 10 20
1 1 10 25
2 2 10 40
3 2 10 41
4 3 10 42
5 1 10 45
6 1 10 45
7 5 10 50
df["end"]=df["time"].shift(-1)
df["id 2"]=df["id"].shift(-1)
df["act 2"]=df["act"].shift(-1)
df.drop(df.index[len(df)-1],inplace=True)
id act time time 2 id 2 act 2
0 1 10 20 25.0 1.0 10.0
1 1 10 25 40.0 2.0 10.0
2 2 10 40 41.0 2.0 10.0
3 2 10 41 42.0 3.0 10.0
4 3 10 42 45.0 1.0 10.0
5 1 10 45 45.0 1.0 10.0
6 1 10 45 50.0 5.0 10.0
df=df.loc[(df["id"]==df["id 2"])== (df["act"]==df["act 2"])]
df.drop(columns=["id 2","act 2"],axis=0,inplace=True)
id act time end
0 1 10 20 25.0
2 2 10 40 41.0
5 1 10 45 45.0

Change Cells in Pandas DataFrame Based on Conditional Slices

I'm playing around with the Titanic dataset, and what I'd like to do is fill in all the NaN/Null values of the Age column with the median value base on that Pclass.
Here is some data:
train
PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1 NaN
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1 Nan
Here is what I would like to end up with:
PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1 35
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1 35
The first thing I came up with is this - In the interest of brevity I have only included one slice for Pclass equal to 1 rather than including 2 and 3:
Pclass_1 = train['Pclass']==1
train[Pclass_1]['Age'].fillna(train[train['Pclass']==1]['Age'].median(), inplace=True)
As far as I understand, this method creates a view rather than editing train itself (I don't quite understand how this is different from a copy, or if they are analogous in terms of memory -- that is an aside I would love to hear about if possible). I particularly like this Q/A on the topic View vs Copy, How Do I Tell? but it doesn't include the insight I'm looking for.
Looking through Pandas docs I learned why you want to use .loc to avoid this pitfall. However I just can't seem to get the syntax right.
Pclass_1 = train.loc[:,['Pclass']==1]
Pclass_1.Age.fillna(train[train['Pclass']==1]['Age'].median(),inplace=True)
I'm getting lost in indices. This one ends up looking for a column named False which obviously doesn't exist. I don't know how to do this without chained indexing. train.loc[:,train['Pclass']==1] returns an exception IndexingError: Unalignable boolean Series key provided.

In this part of the line,
train.loc[:,['Pclass']==1]
the part ['Pclass'] == 1 is comparing the list ['Pclass'] to the value 1, which returns False. The .loc[] is then evaluated as .loc[:,False] which is causing the error.
I think you mean:
train.loc[train['Pclass']==1]
which selects all of the rows where Pclass is 1. This fixes the error, but it will still give you the "SettingWithCopyWarning".
EDIT 1
(old code removed)
Here is an approach that uses groupby with transform to create a Series
containing the median Age for each Pclass. The Series is then used as the argument to fillna() to replace the missing values with the median. Using this approach will correct all passenger classes at the same time, which is what the OP originally requested. The solution comes from the answer to Python-pandas Replace NA with the median or mean of a group in dataframe
import pandas as pd
from io import StringIO
tbl = """PassengerId Pclass Age
0 1 3 22
1 2 1 35
2 3 3 26
3 4 1 35
4 5 3 35
5 6 1
6 7 1 54
7 8 3 2
8 9 3 27
9 10 2 14
10 11 1
"""
train = pd.read_table(StringIO(tbl), sep='\s+')
print('Original:\n', train)
median_age = train.groupby('Pclass')['Age'].transform('median') #median Ages for all groups
train['Age'].fillna(median_age, inplace=True)
print('\nNaNs replaced with median:\n', train)
The code produces:
Original:
PassengerId Pclass Age
0 1 3 22.0
1 2 1 35.0
2 3 3 26.0
3 4 1 35.0
4 5 3 35.0
5 6 1 NaN
6 7 1 54.0
7 8 3 2.0
8 9 3 27.0
9 10 2 14.0
10 11 1 NaN
NaNs replaced with median:
PassengerId Pclass Age
0 1 3 22.0
1 2 1 35.0
2 3 3 26.0
3 4 1 35.0
4 5 3 35.0
5 6 1 35.0
6 7 1 54.0
7 8 3 2.0
8 9 3 27.0
9 10 2 14.0
10 11 1 35.0
One thing to note is that this line, which uses inplace=True:
train['Age'].fillna(median_age, inplace=True)
can be replaced with assignment using .loc:
train.loc[:,'Age'] = train['Age'].fillna(median_age)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

DataFrame: fillna() with running sum of valid values - python

Another simple way to do that : df.sum()-df.fillna(0).cumsum()

Related

Find polynomial relationship between two pandas df columns and extend it to the rest of the dataset

How to compare with the previous line after reassignment [closed]

Is it possible to use pandas.DataFrame.rolling window period 5 with skipping today's value in that

Combine only certain rows in dataframe effeciently

Change Cells in Pandas DataFrame Based on Conditional Slices

Categories

Resources