Limit cumsum to only the previous 4 rows - python

beginner's question:
I want to create a cumulative sum column on my dataframe, but I only want the column to add the values from the previous 4 rows (inclusive of the current row). I also need to start the count again with each new 'Type' in the frame.
This is what I'm going for:
Type Value Desired column
A 1 -
A 2 -
A 1 -
A 1 5
A 2 6
A 2 6
B 2 -
B 2 -
B 2 -
B 2 8
B 1 7
B 1 6

You can do this by applying a rolling_sum after we groupby the Type. For example:
>>> df["sum4"] = df.groupby("Type")["Value"].apply(lambda x: pd.rolling_sum(x,4))
>>> df
Type Value sum4
0 A 1 NaN
1 A 2 NaN
2 A 1 NaN
3 A 1 5
4 A 2 6
5 A 2 6
6 B 2 NaN
7 B 2 NaN
8 B 2 NaN
9 B 2 8
10 B 1 7
11 B 1 6
pandas uses NaN to represent missing data; if you really want - instead, you could do that too, using
df["sum4"] = df["sum4"].fillna('-')

Related

create a new column using previous row of other column consider a duplicate as well as a based on group [duplicate]

I have a Pandas dataframe, and I want to create a new column whose values are that of another column, shifted down by one row. The last row should show NaN.
The catch is that I want to do this by group, with the last row of each group showing NaN. NOT have the last row of a group "steal" a value from a group that happens to be adjacent in the dataframe.
My attempted implementation is quite shamefully broken, so I'm clearly misunderstanding something fundamental.
df['B_shifted'] = df.groupby(['A'])['B'].transform(lambda x:x.values[1:])
Newer versions of pandas can now perform a shift on a group:
df['B_shifted'] = df.groupby(['A'])['B'].shift(1)
Note that when shifting down, it's the first row that has NaN.
Shift works on the output of the groupby clause:
>>> df = pandas.DataFrame(numpy.random.randint(1,3, (10,5)), columns=['a','b','c','d','e'])
>>> df
a b c d e
0 2 1 2 1 1
1 2 1 1 1 1
2 1 2 2 1 2
3 1 2 1 1 2
4 2 2 1 1 2
5 2 2 2 2 1
6 2 2 1 1 1
7 2 2 2 1 1
8 2 2 2 2 1
9 2 2 2 2 1
for k, v in df.groupby('a'):
print k
print 'normal'
print v
print 'shifted'
print v.shift(1)
1
normal
a b c d e
2 1 2 2 1 2
3 1 2 1 1 2
shifted
a b c d e
2 NaN NaN NaN NaN NaN
3 1 2 2 1 2
2
normal
a b c d e
0 2 1 2 1 1
1 2 1 1 1 1
4 2 2 1 1 2
5 2 2 2 2 1
6 2 2 1 1 1
7 2 2 2 1 1
8 2 2 2 2 1
9 2 2 2 2 1
shifted
a b c d e
0 NaN NaN NaN NaN NaN
1 2 1 2 1 1
4 2 1 1 1 1
5 2 2 1 1 2
6 2 2 2 2 1
7 2 2 1 1 1
8 2 2 2 1 1
9 2 2 2 2 1
#EdChum's comment is a better answer to this question, so I'm posting it here for posterity:
df['B_shifted'] = df.groupby(['A'])['B'].transform(lambda x:x.shift())
or similarly
df['B_shifted'] = df.groupby(['A'])['B'].transform('shift').
The former notation is more flexible, of course (e.g. if you want to shift by 2).

How to remove or drop all rows after first occurrence of `NaN` from the entire DataFrame

I am looking forward to remove/drop all rows after first occurrence of NaN based on any of dataFrame column.
I have created two sample DataFrames as illustrated Below, the first dataframe the dtypes are for initial two columns are object while the last one in int, while in the Second dataframe these are float, obj and int.
First:
>>> df = pd.DataFrame({"A": (1,2,3,4,5,6,7,'NaN','NaN','NaN','NaN'),"B": (1,2,3,'NaN',4,5,6,7,'NaN',"9","10"),"C": range(11)})
>>> df
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 NaN 3
4 5 4 4
5 6 5 5
6 7 6 6
7 NaN 7 7
8 NaN NaN 8
9 NaN 9 9
10 NaN 10 10
Dtypes:
>>> df.dtypes
A object
B object
C int64
dtype: object
While carrying out index based approach as follows based on a particular, it works Just fine as long as dtype is obj and int but i'm looking for dataFrame level action merely not limited to a column.
>>> df[:df[df['A'] == 'NaN'].index[0]]
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 NaN 3
4 5 4 4
5 6 5 5
6 7 6 6
>>> df[:df[df['B'] == 'NaN'].index[0]]
A B C
0 1 1 0
1 2 2 1
2 3 3 2
Second:
Another interesting fact while creating DataFrame with np.nan where we get different dtype, then even index based approach failed for a single column operation s well.
>>> df = pd.DataFrame({"A": (1,2,3,4,5,6,7,np.nan,np.nan,np.nan,np.nan),"B": (1,2,3,np.nan,4,5,6,7,np.nan,"9","10"),"C": range(11)})
>>> df
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
3 4.0 NaN 3
4 5.0 4 4
5 6.0 5 5
6 7.0 6 6
7 NaN 7 7
8 NaN NaN 8
9 NaN 9 9
10 NaN 10 10
dtypes:
>>> df.dtypes
A float64
B object
C int64
dtype: object
Error:
>>> df[:df[df['B'] == 'NaN'].index[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0
>>> df[:df[df['A'] == 'NaN'].index[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0
Expected should be for the Second DataFrame:
>>> df
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
So, i am looking a way around to check across the entire DataFrame regardless of dtype and drop all rows from the first occurrence of NaN in the DataFrame.
You can try:
out=df.iloc[:df.isna().any(1).idxmax()]
OR
via replace() make your string 'NaN's to real 'NaN's then check for missing values and filter rows:
df=df.replace({'NaN':float('NaN'),'nan':float('NaN')})
out=df.iloc[:df.isna().any(1).idxmax()]
output of out:
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
Just for posterity ...
>>> df.iloc[:df.isna().any(1).argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
>>> df.iloc[:df.isnull().any(1).argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2

Fast way to fill NaN in DataFrame

I have DataFrame object df with column like that:
[In]: df
[Out]:
id sum
0 1 NaN
1 1 NaN
2 1 2
3 1 NaN
4 1 4
5 1 NaN
6 2 NaN
7 2 NaN
8 2 3
9 2 NaN
10 2 8
10 2 NaN
... ... ...
[1810601 rows x 2 columns]
I have a lot a NaN values in my column and I want to fill these in the following way:
if NaN is on the beginning (for first index per id equals 0), then it should be 0
else if NaN I want take value from previous index for the same id
Output should be like that:
[In]: df
[Out]:
id sum
0 1 0
1 1 0
2 1 2
3 1 2
4 1 4
5 1 4
6 2 0
7 2 0
8 2 3
9 2 3
10 2 8
10 2 8
... ... ...
[1810601 rows x 2 columns]
I tried to do it "step by step" using loop with iterrows(), but it is very ineffective method. I believe it can be done faster with pandas methods
Try ffill as suggested with groupby
df['sum'] = df.groupby('id')['sum'].ffill().fillna(0)

search for duplicated consecutive rows and put in additional column pandas

I have a df:
df1
a b c d
0 2 4 1
0 2 5 1
0 1 6 2
1 2 7 2
1 1 8 1
1 1 4 1
I need to group by a and b and if two consecutive values in d are = 1 within groups, I want c in a column next to the row . Like:
df1
a b c d c1
0 2 4 1 5
0 1 6 2 nan
1 2 7 2 nan
1 1 8 1 4
Any ideas?
I tried
df1.groupby([df1.a, df1.b, d.diff().ne(0)]
then loc() only the rows with 1s and merge the two dataframes again, but the first function is not completely correct.

Pandas reverse column values groupwise

I want to reverse a column values in my dataframe, but only on a individual "groupby" level. Below you can find a minimal demonstration example, where I want to "flip" values that belong the same letter A,B or C:
df = pd.DataFrame({"group":["A","A","A","B","B","B","B","C","C"],
"value": [1,3,2,4,4,2,3,2,5]})
group value
0 A 1
1 A 3
2 A 2
3 B 4
4 B 4
5 B 2
6 B 3
7 C 2
8 C 5
My desired output looks like this: (column is added instead of replaced only for the brevity purposes)
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2
As always, when I don't see a proper vector-style approach, I end messing with loops just for the sake of final output, but my current code hurts me very much:
for i in list(set(df["group"].values.tolist())):
reversed_group = df.loc[df["group"]==i,"value"].values.tolist()[::-1]
df.loc[df["group"]==i,"value_desired"] = reversed_group
Pandas gurus, please show me the way :)
You can use transform
In [900]: df.groupby('group')['value'].transform(lambda x: x[::-1])
Out[900]:
0 2
1 3
2 1
3 3
4 2
5 4
6 4
7 5
8 2
Name: value, dtype: int64
Details
In [901]: df['value_desired'] = df.groupby('group')['value'].transform(lambda x: x[::-1])
In [902]: df
Out[902]:
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2

Categories

Resources