Pandas ffill for certain values in a column - python

I have a df like this:
time data
0 1
1 1
2 nan
3 nan
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 nan
Is there a way to use pd.Series.ffill() to ffill on for certain occurences of values? Specifically, I want to forward fill only if values in df.data are == 1 or 4. Should look like this:
time data
0 1
1 1
2 1
3 1
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 4

One option would be to forward fill (ffill) the column, then only populate where the values are 1 or 4 using (isin) and (mask):
s = df['data'].ffill()
df['data'] = df['data'].mask(s.isin([1, 4]), s)
df:
time data
0 0 1.0
1 1 1.0
2 2 1.0
3 3 1.0
4 4 6.0
5 5 NaN
6 6 NaN
7 7 NaN
8 8 5.0
9 9 4.0
10 10 4.0

Related

Pandas assign series to another Series based on index

I have three Pandas Dataframes:
df1:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
df2:
0 1
3 7
6 5
9 2
df3:
1 2
4 6
7 6
My goal is to assign the values of df2 and df3 to df1 based on the index.
df1 should then become:
0 1
1 2
2 NaN
3 7
4 6
5 NaN
6 5
7 6
8 NaN
9 2
I tried with simple assinment:
df1.loc[df2.index] = df2.values
or
df1.loc[df2.index] = df2
but this gives me an ValueError:
ValueError: Must have equal len keys and value when setting with an iterable
Thanks for your help!
You can do concat with combine_first:
pd.concat([df2,df3]).combine_first(df1)
Or reindex:
pd.concat([df2,df3]).reindex_like(df1)
0 1.0
1 2.0
2 NaN
3 7.0
4 6.0
5 NaN
6 5.0
7 6.0
8 NaN
9 2.0

Count preceding non NaN values in pandas

I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

create a sliding window of input data from pandas dataframe

I have to create a sliding window of input data with window size = 3
Dataframe
0 1
0 1 2
1 3 4
2 5 6
3 7 8
Desired output:
0 1 2 3 4 5
0 1 2 NA NA NA NA
1 3 4 1 2 NA NA
2 5 6 3 4 1 2
3 7 8 5 6 3 4
I used data.values.flatten() but it converts all rows in dataframe in nested list.
How can I create a sliding window of input data (of desired window length) from dataframe
You can just concat:
new_df = pd.concat([df.shift(i) for i in range(3)], axis=1)
# rename columns
new_df.columns = np.arange(new_df.shape[1])
output:
0 1 2 3 4 5
0 1 2 NaN NaN NaN NaN
1 3 4 1.0 2.0 NaN NaN
2 5 6 3.0 4.0 1.0 2.0
3 7 8 5.0 6.0 3.0 4.0

How to loop through each row in pandas dataframe and set values equal to nan after a threshold is surpassed?

If I have a pandas dataframe like this:
0 1 2 3 4 5
A 5 5 10 9 4 5
B 10 10 10 8 1 1
C 8 8 0 9 6 3
D 10 10 11 4 2 9
E 0 9 1 5 8 3
If I set a threshold of 7, how do I loop through each row and set the values after the threshold is no longer met equal to np.nan such that I get a data frame like this:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2 9
E 0 9 1 5 8 NaN
Where everything after the last number greater than 7 is set equal to np.nan.
Let's try this:
df.where(df.where(df > 7).bfill(axis=1).notna())
Output:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN
create a mask m by using df.where on df.gt(7) and bfill and isna. Finally, indexing df using m
m = df.where(df.gt(7)).bfill(1).notna()
df[m]
Out[24]:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN
A very nice question , reverse the order then cumsum the one equal to 0 should be NaN
df.where(df.iloc[:,::-1].gt(7).cumsum(1).ne(0))
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN

How to use pandas.DataFrame.mask for NaN

I want to ignore NaN values in my selected dataframe columns when I want to normalize with sklearn.preprocessing.normalize. Column example:
0 12.0
1 12.0
2 3.0
3 NaN
4 3.0
5 3.0
6 NaN
7 NaN
8 3.0
9 3.0
10 3.0
11 4.0
12 10.0
You can make use of function dropna(). It will return the same dataframe with rows containing NaN deleted.
>>> a.dropna()
0 12.0
0 1 12
1 2 3
3 4 3
4 5 3
7 8 3
8 9 3
9 10 3
10 11 4
11 12 10

Categories

Resources