Dataframe operation - python

I have a dataframe populated with different zeros and values different than zero. For each row I want to apply the following condition:
If the value in the given cell is different than zero AND the value in the cell to the right is zero, then put the same value in the cell to the right.
The example would be the following:
This is the one of the rows in the dataframe now:
[0,0,0,20,0,0,0,33,3,0,5,0,0,0,0,0]
The function would convert it to the following:
[0,0,0,20,20,20,20,33,3,3,5,5,5,5,5,5]
I want to apply this to the whole dataframe.
Your help would be much appreciated!
Thank you.

Since you imply you are using Pandas, I would leverage a bit of the build-in muscle in the library.
import pandas as pd
import numpy as np
s = pd.Series([0,0,0,20,0,0,0,33,3,0,5,0,0,0,0,0])
s.replace(0, np.NaN, inplace=True)
s = s.ffill()
Output:
0 NaN
1 NaN
2 NaN
3 20.0
4 20.0
5 20.0
6 20.0
7 33.0
8 3.0
9 3.0
10 5.0
11 5.0
12 5.0
13 5.0
14 5.0
15 5.0
dtype: float64

Related

Replace values in dataframe with difference to last row value by condition

I'm trying to replace every value above 1000 in my dataframe by its difference to the previous row value.
This is the way I tried with pandas:
data_df.replace(data_df.where(data_df["value"] >= 1000), data_df["value"].diff(), inplace=True)
This does not result in an error, but nothing in the dataframe changes. What am I missing?
import numpy as np
import pandas as pd
d = {'value': [1000, 200002,50004,600005], }
data_df = pd.DataFrame(data=d)
data_df["diff"] = data_df["value"].diff()
data_df["value"] = np.where((data_df["value"]>10000) ,data_df["diff"],data_df["value"])
data_df.drop(columns='diff', inplace=True)
I introduce one column "diff" to get the difference of pervious row.
np.where allow u implement the if else statement.
Hope it helps u thanks!
You can set the freq to 1000 or whatever interval you want. I have it at 10 to make the sample easier to see. Basically shifting the row, and for every row where the index is evenly divisible by the frequency, use the shifted value, otherwise leave as is.
import pandas as pd
import numpy as np
freq = 10
df = pd.DataFrame({'data':[x for x in range(30)]})
df['previous'] = df['data'].shift(1)
df['data'] = np.where((df.index % freq==0) & (df.index>0), df['data'] -df['previous'], df['data'])
df.drop(columns='previous', inplace=True)
Output
data
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 6.0
7 7.0
8 8.0
9 9.0
10 1.0
11 11.0
12 12.0
13 13.0
14 14.0
15 15.0
16 16.0
17 17.0
18 18.0
19 19.0
20 1.0
21 21.0
22 22.0
23 23.0
24 24.0
25 25.0
26 26.0
27 27.0
28 28.0
29 29.0

Forward fill on custom value in pandas dataframe

I am looking to perform forward fill on some dataframe columns.
the ffill method replaces missing values or NaN with the previous filled value.
In my case, I would like to perform a forward fill, with the difference that I don't want to do that on Nan but for a specific value (say "*").
Here's an example
import pandas as pd
import numpy as np
d = [{"a":1, "b":10},
{"a":2, "b":"*"},
{"a":3, "b":"*"},
{"a":4, "b":"*"},
{"a":np.nan, "b":50},
{"a":6, "b":60},
{"a":7, "b":70}]
df = pd.DataFrame(d)
with df being
a b
0 1.0 10
1 2.0 *
2 3.0 *
3 4.0 *
4 NaN 50
5 6.0 60
6 7.0 70
The expected result should be
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
If replacing "*" by np.nan then ffill, that would cause to apply ffill to column a.
Since my data has hundreds of columns, I was wondering if there is a more efficient way than looping over all columns, check if it countains "*", then replace and ffill.
You can use df.mask with df.isin with df.replace
df.mask(df.isin(['*']),df.replace('*',np.nan).ffill())
a b
0 1.0 10
1 2.0 10
2 3.0 10
3 4.0 10
4 NaN 50
5 6.0 60
6 7.0 70
I think you're going in the right direction, but here's a complete solution. What I'm doing is 'marking' the original NaN values, then replacing "*" with NaN, using ffill, and then putting the original NaN values back.
df = df.replace(np.NaN, "<special>").replace("*", np.NaN).ffill().replace("<special>", np.NaN)
output:
a b
0 1.0 10.0
1 2.0 10.0
2 3.0 10.0
3 4.0 10.0
4 NaN 50.0
5 6.0 60.0
6 7.0 70.0
And here's an alternative solution that does the same thing, without the 'special' marking:
original_nan = df.isna()
df = df.replace("*", np.NaN).ffill()
df[original_nan] = np.NaN

How to vectorize Pandas rolling and shifting operations to improve performance

Using Pandas 1.0 and Numpy 1.18, I need to apply Rolling multiple times, with a varying window size and summary functions to a large dataframe with a large number of groups. Before applying the summary function the Series is also shifted by 1 to discard the current row value. This is an example for a rolling max shifted by 1:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [5,2,4,5,4,2,3,5,5,2,4,1], 'b': [18,37,60,45,40,40,50,10,30,2,46,19]})
df = df.sort_values('a').reset_index(drop=True)
df['max'] = df.groupby('a', sort=False, as_index=False)['b'].rolling(2, min_periods=1).apply(lambda x: np.max(x[:-1])).reset_index(drop=True)
Result:
df
a b max
0 1 19 NaN
1 2 37 NaN
2 2 40 37.0
3 2 2 40.0
4 3 50 NaN
5 4 60 NaN
6 4 40 60.0
7 4 46 40.0
8 5 18 NaN
9 5 45 18.0
10 5 10 45.0
11 5 30 10.0
The result is correct, but it takes too long once it is applied to a large dataframe and I was wondering if there is a way to refactor this logic to make use of vectorization instead of relying on apply, which, as I read, is implemented as a loop under the hood and it performs poorly.

Find longest subsequence without NaN values in set of series

Hello I'm trying to figure out a method that finds the longest common continuous subsequence (in this case time interval) without any missing (Nan) values from a set of sequences. This is a example dataframe.
time s_1 s_2 s_3
0 1 2 2 Nan
1 2 3 Nan Nan
2 3 3 2 2
3 4 5 3 10
4 5 8 4 3
5 6 Nan Nan 7
6 7 5 2 Nan
7 8 Nan 3 Nan
For this small example the "best" time interval would be from 3-5 or index 2-4. The real dataframe is way bigger and contains more series. Is it possible to find an efficient solution to this problem?
Thank you very much.
I updated this with a bit of setup for a working example:
import pandas as pd
import numpy as np
s1 = [2,3,3,5,8,np.NAN,5,np.NAN,1]
s2 = [2,np.NAN,2,3,4,np.NAN,2,3,1]
s3 = [np.NAN,np.NAN,2,10,3,7,np.NAN,np.NAN,1]
data = {'time':np.arange(1,9+1),'s_1':s1,'s_2':s2,'s_3':s3}
df = pd.DataFrame(data)
print(df)
This will create a the DataFrame you posted above, but with an additional entry on the end so there will be two zones with continuous indexes.
I think the best approach from here is to drop all of the rows that are missing data and then count up the longest sequence in the remaining index. Something like this should do the trick:
sequence = np.array(df.dropna(how='any').index)
longest_seq = max(np.split(sequence, np.where(np.diff(sequence) != 1)[0]+1), key=len)
print(df.iloc[longest_seq])
Which will give you:
time s_1 s_2 s_3
2 3 3.0 2.0 2.0
3 4 5.0 3.0 10.0
4 5 8.0 4.0 3.0
dropna first , then we using cumsum with diff to build a key for distinguish different group by it is continue or not (different by 1 )
s=df.dropna()
idx=s.time.groupby(s.time.diff().ne(1).cumsum()).transform('count')
idx
0 1
2 3
3 3
4 3
Name: time, dtype: int64
yourmax=s[idx==idx.max()]
yourmax
time s_1 s_2 s_3
2 3 3.0 2.0 2.0
3 4 5.0 3.0 10.0
4 5 8.0 4.0 3.0

Fill NaN values in dataframe with previous values in column

Hi I have a dataframe with some missing values ex:
The black numbers 40 and 50 are the values already inputted and the red ones are to autofill from the previous values. Row 2 is blank as there is no previous number to fill.
Any idea how I can do this efficiently? I was trying loops but maybe there is a better way
It can be done easily with ffill method in pandas fillna.
To illustrate the working consider the following sample dataframe
df = pd.DataFrame()
df['Vals'] = [1, 2, 3, np.nan, np.nan, 6, 7, np.nan, 8]
Vals
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 7.0
7 NaN
8 8.0
To fill the missing value do this
df['Vals'].fillna(method='ffill', inplace=True)
Vals
0 1.0
1 2.0
2 3.0
3 3.0
4 3.0
5 6.0
6 7.0
7 7.0
8 8.0
There is a direct synonym function for this, pandas.DataFrame.ffill
df['Vals',inplace=True]

Categories

Resources