Hi I am trying to fill my dataframe's NaN values through fillna method:
after applying the fill na with value = df.mean(axis =1) I am still getting some NaN values in some columns
can anyone explain how is it filling up the NaN values
Try:
df.fillna(df.mean())
Fills all NaN with the df.mean of a column values.
Given df,
0 1 2 3 4
0 804.0 271.0 690.0 401.0 158.0
1 352.0 995.0 770.0 616.0 791.0
2 381.0 824.0 61.0 152.0 NaN
3 907.0 607.0 NaN 488.0 180.0
4 981.0 938.0 378.0 957.0 176.0
5 NaN NaN NaN NaN NaN
Output:
0 1 2 3 4
0 804.0 271.0 690.00 401.0 158.00
1 352.0 995.0 770.00 616.0 791.00
2 381.0 824.0 61.00 152.0 326.25
3 907.0 607.0 474.75 488.0 180.00
4 981.0 938.0 378.00 957.0 176.00
5 685.0 727.0 474.75 522.8 326.25
I'm looking to make a new column, MaxPriceBetweenEntries based on the max() of a slice of the dataframe
idx Price EntryBar ExitBar
0 10.00 0 1
1 11.00 NaN NaN
2 10.15 2 4
3 12.14 NaN NaN
4 10.30 NaN NaN
turned into
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 10.00 0 1 11.00
1 11.00 NaN NaN NaN
2 10.15 2 4 12.14
3 12.14 NaN NaN NaN
4 10.30 NaN NaN NaN
I can get all the rows with an EntryBar or ExitBar value with df.loc[df["EntryBar"].notnull()] and df.loc[df["ExitBar"].notnull()], but I can't use that to set a new column:
df.loc[df["EntryBar"].notnull(),"MaxPriceBetweenEntries"] = df.loc[df["EntryBar"]:df["ExitBar"]]["Price"].max()
but that's effectively a guess at this point, because nothing I'm trying works. Ideally the solution wouldn't involve a loop directly because there may be millions of rows.
You can groupby the cumulative sum of non-null entries and take the max, unsing np.where() to only apply to non-null rows::
df['MaxPriceBetweenEntries'] = np.where(df['EntryBar'].notnull(),
df.groupby(df['EntryBar'].notnull().cumsum())['Price'].transform('max'),
np.nan)
df
Out[1]:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
Let's try groupby() and where:
s = df['EntryBar'].notna()
df['MaxPriceBetweenEntries'] = df.groupby(s.cumsum())['Price'].transform('max').where(s)
Output:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
You can forward fill the null values, group by entry and get the max of that groups Price. Use that as the right side of a left join and you should be in business.
df.merge(df.ffill().groupby('EntryBar')['Price'].max().reset_index(name='MaxPriceBetweenEntries'),
on='EntryBar',
how='left')
Try
df.loc[df['ExitBar'].notna(),'Max']=df.groupby(df['ExitBar'].ffill()).Price.max().values
df
Out[74]:
idx Price EntryBar ExitBar Max
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
Current_df:
Unnamed: 0 Div Date Time HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee Unnamed: 62 GB>2.5 GB<2.5 GBAHH GBAHA GBAH HT AT
0 0 E0 2019-08-09 20:00:00 Liverpool Norwich 4 1 H 4 0 H M Oliver NaN NaN NaN NaN NaN NaN NaN NaN
1 1 E0 2019-08-10 12:30:00 West Ham Man City 0 5 A 0 1 A M Dean NaN NaN NaN NaN NaN NaN NaN NaN
2 2 E0 2019-08-10 15:00:00 Bournemouth Sheffield United 1 1 D 0 0 D K Friend NaN NaN NaN NaN NaN NaN NaN NaN
3 3 E0 2019-08-10 15:00:00 Burnley Southampton 3 0 H 0 0 D G Scott NaN NaN NaN NaN NaN NaN NaN NaN
4 4 E0 2019-08-10 15:00:00 Crystal Palace Everton 0 0 D 0 0 D J Moss NaN NaN NaN NaN NaN NaN NaN NaN
5 5 E0 2019-08-10 15:00:00 Watford Brighton 0 3 A 0 1 A C Pawson NaN NaN NaN NaN NaN NaN NaN NaN
6 6 E0 2019-08-10 17:30:00 Tottenham Aston Villa 3 1 H 0 1 A C Kavanagh NaN NaN NaN NaN NaN NaN NaN NaN
7 7 E0 2019-08-11 14:00:00 Leicester Wolves 0 0 D 0 0 D A Marriner NaN NaN NaN NaN NaN NaN NaN NaN
8 7084 G1 2004-09-18 NaN NaN NaN 0 1 A 0 0 D NaN NaN 1.83 1.83 1.66 1.95 0.5 Ergotelis Iraklis
9 7085 G1 2004-09-18 NaN NaN NaN 3 1 H 1 1 D NaN NaN 2.00 1.65 1.90 1.71 -0.5 Xanthi Aris
10 7086 G1 2004-09-19 NaN NaN NaN 1 0 H 1 0 H NaN NaN 2.00 1.65 1.85 1.85 0.0 Chalkidona Panionios
11 7087 G1 2004-09-19 NaN NaN NaN 1 1 D 0 0 D NaN NaN 1.83 1.83 1.67 1.95 0.5 Egaleo AEK
12 7088 G1 2004-09-19 NaN NaN NaN 1 0 H 1 0 H NaN NaN 1.85 1.79 1.85 1.85 0.0 Kalamaria OFI
13 7089 G1 2004-09-19 NaN NaN NaN 2 1 H 1 1 D NaN NaN NaN NaN NaN NaN NaN Olympiakos Kalithea
14 7090 G1 2004-09-19 NaN NaN NaN 3 0 H 2 0 H NaN NaN NaN NaN NaN NaN NaN Panathinaikos Ionikos
Expected df:
Unnamed: 0 Div Date Time HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR Referee Unnamed: 62 GB>2.5 GB<2.5 GBAHH GBAHA GBAH HT AT
0 0 E0 2019-08-09 20:00:00 Liverpool Norwich 4 1 H 4 0 H M Oliver NaN NaN NaN NaN NaN NaN NaN NaN
1 1 E0 2019-08-10 12:30:00 West Ham Man City 0 5 A 0 1 A M Dean NaN NaN NaN NaN NaN NaN NaN NaN
2 2 E0 2019-08-10 15:00:00 Bournemouth Sheffield United 1 1 D 0 0 D K Friend NaN NaN NaN NaN NaN NaN NaN NaN
3 3 E0 2019-08-10 15:00:00 Burnley Southampton 3 0 H 0 0 D G Scott NaN NaN NaN NaN NaN NaN NaN NaN
4 4 E0 2019-08-10 15:00:00 Crystal Palace Everton 0 0 D 0 0 D J Moss NaN NaN NaN NaN NaN NaN NaN NaN
5 5 E0 2019-08-10 15:00:00 Watford Brighton 0 3 A 0 1 A C Pawson NaN NaN NaN NaN NaN NaN NaN NaN
6 6 E0 2019-08-10 17:30:00 Tottenham Aston Villa 3 1 H 0 1 A C Kavanagh NaN NaN NaN NaN NaN NaN NaN NaN
7 7 E0 2019-08-11 14:00:00 Leicester Wolves 0 0 D 0 0 D A Marriner NaN NaN NaN NaN NaN NaN NaN NaN
8 7084 G1 2004-09-18 NaN NaN NaN 0 1 A 0 0 D NaN NaN 1.83 1.83 1.66 1.95 0.5 NaN NaN
9 7085 G1 2004-09-18 NaN Ergotelis Iraklis 3 1 H 1 1 D NaN NaN 2.00 1.65 1.90 1.71 -0.5 NaN NaN
10 7086 G1 2004-09-19 NaN Xanthi Aris 1 0 H 1 0 H NaN NaN 2.00 1.65 1.85 1.85 0.0 NaN NaN
11 7087 G1 2004-09-19 NaN Chalkidona Panionios 1 1 D 0 0 D NaN NaN 1.83 1.83 1.67 1.95 0.5 NaN NaN
12 7088 G1 2004-09-19 NaN Egaleo AEK 1 0 H 1 0 H NaN NaN 1.85 1.79 1.85 1.85 0.0 NaN NaN
13 7089 G1 2004-09-19 NaN Kalamaria OFI 2 1 H 1 1 D NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 7090 G1 2004-09-19 NaN Olympiakos Kalithea 3 0 H 2 0 H NaN NaN NaN NaN NaN NaN NaN NaN NaN
Essentially, I want to place non null values of HT and AT to HomeTeam and AwayTeam columns
There does not seem to be a straightforward way; there are many ways I guess;
Create a new column with an IF HT and AT not blank and HomeTeam and
AwayTeam blank then HT and AT else HomeTeam and AwayTeam
If (In column HomeTeam and AwayTeam) If HomeTeam and AwayTeam blank then HT
and AT else HomeTeam and Away Team.
How can I go about it in pandas?
You can do this considering df is your pandas Dataframe and you have imported NumPy as np
df = df.replace('', np.nan)
And after that apply a lambda function looking for the 'NaN' value like in the code below:
import pandas as pd
names = {'First_name': ['Jon','Bill','Maria','Emma']}
df = pd.DataFrame(names,columns=['First_name'])
df['name_match'] = df['First_name'].apply(lambda x: 'Match' if x == 'Bill' else 'Mismatch')
print (df)
I have a dataframe like this
import pandas as pd
import numpy as np
raw_data = {'Country':['UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','UK','US','US','US','US','US','US'],
'Product':['A','A','A','A','B','B','B','B','B','B','B','B','C','C','C','D','D','D','D','D','D'],
'Week': [1,2,3,4,1,2,3,4,5,6,7,8,1,2,3,1,2,3,4,5,6],
'val': [5,4,3,1,5,6,7,8,9,10,11,12,5,5,5,5,6,7,8,9,10]
}
df2 = pd.DataFrame(raw_data, columns = ['Country','Product','Week', 'val'])
print(df2)
i want to calculate moving average and std dev for val column by country and product..like 3 weeks,5 weeks ,7 weeks etc
wanted dataframe:
'Contry', 'product','week',val', '3wks_avg' '3wks_std','5wks_avg',5wks,std'..etc
Like WenYoBen suggested, we can create a list of all the window sizes you want, and then dynamically create your wanted columns with GroupBy.rolling:
weeks = [3, 5, 7]
for week in weeks:
df[[f'{week}wks_avg', f'{week}wks_std']] = (
df.groupby(['Country', 'Product']).rolling(window=week, on='Week')['val']
.agg(['mean', 'std']).reset_index(drop=True)
)
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
0 UK A 1 5 nan nan nan nan nan nan
1 UK A 2 4 nan nan nan nan nan nan
2 UK A 3 3 4.00 1.00 nan nan nan nan
3 UK A 4 1 2.67 1.53 nan nan nan nan
4 UK B 1 5 nan nan nan nan nan nan
5 UK B 2 6 nan nan nan nan nan nan
6 UK B 3 7 6.00 1.00 nan nan nan nan
7 UK B 4 8 7.00 1.00 nan nan nan nan
8 UK B 5 9 8.00 1.00 7.00 1.58 nan nan
9 UK B 6 10 9.00 1.00 8.00 1.58 nan nan
10 UK B 7 11 10.00 1.00 9.00 1.58 8.00 2.16
11 UK B 8 12 11.00 1.00 10.00 1.58 9.00 2.16
12 UK C 1 5 nan nan nan nan nan nan
13 UK C 2 5 nan nan nan nan nan nan
14 UK C 3 5 5.00 0.00 nan nan nan nan
15 US D 1 5 nan nan nan nan nan nan
16 US D 2 6 nan nan nan nan nan nan
17 US D 3 7 6.00 1.00 nan nan nan nan
18 US D 4 8 7.00 1.00 nan nan nan nan
19 US D 5 9 8.00 1.00 7.00 1.58 nan nan
20 US D 6 10 9.00 1.00 8.00 1.58 nan nan
This is how you would get the moving average for 3 weeks :
df['3weeks_avg'] = list(df.groupby(['Country', 'Product']).rolling(3).mean()['val'])
Apply the same principle for the other columns you want to compute.
IIUC, you may try this
wks = ['Week_3', 'Week_5', 'Week_7']
df_calc = (df2.groupby(['Country', 'Product']).expanding().val
.agg(['mean', 'std']).rename(lambda x: f'Week_{x+1}', level=-1)
.query('ilevel_2 in #wks').unstack())
Out[246]:
mean std
Week_3 Week_5 Week_7 Week_3 Week_5 Week_7
Country Product
UK A 4.0 NaN NaN 1.0 NaN NaN
B NaN 5.0 6.0 NaN NaN 1.0
You will want to use a groupby-transform to get the rolling moments of your data. The following should compute what you are looking for:
weeks = [3, 5, 7] # define weeks
df2 = df2.sort_values('Week') # order by time
for i in weeks: # loop through time intervals you want to compute
df2['{}wks_avg'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).mean()) # i-week rolling mean
df2['{}wks_std'.format(i)] = df2.groupby(['Country', 'Product'])['val'].transform(lambda x: x.rolling(i).std()) # i-week rolling std
Here is what the resulting dataframe will look like.
print(df2.dropna().head().to_string())
Country Product Week val 3wks_avg 3wks_std 5wks_avg 5wks_std 7wks_avg 7wks_std
17 US D 3 7 6.0 1.0 6.0 1.0 6.0 1.0
6 UK B 3 7 6.0 1.0 6.0 1.0 6.0 1.0
14 UK C 3 5 5.0 0.0 5.0 0.0 5.0 0.0
2 UK A 3 3 4.0 1.0 4.0 1.0 4.0 1.0
7 UK B 4 8 7.0 1.0 7.0 1.0 7.0 1.0
I have a dataframe like so:
df = pd.DataFrame({'time':['23:59:45','23:49:50','23:59:55','00:00:00','00:00:05','00:00:10','00:00:15'],
'X':[-5,-4,-2,5,6,10,11],
'Y':[3,4,5,9,20,22,23]})
As you can see, the time is formed by hours (string format) and are across midnight. The time is given every 5 seconds!
My goal is however to add empty rows (filled with Nan for examples) so that the time is every second. Finally the column time should be converted as a time stamp and set as index.
Could you please suggest a smart and elegant way to achieve my goal?
Here is what the output should look like:
X Y
time
23:59:45 -5.0 3.0
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
... ... ...
00:00:10 10.0 22.0
00:00:11 NaN NaN
00:00:12 NaN NaN
00:00:13 NaN NaN
00:00:14 NaN NaN
00:00:15 11.0 23.0
Note: I do not need the dates.
Use to_timedelta with reindex by timedelta_range:
df['time'] = pd.to_timedelta(df['time'])
idx = pd.timedelta_range('0', '23:59:59', freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5.0 9.0
1 00:00:01 NaN NaN
2 00:00:02 NaN NaN
3 00:00:03 NaN NaN
4 00:00:04 NaN NaN
5 00:00:05 6.0 20.0
6 00:00:06 NaN NaN
7 00:00:07 NaN NaN
8 00:00:08 NaN NaN
9 00:00:09 NaN NaN
If need replace NaNs:
df = df.set_index('time').reindex(idx, fill_value=0).reset_index()
print (df.head(10))
time X Y
0 00:00:00 5 9
1 00:00:01 0 0
2 00:00:02 0 0
3 00:00:03 0 0
4 00:00:04 0 0
5 00:00:05 6 20
6 00:00:06 0 0
7 00:00:07 0 0
8 00:00:08 0 0
9 00:00:09 0 0
Another solution with resample, but is possible some rows are missing in the end:
df = df.set_index('time').resample('S').first()
print (df.tail(10))
X Y
time
23:59:46 NaN NaN
23:59:47 NaN NaN
23:59:48 NaN NaN
23:59:49 NaN NaN
23:59:50 NaN NaN
23:59:51 NaN NaN
23:59:52 NaN NaN
23:59:53 NaN NaN
23:59:54 NaN NaN
23:59:55 -2.0 5.0
EDIT1:
idx1 = pd.timedelta_range('23:59:45', '23:59:59', freq='S', name='time')
idx2 = pd.timedelta_range('0', '00:00:15', freq='S', name='time')
idx = np.concatenate([idx1, idx2])
df['time'] = pd.to_timedelta(df['time'])
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:59:45 -5.0 3.0
1 23:59:46 NaN NaN
2 23:59:47 NaN NaN
3 23:59:48 NaN NaN
4 23:59:49 NaN NaN
5 23:59:50 NaN NaN
6 23:59:51 NaN NaN
7 23:59:52 NaN NaN
8 23:59:53 NaN NaN
9 23:59:54 NaN NaN
print (df.tail(10))
time X Y
21 00:00:06 NaN NaN
22 00:00:07 NaN NaN
23 00:00:08 NaN NaN
24 00:00:09 NaN NaN
25 00:00:10 10.0 22.0
26 00:00:11 NaN NaN
27 00:00:12 NaN NaN
28 00:00:13 NaN NaN
29 00:00:14 NaN NaN
30 00:00:15 11.0 23.0
EDIT:
Another solution - change next day to 1 day timedeltas:
df['time'] = pd.to_timedelta(df['time'])
a = pd.to_timedelta(df['time'].diff().dt.days.abs().cumsum().fillna(1).sub(1), unit='d')
df['time'] = df['time'] + a
print (df)
X Y time
0 -5 3 0 days 23:59:45
1 -4 4 0 days 23:49:50
2 -2 5 0 days 23:59:55
3 5 9 1 days 00:00:00
4 6 20 1 days 00:00:05
5 10 22 1 days 00:00:10
6 11 23 1 days 00:00:15
idx = pd.timedelta_range(df['time'].min(), df['time'].max(), freq='S', name='time')
df = df.set_index('time').reindex(idx).reset_index()
print (df.head(10))
time X Y
0 23:49:50 -4.0 4.0
1 23:49:51 NaN NaN
2 23:49:52 NaN NaN
3 23:49:53 NaN NaN
4 23:49:54 NaN NaN
5 23:49:55 NaN NaN
6 23:49:56 NaN NaN
7 23:49:57 NaN NaN
8 23:49:58 NaN NaN
9 23:49:59 NaN NaN
print (df.tail(10))
time X Y
616 1 days 00:00:06 NaN NaN
617 1 days 00:00:07 NaN NaN
618 1 days 00:00:08 NaN NaN
619 1 days 00:00:09 NaN NaN
620 1 days 00:00:10 10.0 22.0
621 1 days 00:00:11 NaN NaN
622 1 days 00:00:12 NaN NaN
623 1 days 00:00:13 NaN NaN
624 1 days 00:00:14 NaN NaN
625 1 days 00:00:15 11.0 23.0