DataFrame.fillna method - filling the NaN values with Df.mean(axis =1) - python

Hi I am trying to fill my dataframe's NaN values through fillna method:
after applying the fill na with value = df.mean(axis =1) I am still getting some NaN values in some columns
can anyone explain how is it filling up the NaN values

Try:
df.fillna(df.mean())
Fills all NaN with the df.mean of a column values.
Given df,
0 1 2 3 4
0 804.0 271.0 690.0 401.0 158.0
1 352.0 995.0 770.0 616.0 791.0
2 381.0 824.0 61.0 152.0 NaN
3 907.0 607.0 NaN 488.0 180.0
4 981.0 938.0 378.0 957.0 176.0
5 NaN NaN NaN NaN NaN
Output:
0 1 2 3 4
0 804.0 271.0 690.00 401.0 158.00
1 352.0 995.0 770.00 616.0 791.00
2 381.0 824.0 61.00 152.0 326.25
3 907.0 607.0 474.75 488.0 180.00
4 981.0 938.0 378.00 957.0 176.00
5 685.0 727.0 474.75 522.8 326.25

Related

Calculate maximum difference of rolling interval of n columns

I have a dataset
df
Time Spot Ubalance
0 2017-01-01T00:00:00+01:00 20.96 NaN
1 2017-01-01T01:00:00+01:00 20.90 29.40
2 2017-01-01T02:00:00+01:00 18.13 24.73
3 2017-01-01T03:00:00+01:00 16.03 24.73
4 2017-01-01T04:00:00+01:00 16.43 27.89
5 2017-01-01T05:00:00+01:00 13.75 28.26
6 2017-01-01T06:00:00+01:00 11.10 30.43
7 2017-01-01T07:00:00+01:00 15.47 32.85
8 2017-01-01T08:00:00+01:00 16.88 33.91
9 2017-01-01T09:00:00+01:00 21.81 28.58
10 2017-01-01T10:00:00+01:00 26.24 28.58
I want to generate a series/dataframe in which I calculate the maximum difference between the highest and lowest value of the last n rows within multiple columns, i.e., the maximum difference of these "last" 10 rows would be
33.91 (highest is here in "ubalance") - 11.10 (lowest is in "Spot") = 22.81
I've tried .rolling() but it apparently does not contain a difference attribute.
Expected outcome:
Time Spot Ubalance Diff
0 2017-01-01T00:00:00+01:00 20.96 NaN NaN
1 2017-01-01T01:00:00+01:00 20.90 29.40 NaN
2 2017-01-01T02:00:00+01:00 18.13 24.73 NaN
3 2017-01-01T03:00:00+01:00 16.03 24.73 NaN
4 2017-01-01T04:00:00+01:00 16.43 27.89 NaN
5 2017-01-01T05:00:00+01:00 13.75 28.26 NaN
6 2017-01-01T06:00:00+01:00 11.10 30.43 NaN
7 2017-01-01T07:00:00+01:00 15.47 32.85 NaN
8 2017-01-01T08:00:00+01:00 16.88 33.91 NaN
9 2017-01-01T09:00:00+01:00 21.81 28.58 NaN
10 2017-01-01T10:00:00+01:00 26.24 28.58 22.81
Use Rolling.aggregate and then subtract:
df1 = df['Spot'].rolling(10).agg(['min','max'])
print (df1)
min max
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 11.1 21.81
10 11.1 26.24
df['dif'] = df1['max'].sub(df1['min'])
print (df)
Time Spot Ubalance dif
0 2017-01-01T00:00:00+01:00 20.96 NaN NaN
1 2017-01-01T01:00:00+01:00 20.90 29.40 NaN
2 2017-01-01T02:00:00+01:00 18.13 24.73 NaN
3 2017-01-01T03:00:00+01:00 16.03 24.73 NaN
4 2017-01-01T04:00:00+01:00 16.43 27.89 NaN
5 2017-01-01T05:00:00+01:00 13.75 28.26 NaN
6 2017-01-01T06:00:00+01:00 11.10 30.43 NaN
7 2017-01-01T07:00:00+01:00 15.47 32.85 NaN
8 2017-01-01T08:00:00+01:00 16.88 33.91 NaN
9 2017-01-01T09:00:00+01:00 21.81 28.58 10.71
10 2017-01-01T10:00:00+01:00 26.24 28.58 15.14
Or custom function with lambda:
df['diff'] = df['Spot'].rolling(10).agg(lambda x: x.max() - x.min())
EDIT:
For processing all columns from list use:
cols = ['Spot','Ubalance']
N = 10
df['dif'] = (df[cols].stack(dropna=False)
.rolling(len(cols) * N)
.agg(lambda x: x.max() - x.min())
.groupby(level=0)
.max())
print (df)
Time Spot Ubalance dif
0 2017-01-01T00:00:00+01:00 20.96 NaN NaN
1 2017-01-01T01:00:00+01:00 20.90 29.40 NaN
2 2017-01-01T02:00:00+01:00 18.13 24.73 NaN
3 2017-01-01T03:00:00+01:00 16.03 24.73 NaN
4 2017-01-01T04:00:00+01:00 16.43 27.89 NaN
5 2017-01-01T05:00:00+01:00 13.75 28.26 NaN
6 2017-01-01T06:00:00+01:00 11.10 30.43 NaN
7 2017-01-01T07:00:00+01:00 15.47 32.85 NaN
8 2017-01-01T08:00:00+01:00 16.88 33.91 NaN
9 2017-01-01T09:00:00+01:00 21.81 28.58 NaN
10 2017-01-01T10:00:00+01:00 26.24 28.58 22.81
you could a rolling window like this:
n = 10
df.rolling(3).apply(func=lambda x: x.max() - x.min())
you can specify in the lambda function the column you want to do the rolling window

Append empty rows by subtracting 7 days from date

How can I create empty rows from 7 days before 2016-01-01 going to January 2015? I tried reindexing
df
date value
0 2016-01-01 4.0
1 2016-01-08 5.0
2 2016-01-15 1.0
Expected Output
date value
2015-01-02 NaN
....
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0
First create DatetimeIndex:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
And then use DataFrame.reindex with date_range by your minimal value and minimal index value with Index.union for avoid lost original index values:
rng = pd.date_range('2015-01-02', df.index.min(), freq='7d').union(df.index)
df = df.reindex(rng)
print (df)
value
2015-01-02 NaN
2015-01-09 NaN
2015-01-16 NaN
2015-01-23 NaN
2015-01-30 NaN
2015-02-06 NaN
2015-02-13 NaN
2015-02-20 NaN
2015-02-27 NaN
2015-03-06 NaN
2015-03-13 NaN
2015-03-20 NaN
2015-03-27 NaN
2015-04-03 NaN
2015-04-10 NaN
2015-04-17 NaN
2015-04-24 NaN
2015-05-01 NaN
2015-05-08 NaN
2015-05-15 NaN
2015-05-22 NaN
2015-05-29 NaN
2015-06-05 NaN
2015-06-12 NaN
2015-06-19 NaN
2015-06-26 NaN
2015-07-03 NaN
2015-07-10 NaN
2015-07-17 NaN
2015-07-24 NaN
2015-07-31 NaN
2015-08-07 NaN
2015-08-14 NaN
2015-08-21 NaN
2015-08-28 NaN
2015-09-04 NaN
2015-09-11 NaN
2015-09-18 NaN
2015-09-25 NaN
2015-10-02 NaN
2015-10-09 NaN
2015-10-16 NaN
2015-10-23 NaN
2015-10-30 NaN
2015-11-06 NaN
2015-11-13 NaN
2015-11-20 NaN
2015-11-27 NaN
2015-12-04 NaN
2015-12-11 NaN
2015-12-18 NaN
2015-12-25 NaN
2016-01-01 4.0
2016-01-08 5.0
2016-01-15 1.0

If row below duplicate, use value from other columns until a new value is found

I have a tricky data manipulation question. Basically, I have a list of dates. On each day, there is a count of how many issues are open. I want to create a new column, ideal_issues_left, that uses np.linspace to calculate the ideal number of issues left, if they are all to be completed at a steady rate each day to zero at the end of the date range.
I have managed to create a dataframe of the estimates per day from each starting point, but what I want to do now is fill the ideal_issues_left column with the estimates based on the following logic:
If the number of open issues is different the next day, fill ideal_issues_left with the first column from the estimates data frame.
If the number of open issues is the same, fill ideal_issues_left with data from the columns 1+, until a new number of open_issues is reached.
For example, say this is the date range and open issues:
import pandas as pd
chart_data = pd.DataFrame({
'date': pd.date_range('2018-08-19', '2018-09-01', freq='d'),
'open_issues': [23.0, 25.0, 26.0, 26.0, 28.0, 36.0, 33.0, 39.0, 39.0, 38.0, 38.0, 38.0, 38.0, 38.0]
})
chart_data
date open_issues
0 2020-08-19 23.0
1 2020-08-20 25.0
2 2020-08-21 26.0
3 2020-08-22 26.0
4 2020-08-23 28.0
5 2020-08-24 36.0
6 2020-08-25 33.0
7 2020-08-26 39.0
8 2020-08-27 39.0
9 2020-08-28 38.0
10 2020-08-29 38.0
11 2020-08-30 38.0
12 2020-08-31 38.0
13 2020-09-01 38.0
p = []
for day, val in enumerate(chart_data.loc[:, 'open_issues']):
days_left = 14 - day
p.append(np.linspace(start=val, stop=0, num=days_left))
estimates = pd.DataFrame(p)
estimates
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 23.0 21.230769 19.461538 17.692308 15.923077 14.153846 12.384615 10.615385 8.846154 7.076923 5.307692 3.538462 1.769231 0.0
1 25.0 22.916667 20.833333 18.750000 16.666667 14.583333 12.500000 10.416667 8.333333 6.250000 4.166667 2.083333 0.000000 NaN
2 26.0 23.636364 21.272727 18.909091 16.545455 14.181818 11.818182 9.454545 7.090909 4.727273 2.363636 0.000000 NaN NaN
3 26.0 23.400000 20.800000 18.200000 15.600000 13.000000 10.400000 7.800000 5.200000 2.600000 0.000000 NaN NaN NaN
4 28.0 24.888889 21.777778 18.666667 15.555556 12.444444 9.333333 6.222222 3.111111 0.000000 NaN NaN NaN NaN
5 36.0 31.500000 27.000000 22.500000 18.000000 13.500000 9.000000 4.500000 0.000000 NaN NaN NaN NaN NaN
6 33.0 28.285714 23.571429 18.857143 14.142857 9.428571 4.714286 0.000000 NaN NaN NaN NaN NaN NaN
7 39.0 32.500000 26.000000 19.500000 13.000000 6.500000 0.000000 NaN NaN NaN NaN NaN NaN NaN
8 39.0 31.200000 23.400000 15.600000 7.800000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN
9 38.0 28.500000 19.000000 9.500000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 38.0 25.333333 12.666667 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 38.0 19.000000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 38.0 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 38.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
The desired end result should be:
chart_data
date open_issues ideal_issues_left
0 2018-08-19 23.0 23.0
1 2018-08-20 25.0 25.0
2 2018-08-21 26.0 26.0 # <- this value is from estimates row 2 col 0
3 2018-08-22 26.0 23.6 # <- this value is from estimates row 2 col 1
4 2018-08-23 28.0 28.0
5 2018-08-24 36.0 36.0
6 2018-08-25 33.0 33.0
7 2018-08-26 39.0 39.0 # <- this value is from estimates row 7 col 0
8 2018-08-27 39.0 32.5 # <- this value is from estimates row 7 col 1
9 2018-08-28 38.0 38.0 # <- this value is from estimates row 9 col 0
10 2018-08-29 38.0 28.5 # <- this value is from estimates row 9 col 1
11 2018-08-30 38.0 19.0 # <- this value is from estimates row 9 col 2
12 2018-08-31 38.0 9.5 # <- this value is from estimates row 9 col 3
13 2018-09-01 38.0 0.0 # <- this value is from estimates row 9 col 4
Thank you!
If there are an equal number of issues, the cumulative count is taken from the sum of the cumulative total. The value is updated with the data extracted in the same number of issues in the reference data frame.
chart_data['flg'] = chart_data['open_issues'].groupby(((chart_data['open_issues'] != chart_data['open_issues'].shift())).cumsum()).cumcount()
chart_data
date open_issues flg
0 2018-08-19 23.0 0
1 2018-08-20 25.0 0
2 2018-08-21 26.0 0
3 2018-08-22 26.0 1
4 2018-08-23 28.0 0
5 2018-08-24 36.0 0
6 2018-08-25 33.0 0
7 2018-08-26 39.0 0
8 2018-08-27 39.0 1
9 2018-08-28 38.0 0
10 2018-08-29 38.0 1
11 2018-08-30 38.0 2
12 2018-08-31 38.0 3
13 2018-09-01 38.0 4
for i,issues in enumerate(chart_data['open_issues']):
k = chart_data.loc[i,'flg']
df = estimates[estimates[0] == issues]
l = df.iloc[:1, k].values
# print(l)
chart_data.loc[i,'idea_issues_left'] = l
chart_data
date open_issues flg idea_issues_left
0 2018-08-19 23.0 0 23.000000
1 2018-08-20 25.0 0 25.000000
2 2018-08-21 26.0 0 26.000000
3 2018-08-22 26.0 1 23.636364
4 2018-08-23 28.0 0 28.000000
5 2018-08-24 36.0 0 36.000000
6 2018-08-25 33.0 0 33.000000
7 2018-08-26 39.0 0 39.000000
8 2018-08-27 39.0 1 32.500000
9 2018-08-28 38.0 0 38.000000
10 2018-08-29 38.0 1 28.500000
11 2018-08-30 38.0 2 19.000000
12 2018-08-31 38.0 3 9.500000
13 2018-09-01 38.0 4 0.000000
If your dataset is large and you want to avoid looping you can use merge instead.
chart_data["prev_day_open_issues"] = chart_data["open_issues"].shift(1)
chart_data["no match"] = chart_data["open_issues"] != chart_data["prev_day_open_issues"]
# same idea as in r-beginners code
chart_data["ideal_pos"] = (chart_data["open_issues"]
.groupby(chart_data["no match"].cumsum())
.cumcount())
# tidy up and remove temp columns
new_chart_data = chart_data[["date", "open_issues", "ideal_pos"]]
# make your estimates dataframe into a one-to-one lookup in long format
estimates["open_issues"] = estimates[0]
new_estimates = (estimates
.drop_duplicates(subset=["open_issues"])
.melt(id_vars="open_issues", var_name="ideal_pos",
value_name="ideal_issues_left"))
# join
final = new_chart_data.merge(new_estimates, how="left", on=["open_issues", "ideal_pos"])
print(final[["date", "open_issues", "ideal_issues_left"]])
date open_issues ideal_issues_left
2018-08-19 23.0 23.000000
2018-08-20 25.0 25.000000
2018-08-21 26.0 26.000000
2018-08-21 26.0 23.636364
2018-08-23 28.0 28.000000
2018-08-24 36.0 36.000000
2018-08-25 33.0 33.000000
2018-08-26 39.0 39.000000
2018-08-26 39.0 32.500000
2018-08-28 38.0 38.000000
2018-08-28 38.0 28.500000
2018-08-28 38.0 19.000000
2018-08-28 38.0 9.500000
2018-08-28 38.0 0.000000

How do i add data in dataframe under the columns and across the rows?

I have a few dataframes that are loaded previously from CSV files
b = portfolionew_df.loc [1,['x_1','x_2','x_3','x_4','x_5']]
x = [stockprice_df.loc[ :, b]]
print(x)
This is the result for x:
NYSEARCA:RYE NYSEARCA:XOP NYSEARCA:PXE NYSEARCA:VAW NYSEARCA:PYZ
0 68.37 52.00 25.37 87.94 35.00
1 60.70 48.04 22.64 83.78 32.61
2 67.04 54.48 24.70 86.61 34.44
3 65.86 53.75 24.16 84.94 34.21
c = pd.DataFrame(index=(time_df['Date']),columns=(b))
print(c)
This is the result for c:
Date NYSEARCA:RYE NYSEARCA:XOP NYSEARCA:PXE NYSEARCA:VAW NYSEARCA:PYZ
2007-12-31 NaN NaN NaN NaN NaN
2008-01-31 NaN NaN NaN NaN NaN
2008-02-29 NaN NaN NaN NaN NaN
2008-03-31 NaN NaN NaN NaN NaN
The content is all NaN because i did not manage to add data in.
How can i achieve this?
Date NYSEARCA:RYE NYSEARCA:XOP NYSEARCA:PXE NYSEARCA:VAW NYSEARCA:PYZ
2007-12-31 68.37 52.00 25.37 87.94 35.00
2008-01-31 60.70 48.04 22.64 83.78 32.61
2008-02-29 67.04 54.48 24.70 86.61 34.44
2008-03-31 65.86 53.75 24.16 84.94 34.21
My aim is to add the data from x into c dataframe. How can i do it?
As your data is in x and index in time_df, you can do:
c=x
c=c.set_index(time_df['Date'])
print c
IIUC by using combine_first
c.reset_index().combine_first(x)
Out[523]:
Date NYSEARCA:PXE NYSEARCA:PYZ NYSEARCA:RYE NYSEARCA:VAW \
0 2007-12-31 25.37 35.00 68.37 87.94
1 2008-01-31 22.64 32.61 60.70 83.78
2 2008-02-29 24.70 34.44 67.04 86.61
3 2008-03-31 24.16 34.21 65.86 84.94
NYSEARCA:XOP
0 52.00
1 48.04
2 54.48
3 53.75

how to assign values to a new data frame from another data frame in python

I set up a new data frame SimMean:
columns = ['Tenor','5x16', '7x8', '2x16H']
index = range(0,12)
SimMean = pd.DataFrame(index=index, columns=columns)
SimMean
Tenor 5x16 7x8 2x16H
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 NaN NaN NaN NaN
10 NaN NaN NaN NaN
11 NaN NaN NaN NaN
I have another data frame FwdDf:
FwdDf
Tenor 5x16 7x8 2x16H
0 2017-01-01 50.94 34.36 43.64
1 2017-02-01 50.90 32.60 42.68
2 2017-03-01 42.66 26.26 37.26
3 2017-04-01 37.08 22.65 32.46
4 2017-05-01 42.21 20.94 33.28
5 2017-06-01 39.30 22.05 32.29
6 2017-07-01 50.90 21.80 38.51
7 2017-08-01 42.77 23.64 35.07
8 2017-09-01 37.45 19.61 32.68
9 2017-10-01 37.55 21.75 32.10
10 2017-11-01 35.61 22.73 32.90
11 2017-12-01 40.16 29.79 37.49
12 2018-01-01 53.45 36.09 47.61
13 2018-02-01 52.89 35.74 45.00
14 2018-03-01 44.67 27.79 38.62
15 2018-04-01 38.48 24.21 34.43
16 2018-05-01 43.87 22.17 34.69
17 2018-06-01 40.24 22.85 34.31
18 2018-07-01 49.98 23.58 39.96
19 2018-08-01 45.57 24.76 37.23
20 2018-09-01 38.90 21.74 34.22
21 2018-10-01 39.75 23.36 35.20
22 2018-11-01 38.04 24.20 34.62
23 2018-12-01 42.68 31.03 40.00
now I need to assign the 'Tenor' data from row 12 to row 23 in FwdDf to the new data frame SimMean.
I used
SimMean.loc[0:11,'Tenor'] = FwdDf.loc [12:23,'Tenor']
but it didn't work:
SimMean
Tenor 5x16 7x8 2x16H
0 None NaN NaN NaN
1 None NaN NaN NaN
2 None NaN NaN NaN
3 None NaN NaN NaN
4 None NaN NaN NaN
5 None NaN NaN NaN
6 None NaN NaN NaN
7 None NaN NaN NaN
8 None NaN NaN NaN
9 None NaN NaN NaN
10 None NaN NaN NaN
11 None NaN NaN NaN
I'm new to python. I would appreciate your help. Thanks
call .values so there are no index alignment issues:
In [35]:
SimMean.loc[0:11,'Tenor'] = FwdDf.loc[12:23,'Tenor'].values
SimMean
Out[35]:
Tenor 5x16 7x8 2x16H
0 2018-01-01 NaN NaN NaN
1 2018-02-01 NaN NaN NaN
2 2018-03-01 NaN NaN NaN
3 2018-04-01 NaN NaN NaN
4 2018-05-01 NaN NaN NaN
5 2018-06-01 NaN NaN NaN
6 2018-07-01 NaN NaN NaN
7 2018-08-01 NaN NaN NaN
8 2018-09-01 NaN NaN NaN
9 2018-10-01 NaN NaN NaN
10 2018-11-01 NaN NaN NaN
11 2018-12-01 NaN NaN NaN
EDIT
As your column is actually datetime then you need to convert the type again:
In [46]:
SimMean['Tenor'] = pd.to_datetime(SimMean['Tenor'])
SimMean
Out[46]:
Tenor 5x16 7x8 2x16H
0 2018-01-01 NaN NaN NaN
1 2018-02-01 NaN NaN NaN
2 2018-03-01 NaN NaN NaN
3 2018-04-01 NaN NaN NaN
4 2018-05-01 NaN NaN NaN
5 2018-06-01 NaN NaN NaN
6 2018-07-01 NaN NaN NaN
7 2018-08-01 NaN NaN NaN
8 2018-09-01 NaN NaN NaN
9 2018-10-01 NaN NaN NaN
10 2018-11-01 NaN NaN NaN
11 2018-12-01 NaN NaN NaN

Categories

Resources