Filling Null values with respective mean [duplicate] - python

This question already has answers here:
How to fillna by groupby outputs in pandas?
(3 answers)
Closed 4 years ago.
I have a dataset as follow -
alldata.loc[:,["Age","Pclass"]].head(10)
Out[24]:
Age Pclass
0 22.0 3
1 38.0 1
2 26.0 3
3 35.0 1
4 35.0 3
5 NaN 3
6 54.0 1
7 2.0 3
8 27.0 3
9 14.0 2
Now I want to fill all the null values in Age with the mean of all the Age values for that respective Pclass type.
Example -
In the above snippet for null value of Age for Pclass = 3, it takes mean of all the age belonging to Pclass = 3. Therefore replacing null value of Age = 22.4.
I tried some solutions using groupby, but it made changes only to a specific Pclass value and converted rest of the fields to null. How to achieve 0 null values in this case.

You can use
1] transform and lambda function
In [41]: df.groupby('Pclass')['Age'].transform(lambda x: x.fillna(x.mean()))
Out[41]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 22.4
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
Or use
2] fillna over mean
In [46]: df['Age'].fillna(df.groupby('Pclass')['Age'].transform('mean'))
Out[46]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
5 22.4
6 54.0
7 2.0
8 27.0
9 14.0
Name: Age, dtype: float64
Or use
3] loc to replace null values
In [47]: df.loc[df['Age'].isnull(), 'Age'] = df.groupby('Pclass')['Age'].transform('mean')
In [48]: df
Out[48]:
Age Pclass
0 22.0 3
1 38.0 1
2 26.0 3
3 35.0 1
4 35.0 3
5 22.4 3
6 54.0 1
7 2.0 3
8 27.0 3
9 14.0 2

Related

Pandas - new column based on other row

I want to create a new column in my dataframe with the value of a other row.
DataFrame
TimeStamp Event Value
0 1603822620000 1 102.0
1 1603822680000 1 108.0
2 1603822740000 1 107.0
3 1603822800000 2 1
4 1603823040000 1 106.0
5 1603823100000 2 0
6 1603823160000 2 1
7 1603823220000 1 105.0
I would like to add a new column with the previous value where event = 1.
TimeStamp Event Value PrevValue
0 1603822620000 1 102.0 NaN
1 1603822680000 1 108.0 102.0
2 1603822740000 1 107.0 108.0
3 1603822800000 2 1 107.0
4 1603823040000 1 106.0 107.0
5 1603823100000 2 0 106.0
6 1603823160000 2 1 106.0
7 1603823220000 1 105.0 106.0
So I can't simply use shift(1) and also not groupBy(event).shift(1).
Current solution
df["PrevValue"] =df.timestamp.apply(lambda ts: (df[(df.Event == 1) & (df.timestamp < ts)].iloc[-1].value))
But I guess, that's not the best solution.
Is there something like shiftUntilCondition(condition)?
Thanks a lot!
Try with
df['new'] = df['Value'].where(df['Event']==1).ffill().shift()
Out[83]:
0 NaN
1 102.0
2 108.0
3 107.0
4 107.0
5 106.0
6 106.0
7 106.0
Name: Value, dtype: float64

How to calculate totals in dataframe by pandas

I have got this dataframe:
Date Trader1 Trader2 Trader3
01/04/2020 4 6 8
02/04/2020 4 6 8
03/04/2020 4 7 8
04/04/2020 4 7 8
05/04/2020 3 5 7
06/04/2020 2 4 7
07/04/2020 2 3 6
08/04/2020 3 3 6
09/04/2020 3 5 7
10/04/2020 3 5 7
11/04/2020 3 5 6
I would like to get Totals for each column by using python/pandas library. When I apply a.loc['Total'] = pd.Series(a.sum()) I can get result as Totals for each column, but it also adds together values of Date column (dates). How can I calculate totals only for needed columns?
You can select only numeric columns by DataFrame.select_dtypes:
a.loc['Total'] = a.select_dtypes(np.number).sum()
You can remove column Date by DataFrame.drop:
a.loc['Total'] = a.drop('Date', axis=1).sum()
Or select all columns without first by positions by DataFrame.iloc:
a.loc['Total'] = a.iloc[:, 1:].sum()
print (a)
Date Trader1 Trader2 Trader3
0 01/04/2020 4.0 6.0 8.0
1 02/04/2020 4.0 6.0 8.0
2 03/04/2020 4.0 7.0 8.0
3 04/04/2020 4.0 7.0 8.0
4 05/04/2020 3.0 5.0 7.0
5 06/04/2020 2.0 4.0 7.0
6 07/04/2020 2.0 3.0 6.0
7 08/04/2020 3.0 3.0 6.0
8 09/04/2020 3.0 5.0 7.0
9 10/04/2020 3.0 5.0 7.0
10 11/04/2020 3.0 5.0 6.0
Total NaN 35.0 56.0 78.0
data[['Trader1','Trader2','Trader3']].sum()
I just saw your comment, There may be better ways, but I think this should work
data[data.columns[1:]].sum()
you have to provide the range in last line.

How to interpolate in Pandas using only previous values?

This is my dataframe:
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
id value
0 1 5
1 1 6
2 1 NaN
3 2 NaN
4 2 8
5 2 4
6 2 NaN
7 2 10
8 3 NaN
This is my expected output:
id value
0 1 5
1 1 6
2 1 7
3 2 NaN
4 2 8
5 2 4
6 2 2
7 2 10
8 3 NaN
This is my current output using this code:
df.value.interpolate(method="krogh")
0 5.000000
1 6.000000
2 9.071429
3 10.171429
4 8.000000
5 4.000000
6 2.357143
7 10.000000
8 36.600000
Basically, I want to do two important things here:
Groupby ID then Interpolate using only above values not below row values
This should do the trick:
df["value_interp"]=df.value.combine_first(df.groupby("id")["value"].apply(lambda y: y.expanding().apply(lambda x: x.interpolate(method="krogh").to_numpy()[-1], raw=False)))
Outputs:
id value value_interp
0 1.0 5.0 5.0
1 1.0 6.0 6.0
2 1.0 NaN 7.0
3 2.0 NaN NaN
4 2.0 8.0 8.0
5 2.0 4.0 4.0
6 2.0 NaN 0.0
7 2.0 10.0 10.0
8 3.0 NaN NaN
(It interpolates based only on the previous values within the group - hence index 6 will return 0 not 2)
You can group by id and then loop over groups to make interpolations. For id = 2 interpolation will not give you value 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
data = []
for name, group in df.groupby('id'):
group_interpolation = group.interpolate(method='krogh', limit_direction='forward', axis=0)
data.append(group_interpolation)
df = (pd.concat(data)).round(1)
Output:
id value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 4.7
7 2.0 10.0
8 3.0 NaN
Current pandas.Series.interpolate does not support what you want so to achieve your goal you need to do 2 grouby's that will account for your desire to use only previous rows. The idea is as follows: to combine into one group only missing value (!!!) and previous rows (it might have limitations if you have several missing values in a row, but it serves well for your toy example)
Suppose we have a df:
print(df)
ID Value
0 1 5.0
1 1 6.0
2 1 NaN
3 2 NaN
4 2 8.0
5 2 4.0
6 2 NaN
7 2 10.0
8 3 NaN
Then we will combine any missing values within a group with previous rows:
df["extrapolate"] = df.groupby("ID")["Value"].apply(lambda grp: grp.isnull().cumsum().shift().bfill())
print(df)
ID Value extrapolate
0 1 5.0 0.0
1 1 6.0 0.0
2 1 NaN 0.0
3 2 NaN 1.0
4 2 8.0 1.0
5 2 4.0 1.0
6 2 NaN 1.0
7 2 10.0 2.0
8 3 NaN NaN
You may see, that when grouped by ["ID","extrapolate"] the missing value will fall into the same group as nonnull values of previous rows.
Now we are ready to do extrapolation (with spline of order=1):
df.groupby(["ID","extrapolate"], as_index=False).apply(lambda grp:grp.interpolate(method="spline",order=1)).drop("extrapolate", axis=1)
ID Value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 0.0
7 2.0 10.0
8 NaN NaN
Hope this helps.

How to add the sum of some column at the end of dataframe

I have a pandas dataframe with 11 columns. I want to add the sum of all values of columns 9 and column 10 to the end of table. So far I tried 2 methods:
Assigning the data to the cell with dataframe.iloc[rownumber, 8]. This results in an out of bound error.
Creating a vector with some blank: ' ' by using the following code:
total = ['', '', '', '', '', '', '', '', dataframe['Column 9'].sum(), dataframe['Column 10'].sum(), '']
dataframe = dataframe.append(total)
The result was not nice as it added the total vector as a vertical vector at the end rather than a horizontal one. What can I do to solve the issue?
You need use pandas.DataFrame.append with ignore_index=True
so use:
dataframe=dataframe.append(dataframe[['Column 9','Column 10']].sum(),ignore_index=True).fillna('')
Example:
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['col1']=[1,2,3,4]
df['col2']=[2,3,4,5]
df['col3']=[5,6,7,8]
df['col4']=[5,6,7,8]
Using Append:
df=df.append(df[['col2','col3']].sum(),ignore_index=True)
print(df)
col1 col2 col3 col4
0 1.0 2.0 5.0 5.0
1 2.0 3.0 6.0 6.0
2 3.0 4.0 7.0 7.0
3 4.0 5.0 8.0 8.0
4 NaN 14.0 26.0 NaN
Whitout NaN values:
df=df.append(df[['col2','col3']].sum(),ignore_index=True).fillna('')
print(df)
col1 col2 col3 col4
0 1 2.0 5.0 5
1 2 3.0 6.0 6
2 3 4.0 7.0 7
3 4 5.0 8.0 8
4 14.0 26.0
Create new DataFrame with sums. This example DataFrame has columns 'a' and 'b'. df1 is the DataFrame what need to be summed up and df3 is one line DataFrame only with sums:
data = [[df1.a.sum(),df1.b.sum()]]
df3 = pd.DataFrame(data,columns=['a','b'])
Then append it to end:
df1.append(df3)
simply try this:(replace test with your dataframe name)
row wise sum(which you have asked for):
test['Total'] = test[['col9','col10']].sum(axis=1)
print(test)
column wise sum:
test.loc['Total'] = test[['col9','col10']].sum()
test.fillna('',inplace=True)
print(test)
IICU , this is what you need (change numbers 8 & 9 to suit your needs)
df['total']=df.iloc[ : ,[8,9]].sum(axis=1) #horizontal sum
df['total1']=df.iloc[ : ,[8,9]].sum().sum() #Vertical sum
df.loc['total2']=df.iloc[ : ,[8,9]].sum() # vertical sum in rows for only columns 8 & 9
Example
a=np.arange(0, 11, 1)
b=np.random.randint(10, size=(5,11))
df=pd.DataFrame(columns=a, data=b)
0 1 2 3 4 5 6 7 8 9 10
0 0 5 1 3 4 8 6 6 8 1 0
1 9 9 8 9 9 2 3 8 9 3 6
2 5 7 9 0 8 7 8 8 7 1 8
3 0 7 2 8 8 3 3 0 4 8 2
4 9 9 2 5 2 2 5 0 3 4 1
**output**
0 1 2 3 4 5 6 7 8 9 10 total total1
0 0.0 5.0 1.0 3.0 4.0 8.0 6.0 6.0 8.0 1.0 0.0 9.0 48.0
1 9.0 9.0 8.0 9.0 9.0 2.0 3.0 8.0 9.0 3.0 6.0 12.0 48.0
2 5.0 7.0 9.0 0.0 8.0 7.0 8.0 8.0 7.0 1.0 8.0 8.0 48.0
3 0.0 7.0 2.0 8.0 8.0 3.0 3.0 0.0 4.0 8.0 2.0 12.0 48.0
4 9.0 9.0 2.0 5.0 2.0 2.0 5.0 0.0 3.0 4.0 1.0 7.0 48.0
total2 NaN NaN NaN NaN NaN NaN NaN NaN 31.0 17.0 NaN NaN NaN

Merging two Pandas series using a key

I am having two pandas series, namely, x and y.
x.head() gives:
user hotel rating id
0 1 1253 5 2783_1253
1 4 589 5 2783_589
2 5 1270 4 2783_1270
3 3 1274 4 2783_1274
4 2 741 5 2783_741
y.head() gives:
UserID Gender Age Occupation Zip Code
0 1.0 F 18.0 10.0 48067
1 2.0 M 56.0 16.0 70072
2 3.0 M 25.0 15.0 55117
3 4.0 M 45.0 7.0 2460
4 5.0 M 25.0 20.0 55455
What I need is to merge columns of these two where user = UserID.
So for example my first row should look like:
user hotel rating id UserID Gender Age Occupation Zip Code
0 1 1253 5 2783_1253 1.0 F 18.0 10.0 48067
How will I get it?
I think you need first convert float column to int and then merge:
y['user'] = y.UserID.astype(int)
df = pd.merge(x,y, on='user')
print (df)
user hotel rating id UserID Gender Age Occupation Zip Code
0 1 1253 5 2783_1253 1.0 2.0 M 56.0 16.0 70072
1 4 589 5 2783_589 4.0 5.0 M 25.0 20.0 55455
2 3 1274 4 2783_1274 3.0 4.0 M 45.0 7.0 2460
3 2 741 5 2783_741 2.0 3.0 M 25.0 15.0 55117
Or convert both columns to float:
x['UserID'] = x.user.astype(float)
df = pd.merge(x,y, on='UserID')
print (df)
user hotel rating id UserID Gender Age Occupation Zip Code
0 1 1253 5 2783_1253 1.0 2.0 M 56.0 16.0 70072
1 4 589 5 2783_589 4.0 5.0 M 25.0 20.0 55455
2 3 1274 4 2783_1274 3.0 4.0 M 45.0 7.0 2460
3 2 741 5 2783_741 2.0 3.0 M 25.0 15.0 55117
What you are looking for is a join. You will find your answer here: http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.join.html (it works just like in SQL).
However, there might some additional casting and renaming if you want to keep both user as an integer and UserID as a float.

Categories

Resources