I have this data frame:
o d r kz p
1 3 1 5 NaN
1 3 2 0 NaN
1 10 1 7 NaN
1 10 3 1 NaN
1 10 2 2 NaN
I would like to fill up the 'p' column by the proportions of 'kz' values for each pair of 'o' and 'd'. The result should look like:
o d r kz p
1 3 1 5 100%
1 3 2 0 0%
1 10 1 7 70%
1 10 3 1 10%
1 10 2 2 20%
I am thinking of looping through the data frame and assign a list of lists of kz values and then regressively fill up the p column.
Is there any elegant way to do it e.g. with groupby or Pivot table?
You can do it in several steps:
Compute the sum per group with groupby (doc) and agg (doc).
Merge these values with you current dataframe with merge (doc).
Compute the ratio
Here the code:
# Import modules
import pandas as pd
import numpy as np
# Data
df = pd.DataFrame(
[[1, 3, 1, 5, np.NaN],
[1, 3, 2, 0, np.NaN],
[1, 10, 1, 7, np.NaN],
[1, 10, 3, 1, np.NaN],
[1, 10, 2, 2, np.NaN]],
columns=["o", "d", "r", "kz", "p"])
print(df)
# o d r kz p
# 0 1 3 1 5 NaN
# 1 1 3 2 0 NaN
# 2 1 10 1 7 NaN
# 3 1 10 3 1 NaN
# 4 1 10 2 2 NaN
# Compute the sum per group
sum_ = df.groupby(['o', 'd']).agg({'kz': 'sum'})
sum_.reset_index(inplace=True)
print(sum_)
# o d kz
# 0 1 3 5
# 1 1 10 10
# Merge these values with the current dataframe
df = df.merge(sum_, on=['o', 'd'], how="outer", suffixes=('', '_sum'))
print(df)
# o d r kz p kz_sum
# 0 1 3 1 5 NaN 5
# 1 1 3 2 0 NaN 5
# 2 1 10 1 7 NaN 10
# 3 1 10 3 1 NaN 10
# 4 1 10 2 2 NaN 10
# Compute teh ratio
df.p = df.kz / df.kz_sum * 100
print(df)
# o d r kz p kz_sum
# 0 1 3 1 5 100.0 5
# 1 1 3 2 0 0.0 5
# 2 1 10 1 7 70.0 10
# 3 1 10 3 1 10.0 10
# 4 1 10 2 2 20.0 10
First sum() 'kz' column group by 'o' and 'd' and store it in the 'tmp'. Merge those two data frames. Then calculate the percentage value 'p' using the original value of 'kz' and sum value of 'kz'. Drop sum value of 'kz' and rename the original column name to 'kz'.
import pandas as pd
d = {'o' : pd.Series([1,1,1,1,1]),
'd' : pd.Series([3,3,10,10,10]),
'r' : pd.Series([1,2,1,3,2]),
'kz' : pd.Series([5,0,7,1,2]),
'p' : pd.Series(None)}
# creates Dataframe.
df = pd.DataFrame(d)
tmp=df.groupby(['o','d'])["kz"].sum()
merge_tmp=pd.merge(df, tmp, on=['o','d'], how='inner',suffixes=('_org','_tmp'))
merge_tmp['p'] = ((merge_tmp['kz_org']/merge_tmp['kz_tmp'])*100)
merge_tmp = merge_tmp.drop('kz_tmp', axis='columns')
merge_tmp = merge_tmp.rename({'kz_org': 'kz'}, axis='columns')
print(merge_tmp)
Related
Let's say input was
d = {'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,3,4,5,6,7,8,9,10],
'col3': [1,2,3,4,5,6,7,8,9,10],
'offset': [1,2,3,1,2,3,1,2,3,1]}
df = pd.DataFrame(data=d)
I want to create an additional column that looks like this:
df['output'] = [1, 4, 9, 4, 10, 18, 7, 16, 27, 10]
Basically each number in offset is telling you the number of columns to sum over (from col1 as ref point).
Is there a vectorized way to do this without iterating through each value in offset?
You use np.select. To use it, create each of the column sum (1, 2, 3 ... as needed) as the possible choices, and create a boolean masks for each value in offset column as the possible conditons.
# get all possible values from offset
lOffset = df['offset'].unique()
# get te result with np.select
df['output'] = np.select(
# create mask for each values in offset
condlist=[df['offset'].eq(i) for i in lOffset],
# crerate the sum over the number of columns per offset value
choicelist=[df.iloc[:,:i].sum(axis=1) for i in lOffset]
)
print(df)
# col1 col2 col3 offset output
# 0 1 1 1 1 1
# 1 2 2 2 2 4
# 2 3 3 3 3 9
# 3 4 4 4 1 4
# 4 5 5 5 2 10
# 5 6 6 6 3 18
# 6 7 7 7 1 7
# 7 8 8 8 2 16
# 8 9 9 9 3 27
# 9 10 10 10 1 10
Note: this assumes that your offset column is the last one
It can be done with pd.crosstab then we mask all 0 to NaN and back fill, this will return 1 as all value ned to sum
df['new'] = df.filter(like = 'col').where(pd.crosstab(df.index,df.offset).mask(lambda x : x==0).bfill(1).values==1).sum(1)
Out[710]:
0 1.0
1 4.0
2 9.0
3 4.0
4 10.0
5 18.0
6 7.0
7 16.0
8 27.0
9 10.0
dtype: float64
I have an initial column with no missing data (A) but with repeated values. How do I fill the next column (B) with missing data so that it is filled and the column on the left always has the same value on the right? I would also like any other columns to remain the same (C)
For example, this is what I have
A B C
1 1 20 4
2 2 NaN 8
3 3 NaN 2
4 2 30 9
5 3 40 1
6 1 NaN 3
And this is what I want
A B C
1 1 20 4
2 2 30* 8
3 3 40* 2
4 2 30 9
5 3 40 1
6 1 20* 3
Asterisk on filled values.
This needs to be scalable with a very large dataframe.
Additionally, if I had a value on the left column that has more than one value on the right side on separate observations, how would I fill with the mean?
You can use groupby on 'A' and use first to find the first corresponding value in 'B' (it will not select NaN).
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,2,3,1],
'B':[20, None, None, 30, 40, None],
'C': [4,8,2,9,1,3]})
# find first 'B' value for each 'A'
lookup = df[['A', 'B']].groupby('A').first()['B']
# only use rows where 'B' is NaN
nan_mask = df['B'].isnull()
# replace NaN values in 'B' with lookup values
df['B'].loc[nan_mask] = df.loc[nan_mask].apply(lambda x: lookup[x['A']], axis=1)
print(df)
Which outputs:
A B C
0 1 20.0 4
1 2 30.0 8
2 3 40.0 2
3 2 30.0 9
4 3 40.0 1
5 1 20.0 3
If there are many NaN values in 'B' you might want to exclude them before you use groupby.
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,2,3,1],
'B':[20, None, None, 30, 40, None],
'C': [4,8,2,9,1,3]})
# Only use rows where 'B' is NaN
nan_mask = df['B'].isnull()
# Find first 'B' value for each 'A'
lookup = df[~nan_mask][['A', 'B']].groupby('A').first()['B']
df['B'].loc[nan_mask] = df.loc[nan_mask].apply(lambda x: lookup[x['A']], axis=1)
print(df)
You could do sort_values first then forward fill column B based on column A. The way to implement this will be:
import pandas as pd
import numpy as np
x = {'A':[1,2,3,2,3,1],
'B':[20,np.nan,np.nan,30,40,np.nan],
'C':[4,8,2,9,1,3]}
df = pd.DataFrame(x)
#sort_values first, then forward fill based on column B
#this will get the right values for you while maintaing
#the original order of the dataframe
df['B'] = df.sort_values(by=['A','B'])['B'].ffill()
print (df)
Output will be:
Original data:
A B C
0 1 20.0 4
1 2 NaN 8
2 3 NaN 2
3 2 30.0 9
4 3 40.0 1
5 1 NaN 3
Updated data:
A B C
0 1 20.0 4
1 2 30.0 8
2 3 40.0 2
3 2 30.0 9
4 3 40.0 1
5 1 20.0 3
My dataframe is shown as follows:
User Date Unit
1 A 2000-10-31 1
2 A 2001-10-31 2
3 A 2002-10-31 1
4 A 2003-10-31 2
5 B 2000-07-31 1
6 B 2000-08-31 2
7 B 2001-07-31 1
8 B 2002-06-30 1
9 B 2002-07-31 1
10 B 2002-08-31 1
I want to make the following judgement:
(1) For the 'User' with 'Unit' in the same month in the past consecutive two years. The data should be classified as 'Routine' with a dummy variable 1.
(2) Otherwise, the data should be classified as 0 in the 'Routine' column.
(3) For the data do not have two past consecutive years. The 'Routine' column should show NaN.
My desired output is:
User Date Unit Routine
1 A 2000-10-31 1 NaN
2 A 2001-10-31 2 NaN
3 A 2002-10-31 1 1
4 A 2003-10-31 2 1
5 B 2000-07-31 1 NaN
6 B 2000-08-31 2 NaN
7 B 2001-07-31 1 NaN
8 B 2002-06-30 1 0
9 B 2002-07-31 1 1
10 B 2002-08-31 1 0
The code of the dataframe is shown as follows:
df=pd.DataFrame({'User':list('AAAABBBBBB'),
'Date':['2000-10-31','2001-10-31','2002-10-31','2003-10-31','2000-07-31',
'2000-08-31','2001-07-31','2002-06-30','2002-07-31','2002-08-31'],
'Unit':[1,2,1,2,1,2,1,1,1,1]})
df['Date']=pd.to_datetime(df['Date'])
I want to use groupby function since there are many users in the dataframe. Thank you.
The code:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'User': list('AAAABBBBBB'),
'Date': [
'2000-10-31', '2001-10-31', '2002-10-31', '2003-10-31',
'2000-07-31', '2000-08-31', '2001-07-31', '2002-06-30',
'2002-07-31', '2002-08-31'],
'Unit': [1, 2, 1, 2, 1, 2, 1, 1, 1, 1]})
df['Date'] = pd.to_datetime(df['Date'])
def routine(user, cdate, unit):
result = np.nan
two_years = [cdate.year - 1, cdate.year - 2]
mask = df.User == user
mask = mask & df.Date.dt.year.isin(two_years)
sdf = df[mask]
years = sdf.Date.dt.year.to_list()
got_years = all([y in years for y in two_years])
result = 0 if (sdf.shape[0] > 0) & got_years else result
mask2 = (sdf.Date.dt.month == cdate.month) & (sdf.Unit == unit)
sdf = sdf[mask2]
result = 1 if (sdf.shape[0] > 0) & got_years else result
return result
df['Routine'] = df.apply(
lambda row: routine(row['User'], row['Date'], row['Unit']), axis=1)
print(df)
Output:
User Date Unit Routine
0 A 2000-10-31 1 NaN
1 A 2001-10-31 2 NaN
2 A 2002-10-31 1 1.0
3 A 2003-10-31 2 1.0
4 B 2000-07-31 1 NaN
5 B 2000-08-31 2 NaN
6 B 2001-07-31 1 NaN
7 B 2002-06-30 1 0.0
8 B 2002-07-31 1 1.0
9 B 2002-08-31 1 0.0
I want to shift rows in a pandas df when values are equal to a specific value in a Column. For the df below, I'm trying to shift the values in Column B to Column A when values in A == x.
import pandas as pd
df = pd.DataFrame({
'A' : [1,'x','x','x',5],
'B' : ['x',2,3,4,'x'],
})
This is my attempt:
df = df.loc[df.A.shift(-1) == df.A.shift(1), 'x'] = df.A.shift(1)
Intended Output:
A B
0 1 x
1 2
2 3
3 4
4 5 x
You can use:
m = df.A.eq('x')
df[m]=df[m].shift(-1,axis=1)
print(df)
A B
0 1 x
1 2 NaN
2 3 NaN
3 4 NaN
4 5 x
You can use:
df[df.A=='x'] = df.shift(-1,axis=1)
print(df)
A B
0 1 x
1 2 NaN
2 3 NaN
3 4 NaN
4 5 x
Is there a way in pandas dataframe to locate all the row blocks of size n where the highest value is exactly in the middle? What i need is to create an extra column which only has values of the middle biggest value of each such block.
Here is the example using for cycle and block size 5:
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 1, 2, 1, 4, 3, 2, 1, 5, 2, 2, 5],
columns = ['number'])
for i in range(2, len(df) - 2):
if (df.loc[i, 'number'] > df.loc[i - 1, 'number'] and\
df.loc[i, 'number'] > df.loc[i - 2, 'number'] and\
df.loc[i, 'number'] > df.loc[i + 1, 'number'] and\
df.loc[i, 'number'] > df.loc[i + 2, 'number']):
df.loc[i, 'high'] = df.loc[i, 'number']
Output:
number high
0 1 None
1 2 None
2 3 3
3 2 None
4 1 None
5 2 None
6 1 None
7 4 4
8 3 None
9 2 None
10 1 None
11 5 5
12 2 None
13 2 None
14 5 None
You can use pd.DataFrame.rolling with the parameter center=True. Take the max of this, and compare it to your target.
def highest_in(s, n):
test = s.rolling(window=n, center=True).max() == s
return s.where(test, None)
df['high'] = highest_in(df.number, n=5)
print(df)
# number high
# 0 1 None
# 1 2 None
# 2 3 3
# 3 2 None
# 4 1 None
# 5 2 None
# 6 1 None
# 7 4 4
# 8 3 None
# 9 2 None
# 10 1 None
# 11 5 5
# 12 2 None
# 13 2 None
# 14 5 None
We can also use argrelextrema from scipy to get local maximas in order. Here order is 2 to consider 2 numbers above and 2 numbers below. By consider the value the block size will be five.
from scipy.signal import argrelextrema
maxInd = argrelextrema(df.number.values, np.greater, order=2)
df['new'] = df.iloc[maxInd]['high']
Output :
number new
0 1 NaN
1 2 NaN
2 3 3.0
3 2 NaN
4 1 NaN
5 2 NaN
6 1 NaN
7 4 4.0
8 3 NaN
9 2 NaN
10 1 NaN
11 5 5.0
12 2 NaN
13 2 NaN
14 5 NaN