I have a data frame
df = pd.DataFrame([[3,2,1,5,'Stay',2],[4,5,6,10,'Leave',10],
[10,20,30,40,'Stay',11],[12,2,3,3,'Leave',15],
[31,23,31,45,'Stay',25],[12,21,17,6,'Stay',15],
[15,17,18,12,'Leave',10],[3,2,1,5,'Stay',3],
[12,2,3,3,'Leave',12]], columns = ['A','B','C','D','Status','E'])
A B C D Status E
0 3 2 1 5 Stay 2
1 4 5 6 10 Leave 10
2 10 20 30 40 Stay 11
3 12 2 3 3 Leave 15
4 31 23 31 45 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
I want to run a condition where if Status is Stay and if column E is smaller than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
If Status is Leave and if column E is larger than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
So the result is:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
My attempt:
if df['Status'] == 'Stay':
if df['E'] < df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
elif df['Status'] == 'Leave':
if df['E'] > df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
This runs into several problems including problem with string. Your help is kindly appreciated.
I think you want boolean indexing:
s1 = df.Status.eq('Stay') & df['E'].lt(df['A'])
s2 = df.Status.eq('Leave') & df['E'].gt(df['A'])
s = s1 | s2
df.loc[s, ['A','B','C','D']] = df.loc[s, ['E','A','B','C']].to_numpy()
Output:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
Using np.roll with .loc:
shift = np.roll(df.select_dtypes(exclude='object'),1,axis=1)[:, :-1]
m1 = df['Status'].eq('Stay') & (df['E'] < df['A'])
m2 = df['Status'].eq('Leave') & (df['E'] > df['A'])
df.loc[m1|m2, ['A','B','C','D']] = shift[m1|m2]
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
Use DataFrame.mask + DataFrame.shift:
#Status like index to use shift
new_df=df.set_index('Status')
#DataFrame to replace
df_modify=new_df.shift(axis=1,fill_value=df['E'])
#Creating boolean mask
under_mask=(df.Status.eq('Stay'))&(df.E<df.A)
over_mask=(df.Status.eq('Leave'))&(df.E>df.A)
#Using DataFrame.mask
new_df=new_df.mask(under_mask|over_mask,df_modify).reset_index()
print(new_df)
Output
Status A B C D E
0 Stay 2 3 2 1 5
1 Leave 10 4 5 6 10
2 Stay 10 20 30 40 11
3 Leave 15 12 2 3 3
4 Stay 25 31 23 31 45
5 Stay 12 21 17 6 15
6 Leave 15 17 18 12 10
7 Stay 3 2 1 5 3
8 Leave 12 2 3 3 12
It sounds like you want to do this for each row of the data, but your code is written to try to do it at the top level. Can you use a for ... in loop to iterate over the rows?
for row in df:
if row['Status'] == 'Stay':
... etc ...
Related
I work with Pandas and I am trying to create a column where the value is increased and especially reset by condition based on the Time column
Input data:
Out[73]:
ID Time Job Level Counter
0 1 17 a
1 1 18 a
2 1 19 a
3 1 20 a
4 1 21 a
5 1 22 b
6 1 23. b
7 1 24. b
8 2 10. a
9 2 11 a
10 2 12 a
11 2 13 a
12 2 14. b
13 2 15 b
14 2 16 b
15 2 17 c
16 2 18 c
I want to create a new vector 'count' where the value within each ID group remains the same before the first change and start from zero every time a change in the Job level is encountered while remains equal to Time before the first change or no change.
What I would like to have:
ID Time Job Level Counter
0 1 17 a 17
1 1 18 a 18
2 1 19 a 19
3 1 20 a 20
4 1 21 a 21
5 1 22 b 0
6 1 23 b 1
7 1 24 b 2
8 2 10 a 10
9 2 11 a 11
10 2 12 a 12
11 2 13 a 13
12 2 14 b 0
13 2 15 b 1
14 2 16 b 2
15 2 17 c 0
16 2 18 c 1
This is what I tried
df = df.sort_values(['ID']).reset_index(drop=True)
df['Counter'] = promo_details.groupby('ID')['job_level'].apply(lambda x: x.shift()!=x)
def func(group):
group.loc[group.index[0],'Counter']=group.loc[group.index[0],'time_in_level']
return group
df = df.groupby('emp_id').apply(func)
df['Counter'] = df['Counter'].replace(True,'a')
df['Counter'] = np.where(df.Counter == False,df['Time'],df['Counter'])
df['Counter'] = df['Counter'].replace('a',0)
This is not creating a cumulative change after the first change while preserving counts before it,
Use GroupBy.cumcount for counter with filter first group - there is added values from column Time:
#if need test consecutive duplicates
s = df['Job Level'].ne(df['Job Level'].shift()).cumsum()
m = s.groupby(df['ID']).transform('first').eq(s)
df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount())
print (df)
ID Time Job Level Counter
0 1 17 a 17
1 1 18 a 18
2 1 19 a 19
3 1 20 a 20
4 1 21 a 21
5 1 22 b 0
6 1 23 b 1
7 1 24 b 2
8 2 10 a 10
9 2 11 a 11
10 2 12 a 12
11 2 13 a 13
12 2 14 b 0
13 2 15 b 1
14 2 16 b 2
15 2 17 c 0
16 2 18 c 1
Or:
#if each groups are unique
m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level'])
df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount())
Difference in changed data:
print (df)
ID Time Job Level
12 2 14 b
13 2 15 b
14 2 16 b
15 2 17 c
16 2 18 c
10 2 12 a
11 2 18 a
12 2 19 b
13 2 20 b
#if need test consecutive duplicates
s = df['Job Level'].ne(df['Job Level'].shift()).cumsum()
m = s.groupby(df['ID']).transform('first').eq(s)
df['Counter1'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount())
m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level'])
df['Counter2'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount())
print (df)
ID Time Job Level Counter1 Counter2
12 2 14 b 14 14
13 2 15 b 15 15
14 2 16 b 16 16
15 2 17 c 0 0
16 2 18 c 1 1
10 2 12 a 0 0
11 2 18 a 1 1
12 2 19 b 0 19
13 2 20 b 1 20
I want to concatenate two df along columns. Both have the same number of indices.
df1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
df2
D E F
0 13 14 15
1 16 17 18
2 19 20 21
3 22 23 24
Expected:
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
I have done:
df_combined = pd.concat([df1,df2], axis=1)
But, the df_combined have new rows with NaN values in some columns...
I can't find my error. So, what I have to do? Thanks in advance!
In this case, merge() works.
pd.merge(df1, df2, left_index=True, right_index=True)
output
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
This works only if both dataframe have same indices.
The df looks like below:
A B C
1 8 23
2 8 22
3 8 45
4 9 45
5 6 12
6 8 10
7 11 12
8 9 67
I want to create a new df with the occurence of 8 in 'B' and the next row value of 8.
New df:
The df looks like below:
A B C
1 8 23
2 8 22
3 8 45
4 9 45
6 8 10
7 11 12
Use boolean indexing with compared by shifted values with | for bitwise OR:
df = df[df.B.shift().eq(8) | df.B.eq(8)]
print (df)
A B C
0 1 8 23
1 2 8 22
2 3 8 45
3 4 9 45
5 6 8 10
6 7 11 12
I'm new to python and have a query. I need the value repeated in column B until a change occur in column A.
Here's the sample Data:
A B
18 1
18 0
18 0
24 2
24 0
24 0
24 0
10 3
10 0
10 0
How I want my output
Column A Column B
18 1
18 1
18 1
18 1
24 2
24 2
10 3
10 3
10 3
10 3
Please help me thru this. Thank you
You can use transform by first if need repeat first value of each group:
df['Column B'] = df.groupby('Column A')['Column B'].transform('first')
print (df)
Column A Column B
0 18 1
1 18 1
2 18 1
3 18 1
4 24 2
5 24 2
6 10 3
7 10 3
8 10 3
9 10 3
Another solution which dont depends of Column A - replace 0 values by NaN, use forward filling by ffill and last cast to int:
df['Column B'] = df['Column B'].replace(0,np.nan).ffill().astype(int)
print (df)
Column A Column B
0 18 1
1 18 1
2 18 1
3 18 1
4 24 2
5 24 2
6 10 3
7 10 3
8 10 3
9 10 3
I have two dataframes, the first one df1 contains only one row :
A B C D E
0 5 8 9 5 0
and the second one has multiple rows , but the same number of columns:
D C E A B
0 5 0 3 3 7
1 9 3 5 2 4
2 7 6 8 8 1
3 6 7 7 8 1
4 5 9 8 9 4
5 3 0 3 5 0
6 2 3 8 1 3
7 3 3 7 0 1
8 9 9 0 4 7
9 3 2 7 2 0
In real example I have much more columns (more than 100). the both dataframes has the same number of columns, and the same columns names, but the order of columns is different, as it's shown in the example.
I should multiply two dataframes (matrix_like multiplication), except of I couldn't perform simple df2.values * df1.values because the columns are not ordered in the same manner, so for instance the second column of df1 B couldn't be multiplied at the second column of df2, because we find C instead of B at second column of df2 , while the column B is the 5th column in df2.
Is there simple and pythonic solution to multiply the dataframes, taking into account the column names ant not column index?
df1[df2.columns] returns a dataframe where the columns are ordered as in df2:
df1
Out[91]:
A B C D E
0 3 8 9 5 0
df1[df2.columns]
Out[92]:
D C E A B
0 5 9 0 3 8
So, you just need:
df2.values * df1[df2.columns].values
This will raise a key error if you have additional columns in df2; and it will only select df2's columns even if you have more columns in df1.
As #MaxU noted, since you are operating on numpy arrays, in order to go back to the dataframe structure you will need:
pd.DataFrame(df2.values * df1[df2.columns].values, columns = df2.columns)
You can use mul, df1 is converted to Serie by ix:
print df1.ix[0]
A 5
B 8
C 9
D 5
E 0
Name: 0, dtype: int64
print df2.mul(df1.ix[0])
A B C D E
0 15 56 0 25 0
1 10 32 27 45 0
2 40 8 54 35 0
3 40 8 63 30 0
4 45 32 81 25 0
5 25 0 0 15 0
6 5 24 27 10 0
7 0 8 27 15 0
8 20 56 81 45 0
9 10 0 18 15 0
If you need change order of final DataFrame, use with reindex_axis:
print df2.mul(df1.ix[0]).reindex_axis(df2.columns.tolist(), axis=1)
D C E A B
0 25 0 0 15 56
1 45 27 0 10 32
2 35 54 0 40 8
3 30 63 0 40 8
4 25 81 0 45 32
5 15 0 0 25 0
6 10 27 0 5 24
7 15 27 0 0 8
8 45 81 0 20 56
9 15 18 0 10 0
Another solution is reorder columns by reindex index of Serie by df2.columns:
print df2.mul(df1.ix[0].reindex(df2.columns))
D C E A B
0 25 0 0 15 56
1 45 27 0 10 32
2 35 54 0 40 8
3 30 63 0 40 8
4 25 81 0 45 32
5 15 0 0 25 0
6 10 27 0 5 24
7 15 27 0 0 8
8 45 81 0 20 56
9 15 18 0 10 0