Conditional Cumulative Count pandas while preserving values before first change - python

I work with Pandas and I am trying to create a column where the value is increased and especially reset by condition based on the Time column
Input data:
Out[73]:
ID Time Job Level Counter
0 1 17 a
1 1 18 a
2 1 19 a
3 1 20 a
4 1 21 a
5 1 22 b
6 1 23. b
7 1 24. b
8 2 10. a
9 2 11 a
10 2 12 a
11 2 13 a
12 2 14. b
13 2 15 b
14 2 16 b
15 2 17 c
16 2 18 c
I want to create a new vector 'count' where the value within each ID group remains the same before the first change and start from zero every time a change in the Job level is encountered while remains equal to Time before the first change or no change.
What I would like to have:
ID Time Job Level Counter
0 1 17 a 17
1 1 18 a 18
2 1 19 a 19
3 1 20 a 20
4 1 21 a 21
5 1 22 b 0
6 1 23 b 1
7 1 24 b 2
8 2 10 a 10
9 2 11 a 11
10 2 12 a 12
11 2 13 a 13
12 2 14 b 0
13 2 15 b 1
14 2 16 b 2
15 2 17 c 0
16 2 18 c 1
This is what I tried
df = df.sort_values(['ID']).reset_index(drop=True)
df['Counter'] = promo_details.groupby('ID')['job_level'].apply(lambda x: x.shift()!=x)
def func(group):
group.loc[group.index[0],'Counter']=group.loc[group.index[0],'time_in_level']
return group
df = df.groupby('emp_id').apply(func)
df['Counter'] = df['Counter'].replace(True,'a')
df['Counter'] = np.where(df.Counter == False,df['Time'],df['Counter'])
df['Counter'] = df['Counter'].replace('a',0)
This is not creating a cumulative change after the first change while preserving counts before it,

Use GroupBy.cumcount for counter with filter first group - there is added values from column Time:
#if need test consecutive duplicates
s = df['Job Level'].ne(df['Job Level'].shift()).cumsum()
m = s.groupby(df['ID']).transform('first').eq(s)
df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount())
print (df)
ID Time Job Level Counter
0 1 17 a 17
1 1 18 a 18
2 1 19 a 19
3 1 20 a 20
4 1 21 a 21
5 1 22 b 0
6 1 23 b 1
7 1 24 b 2
8 2 10 a 10
9 2 11 a 11
10 2 12 a 12
11 2 13 a 13
12 2 14 b 0
13 2 15 b 1
14 2 16 b 2
15 2 17 c 0
16 2 18 c 1
Or:
#if each groups are unique
m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level'])
df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount())
Difference in changed data:
print (df)
ID Time Job Level
12 2 14 b
13 2 15 b
14 2 16 b
15 2 17 c
16 2 18 c
10 2 12 a
11 2 18 a
12 2 19 b
13 2 20 b
#if need test consecutive duplicates
s = df['Job Level'].ne(df['Job Level'].shift()).cumsum()
m = s.groupby(df['ID']).transform('first').eq(s)
df['Counter1'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount())
m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level'])
df['Counter2'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount())
print (df)
ID Time Job Level Counter1 Counter2
12 2 14 b 14 14
13 2 15 b 15 15
14 2 16 b 16 16
15 2 17 c 0 0
16 2 18 c 1 1
10 2 12 a 0 0
11 2 18 a 1 1
12 2 19 b 0 19
13 2 20 b 1 20

Related

Functional approach to group DataFrame columns into MultiIndex

Is there a simpler functional way to group columns into a MultiIndex?
# Setup
l = [...]
l2,l3,l4 = do_things(l, [2,3,4])
d = {2:l2, 3:l3, 4:l4}
# Or,
l = l2 = l3 = l4 = list(range(20))
Problems with my approaches:
# Cons:
# * Complicated
# * Requires multiple iterations over the dictionary to occur
# in the same order. This is guaranteed as the dictionary is
# unchanged but I'm not happy with the implicit dependency.
df = pd.DataFrame\
( zip(*d.values())
, index=l
, columns=pd.MultiIndex.from_product([["group"], d.keys()])
).rename_axis("x").reset_index().reset_index()
# Cons:
# * Complicated
# * Multiple assignments
df = pd.DataFrame(d, index=l).rename_axis("x")
df.columns = pd.MultiIndex.from_product([["group"],df.columns])
df = df.reset_index().reset_index()
I'm looking for something like:
df =\
( pd.DataFrame(d, index=l)
. rename_axis("x")
. group_columns("group")
. reset_index().reset_index()
)
Result:
index x group
2 3 4
0 0 2 0 0 0
1 1 2 0 0 0
2 2 2 0 0 0
3 3 2 0 0 0
4 4 1 0 0 0
5 5 2 0 0 0
6 6 1 0 0 0
7 7 2 0 0 0
8 8 4 0 1 1
9 9 4 0 1 1
10 10 4 0 1 1
11 11 0 0 1 1
12 12 1 0 1 1
13 13 1 0 1 1
14 14 3 1 2 2
15 15 1 1 2 2
16 16 1 1 2 3
17 17 1 1 2 3
18 18 4 1 2 3
19 19 3 1 2 3
20 20 4 1 2 3
21 21 4 1 2 3
22 22 4 1 2 3
23 23 4 1 2 3
It is probably easiest just to reformat the dictionary and pass it to the DataFrame constructor:
# Sample Data
size = 5
lst = np.arange(size) + 10
d = {2: lst, 3: lst + size, 4: lst + (size * 2)}
df = pd.DataFrame(
# Add group level by changing keys to tuples
{('group', k): v for k, v in d.items()},
index=lst
)
Output:
group
2 3 4
10 10 15 20
11 11 16 21
12 12 17 22
13 13 18 23
14 14 19 24
Notice that tuples get interpreted as a MultiIndex automatically
This can be followed with whatever chain of operations desired:
df = pd.DataFrame(
{('group', k): v for k, v in d.items()},
index=lst
).rename_axis('x').reset_index().reset_index()
df:
index x group
2 3 4
0 0 10 10 15 20
1 1 11 11 16 21
2 2 12 12 17 22
3 3 13 13 18 23
4 4 14 14 19 24
It is also possible to combine steps and generate the complete DataFrame directly:
df = pd.DataFrame({
('index', ''): pd.RangeIndex(len(lst)),
('x', ''): lst,
**{('group', k): v for k, v in d.items()}
})
df:
index x group
2 3 4
0 0 10 10 15 20
1 1 11 11 16 21
2 2 12 12 17 22
3 3 13 13 18 23
4 4 14 14 19 24
Naturally any combination of dictionary comprehension and pandas operations can be used.

Several Layers of If Statements with String

I have a data frame
df = pd.DataFrame([[3,2,1,5,'Stay',2],[4,5,6,10,'Leave',10],
[10,20,30,40,'Stay',11],[12,2,3,3,'Leave',15],
[31,23,31,45,'Stay',25],[12,21,17,6,'Stay',15],
[15,17,18,12,'Leave',10],[3,2,1,5,'Stay',3],
[12,2,3,3,'Leave',12]], columns = ['A','B','C','D','Status','E'])
A B C D Status E
0 3 2 1 5 Stay 2
1 4 5 6 10 Leave 10
2 10 20 30 40 Stay 11
3 12 2 3 3 Leave 15
4 31 23 31 45 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
I want to run a condition where if Status is Stay and if column E is smaller than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
If Status is Leave and if column E is larger than column A, then: change the data where data in column D is replaced with data column C, data in column C is replaced with data from column B and data in column B is replaced with data from column A and data in column A is replaced with data from column E.
So the result is:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
My attempt:
if df['Status'] == 'Stay':
if df['E'] < df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
elif df['Status'] == 'Leave':
if df['E'] > df['A']:
df['D'] = df['C']
df['C'] = df['B']
df['B'] = df['A']
df['A'] = df['E']
This runs into several problems including problem with string. Your help is kindly appreciated.
I think you want boolean indexing:
s1 = df.Status.eq('Stay') & df['E'].lt(df['A'])
s2 = df.Status.eq('Leave') & df['E'].gt(df['A'])
s = s1 | s2
df.loc[s, ['A','B','C','D']] = df.loc[s, ['E','A','B','C']].to_numpy()
Output:
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
Using np.roll with .loc:
shift = np.roll(df.select_dtypes(exclude='object'),1,axis=1)[:, :-1]
m1 = df['Status'].eq('Stay') & (df['E'] < df['A'])
m2 = df['Status'].eq('Leave') & (df['E'] > df['A'])
df.loc[m1|m2, ['A','B','C','D']] = shift[m1|m2]
A B C D Status E
0 2 3 2 1 Stay 2
1 10 4 5 6 Leave 10
2 10 20 30 40 Stay 11
3 15 12 2 3 Leave 15
4 25 31 23 31 Stay 25
5 12 21 17 6 Stay 15
6 15 17 18 12 Leave 10
7 3 2 1 5 Stay 3
8 12 2 3 3 Leave 12
Use DataFrame.mask + DataFrame.shift:
#Status like index to use shift
new_df=df.set_index('Status')
#DataFrame to replace
df_modify=new_df.shift(axis=1,fill_value=df['E'])
#Creating boolean mask
under_mask=(df.Status.eq('Stay'))&(df.E<df.A)
over_mask=(df.Status.eq('Leave'))&(df.E>df.A)
#Using DataFrame.mask
new_df=new_df.mask(under_mask|over_mask,df_modify).reset_index()
print(new_df)
Output
Status A B C D E
0 Stay 2 3 2 1 5
1 Leave 10 4 5 6 10
2 Stay 10 20 30 40 11
3 Leave 15 12 2 3 3
4 Stay 25 31 23 31 45
5 Stay 12 21 17 6 15
6 Leave 15 17 18 12 10
7 Stay 3 2 1 5 3
8 Leave 12 2 3 3 12
It sounds like you want to do this for each row of the data, but your code is written to try to do it at the top level. Can you use a for ... in loop to iterate over the rows?
for row in df:
if row['Status'] == 'Stay':
... etc ...

Pandas reshape dataframe every N rows to columns

I have a dataframe as follows :
df1=pd.DataFrame(np.arange(24).reshape(6,-1),columns=['a','b','c','d'])
and I want to take 3 set of rows and convert them to columns with following order
Numpy reshape doesn't give intended answer
pd.DataFrame(np.reshape(df1.values,(3,-1)),columns=['a','b','c','d','e','f','g','h'])
In [258]: df = pd.DataFrame(np.hstack(np.split(df1, 2)))
In [259]: df
Out[259]:
0 1 2 3 4 5 6 7
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
In [260]: import string
In [261]: df.columns = list(string.ascii_lowercase[:len(df.columns)])
In [262]: df
Out[262]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
Create 3d array by reshape:
a = np.hstack(np.reshape(df1.values,(-1, 3, len(df1.columns))))
df = pd.DataFrame(a,columns=['a','b','c','d','e','f','g','h'])
print (df)
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
This uses the reshape/swapaxes/reshape idiom for rearranging sub-blocks of NumPy arrays.
In [26]: pd.DataFrame(df1.values.reshape(2,3,4).swapaxes(0,1).reshape(3,-1), columns=['a','b','c','d','e','f','g','h'])
Out[26]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
If you want a pure pandas solution:
df.set_index([df.index % 3, df.index // 3])\
.unstack()\
.sort_index(level=1, axis=1)\
.set_axis(list('abcdefgh'), axis=1, inplace=False)
Output:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Repeat the value in column B until there is change occur in column A in python

I'm new to python and have a query. I need the value repeated in column B until a change occur in column A.
Here's the sample Data:
A B
18 1
18 0
18 0
24 2
24 0
24 0
24 0
10 3
10 0
10 0
How I want my output
Column A Column B
18 1
18 1
18 1
18 1
24 2
24 2
10 3
10 3
10 3
10 3
Please help me thru this. Thank you
You can use transform by first if need repeat first value of each group:
df['Column B'] = df.groupby('Column A')['Column B'].transform('first')
print (df)
Column A Column B
0 18 1
1 18 1
2 18 1
3 18 1
4 24 2
5 24 2
6 10 3
7 10 3
8 10 3
9 10 3
Another solution which dont depends of Column A - replace 0 values by NaN, use forward filling by ffill and last cast to int:
df['Column B'] = df['Column B'].replace(0,np.nan).ffill().astype(int)
print (df)
Column A Column B
0 18 1
1 18 1
2 18 1
3 18 1
4 24 2
5 24 2
6 10 3
7 10 3
8 10 3
9 10 3

Permute groups in Pandas

Say I have a Pandas DataFrame whose data look like
import numpy as np
import pandas as pd
n = 30
df = pd.DataFrame({'a': np.arange(n),
'b': np.random.choice([0, 1, 2], n),
'c': np.arange(n)})
Question: how to permute groups (grouped by b column)?
Not permutation within each group, but permutation in group level?
Example
Before
a b c
1 0 1
2 0 2
3 1 3
4 1 4
5 2 5
6 2 6
After
a b c
3 1 3
4 1 4
1 0 1
2 0 2
5 2 5
6 2 6
Basically before permutation, df['b'].unqiue() == [0, 1, 2], after permutation, df['b'].unique() == [1, 0, 2].
Here's an answer inspired by the accepted answer to this SO post, which uses a temporary Categorical column as a sorting key to do custom sort orderings. In this answer, I produce all permutations, but you can just take the first one if you are looking for only one.
import itertools
df_results = list()
orderings = itertools.permutations(df["b"].unique())
for ordering in orderings:
df_2 = df.copy()
df_2["b_key"] = pd.Categorical(df_2["b"], [i for i in ordering])
df_2.sort_values("b_key", inplace=True)
df_2.drop(["b_key"], axis=1, inplace=True)
df_results.append(df_2)
for df in df_results:
print(df)
The idea here is that we create a new categorical variable each time, with a slightly different enumerated order, then sort by it. We discard it at the end once we no longer need it.
If i understood your question correctly, you can do it this way:
n = 30
df = pd.DataFrame({'a': np.arange(n),
'b': np.random.choice([0, 1, 2], n),
'c': np.arange(n)})
order = pd.Series([1,0,2])
cols = df.columns
df['idx'] = df.b.map(order)
index = df.index
df = df.reset_index().sort_values(['idx', 'index'])[cols]
Step by step:
In [103]: df['idx'] = df.b.map(order)
In [104]: df
Out[104]:
a b c idx
0 0 2 0 2
1 1 0 1 1
2 2 1 2 0
3 3 0 3 1
4 4 1 4 0
5 5 1 5 0
6 6 1 6 0
7 7 2 7 2
8 8 0 8 1
9 9 1 9 0
10 10 0 10 1
11 11 1 11 0
12 12 0 12 1
13 13 2 13 2
14 14 0 14 1
15 15 2 15 2
16 16 1 16 0
17 17 2 17 2
18 18 1 18 0
19 19 1 19 0
20 20 0 20 1
21 21 0 21 1
22 22 1 22 0
23 23 1 23 0
24 24 2 24 2
25 25 0 25 1
26 26 0 26 1
27 27 0 27 1
28 28 1 28 0
29 29 1 29 0
In [105]: df.reset_index().sort_values(['idx', 'index'])
Out[105]:
index a b c idx
2 2 2 1 2 0
4 4 4 1 4 0
5 5 5 1 5 0
6 6 6 1 6 0
9 9 9 1 9 0
11 11 11 1 11 0
16 16 16 1 16 0
18 18 18 1 18 0
19 19 19 1 19 0
22 22 22 1 22 0
23 23 23 1 23 0
28 28 28 1 28 0
29 29 29 1 29 0
1 1 1 0 1 1
3 3 3 0 3 1
8 8 8 0 8 1
10 10 10 0 10 1
12 12 12 0 12 1
14 14 14 0 14 1
20 20 20 0 20 1
21 21 21 0 21 1
25 25 25 0 25 1
26 26 26 0 26 1
27 27 27 0 27 1
0 0 0 2 0 2
7 7 7 2 7 2
13 13 13 2 13 2
15 15 15 2 15 2
17 17 17 2 17 2
24 24 24 2 24 2

Categories

Resources