Python Dataframe: Create columns based on another column - python

I have a dataframe with repeated values for one column (here column 'A') and I want to convert this dataframe so that new columns are formed based on values of column 'A'.
Example
df = pd.DataFrame({'A':range(4)*3, 'B':range(12),'C':range(12,24)})
df
A B C
0 0 0 12
1 1 1 13
2 2 2 14
3 3 3 15
4 0 4 16
5 1 5 17
6 2 6 18
7 3 7 19
8 0 8 20
9 1 9 21
10 2 10 22
11 3 11 23
Note that the values of "A" column are repeated 3 times.
Now I want the simplest solution to convert it to another dataframe with this configuration (please ignore the naming of the columns, it is used for description purpose only, they could be anything):
B C
A0 A1 A2 A3 A0 A1 A2 A3
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

This is a pivot problem, so use
df.assign(idx=df.groupby('A').cumcount()).pivot('idx', 'A', ['B', 'C'])
B C
A 0 1 2 3 0 1 2 3
idx
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
If the headers are important, you can use MultiIndex.set_levels to fix them.
u = df.assign(idx=df.groupby('A').cumcount()).pivot('idx', 'A', ['B', 'C'])
u.columns = u.columns.set_levels(
['A' + u.columns.levels[1].astype(str)], level=[1])
u
B C
A A0 A1 A2 A3 A0 A1 A2 A3
idx
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

You may need assign the group help key by cumcount , then just do unstack
yourdf=df.assign(D=df.groupby('A').cumcount(),A='A'+df.A.astype(str)).set_index(['D','A']).unstack()
B C
A A0 A1 A2 A3 A0 A1 A2 A3
D
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Related

Dynamically create columns in a dataframe

I have a Dataframe like the following:
a b a1 b1
0 1 6 10 20
1 2 7 11 21
2 3 8 12 22
3 4 9 13 23
4 5 2 14 24
where a1 and b1 are dynamically created by a and b. Can we create percentage columns dynamically as well ?
The one thing that is contant is the created columns will have 1 suffixed after the name
Expected output:
a b a1 b1 a% b%
0 0 6 10 20 0 30
1 2 7 11 21 29 33
2 3 8 12 22 38 36
3 4 9 13 23 44 39
4 5 2 14 24 250 8
Create new DataFrame by divide both columns and rename columns by DataFrame.add_suffix, last append to original by DataFrame.join:
cols = ['a','b']
new = [f'{x}1' for x in cols]
df = df.join(df[cols].div(df[new].to_numpy()).mul(100).add_suffix('%'))
print (df)
a b a1 b1 a% b%
0 1 6 10 20 10.000000 30.000000
1 2 7 11 21 18.181818 33.333333
2 3 8 12 22 25.000000 36.363636
3 4 9 13 23 30.769231 39.130435
4 5 2 14 24 35.714286 8.333333

Conditional Cumulative Count pandas while preserving values before first change

I work with Pandas and I am trying to create a column where the value is increased and especially reset by condition based on the Time column
Input data:
Out[73]:
ID Time Job Level Counter
0 1 17 a
1 1 18 a
2 1 19 a
3 1 20 a
4 1 21 a
5 1 22 b
6 1 23. b
7 1 24. b
8 2 10. a
9 2 11 a
10 2 12 a
11 2 13 a
12 2 14. b
13 2 15 b
14 2 16 b
15 2 17 c
16 2 18 c
I want to create a new vector 'count' where the value within each ID group remains the same before the first change and start from zero every time a change in the Job level is encountered while remains equal to Time before the first change or no change.
What I would like to have:
ID Time Job Level Counter
0 1 17 a 17
1 1 18 a 18
2 1 19 a 19
3 1 20 a 20
4 1 21 a 21
5 1 22 b 0
6 1 23 b 1
7 1 24 b 2
8 2 10 a 10
9 2 11 a 11
10 2 12 a 12
11 2 13 a 13
12 2 14 b 0
13 2 15 b 1
14 2 16 b 2
15 2 17 c 0
16 2 18 c 1
This is what I tried
df = df.sort_values(['ID']).reset_index(drop=True)
df['Counter'] = promo_details.groupby('ID')['job_level'].apply(lambda x: x.shift()!=x)
def func(group):
group.loc[group.index[0],'Counter']=group.loc[group.index[0],'time_in_level']
return group
df = df.groupby('emp_id').apply(func)
df['Counter'] = df['Counter'].replace(True,'a')
df['Counter'] = np.where(df.Counter == False,df['Time'],df['Counter'])
df['Counter'] = df['Counter'].replace('a',0)
This is not creating a cumulative change after the first change while preserving counts before it,
Use GroupBy.cumcount for counter with filter first group - there is added values from column Time:
#if need test consecutive duplicates
s = df['Job Level'].ne(df['Job Level'].shift()).cumsum()
m = s.groupby(df['ID']).transform('first').eq(s)
df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount())
print (df)
ID Time Job Level Counter
0 1 17 a 17
1 1 18 a 18
2 1 19 a 19
3 1 20 a 20
4 1 21 a 21
5 1 22 b 0
6 1 23 b 1
7 1 24 b 2
8 2 10 a 10
9 2 11 a 11
10 2 12 a 12
11 2 13 a 13
12 2 14 b 0
13 2 15 b 1
14 2 16 b 2
15 2 17 c 0
16 2 18 c 1
Or:
#if each groups are unique
m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level'])
df['Counter'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount())
Difference in changed data:
print (df)
ID Time Job Level
12 2 14 b
13 2 15 b
14 2 16 b
15 2 17 c
16 2 18 c
10 2 12 a
11 2 18 a
12 2 19 b
13 2 20 b
#if need test consecutive duplicates
s = df['Job Level'].ne(df['Job Level'].shift()).cumsum()
m = s.groupby(df['ID']).transform('first').eq(s)
df['Counter1'] = np.where(m, df['Time'], df.groupby(['ID', s]).cumcount())
m = df.groupby('ID')['Job Level'].transform('first').eq(df['Job Level'])
df['Counter2'] = np.where(m, df['Time'], df.groupby(['ID', 'Job Level']).cumcount())
print (df)
ID Time Job Level Counter1 Counter2
12 2 14 b 14 14
13 2 15 b 15 15
14 2 16 b 16 16
15 2 17 c 0 0
16 2 18 c 1 1
10 2 12 a 0 0
11 2 18 a 1 1
12 2 19 b 0 19
13 2 20 b 1 20

Produce Unique value for duplicates in column using Pandas/Python

I have a dataset, df, where I would like to create unique ids for the values in the type column by placing numbers on the end.
Data
type total free use
a 10 5 5
a 10 4 6
a 10 1 9
a 10 8 2
a 10 3 7
b 20 5 5
b 20 3 7
b 20 2 8
b 20 6 4
b 20 2 8
Desired
type total free use
a 10 5 5
a1 10 4 6
a2 10 1 9
a3 10 8 2
a4 10 3 7
b 20 5 5
b1 20 3 7
b2 20 2 8
b3 20 6 4
b4 20 2 8
Doing
I was able to do this in R by doing, but unsure of how to do this in Python:
library(data.table)
setDT(DT)
DT[ , run_id := rleid(ID)]
DT[DT[ , .SD[1L], by = run_id][duplicated(ID), ID := paste0('list', .I)],
on = 'run_id', ID := i.ID][]
I am researching this, any input is appreciated
You can use groupby.cumcount:
df['type'] += np.where(df['type'].duplicated(),
df.groupby('type').cumcount().astype(str),
'')
Or similarly with loc update:
df.loc[df['type'].duplicated(), 'type'] += df.groupby('type').cumcount().astype(str)
Output:
type total free use
0 a 10 5 5
1 a1 10 4 6
2 a2 10 1 9
3 a3 10 8 2
4 a4 10 3 7
5 b 20 5 5
6 b1 20 3 7
7 b2 20 2 8
8 b3 20 6 4
9 b4 20 2 8

Concatenate dataframes along columns in a pandas dataframe

I want to concatenate two df along columns. Both have the same number of indices.
df1
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
df2
D E F
0 13 14 15
1 16 17 18
2 19 20 21
3 22 23 24
Expected:
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
I have done:
df_combined = pd.concat([df1,df2], axis=1)
But, the df_combined have new rows with NaN values in some columns...
I can't find my error. So, what I have to do? Thanks in advance!
In this case, merge() works.
pd.merge(df1, df2, left_index=True, right_index=True)
output
A B C D E F
0 1 2 3 13 14 15
1 4 5 6 16 17 18
2 7 8 9 19 20 21
3 10 11 12 22 23 24
This works only if both dataframe have same indices.

Pandas reshape dataframe every N rows to columns

I have a dataframe as follows :
df1=pd.DataFrame(np.arange(24).reshape(6,-1),columns=['a','b','c','d'])
and I want to take 3 set of rows and convert them to columns with following order
Numpy reshape doesn't give intended answer
pd.DataFrame(np.reshape(df1.values,(3,-1)),columns=['a','b','c','d','e','f','g','h'])
In [258]: df = pd.DataFrame(np.hstack(np.split(df1, 2)))
In [259]: df
Out[259]:
0 1 2 3 4 5 6 7
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
In [260]: import string
In [261]: df.columns = list(string.ascii_lowercase[:len(df.columns)])
In [262]: df
Out[262]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
Create 3d array by reshape:
a = np.hstack(np.reshape(df1.values,(-1, 3, len(df1.columns))))
df = pd.DataFrame(a,columns=['a','b','c','d','e','f','g','h'])
print (df)
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
This uses the reshape/swapaxes/reshape idiom for rearranging sub-blocks of NumPy arrays.
In [26]: pd.DataFrame(df1.values.reshape(2,3,4).swapaxes(0,1).reshape(3,-1), columns=['a','b','c','d','e','f','g','h'])
Out[26]:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
If you want a pure pandas solution:
df.set_index([df.index % 3, df.index // 3])\
.unstack()\
.sort_index(level=1, axis=1)\
.set_axis(list('abcdefgh'), axis=1, inplace=False)
Output:
a b c d e f g h
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Categories

Resources