Produce Unique value for duplicates in column using Pandas/Python

Produce Unique value for duplicates in column using Pandas/Python - python

I have a dataset, df, where I would like to create unique ids for the values in the type column by placing numbers on the end.
Data
type total free use
a 10 5 5
a 10 4 6
a 10 1 9
a 10 8 2
a 10 3 7
b 20 5 5
b 20 3 7
b 20 2 8
b 20 6 4
b 20 2 8
Desired
type total free use
a 10 5 5
a1 10 4 6
a2 10 1 9
a3 10 8 2
a4 10 3 7
b 20 5 5
b1 20 3 7
b2 20 2 8
b3 20 6 4
b4 20 2 8
Doing
I was able to do this in R by doing, but unsure of how to do this in Python:
library(data.table)
setDT(DT)
DT[ , run_id := rleid(ID)]
DT[DT[ , .SD[1L], by = run_id][duplicated(ID), ID := paste0('list', .I)],
on = 'run_id', ID := i.ID][]
I am researching this, any input is appreciated

You can use groupby.cumcount:
df['type'] += np.where(df['type'].duplicated(),
df.groupby('type').cumcount().astype(str),
'')
Or similarly with loc update:
df.loc[df['type'].duplicated(), 'type'] += df.groupby('type').cumcount().astype(str)
Output:
type total free use
0 a 10 5 5
1 a1 10 4 6
2 a2 10 1 9
3 a3 10 8 2
4 a4 10 3 7
5 b 20 5 5
6 b1 20 3 7
7 b2 20 2 8
8 b3 20 6 4
9 b4 20 2 8

Related

Dynamically create columns in a dataframe

I have a Dataframe like the following:
a b a1 b1
0 1 6 10 20
1 2 7 11 21
2 3 8 12 22
3 4 9 13 23
4 5 2 14 24
where a1 and b1 are dynamically created by a and b. Can we create percentage columns dynamically as well ?
The one thing that is contant is the created columns will have 1 suffixed after the name
Expected output:
a b a1 b1 a% b%
0 0 6 10 20 0 30
1 2 7 11 21 29 33
2 3 8 12 22 38 36
3 4 9 13 23 44 39
4 5 2 14 24 250 8

Create new DataFrame by divide both columns and rename columns by DataFrame.add_suffix, last append to original by DataFrame.join:
cols = ['a','b']
new = [f'{x}1' for x in cols]
df = df.join(df[cols].div(df[new].to_numpy()).mul(100).add_suffix('%'))
print (df)
a b a1 b1 a% b%
0 1 6 10 20 10.000000 30.000000
1 2 7 11 21 18.181818 33.333333
2 3 8 12 22 25.000000 36.363636
3 4 9 13 23 30.769231 39.130435
4 5 2 14 24 35.714286 8.333333

How does pandas convert one column of data into another?

I have a dataframe generated by pandas, as follows：
NO CODE
1 a
2 a
3 a
4 a
5 a
6 a
7 b
8 b
9 a
10 a
11 a
12 a
13 b
14 a
15 a
16 a
I want to convert the CODE column data to get the NUM column. The encoding rules are as follows:
NO CODE NUM
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 b b
8 b b
9 a 1
10 a 2
11 a 3
12 a 4
13 b b
14 a 1
15 a 2
16 a 3
thank you！

Try:
a_group = df.CODE.eq('a')
df['NUM'] = np.where(a_group,
df.groupby(a_group.ne(a_group.shift()).cumsum())
.CODE.cumcount()+1,
df.CODE)
on
df = pd.DataFrame({'CODE':list('baaaaaabbaaaabbaa')})
yields
CODE NUM
-- ------ -----
0 b b
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 b b
8 b b
9 a 1
10 a 2
11 a 3
12 a 4
13 b b
14 b b
15 a 1
16 a 2

IIUC
s=df.CODE.eq('b').cumsum()
df['NUM']=df.CODE.where(df.CODE.eq('b'),s[~df.CODE.eq('b')].groupby(s).cumcount()+1)
df
Out[514]:
NO CODE NUM
0 1 a 1
1 2 a 2
2 3 a 3
3 4 a 4
4 5 a 5
5 6 a 6
6 7 b b
7 8 b b
8 9 a 1
9 10 a 2
10 11 a 3
11 12 a 4
12 13 b b
13 14 a 1
14 15 a 2
15 16 a 3

Python Dataframe: Create columns based on another column

I have a dataframe with repeated values for one column (here column 'A') and I want to convert this dataframe so that new columns are formed based on values of column 'A'.
Example
df = pd.DataFrame({'A':range(4)*3, 'B':range(12),'C':range(12,24)})
df
A B C
0 0 0 12
1 1 1 13
2 2 2 14
3 3 3 15
4 0 4 16
5 1 5 17
6 2 6 18
7 3 7 19
8 0 8 20
9 1 9 21
10 2 10 22
11 3 11 23
Note that the values of "A" column are repeated 3 times.
Now I want the simplest solution to convert it to another dataframe with this configuration (please ignore the naming of the columns, it is used for description purpose only, they could be anything):
B C
A0 A1 A2 A3 A0 A1 A2 A3
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

This is a pivot problem, so use
df.assign(idx=df.groupby('A').cumcount()).pivot('idx', 'A', ['B', 'C'])
B C
A 0 1 2 3 0 1 2 3
idx
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
If the headers are important, you can use MultiIndex.set_levels to fix them.
u = df.assign(idx=df.groupby('A').cumcount()).pivot('idx', 'A', ['B', 'C'])
u.columns = u.columns.set_levels(
['A' + u.columns.levels[1].astype(str)], level=[1])
u
B C
A A0 A1 A2 A3 A0 A1 A2 A3
idx
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

You may need assign the group help key by cumcount , then just do unstack
yourdf=df.assign(D=df.groupby('A').cumcount(),A='A'+df.A.astype(str)).set_index(['D','A']).unstack()
B C
A A0 A1 A2 A3 A0 A1 A2 A3
D
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23

Dynamic pandas dataframe generation

Here is code I wrote to generate a dataframe that contains 4 columns
num_rows = 10
df = pd.DataFrame({ 'id_col' : [x+1 for x in range(num_rows)] , 'c1': [randint(0, 9) for x in range(num_rows)], 'c2': [randint(0, 9) for x in range(num_rows)], 'c3': [randint(0, 9) for x in range(num_rows)] })
df
print(df) renders :
id_col c1 c2 c3
0 1 3 1 5
1 2 0 2 4
2 3 1 2 5
3 4 0 5 6
4 5 0 0 1
5 6 6 5 8
6 7 1 6 8
7 8 5 8 8
8 9 1 5 2
9 10 2 9 2
I've set the number or rows to be dynamically generated via the num_rows variable.
How to dynamically generate 1000 columns where each column is prepended by 'c'. So columns c1,c2,c3....c1000 are generated where each columns contains 10 rows ?

For better performance I suggest use for create DataFrame numpy function numpy.random.randint and then change columns names by list comprehension, for new column by position use DataFrame.insert:
np.random.seed(458)
N = 15
M = 10
df = pd.DataFrame(np.random.randint(10, size=(M, N)))
df.columns = ['c{}'.format(x+1) for x in df.columns]
df.insert(0, 'idcol', np.arange(M))
print (df)
idcol c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15
0 0 8 2 1 6 2 1 0 9 7 8 0 5 5 6 0
1 1 0 2 5 0 0 2 5 2 9 2 1 0 0 5 0
2 2 5 1 3 5 4 5 3 0 2 1 7 8 9 5 4
3 3 8 7 7 0 1 3 6 7 5 8 8 9 8 5 5
4 4 2 8 1 7 3 7 4 6 0 7 0 9 4 0 4
5 5 9 2 1 6 1 9 5 6 7 4 6 1 7 3 7
6 6 1 9 3 9 7 7 2 7 9 8 2 7 2 5 5
7 7 7 6 6 6 4 2 9 0 6 5 7 0 0 4 9
8 8 6 4 2 1 3 1 7 0 4 3 0 5 4 7 7
9 9 1 3 5 7 2 2 1 5 6 1 9 5 9 6 3
Another solution with numpy.hstack for stack first id column to 2d array:
np.random.seed(458)
arr = np.hstack([np.arange(M)[:, None], np.random.randint(10, size=(M, N))])
df = pd.DataFrame(arr)
df.columns = ['idcol'] + ['c{}'.format(x) for x in df.columns[1:]]
print (df)
idcol c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15
0 0 8 2 1 6 2 1 0 9 7 8 0 5 5 6 0
1 1 0 2 5 0 0 2 5 2 9 2 1 0 0 5 0
2 2 5 1 3 5 4 5 3 0 2 1 7 8 9 5 4
3 3 8 7 7 0 1 3 6 7 5 8 8 9 8 5 5
4 4 2 8 1 7 3 7 4 6 0 7 0 9 4 0 4
5 5 9 2 1 6 1 9 5 6 7 4 6 1 7 3 7
6 6 1 9 3 9 7 7 2 7 9 8 2 7 2 5 5
7 7 7 6 6 6 4 2 9 0 6 5 7 0 0 4 9
8 8 6 4 2 1 3 1 7 0 4 3 0 5 4 7 7
9 9 1 3 5 7 2 2 1 5 6 1 9 5 9 6 3

IIUC, use str.format and dict comprehension
num_rows = 10
num_cols = 15
df = pd.DataFrame({ 'c{}'.format(n): [randint(0, 9) for x in range(num_rows)] for n in range(num_cols)},
index=[x+1 for x in range(num_rows)] , )
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
1 1 6 2 1 3 1 8 8 2 0
2 2 6 2 2 5 7 4 1 6 2
3 1 2 6 8 7 5 5 7 2 2
4 5 5 3 3 4 7 8 1 8 6
5 7 2 8 6 5 6 2 0 0 4
6 8 2 4 4 6 3 0 1 0 2
7 5 6 8 5 1 0 4 8 4 7
8 1 5 4 5 2 4 4 6 2 7
9 5 7 7 8 5 0 2 7 3 2
10 4 8 5 3 3 7 5 1 5 1

You can use the np.random.randint to create a full array of random values, f-strings (Python 3.6+) with a list comprehension for column naming, and pd.DataFrame.assign with np.arange for defining "id_col":
import pandas as pd, numpy as np
rows = 10
cols = 5
minval, maxval = 0, 10
df = pd.DataFrame(np.random.randint(minval, maxval, (rows, cols)),
columns=[f'c{i}' for i in range(1, cols+1)])\
.assign(id_col=np.arange(1, num_rows+1))
print(df)
c1 c2 c3 c4 c5 id_col
0 8 4 6 0 8 1
1 8 3 5 9 0 2
2 1 3 3 6 2 3
3 6 4 1 1 7 4
4 3 7 0 9 5 5
5 4 6 8 8 6 6
6 0 3 9 9 7 7
7 0 6 1 2 4 8
8 3 7 1 2 0 9
9 6 6 0 5 8 10

choose from one dataframe based on another dataframe

I have two data frames one like this:
point sector
1 1 4
2 2 5
3 3 2
4 4 1
5 5 5
6 6 1
7 7 4
8 8 3
10 10 5
11 11 2
12 12 1
13 13 3
14 14 1
15 15 4
16 16 3
17 17 2
18 18 1
19 19 1
20 20 1
21 alt 1 2
22 alt 3 3
23 alt 2 5
And the other like this, where the entry corresponds to the sector I want the point to come from.
p1 p2 p3 p4
1 2 3 4
1 2 3 5
1 2 4 5
1 3 4 5
2 3 4 5
What I want to do is create another data frame that will give me a randomly selected set of points from the first dataframe based on their sector.
For example:
p1 p2 p3 p4
lane 1: 12 3 8 7
As you can see the numbers from lane 1 all have sectors that are in line 1 of the 2nd dataframe. I have been trying to use df.loc but was wondering if there is a better way?

For each row, fetch data from the first dataframe and random choice it:
df2.apply(lambda r: df.loc[r].groupby(level=0).point.apply(np.random.choice).values, axis=1)
Out[132]:
p1 p2 p3 p4
0 4 11 alt 3 1
1 6 11 13 alt 2
2 4 17 7 alt 2
3 19 alt 3 15 5
4 alt 1 13 7 10

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Produce Unique value for duplicates in column using Pandas/Python - python

Related

Dynamically create columns in a dataframe

How does pandas convert one column of data into another?

Python Dataframe: Create columns based on another column

Dynamic pandas dataframe generation

choose from one dataframe based on another dataframe

Categories

Resources