I have a dataset, df, where I would like to create unique ids for the values in the type column by placing numbers on the end.
Data
type total free use
a 10 5 5
a 10 4 6
a 10 1 9
a 10 8 2
a 10 3 7
b 20 5 5
b 20 3 7
b 20 2 8
b 20 6 4
b 20 2 8
Desired
type total free use
a 10 5 5
a1 10 4 6
a2 10 1 9
a3 10 8 2
a4 10 3 7
b 20 5 5
b1 20 3 7
b2 20 2 8
b3 20 6 4
b4 20 2 8
Doing
I was able to do this in R by doing, but unsure of how to do this in Python:
library(data.table)
setDT(DT)
DT[ , run_id := rleid(ID)]
DT[DT[ , .SD[1L], by = run_id][duplicated(ID), ID := paste0('list', .I)],
on = 'run_id', ID := i.ID][]
I am researching this, any input is appreciated
You can use groupby.cumcount:
df['type'] += np.where(df['type'].duplicated(),
df.groupby('type').cumcount().astype(str),
'')
Or similarly with loc update:
df.loc[df['type'].duplicated(), 'type'] += df.groupby('type').cumcount().astype(str)
Output:
type total free use
0 a 10 5 5
1 a1 10 4 6
2 a2 10 1 9
3 a3 10 8 2
4 a4 10 3 7
5 b 20 5 5
6 b1 20 3 7
7 b2 20 2 8
8 b3 20 6 4
9 b4 20 2 8
Related
I have a Dataframe like the following:
a b a1 b1
0 1 6 10 20
1 2 7 11 21
2 3 8 12 22
3 4 9 13 23
4 5 2 14 24
where a1 and b1 are dynamically created by a and b. Can we create percentage columns dynamically as well ?
The one thing that is contant is the created columns will have 1 suffixed after the name
Expected output:
a b a1 b1 a% b%
0 0 6 10 20 0 30
1 2 7 11 21 29 33
2 3 8 12 22 38 36
3 4 9 13 23 44 39
4 5 2 14 24 250 8
Create new DataFrame by divide both columns and rename columns by DataFrame.add_suffix, last append to original by DataFrame.join:
cols = ['a','b']
new = [f'{x}1' for x in cols]
df = df.join(df[cols].div(df[new].to_numpy()).mul(100).add_suffix('%'))
print (df)
a b a1 b1 a% b%
0 1 6 10 20 10.000000 30.000000
1 2 7 11 21 18.181818 33.333333
2 3 8 12 22 25.000000 36.363636
3 4 9 13 23 30.769231 39.130435
4 5 2 14 24 35.714286 8.333333
I have a dataframe generated by pandas, as follows:
NO CODE
1 a
2 a
3 a
4 a
5 a
6 a
7 b
8 b
9 a
10 a
11 a
12 a
13 b
14 a
15 a
16 a
I want to convert the CODE column data to get the NUM column. The encoding rules are as follows:
NO CODE NUM
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 b b
8 b b
9 a 1
10 a 2
11 a 3
12 a 4
13 b b
14 a 1
15 a 2
16 a 3
thank you!
Try:
a_group = df.CODE.eq('a')
df['NUM'] = np.where(a_group,
df.groupby(a_group.ne(a_group.shift()).cumsum())
.CODE.cumcount()+1,
df.CODE)
on
df = pd.DataFrame({'CODE':list('baaaaaabbaaaabbaa')})
yields
CODE NUM
-- ------ -----
0 b b
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 b b
8 b b
9 a 1
10 a 2
11 a 3
12 a 4
13 b b
14 b b
15 a 1
16 a 2
IIUC
s=df.CODE.eq('b').cumsum()
df['NUM']=df.CODE.where(df.CODE.eq('b'),s[~df.CODE.eq('b')].groupby(s).cumcount()+1)
df
Out[514]:
NO CODE NUM
0 1 a 1
1 2 a 2
2 3 a 3
3 4 a 4
4 5 a 5
5 6 a 6
6 7 b b
7 8 b b
8 9 a 1
9 10 a 2
10 11 a 3
11 12 a 4
12 13 b b
13 14 a 1
14 15 a 2
15 16 a 3
I have a dataframe with repeated values for one column (here column 'A') and I want to convert this dataframe so that new columns are formed based on values of column 'A'.
Example
df = pd.DataFrame({'A':range(4)*3, 'B':range(12),'C':range(12,24)})
df
A B C
0 0 0 12
1 1 1 13
2 2 2 14
3 3 3 15
4 0 4 16
5 1 5 17
6 2 6 18
7 3 7 19
8 0 8 20
9 1 9 21
10 2 10 22
11 3 11 23
Note that the values of "A" column are repeated 3 times.
Now I want the simplest solution to convert it to another dataframe with this configuration (please ignore the naming of the columns, it is used for description purpose only, they could be anything):
B C
A0 A1 A2 A3 A0 A1 A2 A3
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
This is a pivot problem, so use
df.assign(idx=df.groupby('A').cumcount()).pivot('idx', 'A', ['B', 'C'])
B C
A 0 1 2 3 0 1 2 3
idx
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
If the headers are important, you can use MultiIndex.set_levels to fix them.
u = df.assign(idx=df.groupby('A').cumcount()).pivot('idx', 'A', ['B', 'C'])
u.columns = u.columns.set_levels(
['A' + u.columns.levels[1].astype(str)], level=[1])
u
B C
A A0 A1 A2 A3 A0 A1 A2 A3
idx
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
You may need assign the group help key by cumcount , then just do unstack
yourdf=df.assign(D=df.groupby('A').cumcount(),A='A'+df.A.astype(str)).set_index(['D','A']).unstack()
B C
A A0 A1 A2 A3 A0 A1 A2 A3
D
0 0 1 2 3 12 13 14 15
1 4 5 6 7 16 17 18 19
2 8 9 10 11 20 21 22 23
Here is code I wrote to generate a dataframe that contains 4 columns
num_rows = 10
df = pd.DataFrame({ 'id_col' : [x+1 for x in range(num_rows)] , 'c1': [randint(0, 9) for x in range(num_rows)], 'c2': [randint(0, 9) for x in range(num_rows)], 'c3': [randint(0, 9) for x in range(num_rows)] })
df
print(df) renders :
id_col c1 c2 c3
0 1 3 1 5
1 2 0 2 4
2 3 1 2 5
3 4 0 5 6
4 5 0 0 1
5 6 6 5 8
6 7 1 6 8
7 8 5 8 8
8 9 1 5 2
9 10 2 9 2
I've set the number or rows to be dynamically generated via the num_rows variable.
How to dynamically generate 1000 columns where each column is prepended by 'c'. So columns c1,c2,c3....c1000 are generated where each columns contains 10 rows ?
For better performance I suggest use for create DataFrame numpy function numpy.random.randint and then change columns names by list comprehension, for new column by position use DataFrame.insert:
np.random.seed(458)
N = 15
M = 10
df = pd.DataFrame(np.random.randint(10, size=(M, N)))
df.columns = ['c{}'.format(x+1) for x in df.columns]
df.insert(0, 'idcol', np.arange(M))
print (df)
idcol c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15
0 0 8 2 1 6 2 1 0 9 7 8 0 5 5 6 0
1 1 0 2 5 0 0 2 5 2 9 2 1 0 0 5 0
2 2 5 1 3 5 4 5 3 0 2 1 7 8 9 5 4
3 3 8 7 7 0 1 3 6 7 5 8 8 9 8 5 5
4 4 2 8 1 7 3 7 4 6 0 7 0 9 4 0 4
5 5 9 2 1 6 1 9 5 6 7 4 6 1 7 3 7
6 6 1 9 3 9 7 7 2 7 9 8 2 7 2 5 5
7 7 7 6 6 6 4 2 9 0 6 5 7 0 0 4 9
8 8 6 4 2 1 3 1 7 0 4 3 0 5 4 7 7
9 9 1 3 5 7 2 2 1 5 6 1 9 5 9 6 3
Another solution with numpy.hstack for stack first id column to 2d array:
np.random.seed(458)
arr = np.hstack([np.arange(M)[:, None], np.random.randint(10, size=(M, N))])
df = pd.DataFrame(arr)
df.columns = ['idcol'] + ['c{}'.format(x) for x in df.columns[1:]]
print (df)
idcol c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15
0 0 8 2 1 6 2 1 0 9 7 8 0 5 5 6 0
1 1 0 2 5 0 0 2 5 2 9 2 1 0 0 5 0
2 2 5 1 3 5 4 5 3 0 2 1 7 8 9 5 4
3 3 8 7 7 0 1 3 6 7 5 8 8 9 8 5 5
4 4 2 8 1 7 3 7 4 6 0 7 0 9 4 0 4
5 5 9 2 1 6 1 9 5 6 7 4 6 1 7 3 7
6 6 1 9 3 9 7 7 2 7 9 8 2 7 2 5 5
7 7 7 6 6 6 4 2 9 0 6 5 7 0 0 4 9
8 8 6 4 2 1 3 1 7 0 4 3 0 5 4 7 7
9 9 1 3 5 7 2 2 1 5 6 1 9 5 9 6 3
IIUC, use str.format and dict comprehension
num_rows = 10
num_cols = 15
df = pd.DataFrame({ 'c{}'.format(n): [randint(0, 9) for x in range(num_rows)] for n in range(num_cols)},
index=[x+1 for x in range(num_rows)] , )
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9
1 1 6 2 1 3 1 8 8 2 0
2 2 6 2 2 5 7 4 1 6 2
3 1 2 6 8 7 5 5 7 2 2
4 5 5 3 3 4 7 8 1 8 6
5 7 2 8 6 5 6 2 0 0 4
6 8 2 4 4 6 3 0 1 0 2
7 5 6 8 5 1 0 4 8 4 7
8 1 5 4 5 2 4 4 6 2 7
9 5 7 7 8 5 0 2 7 3 2
10 4 8 5 3 3 7 5 1 5 1
You can use the np.random.randint to create a full array of random values, f-strings (Python 3.6+) with a list comprehension for column naming, and pd.DataFrame.assign with np.arange for defining "id_col":
import pandas as pd, numpy as np
rows = 10
cols = 5
minval, maxval = 0, 10
df = pd.DataFrame(np.random.randint(minval, maxval, (rows, cols)),
columns=[f'c{i}' for i in range(1, cols+1)])\
.assign(id_col=np.arange(1, num_rows+1))
print(df)
c1 c2 c3 c4 c5 id_col
0 8 4 6 0 8 1
1 8 3 5 9 0 2
2 1 3 3 6 2 3
3 6 4 1 1 7 4
4 3 7 0 9 5 5
5 4 6 8 8 6 6
6 0 3 9 9 7 7
7 0 6 1 2 4 8
8 3 7 1 2 0 9
9 6 6 0 5 8 10
I have two data frames one like this:
point sector
1 1 4
2 2 5
3 3 2
4 4 1
5 5 5
6 6 1
7 7 4
8 8 3
10 10 5
11 11 2
12 12 1
13 13 3
14 14 1
15 15 4
16 16 3
17 17 2
18 18 1
19 19 1
20 20 1
21 alt 1 2
22 alt 3 3
23 alt 2 5
And the other like this, where the entry corresponds to the sector I want the point to come from.
p1 p2 p3 p4
1 2 3 4
1 2 3 5
1 2 4 5
1 3 4 5
2 3 4 5
What I want to do is create another data frame that will give me a randomly selected set of points from the first dataframe based on their sector.
For example:
p1 p2 p3 p4
lane 1: 12 3 8 7
As you can see the numbers from lane 1 all have sectors that are in line 1 of the 2nd dataframe. I have been trying to use df.loc but was wondering if there is a better way?
For each row, fetch data from the first dataframe and random choice it:
df2.apply(lambda r: df.loc[r].groupby(level=0).point.apply(np.random.choice).values, axis=1)
Out[132]:
p1 p2 p3 p4
0 4 11 alt 3 1
1 6 11 13 alt 2
2 4 17 7 alt 2
3 19 alt 3 15 5
4 alt 1 13 7 10