add incremental value for duplicates [duplicate] - python

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 3 years ago.
Suppose I have a dataframe looking something like
df = pd.DataFrame(np.array([[1, 2, 3, 2], [4, 5, 6, 3], [7, 8, 9, 5]]), columns=['a', 'b', 'c', 'repeater'])
a b c repeater
0 1 2 3 2
1 4 5 6 3
2 7 8 9 5
And I repeat every row based on the df['repeat'] like df = df.loc[df.index.repeat(df['repeater'])]
So I end up with a data frame
a b c repeater
0 1 2 3 2
0 1 2 3 2
1 4 5 6 3
1 4 5 6 3
1 4 5 6 3
2 7 8 9 5
2 7 8 9 5
2 7 8 9 5
2 7 8 9 5
2 7 8 9 5
How can I add an incremental value based on the index row? So a new column df['incremental'] with the output:
a b c repeater incremental
0 1 2 3 2 1
0 1 2 3 2 2
1 4 5 6 3 1
1 4 5 6 3 2
1 4 5 6 3 3
2 7 8 9 5 1
2 7 8 9 5 2
2 7 8 9 5 3
2 7 8 9 5 4
2 7 8 9 5 5

Try your code with an extra groupby and cumcount:
df = df.loc[df.index.repeat(df['repeater'])]
df['incremental'] = df.groupby(df.index).cumcount() + 1
print(df)
Output:
a b c repeater incremental
0 1 2 3 2 1
0 1 2 3 2 2
1 4 5 6 3 1
1 4 5 6 3 2
1 4 5 6 3 3
2 7 8 9 5 1
2 7 8 9 5 2
2 7 8 9 5 3
2 7 8 9 5 4
2 7 8 9 5 5

Related

Generating new column w value 1 to n with n depending on another column in Pandas [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 9 months ago.
Suppose I have the following dataframe
import pandas as pd
df = pd.DataFrame({'a': [1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4],
'b': [3,4,3,7,5,9,4,2,5,6,7,8,4,2,4,5,8,0]})
a b
0 1 3
1 1 4
2 1 3
3 2 7
4 2 5
5 2 9
6 2 4
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 4 4
13 4 2
14 4 4
15 4 5
16 4 8
17 4 0
And I would like to make a new column c with values 1 to n where n depends on the value of column a as follow:
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
While I can write it using a for loop, my data frame is huge and it's computationally costly, is there any efficient to generate such column? Thanks.
Use groupby_cumcount:
df['c'] = df.groupby('a').cumcount().add(1)
print(df)
# Output
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6

How to reduce pandas dataframe to only those individuals with all timepoints

I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
print(df)
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 3 1 5
13 3 2 4
14 3 4 5
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).
Here's the ideal output:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 4 1 8
13 4 2 4
14 4 3 5
15 4 4 6
16 4 5 2
17 4 6 3
Any help much appreciated.
You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:
t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]
Prints:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3

Need to count repeating, consecutive values in python dataframe within a groupby

df = pd.DataFrame({'site':[1,1,1,1,1,1,1,1,1,1], 'parm':[8,8,8,8,8,9,9,9,9,9],
'date':[1,2,3,4,5,1,2,3,4,5], 'obs':[1,1,2,3,3,3,5,5,6,6]})
Output
site parm date obs
0 1 8 1 1
1 1 8 2 1
2 1 8 3 2
3 1 8 4 3
4 1 8 5 3
5 1 9 1 3
6 1 9 2 5
7 1 9 3 5
8 1 9 4 6
9 1 9 5 6
I want to count repeating, sequential "obs" values within a "site" and "parm". I have this code which is close:
df['consecutive'] = df.parm.groupby((df.obs != df.obs.shift()).cumsum()).transform('size')
Output
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 3
4 1 8 5 3 3
5 1 9 1 3 3
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
It creates the new column with the count. The gap is when the parm changes from 8 to 9 it includes the parm 9 in the parm 8 count. The expected output is:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
You need to throw site, parm as indicated in the question into groupby:
df['consecutive'] = (df.groupby([df.obs.ne(df.obs.shift()).cumsum(),
'site', 'parm']
)
['obs'].transform('size')
)
Output:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2

pandas - pivot column names as values

I have a dataframe like this:
1 2 3 4 5 6
Ax Ax Ax Ax Ax Ax
delta delta delta delta delta delta
0 6 4 1 5 3 2
1 6 1 5 3 2 4
2 6 1 5 3 2 4
3 6 1 5 3 2 4
4 6 1 5 3 2 4
5 6 1 5 3 2 4
6 6 1 5 3 2 4
7 6 1 5 3 2 4
8 6 1 5 3 2 4
9 6 1 5 3 2 4
I would like to pivot this such that the values are the column, and the columns are the value.
So, the first two rows would become the following:
1 2 3 4 5 6
0 3 6 5 2 4 1
1 3 6 2 5 4 1
I hope that this makes sense. I have tried using pivot() and pivot_table() but it doesn't seem possible with that.
Try:
df1 = df.copy()
df1.columns = df1.columns.droplevel([1,2])
df1.stack().reset_index().pivot(index='level_0', columns=0)
Slice the columns by the sorted indices:
import numpy as np
import pandas as pd
cols = df.columns.get_level_values(0).to_numpy()
pd.DataFrame(cols[np.argsort(df.to_numpy(), 1)],
columns=list(range(1, df.shape[1]+1)))
1 2 3 4 5 6
0 3 6 5 2 4 1
1 2 5 4 6 3 1
2 2 5 4 6 3 1
3 2 5 4 6 3 1
4 2 5 4 6 3 1
5 2 5 4 6 3 1
6 2 5 4 6 3 1
7 2 5 4 6 3 1
8 2 5 4 6 3 1
9 2 5 4 6 3 1

pandas add variables according to variable value

Suppose the following pandas dataframe
Wafer_Id v1 v2
0 0 9 6
1 0 7 8
2 0 1 5
3 1 6 6
4 1 0 8
5 1 5 0
6 2 8 8
7 2 2 6
8 2 3 5
9 3 5 1
10 3 5 6
11 3 9 8
I want to group it according to WaferId and I would like to get something like
w
Out[60]:
Wafer_Id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3
0 0 9 7 1 6 ... ...
1 1 6 0 5 6
2 2 8 2 3 8
3 3 5 5 9 1
I think that I can obtain the result with the pivot function but I am not sure of how to do it
Possible solution
oes = pd.DataFrame()
oes['Wafer_Id'] = [0,0,0,1,1,1,2,2,2,3,3,3]
oes['v1'] = np.random.randint(0, 10, 12)
oes['v2'] = np.random.randint(0, 10, 12)
oes['id'] = [0, 1, 2] * 4
oes.pivot(index='Wafer_Id', columns='id')
oes
Out[74]:
Wafer_Id v1 v2 id
0 0 8 7 0
1 0 3 3 1
2 0 8 0 2
3 1 2 5 0
4 1 4 1 1
5 1 8 8 2
6 2 8 6 0
7 2 4 7 1
8 2 4 3 2
9 3 4 6 0
10 3 9 2 1
11 3 7 1 2
oes.pivot(index='Wafer_Id', columns='id')
Out[75]:
v1 v2
id 0 1 2 0 1 2
Wafer_Id
0 8 3 8 7 3 0
1 2 4 8 5 1 8
2 8 4 4 6 7 3
3 4 9 7 6 2 1

Categories

Resources