I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
print(df)
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 3 1 5
13 3 2 4
14 3 4 5
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).
Here's the ideal output:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 4 1 8
13 4 2 4
14 4 3 5
15 4 4 6
16 4 5 2
17 4 6 3
Any help much appreciated.
You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:
t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]
Prints:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
Related
This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 4 months ago.
I have the following data:
df = pd.DataFrame({'id' : [1,2,3,4,5,6], 'category' : [1,3,1,4,3,2], 'day1' : [10,20,30,40,50,60], 'day2' : [1,2,3,4,5,7], 'day3' : [0,1,2,3,7,9] })
df
id category day1 day2 day3
0 1 1 10 1 0
1 2 3 20 2 1
2 3 1 30 3 2
3 4 4 40 4 3
4 5 3 50 5 7
5 6 2 60 7 9
It is time series data and I need to prepare the new DataFrame as records of ('id', 'category', 'day'):
df = pd.DataFrame({'id' : [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6], 'category' : [1,1,1,3,3,3,1,1,1,4,4,4,3,3,3,2,2,2], 'day' : [10,1,0,20,2,1,30,3,2,40,4,3,50,5,7,60,7,9]})
df
id category day
0 1 1 10
1 1 1 1
2 1 1 0
3 2 3 20
4 2 3 2
5 2 3 1
6 3 1 30
7 3 1 3
8 3 1 2
9 4 4 40
10 4 4 4
11 4 4 3
12 5 3 50
13 5 3 5
14 5 3 7
15 6 2 60
16 6 2 7
17 6 2 9
But I don't know how to do it without looping by every DataFrame cell
A possible solution:
df.set_index(['id', 'category']).stack().rename(
'day').reset_index().drop('level_2', axis=1)
Output:
id category day
0 1 1 10
1 1 1 1
2 1 1 0
3 2 3 20
4 2 3 2
5 2 3 1
6 3 1 30
7 3 1 3
8 3 1 2
9 4 4 40
10 4 4 4
11 4 4 3
12 5 3 50
13 5 3 5
14 5 3 7
15 6 2 60
16 6 2 7
17 6 2 9
You can use pandas.melt :
df_new = df.melt(id_vars=['id', 'category'], value_name='day'
).sort_values(['id', 'variable']
).drop('variable', axis=1
).reset_index(drop=True)
print(df_new)
Output:
id category day
0 1 1 10
1 1 1 1
2 1 1 0
3 2 3 20
4 2 3 2
5 2 3 1
6 3 1 30
7 3 1 3
8 3 1 2
9 4 4 40
10 4 4 4
11 4 4 3
12 5 3 50
13 5 3 5
14 5 3 7
15 6 2 60
16 6 2 7
17 6 2 9
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 9 months ago.
Suppose I have the following dataframe
import pandas as pd
df = pd.DataFrame({'a': [1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4],
'b': [3,4,3,7,5,9,4,2,5,6,7,8,4,2,4,5,8,0]})
a b
0 1 3
1 1 4
2 1 3
3 2 7
4 2 5
5 2 9
6 2 4
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 4 4
13 4 2
14 4 4
15 4 5
16 4 8
17 4 0
And I would like to make a new column c with values 1 to n where n depends on the value of column a as follow:
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
While I can write it using a for loop, my data frame is huge and it's computationally costly, is there any efficient to generate such column? Thanks.
Use groupby_cumcount:
df['c'] = df.groupby('a').cumcount().add(1)
print(df)
# Output
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
I loaded the data without header.
train = pd.read_csv('caravan.train', delimiter ='\t', header=None)
train.index = np.arange(1,len(train)+1)
train
0 1 2 3 4 5 6 7 8 9
1 33 1 3 2 8 0 5 1 3 7
2 37 1 2 2 8 1 4 1 4 6
3 37 1 2 2 8 0 4 2 4 3
4 9 1 3 3 3 2 3 2 4 5
5 40 1 4 2 10 1 4 1 4 7
but the header started from 0, and I want to create header starting with 1 insteade of 0
How can I do this?
In your case
df.columns = df.columns.astype(int)+1
df
Out[99]:
1 2 3 4 5 6 7 8 9 10
1 33 1 3 2 8 0 5 1 3 7
2 37 1 2 2 8 1 4 1 4 6
3 37 1 2 2 8 0 4 2 4 3
4 9 1 3 3 3 2 3 2 4 5
5 40 1 4 2 10 1 4 1 4 7
df = pd.DataFrame({'site':[1,1,1,1,1,1,1,1,1,1], 'parm':[8,8,8,8,8,9,9,9,9,9],
'date':[1,2,3,4,5,1,2,3,4,5], 'obs':[1,1,2,3,3,3,5,5,6,6]})
Output
site parm date obs
0 1 8 1 1
1 1 8 2 1
2 1 8 3 2
3 1 8 4 3
4 1 8 5 3
5 1 9 1 3
6 1 9 2 5
7 1 9 3 5
8 1 9 4 6
9 1 9 5 6
I want to count repeating, sequential "obs" values within a "site" and "parm". I have this code which is close:
df['consecutive'] = df.parm.groupby((df.obs != df.obs.shift()).cumsum()).transform('size')
Output
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 3
4 1 8 5 3 3
5 1 9 1 3 3
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
It creates the new column with the count. The gap is when the parm changes from 8 to 9 it includes the parm 9 in the parm 8 count. The expected output is:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
You need to throw site, parm as indicated in the question into groupby:
df['consecutive'] = (df.groupby([df.obs.ne(df.obs.shift()).cumsum(),
'site', 'parm']
)
['obs'].transform('size')
)
Output:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
I'm figuring out how to assign a categorization from an increasing enumeration column. Here an example of my dataframe:
df = pd.DataFrame({'A':[1,1,1,1,1,1,2,2,3,3,3,3,3],'B':[1,2,3,12,13,14,1,2,5,6,7,8,50]})
This produce:
df
Out[9]:
A B
0 1 1
1 1 2
2 1 3
3 1 12
4 1 13
5 1 14
6 2 1
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 3 50
The column B has an increasing numerical serie, but sometimes the series is interrupted and keeps going with other numbers or start again. My desired output is:
Out[11]:
A B C
0 1 1 1
1 1 2 1
2 1 3 1
3 1 12 2
4 1 13 2
5 1 14 2
6 2 1 3
7 2 2 3
8 3 5 3
9 3 6 4
10 3 7 4
11 3 8 4
12 3 50 5
I appreciate your suggestions, because I can not find an ingenious way to
do it. Thanks
Is this what you need ?
df.B.diff().ne(1).cumsum()
Out[463]:
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 4
9 4
10 4
11 4
12 5
Name: B, dtype: int32