How to reduce pandas dataframe to only those individuals with all timepoints

How to reduce pandas dataframe to only those individuals with all timepoints - python

I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
print(df)
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 3 1 5
13 3 2 4
14 3 4 5
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).
Here's the ideal output:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 4 1 8
13 4 2 4
14 4 3 5
15 4 4 6
16 4 5 2
17 4 6 3
Any help much appreciated.

You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:
t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]
Prints:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3

Related

Convert column based time series data to regular DataFrame [duplicate]

This question already has answers here:
How do I melt a pandas dataframe?
(3 answers)
Closed 4 months ago.
I have the following data:
df = pd.DataFrame({'id' : [1,2,3,4,5,6], 'category' : [1,3,1,4,3,2], 'day1' : [10,20,30,40,50,60], 'day2' : [1,2,3,4,5,7], 'day3' : [0,1,2,3,7,9] })
df
id category day1 day2 day3
0 1 1 10 1 0
1 2 3 20 2 1
2 3 1 30 3 2
3 4 4 40 4 3
4 5 3 50 5 7
5 6 2 60 7 9
It is time series data and I need to prepare the new DataFrame as records of ('id', 'category', 'day'):
df = pd.DataFrame({'id' : [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6], 'category' : [1,1,1,3,3,3,1,1,1,4,4,4,3,3,3,2,2,2], 'day' : [10,1,0,20,2,1,30,3,2,40,4,3,50,5,7,60,7,9]})
df
id category day
0 1 1 10
1 1 1 1
2 1 1 0
3 2 3 20
4 2 3 2
5 2 3 1
6 3 1 30
7 3 1 3
8 3 1 2
9 4 4 40
10 4 4 4
11 4 4 3
12 5 3 50
13 5 3 5
14 5 3 7
15 6 2 60
16 6 2 7
17 6 2 9
But I don't know how to do it without looping by every DataFrame cell

A possible solution:
df.set_index(['id', 'category']).stack().rename(
'day').reset_index().drop('level_2', axis=1)
Output:
id category day
0 1 1 10
1 1 1 1
2 1 1 0
3 2 3 20
4 2 3 2
5 2 3 1
6 3 1 30
7 3 1 3
8 3 1 2
9 4 4 40
10 4 4 4
11 4 4 3
12 5 3 50
13 5 3 5
14 5 3 7
15 6 2 60
16 6 2 7
17 6 2 9

You can use pandas.melt :
df_new = df.melt(id_vars=['id', 'category'], value_name='day'
).sort_values(['id', 'variable']
).drop('variable', axis=1
).reset_index(drop=True)
print(df_new)
Output:
id category day
0 1 1 10
1 1 1 1
2 1 1 0
3 2 3 20
4 2 3 2
5 2 3 1
6 3 1 30
7 3 1 3
8 3 1 2
9 4 4 40
10 4 4 4
11 4 4 3
12 5 3 50
13 5 3 5
14 5 3 7
15 6 2 60
16 6 2 7
17 6 2 9

Generating new column w value 1 to n with n depending on another column in Pandas [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 9 months ago.
Suppose I have the following dataframe
import pandas as pd
df = pd.DataFrame({'a': [1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4],
'b': [3,4,3,7,5,9,4,2,5,6,7,8,4,2,4,5,8,0]})
a b
0 1 3
1 1 4
2 1 3
3 2 7
4 2 5
5 2 9
6 2 4
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 4 4
13 4 2
14 4 4
15 4 5
16 4 8
17 4 0
And I would like to make a new column c with values 1 to n where n depends on the value of column a as follow:
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
While I can write it using a for loop, my data frame is huge and it's computationally costly, is there any efficient to generate such column? Thanks.

Use groupby_cumcount:
df['c'] = df.groupby('a').cumcount().add(1)
print(df)
# Output
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6

How can I create header starting with1

I loaded the data without header.
train = pd.read_csv('caravan.train', delimiter ='\t', header=None)
train.index = np.arange(1,len(train)+1)
train
0 1 2 3 4 5 6 7 8 9
1 33 1 3 2 8 0 5 1 3 7
2 37 1 2 2 8 1 4 1 4 6
3 37 1 2 2 8 0 4 2 4 3
4 9 1 3 3 3 2 3 2 4 5
5 40 1 4 2 10 1 4 1 4 7
but the header started from 0, and I want to create header starting with 1 insteade of 0
How can I do this?

In your case
df.columns = df.columns.astype(int)+1
df
Out[99]:
1 2 3 4 5 6 7 8 9 10
1 33 1 3 2 8 0 5 1 3 7
2 37 1 2 2 8 1 4 1 4 6
3 37 1 2 2 8 0 4 2 4 3
4 9 1 3 3 3 2 3 2 4 5
5 40 1 4 2 10 1 4 1 4 7

Need to count repeating, consecutive values in python dataframe within a groupby

df = pd.DataFrame({'site':[1,1,1,1,1,1,1,1,1,1], 'parm':[8,8,8,8,8,9,9,9,9,9],
'date':[1,2,3,4,5,1,2,3,4,5], 'obs':[1,1,2,3,3,3,5,5,6,6]})
Output
site parm date obs
0 1 8 1 1
1 1 8 2 1
2 1 8 3 2
3 1 8 4 3
4 1 8 5 3
5 1 9 1 3
6 1 9 2 5
7 1 9 3 5
8 1 9 4 6
9 1 9 5 6
I want to count repeating, sequential "obs" values within a "site" and "parm". I have this code which is close:
df['consecutive'] = df.parm.groupby((df.obs != df.obs.shift()).cumsum()).transform('size')
Output
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 3
4 1 8 5 3 3
5 1 9 1 3 3
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
It creates the new column with the count. The gap is when the parm changes from 8 to 9 it includes the parm 9 in the parm 8 count. The expected output is:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2

You need to throw site, parm as indicated in the question into groupby:
df['consecutive'] = (df.groupby([df.obs.ne(df.obs.shift()).cumsum(),
'site', 'parm']
)
['obs'].transform('size')
)
Output:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2

categorize numerical series with python

I'm figuring out how to assign a categorization from an increasing enumeration column. Here an example of my dataframe:
df = pd.DataFrame({'A':[1,1,1,1,1,1,2,2,3,3,3,3,3],'B':[1,2,3,12,13,14,1,2,5,6,7,8,50]})
This produce:
df
Out[9]:
A B
0 1 1
1 1 2
2 1 3
3 1 12
4 1 13
5 1 14
6 2 1
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 3 50
The column B has an increasing numerical serie, but sometimes the series is interrupted and keeps going with other numbers or start again. My desired output is:
Out[11]:
A B C
0 1 1 1
1 1 2 1
2 1 3 1
3 1 12 2
4 1 13 2
5 1 14 2
6 2 1 3
7 2 2 3
8 3 5 3
9 3 6 4
10 3 7 4
11 3 8 4
12 3 50 5
I appreciate your suggestions, because I can not find an ingenious way to
do it. Thanks

Is this what you need ?
df.B.diff().ne(1).cumsum()
Out[463]:
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 4
9 4
10 4
11 4
12 5
Name: B, dtype: int32

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to reduce pandas dataframe to only those individuals with all timepoints - python

Related

Convert column based time series data to regular DataFrame [duplicate]

Generating new column w value 1 to n with n depending on another column in Pandas [duplicate]

How can I create header starting with1

Need to count repeating, consecutive values in python dataframe within a groupby

categorize numerical series with python

Categories

Resources