I loaded the data without header.
train = pd.read_csv('caravan.train', delimiter ='\t', header=None)
train.index = np.arange(1,len(train)+1)
train
0 1 2 3 4 5 6 7 8 9
1 33 1 3 2 8 0 5 1 3 7
2 37 1 2 2 8 1 4 1 4 6
3 37 1 2 2 8 0 4 2 4 3
4 9 1 3 3 3 2 3 2 4 5
5 40 1 4 2 10 1 4 1 4 7
but the header started from 0, and I want to create header starting with 1 insteade of 0
How can I do this?
In your case
df.columns = df.columns.astype(int)+1
df
Out[99]:
1 2 3 4 5 6 7 8 9 10
1 33 1 3 2 8 0 5 1 3 7
2 37 1 2 2 8 1 4 1 4 6
3 37 1 2 2 8 0 4 2 4 3
4 9 1 3 3 3 2 3 2 4 5
5 40 1 4 2 10 1 4 1 4 7
Related
I have a data set, redacted sample below. My goal is linear regression. My question is: Have I created unintended results, due to how I structured the df, using concat and/or div?
For example, predicting:
(2nd time rating) minus (base time rating) with
the ratio of (percent #1) over (percent #2 ).
From the df below:
((4 wk nprs rating)-(base nprs rating))
with dependent variable: ((active modalities)/(passive modalities))
I've created dataframes, in hopes of efficiency, and run OLS, all below.
Thank you for your insight.
What I've tried:
ctr = pd.read_csv('file_path/CTrial Data.csv') #CTR trial
ctr = ctr.apply(pd.to_numeric, errors='coerce') #converting all to numeric, except NaN where can't
ctr = ctr.fillna(0) #replacing NaN with 0's, dropping results in too much loss
Now the difference in pain perception over some time periods.
#OSWESTRY Pain Scale (lower back funcion)
#THIS scale is 0-50
#using absolute value leads to more than max, eg, 4-(-4)=8, not using abs results in negative values
ctr['Trial_4wk_diff'] = (ctr['osw_4wk'] - ctr['osw_base']).abs() #calculating the difference
ctr['Trial_12wk_diff'] = (ctr['osw_12wk'] - ctr['osw_base']).abs()
ctr['Trial_1yr_diff'] = (ctr['osw_1yr'] - ctr['osw_base']).abs()
Grouping the treating modalities by "active" and "passive" then
calculating the ratio. These ratings are specifically the perceived
benefit of each modality.
#perceived benefit of active treatment modalities over perceived benefit of passive modalities
ctr['active'] = pd.concat((ctr['perceived_ben_aerobic_ex'],ctr['perceived_ben_strength_ex'],ctr['perceived_ben_rom_ex']),ignore_index=True)
# 0 NaN
ctr['passive'] = pd.concat((ctr['perceived_ben_meds'],ctr['perceived_ben_rest'], ctr['perceived_ben_surgery'],ctr['perceived_ben_massage'],ctr['perceived_ben_manip'],ctr['perceived_ben_traction'],ctr['perceived_ben_work_restrict']), ignore_index = True)
# 0 NaN
ctr['mods'] = ctr['active'].div(ctr['passive']) #ratio of active vs passive perceived benefit #Here I simply used .div()
ctr['mods'] = add_constant(ctr['mods'].fillna(0))
#ctr['mods'].isna().sum()
#4
ctr['mods'] = ctr['mods'].fillna(0) #these steps remove the NaN
#ctr['mods'].isna().sum()
#0
Now running OLS.
#OSW 4 week difference with average ratio
from statsmodels.tools.tools import add_constant
X = ctr['mods'] #modality
y = ctr['nprs_4wk'].abs() #pain scale
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
output
OLS Regression Results
Dep. Variable: nprs_4wk R-squared: -0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: nan
Date: Thu, 26 Jan 2023 Prob (F-statistic): nan
Time: 11:14:45 Log-Likelihood: -260.12
No. Observations: 119 AIC: 522.2
Df Residuals: 118 BIC: 525.0
Df Model: 0
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
active_avg 0.7351 0.059 12.549 0.000 0.619 0.851
Omnibus: 12.434 Durbin-Watson: 2.044
Prob(Omnibus): 0.002 Jarque-Bera (JB): 14.152
Skew: 0.833 Prob(JB): 0.000845
Kurtosis: 2.718 Cond. No. 1.00
>>> fig, ax = plt.subplots()
>>> fig = sm.graphics.plot_fit(model, 0, ax=ax)
>>> ax.set_ylabel("NPRS 4 wk diff")
>>> ax.set_xlabel("ratio of perceived ben active vs passive")
>>> ax.set_title("4 wk pain scale vs perceived ben ratio")
Data below, with original missing values:
pb_meds
pb_rest
pb_surgery
pb_massage
pb_manip
pb_traction
pb_aerobic_ex
pb_rom_ex
pb_strength_ex
pb_work_restrict
osw_base
osw_4wk
osw_12wk
osw_1yr
4
4
1
4
4
3
4
4
4
4
14
7
19
5
2
4
1
4
4
2
1
2
2
5
35
13
0
0
4
4
4
3
3
3
2
2
2
3
18
19
14
18
1
3
1
5
5
5
5
5
5
1
13
6
8
14
4
5
1
5
3
1
3
5
3
4
18
16
0
0
3
2
1
4
4
5
4
4
4
1
11
8
8
5
4
4
2
4
1
3
2
4
4
2
10
3
3
0
4
4
3
3
2
2
2
4
5
3
15
15
8
4
4
2
1
4
4
2
2
4
4
2
15
9
4
6
4
4
1
4
3
3
3
4
4
2
10
15
9
3
3
3
3
4
4
3
5
4
4
1
10
7
2
0
4
3
4
5
4
5
4
4
10
11
22
17
4
3
1
3
4
4
4
4
5
4
19
24
16
16
4
3
1
4
4
4
4
4
4
4
22
16
17
22
4
4
3
4
4
3
3
3
4
4
16
9
9
0
3
4
2
4
1
3
3
4
4
3
19
17
24
27
2
4
1
5
4
2
3
3
4
2
10
0
0
0
3
4
2
4
3
4
3
3
4
3
18
11
16
20
5
5
1
1
1
1
1
1
1
5
28
0
0
0
3
4
1
5
2
1
3
3
3
5
23
21
15
14
3
3
1
4
3
3
3
3
3
2
15
6
1
6
3
5
2
5
3
3
4
5
3
5
24
30
22
13
4
4
1
4
3
3
4
5
4
10
0
10
8
1
5
1
5
1
1
1
4
4
4
12
16
14
20
1
4
3
2
3
3
4
2
2
4
14
13
16
11
3
3
3
2
3
3
1
3
3
3
22
0
0
22
4
4
3
3
3
3
2
2
3
3
13
22
15
22
4
4
1
4
3
3
2
4
2
5
15
6
2
2
5
5
3
4
1
3
2
3
3
5
27
0
0
0
3
1
3
4
4
3
1
1
1
1
15
15
16
14
3
4
3
4
4
3
2
3
2
4
25
9
0
0
4
4
3
3
3
3
5
5
5
5
33
26
22
19
4
4
3
4
4
3
3
4
4
4
10
0
0
0
4
4
1
3
1
1
3
3
5
4
18
12
10
15
4
3
4
4
4
3
3
3
4
4
17
14
8
18
4
4
4
4
4
4
4
4
4
4
12
9
12
13
3
3
2
2
2
3
3
3
3
10
13
0
13
4
4
1
4
4
1
4
4
4
4
12
11
8
9
4
4
2
4
4
3
3
4
3
3
10
11
5
11
4
4
5
5
5
3
3
3
3
5
18
18
13
0
4
5
3
5
3
2
5
5
5
4
17
11
10
19
3
4
5
3
1
3
3
2
4
5
17
16
28
28
4
5
1
4
1
1
4
4
4
4
12
11
0
0
4
4
4
3
3
2
3
4
4
5
18
16
16
0
3
3
3
4
5
3
4
5
5
1
14
9
9
10
2
4
3
2
3
4
3
4
3
3
24
0
17
25
4
4
3
4
4
3
4
4
4
1
32
24
26
20
4
4
3
4
2
3
3
2
3
4
18
0
0
0
4
3
2
5
5
2
4
4
3
19
11
10
5
4
4
1
4
1
1
4
4
4
3
26
0
11
0
5
5
3
3
5
5
5
5
5
1
15
15
13
11
4
4
3
4
18
8
0
0
4
1
1
4
3
3
4
4
4
4
15
18
18
0
5
5
1
5
4
3
4
4
5
18
0
4
17
4
4
3
4
3
3
2
4
4
3
10
6
11
7
4
3
3
4
4
4
4
4
4
4
12
0
15
12
4
4
3
5
4
3
4
4
4
13
11
0
0
1
2
3
5
5
3
5
5
5
4
14
0
5
0
3
3
3
3
2
3
2
4
4
3
18
8
10
15
4
3
3
4
4
3
2
4
3
3
18
15
11
12
3
4
1
5
4
4
2
5
5
4
12
7
19
14
3
4
1
5
4
3
2
3
2
3
11
0
20
0
2
4
1
3
4
4
4
4
4
5
11
3
11
4
3
2
1
4
1
3
4
4
4
2
12
12
9
15
2
5
1
4
4
3
4
5
5
1
10
0
3
4
4
4
3
5
3
3
4
4
4
2
14
0
8
14
2
3
2
4
3
3
3
3
3
4
21
19
24
25
4
5
1
5
5
3
4
5
4
4
19
18
18
25
4
3
1
4
4
3
4
5
5
4
10
0
0
0
3
3
3
4
4
3
4
4
4
3
22
4
1
0
3
4
2
4
4
3
3
4
3
4
18
5
3
0
4
4
3
3
3
3
1
3
3
4
12
20
18
10
4
4
2
4
4
4
4
4
4
4
21
18
23
0
4
4
3
4
2
3
1
2
4
5
13
12
0
7
4
4
1
4
4
3
4
4
4
3
23
24
24
23
4
4
1
4
2
1
5
5
5
3
13
9
0
18
1
1
3
2
5
1
1
4
4
3
13
6
0
0
4
5
1
5
5
5
4
5
5
3
18
1
2
1
4
4
3
4
3
3
3
4
3
3
22
31
0
0
3
3
3
3
3
3
3
3
3
3
23
5
5
10
4
2
1
5
4
1
4
4
4
4
12
5
2
2
5
5
5
3
3
3
4
4
4
4
23
8
2
3
4
4
4
3
4
3
3
3
3
5
12
6
0
0
3
2
3
3
3
3
2
3
3
2
14
11
9
0
3
4
4
2
3
4
4
4
4
16
15
10
9
3
1
1
3
3
3
1
2
2
3
17
16
23
0
4
4
2
4
4
3
4
4
4
3
11
8
3
14
12
5
4
4
1
1
3
3
3
1
3
3
3
16
2
4
7
5
4
5
4
3
3
3
4
5
4
12
11
19
12
30
4
0
0
4
3
3
4
3
3
2
2
4
3
21
1
9
8
11
13
19
15
3
3
2
3
2
2
3
3
4
3
16
19
26
30
4
4
5
3
5
5
3
4
4
4
35
17
4
12
3
3
2
3
3
3
3
3
3
16
13
10
23
4
5
4
4
5
5
15
8
0
0
4
4
1
4
4
1
1
4
5
3
16
20
4
15
3
4
3
4
3
3
3
3
3
3
18
17
18
0
4
4
3
2
2
3
3
4
4
4
17
5
1
1
3
5
4
3
2
3
1
1
1
4
25
18
29
22
4
3
2
5
2
2
5
5
5
1
10
8
6
0
2
2
4
5
4
5
5
5
4
11
0
0
0
3
2
1
5
5
3
4
4
5
3
10
7
4
15
18
22
19
0
4
4
4
2
4
4
2
3
4
5
12
0
9
0
4
5
3
4
3
3
3
4
4
5
12
11
0
0
2
4
2
4
5
3
3
4
3
4
11
20
14
0
4
4
4
4
4
4
4
4
4
4
21
14
3
0
4
4
3
3
3
3
4
4
3
5
14
20
2
0
4
4
1
1
1
1
1
3
3
3
11
4
3
5
3
3
3
3
3
3
3
3
3
3
25
0
0
0
5
5
3
5
5
3
5
5
5
5
13
16
9
0
4
4
3
4
4
5
4
5
5
3
18
8
6
0
3
3
3
3
3
3
3
3
3
3
16
11
0
0
3
3
3
3
4
4
4
4
5
2
10
5
7
2
3
4
2
4
3
3
4
4
4
3
12
9
8
0
4
5
3
3
2
2
4
4
4
5
26
17
16
0
4
4
1
4
3
3
4
4
4
4
14
16
0
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 9 months ago.
Suppose I have the following dataframe
import pandas as pd
df = pd.DataFrame({'a': [1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4],
'b': [3,4,3,7,5,9,4,2,5,6,7,8,4,2,4,5,8,0]})
a b
0 1 3
1 1 4
2 1 3
3 2 7
4 2 5
5 2 9
6 2 4
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 4 4
13 4 2
14 4 4
15 4 5
16 4 8
17 4 0
And I would like to make a new column c with values 1 to n where n depends on the value of column a as follow:
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
While I can write it using a for loop, my data frame is huge and it's computationally costly, is there any efficient to generate such column? Thanks.
Use groupby_cumcount:
df['c'] = df.groupby('a').cumcount().add(1)
print(df)
# Output
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
print(df)
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 3 1 5
13 3 2 4
14 3 4 5
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).
Here's the ideal output:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 4 1 8
13 4 2 4
14 4 3 5
15 4 4 6
16 4 5 2
17 4 6 3
Any help much appreciated.
You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:
t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]
Prints:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
df = pd.DataFrame({'site':[1,1,1,1,1,1,1,1,1,1], 'parm':[8,8,8,8,8,9,9,9,9,9],
'date':[1,2,3,4,5,1,2,3,4,5], 'obs':[1,1,2,3,3,3,5,5,6,6]})
Output
site parm date obs
0 1 8 1 1
1 1 8 2 1
2 1 8 3 2
3 1 8 4 3
4 1 8 5 3
5 1 9 1 3
6 1 9 2 5
7 1 9 3 5
8 1 9 4 6
9 1 9 5 6
I want to count repeating, sequential "obs" values within a "site" and "parm". I have this code which is close:
df['consecutive'] = df.parm.groupby((df.obs != df.obs.shift()).cumsum()).transform('size')
Output
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 3
4 1 8 5 3 3
5 1 9 1 3 3
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
It creates the new column with the count. The gap is when the parm changes from 8 to 9 it includes the parm 9 in the parm 8 count. The expected output is:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
You need to throw site, parm as indicated in the question into groupby:
df['consecutive'] = (df.groupby([df.obs.ne(df.obs.shift()).cumsum(),
'site', 'parm']
)
['obs'].transform('size')
)
Output:
site parm date obs consecutive
0 1 8 1 1 2
1 1 8 2 1 2
2 1 8 3 2 1
3 1 8 4 3 2
4 1 8 5 3 2
5 1 9 1 3 1
6 1 9 2 5 2
7 1 9 3 5 2
8 1 9 4 6 2
9 1 9 5 6 2
Suppose the following pandas dataframe
Wafer_Id v1 v2
0 0 9 6
1 0 7 8
2 0 1 5
3 1 6 6
4 1 0 8
5 1 5 0
6 2 8 8
7 2 2 6
8 2 3 5
9 3 5 1
10 3 5 6
11 3 9 8
I want to group it according to WaferId and I would like to get something like
w
Out[60]:
Wafer_Id v1_1 v1_2 v1_3 v2_1 v2_2 v2_3
0 0 9 7 1 6 ... ...
1 1 6 0 5 6
2 2 8 2 3 8
3 3 5 5 9 1
I think that I can obtain the result with the pivot function but I am not sure of how to do it
Possible solution
oes = pd.DataFrame()
oes['Wafer_Id'] = [0,0,0,1,1,1,2,2,2,3,3,3]
oes['v1'] = np.random.randint(0, 10, 12)
oes['v2'] = np.random.randint(0, 10, 12)
oes['id'] = [0, 1, 2] * 4
oes.pivot(index='Wafer_Id', columns='id')
oes
Out[74]:
Wafer_Id v1 v2 id
0 0 8 7 0
1 0 3 3 1
2 0 8 0 2
3 1 2 5 0
4 1 4 1 1
5 1 8 8 2
6 2 8 6 0
7 2 4 7 1
8 2 4 3 2
9 3 4 6 0
10 3 9 2 1
11 3 7 1 2
oes.pivot(index='Wafer_Id', columns='id')
Out[75]:
v1 v2
id 0 1 2 0 1 2
Wafer_Id
0 8 3 8 7 3 0
1 2 4 8 5 1 8
2 8 4 4 6 7 3
3 4 9 7 6 2 1