Assume I have a dataframe like this, for example:
0 1 2 3 4 5 6 7 8 9
0 8 9 2 1 6 2 6 8 6 3
1 1 1 8 3 1 6 3 6 3 9
2 1 4 3 5 9 3 5 9 2 3
3 4 6 3 8 4 3 1 5 1 1
4 1 8 5 3 9 6 1 7 2 2
5 6 6 7 9 1 8 2 3 2 8
6 8 3 6 9 9 5 8 4 7 7
7 8 3 3 8 7 1 4 9 7 2
8 7 6 1 4 8 1 6 9 6 6
9 3 3 2 4 8 1 8 1 1 8
10 7 7 5 7 1 4 1 8 8 6
11 6 3 2 7 6 5 7 4 8 7
I would like to put rows to certain "blocks" of given length and the flatten them to single rows. So for example, if the block length would be 3, the result here would be:
0 1 2 3 4 5 6 7 8 9 10 ... 19 20 21 22 23 24 25 26 27 28 29
2 8 9 2 1 6 2 6 8 6 3 1 ... 9 1 4 3 5 9 3 5 9 2 3
5 4 6 3 8 4 3 1 5 1 1 1 ... 2 6 6 7 9 1 8 2 3 2 8
8 8 3 6 9 9 5 8 4 7 7 8 ... 2 7 6 1 4 8 1 6 9 6 6
11 3 3 2 4 8 1 8 1 1 8 7 ... 6 6 3 2 7 6 5 7 4 8 7
How to achieve this?
I think need reshape:
n_blocks =3
df = pd.DataFrame(df.values.reshape(-1, n_blocks *df.shape[1]))
print (df)
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 25 26 27 \
0 8 9 2 1 6 2 6 8 6 3 ... 1 4 3 5 9 3 5 9
1 4 6 3 8 4 3 1 5 1 1 ... 6 6 7 9 1 8 2 3
2 8 3 6 9 9 5 8 4 7 7 ... 7 6 1 4 8 1 6 9
3 3 3 2 4 8 1 8 1 1 8 ... 6 3 2 7 6 5 7 4
28 29
0 2 3
1 2 8
2 6 6
3 8 7
[4 rows x 30 columns]
I found this solution, maybe someone comes up with a better one:
def toBlocks(df, blocklen):
shifted = [df.shift(periods=p) for p in range(blocklen)]
return pd.concat(shifted, axis=1)[blocklen-1:]
Related
I have a data set, redacted sample below. My goal is linear regression. My question is: Have I created unintended results, due to how I structured the df, using concat and/or div?
For example, predicting:
(2nd time rating) minus (base time rating) with
the ratio of (percent #1) over (percent #2 ).
From the df below:
((4 wk nprs rating)-(base nprs rating))
with dependent variable: ((active modalities)/(passive modalities))
I've created dataframes, in hopes of efficiency, and run OLS, all below.
Thank you for your insight.
What I've tried:
ctr = pd.read_csv('file_path/CTrial Data.csv') #CTR trial
ctr = ctr.apply(pd.to_numeric, errors='coerce') #converting all to numeric, except NaN where can't
ctr = ctr.fillna(0) #replacing NaN with 0's, dropping results in too much loss
Now the difference in pain perception over some time periods.
#OSWESTRY Pain Scale (lower back funcion)
#THIS scale is 0-50
#using absolute value leads to more than max, eg, 4-(-4)=8, not using abs results in negative values
ctr['Trial_4wk_diff'] = (ctr['osw_4wk'] - ctr['osw_base']).abs() #calculating the difference
ctr['Trial_12wk_diff'] = (ctr['osw_12wk'] - ctr['osw_base']).abs()
ctr['Trial_1yr_diff'] = (ctr['osw_1yr'] - ctr['osw_base']).abs()
Grouping the treating modalities by "active" and "passive" then
calculating the ratio. These ratings are specifically the perceived
benefit of each modality.
#perceived benefit of active treatment modalities over perceived benefit of passive modalities
ctr['active'] = pd.concat((ctr['perceived_ben_aerobic_ex'],ctr['perceived_ben_strength_ex'],ctr['perceived_ben_rom_ex']),ignore_index=True)
# 0 NaN
ctr['passive'] = pd.concat((ctr['perceived_ben_meds'],ctr['perceived_ben_rest'], ctr['perceived_ben_surgery'],ctr['perceived_ben_massage'],ctr['perceived_ben_manip'],ctr['perceived_ben_traction'],ctr['perceived_ben_work_restrict']), ignore_index = True)
# 0 NaN
ctr['mods'] = ctr['active'].div(ctr['passive']) #ratio of active vs passive perceived benefit #Here I simply used .div()
ctr['mods'] = add_constant(ctr['mods'].fillna(0))
#ctr['mods'].isna().sum()
#4
ctr['mods'] = ctr['mods'].fillna(0) #these steps remove the NaN
#ctr['mods'].isna().sum()
#0
Now running OLS.
#OSW 4 week difference with average ratio
from statsmodels.tools.tools import add_constant
X = ctr['mods'] #modality
y = ctr['nprs_4wk'].abs() #pain scale
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
output
OLS Regression Results
Dep. Variable: nprs_4wk R-squared: -0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: nan
Date: Thu, 26 Jan 2023 Prob (F-statistic): nan
Time: 11:14:45 Log-Likelihood: -260.12
No. Observations: 119 AIC: 522.2
Df Residuals: 118 BIC: 525.0
Df Model: 0
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
active_avg 0.7351 0.059 12.549 0.000 0.619 0.851
Omnibus: 12.434 Durbin-Watson: 2.044
Prob(Omnibus): 0.002 Jarque-Bera (JB): 14.152
Skew: 0.833 Prob(JB): 0.000845
Kurtosis: 2.718 Cond. No. 1.00
>>> fig, ax = plt.subplots()
>>> fig = sm.graphics.plot_fit(model, 0, ax=ax)
>>> ax.set_ylabel("NPRS 4 wk diff")
>>> ax.set_xlabel("ratio of perceived ben active vs passive")
>>> ax.set_title("4 wk pain scale vs perceived ben ratio")
Data below, with original missing values:
pb_meds
pb_rest
pb_surgery
pb_massage
pb_manip
pb_traction
pb_aerobic_ex
pb_rom_ex
pb_strength_ex
pb_work_restrict
osw_base
osw_4wk
osw_12wk
osw_1yr
4
4
1
4
4
3
4
4
4
4
14
7
19
5
2
4
1
4
4
2
1
2
2
5
35
13
0
0
4
4
4
3
3
3
2
2
2
3
18
19
14
18
1
3
1
5
5
5
5
5
5
1
13
6
8
14
4
5
1
5
3
1
3
5
3
4
18
16
0
0
3
2
1
4
4
5
4
4
4
1
11
8
8
5
4
4
2
4
1
3
2
4
4
2
10
3
3
0
4
4
3
3
2
2
2
4
5
3
15
15
8
4
4
2
1
4
4
2
2
4
4
2
15
9
4
6
4
4
1
4
3
3
3
4
4
2
10
15
9
3
3
3
3
4
4
3
5
4
4
1
10
7
2
0
4
3
4
5
4
5
4
4
10
11
22
17
4
3
1
3
4
4
4
4
5
4
19
24
16
16
4
3
1
4
4
4
4
4
4
4
22
16
17
22
4
4
3
4
4
3
3
3
4
4
16
9
9
0
3
4
2
4
1
3
3
4
4
3
19
17
24
27
2
4
1
5
4
2
3
3
4
2
10
0
0
0
3
4
2
4
3
4
3
3
4
3
18
11
16
20
5
5
1
1
1
1
1
1
1
5
28
0
0
0
3
4
1
5
2
1
3
3
3
5
23
21
15
14
3
3
1
4
3
3
3
3
3
2
15
6
1
6
3
5
2
5
3
3
4
5
3
5
24
30
22
13
4
4
1
4
3
3
4
5
4
10
0
10
8
1
5
1
5
1
1
1
4
4
4
12
16
14
20
1
4
3
2
3
3
4
2
2
4
14
13
16
11
3
3
3
2
3
3
1
3
3
3
22
0
0
22
4
4
3
3
3
3
2
2
3
3
13
22
15
22
4
4
1
4
3
3
2
4
2
5
15
6
2
2
5
5
3
4
1
3
2
3
3
5
27
0
0
0
3
1
3
4
4
3
1
1
1
1
15
15
16
14
3
4
3
4
4
3
2
3
2
4
25
9
0
0
4
4
3
3
3
3
5
5
5
5
33
26
22
19
4
4
3
4
4
3
3
4
4
4
10
0
0
0
4
4
1
3
1
1
3
3
5
4
18
12
10
15
4
3
4
4
4
3
3
3
4
4
17
14
8
18
4
4
4
4
4
4
4
4
4
4
12
9
12
13
3
3
2
2
2
3
3
3
3
10
13
0
13
4
4
1
4
4
1
4
4
4
4
12
11
8
9
4
4
2
4
4
3
3
4
3
3
10
11
5
11
4
4
5
5
5
3
3
3
3
5
18
18
13
0
4
5
3
5
3
2
5
5
5
4
17
11
10
19
3
4
5
3
1
3
3
2
4
5
17
16
28
28
4
5
1
4
1
1
4
4
4
4
12
11
0
0
4
4
4
3
3
2
3
4
4
5
18
16
16
0
3
3
3
4
5
3
4
5
5
1
14
9
9
10
2
4
3
2
3
4
3
4
3
3
24
0
17
25
4
4
3
4
4
3
4
4
4
1
32
24
26
20
4
4
3
4
2
3
3
2
3
4
18
0
0
0
4
3
2
5
5
2
4
4
3
19
11
10
5
4
4
1
4
1
1
4
4
4
3
26
0
11
0
5
5
3
3
5
5
5
5
5
1
15
15
13
11
4
4
3
4
18
8
0
0
4
1
1
4
3
3
4
4
4
4
15
18
18
0
5
5
1
5
4
3
4
4
5
18
0
4
17
4
4
3
4
3
3
2
4
4
3
10
6
11
7
4
3
3
4
4
4
4
4
4
4
12
0
15
12
4
4
3
5
4
3
4
4
4
13
11
0
0
1
2
3
5
5
3
5
5
5
4
14
0
5
0
3
3
3
3
2
3
2
4
4
3
18
8
10
15
4
3
3
4
4
3
2
4
3
3
18
15
11
12
3
4
1
5
4
4
2
5
5
4
12
7
19
14
3
4
1
5
4
3
2
3
2
3
11
0
20
0
2
4
1
3
4
4
4
4
4
5
11
3
11
4
3
2
1
4
1
3
4
4
4
2
12
12
9
15
2
5
1
4
4
3
4
5
5
1
10
0
3
4
4
4
3
5
3
3
4
4
4
2
14
0
8
14
2
3
2
4
3
3
3
3
3
4
21
19
24
25
4
5
1
5
5
3
4
5
4
4
19
18
18
25
4
3
1
4
4
3
4
5
5
4
10
0
0
0
3
3
3
4
4
3
4
4
4
3
22
4
1
0
3
4
2
4
4
3
3
4
3
4
18
5
3
0
4
4
3
3
3
3
1
3
3
4
12
20
18
10
4
4
2
4
4
4
4
4
4
4
21
18
23
0
4
4
3
4
2
3
1
2
4
5
13
12
0
7
4
4
1
4
4
3
4
4
4
3
23
24
24
23
4
4
1
4
2
1
5
5
5
3
13
9
0
18
1
1
3
2
5
1
1
4
4
3
13
6
0
0
4
5
1
5
5
5
4
5
5
3
18
1
2
1
4
4
3
4
3
3
3
4
3
3
22
31
0
0
3
3
3
3
3
3
3
3
3
3
23
5
5
10
4
2
1
5
4
1
4
4
4
4
12
5
2
2
5
5
5
3
3
3
4
4
4
4
23
8
2
3
4
4
4
3
4
3
3
3
3
5
12
6
0
0
3
2
3
3
3
3
2
3
3
2
14
11
9
0
3
4
4
2
3
4
4
4
4
16
15
10
9
3
1
1
3
3
3
1
2
2
3
17
16
23
0
4
4
2
4
4
3
4
4
4
3
11
8
3
14
12
5
4
4
1
1
3
3
3
1
3
3
3
16
2
4
7
5
4
5
4
3
3
3
4
5
4
12
11
19
12
30
4
0
0
4
3
3
4
3
3
2
2
4
3
21
1
9
8
11
13
19
15
3
3
2
3
2
2
3
3
4
3
16
19
26
30
4
4
5
3
5
5
3
4
4
4
35
17
4
12
3
3
2
3
3
3
3
3
3
16
13
10
23
4
5
4
4
5
5
15
8
0
0
4
4
1
4
4
1
1
4
5
3
16
20
4
15
3
4
3
4
3
3
3
3
3
3
18
17
18
0
4
4
3
2
2
3
3
4
4
4
17
5
1
1
3
5
4
3
2
3
1
1
1
4
25
18
29
22
4
3
2
5
2
2
5
5
5
1
10
8
6
0
2
2
4
5
4
5
5
5
4
11
0
0
0
3
2
1
5
5
3
4
4
5
3
10
7
4
15
18
22
19
0
4
4
4
2
4
4
2
3
4
5
12
0
9
0
4
5
3
4
3
3
3
4
4
5
12
11
0
0
2
4
2
4
5
3
3
4
3
4
11
20
14
0
4
4
4
4
4
4
4
4
4
4
21
14
3
0
4
4
3
3
3
3
4
4
3
5
14
20
2
0
4
4
1
1
1
1
1
3
3
3
11
4
3
5
3
3
3
3
3
3
3
3
3
3
25
0
0
0
5
5
3
5
5
3
5
5
5
5
13
16
9
0
4
4
3
4
4
5
4
5
5
3
18
8
6
0
3
3
3
3
3
3
3
3
3
3
16
11
0
0
3
3
3
3
4
4
4
4
5
2
10
5
7
2
3
4
2
4
3
3
4
4
4
3
12
9
8
0
4
5
3
3
2
2
4
4
4
5
26
17
16
0
4
4
1
4
3
3
4
4
4
4
14
16
0
Hi I have a DataFrame for which I have multiple columns I want to combine into 1 with several other columns that I want to be duplicated. An example dataframe:
df = pd.DataFrame(np.random.randint(10, size=60).reshape(6, 10))
df.columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'y1', 'y2', 'y3', 'y4', 'y5']
x1 x2 x3 x4 x5 y1 y2 y3 y4 y5
0 2 6 9 4 3 8 6 1 0 7
1 1 4 8 7 3 0 5 7 3 1
2 6 7 4 8 1 5 7 7 8 5
3 6 3 4 8 0 8 7 2 3 8
4 8 5 6 1 6 3 2 1 1 4
5 1 3 7 5 1 6 5 3 8 5
I would like a nice way to produce the following DataFrame:
x1 x2 x3 x4 x5 y
0 2 6 9 4 3 8
1 1 4 8 7 3 0
2 6 7 4 8 1 5
3 6 3 4 8 0 8
4 8 5 6 1 6 3
5 1 3 7 5 1 6
6 2 6 9 4 3 6
7 1 4 8 7 3 5
8 6 7 4 8 1 7
9 6 3 4 8 0 7
10 8 5 6 1 6 2
11 1 3 7 5 1 5
12 2 6 9 4 3 1
13 1 4 8 7 3 7
14 6 7 4 8 1 7
15 6 3 4 8 0 2
16 8 5 6 1 6 1
17 1 3 7 5 1 3
18 2 6 9 4 3 0
19 1 4 8 7 3 3
20 6 7 4 8 1 8
21 6 3 4 8 0 3
22 8 5 6 1 6 1
23 1 3 7 5 1 8
24 2 6 9 4 3 7
25 1 4 8 7 3 1
26 6 7 4 8 1 5
27 6 3 4 8 0 8
28 8 5 6 1 6 4
29 1 3 7 5 1 5
Is there a nice way to produce this DataFrame with Pandas functions or is it more complicated?
Thanks
You can do this with df.melt().
df.melt(
id_vars = ['x1','x2','x3','x4','x5'],
value_vars = ['y1','y2','y3','y4','y5'],
value_name = 'y'
).drop(columns='variable')
df.melt() will have the column called variable that has the value for which column it originally came from (so is that row coming from y1, y2, etc), so you want to drop that as you see above.
This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 9 months ago.
Suppose I have the following dataframe
import pandas as pd
df = pd.DataFrame({'a': [1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,4,4],
'b': [3,4,3,7,5,9,4,2,5,6,7,8,4,2,4,5,8,0]})
a b
0 1 3
1 1 4
2 1 3
3 2 7
4 2 5
5 2 9
6 2 4
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 4 4
13 4 2
14 4 4
15 4 5
16 4 8
17 4 0
And I would like to make a new column c with values 1 to n where n depends on the value of column a as follow:
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
While I can write it using a for loop, my data frame is huge and it's computationally costly, is there any efficient to generate such column? Thanks.
Use groupby_cumcount:
df['c'] = df.groupby('a').cumcount().add(1)
print(df)
# Output
a b c
0 1 3 1
1 1 4 2
2 1 3 3
3 2 7 1
4 2 5 2
5 2 9 3
6 2 4 4
7 2 2 5
8 3 5 1
9 3 6 2
10 3 7 3
11 3 8 4
12 4 4 1
13 4 2 2
14 4 4 3
15 4 5 4
16 4 8 5
17 4 0 6
Is there a quick pythonic way to transform this table
index = pd.date_range('2000-1-1', periods=36, freq='M')
df = pd.DataFrame(np.random.randn(36,4), index=index, columns=list('ABCD'))
In[1]: df
Out[1]:
A B C D
2000-01-31 H 1.368795 0.106294 2.108814
2000-02-29 -1.713401 0.557224 0.115956 -0.851140
2000-03-31 -1.454967 -0.791855 -0.461738 -0.410948
2000-04-30 1.688731 -0.216432 -0.690103 -0.319443
2000-05-31 -1.103961 0.181510 -0.600383 -0.164744
2000-06-30 0.216871 -1.018599 0.731617 -0.721986
2000-07-31 0.621375 0.790072 0.967000 1.347533
2000-08-31 0.588970 -0.360169 0.904809 0.606771
...
into this table
2001 2000
12 11 10 9 8 7 6 5 4 3 2 1 12 11 10 9 8 7 6 5 4 3 2 1
A H
B
C
D
Please excuse the missing values. I added the "H" manually. I hope it gets clear what I am looking for.
For easier check, I've created dataframe of the same shape but with integers as values.
The core of the solution is pandas.DataFrame.transpose, but you need to use index.year + index.month as a new index:
>>> df = pd.DataFrame(np.random.randint(10,size=(36, 4)), index=index, columns=list('ABCD'))
>>> df.set_index(keys=[df.index.year, df.index.month]).transpose()
2000 2001 2002
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
A 0 0 8 7 8 0 7 1 5 1 5 4 2 1 9 5 2 0 5 3 6 4 9 3 5 1 7 3 1 7 6 5 6 8 4 1
B 4 9 9 5 2 0 8 0 9 5 2 7 5 6 3 6 8 8 8 8 0 6 3 7 5 9 6 3 9 7 1 4 7 8 3 3
C 3 2 4 3 1 9 7 6 9 6 8 6 3 5 3 2 2 1 3 1 1 2 8 2 2 6 9 6 1 5 6 5 4 6 7 5
D 8 1 3 9 2 3 8 7 3 2 1 0 1 3 9 1 8 6 4 7 4 6 3 2 9 8 9 9 0 7 4 7 3 6 5 2
Of course, this will not work properly if you have more then one record per year+month. In this case you need to groupby your data first:
>>> i = pd.date_range('2000-1-1', periods=36, freq='W') # weekly index
>>> df = pd.DataFrame(np.random.randint(10,size=(36, 4)), index=i, columns=list('ABCD'))
>>> df.groupby(by=[df.index.year, df.index.month]).sum().transpose()
2000
1 2 3 4 5 6 7 8 9
A 12 13 15 23 9 21 21 31 7
B 33 24 19 30 15 19 20 7 4
C 20 24 26 24 15 18 29 17 4
D 23 29 14 30 19 12 12 11 5
for j in range(10):
for i in range(10):
print(j,end=" ")
My results are bunched together and I need to have 10 numbers per line. I cant use a print("0123456789"). I have tried print(j,j,j,j,j,j,j,j,j) and I get the results that I'm looking for but I'm sure this isn't the proper way to write the code.
If print(j,j,j,j,j,j,j,j,j) works then you simply need to add another print() after each iteration:
for j in range(10):
for i in range(10):
print(j,end=" ")
print()
Output:
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9
Or simply:
for j in range(10):
print(" ".join(str(j) * 10))
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9
Why are you using a nested for loop when you can use a single for loop:
for i in range(10):
print('{} '.format(i) * 10)
This is similar to Malik Brahimi's solution, except it doesn't put a space after the last digit on each line:
for i in range(10):
print(' '.join([str(i)]*10))
output
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 3 3
4 4 4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 7 7
8 8 8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9 9 9
Just for fun, here's another way to do it with a single loop, this time using a format string with numbered fields.
fmt = ('{0} ' * 10)[:-1]
for i in range(10):
print(fmt.format(i))