How to add a repeated column using pandas

How to add a repeated column using pandas - python

I am doing my homework and I encounter a problem, I have a large matrix, the first column Y002 is a nominal variable, which has 3 levels and encoded as 1,2,3 respectively. The other two columns V96 and V97 are just numeric.
Now, I wanna get a group mean corresponds to the variable Y002. I wrote the code like this
group = data2.groupby(by=["Y002"]).mean()
Then I index to get each group mean using
group1 = group["V96"]
group2 = group["V97"]
Now I wanna append this group mean as a new column into the original dataframe, in which each mean matches the corresponding Y002 code(1 or 2 or 3). Actually I tried this code, but it only shows NAN.
data2["group1"] = pd.Series(group1, index=data2.index)
Hope someone could help me with this, many thanks :)
PS: Hope this makes sense. just like R language, we can do the same thing using
data2$group1 = with(data2, tapply(V97,Y002,mean))[data2$Y002]
But how can we implement this in Python and pandas???

You can use .transform()
import pandas as pd
import numpy as np
# your data
# ============================
np.random.seed(0)
df = pd.DataFrame({'Y002': np.random.randint(1,4,100), 'V96': np.random.randn(100), 'V97': np.random.randn(100)})
print(df)
V96 V97 Y002
0 -0.6866 -0.1478 1
1 0.0149 1.6838 2
2 -0.3757 0.9718 1
3 -0.0382 1.6077 2
4 0.3680 -0.2571 2
5 -0.0447 1.8098 3
6 -0.3024 0.8923 1
7 -2.2244 -0.0966 3
8 0.7240 -0.3772 1
9 0.3590 -0.5053 1
.. ... ... ...
90 -0.6906 1.5567 2
91 -0.6815 -0.4189 3
92 -1.5122 -0.4097 1
93 2.1969 1.1164 2
94 1.0412 -0.2510 3
95 -0.0332 -0.4152 1
96 0.0656 -0.6391 3
97 0.2658 2.4978 1
98 1.1518 -3.0051 2
99 0.1380 -0.8740 3
# processing
# ===========================
df['V96_mean'] = df.groupby('Y002')['V96'].transform(np.mean)
df['V97_mean'] = df.groupby('Y002')['V97'].transform(np.mean)
df
V96 V97 Y002 V96_mean V97_mean
0 -0.6866 -0.1478 1 -0.1944 0.0837
1 0.0149 1.6838 2 0.0497 -0.0496
2 -0.3757 0.9718 1 -0.1944 0.0837
3 -0.0382 1.6077 2 0.0497 -0.0496
4 0.3680 -0.2571 2 0.0497 -0.0496
5 -0.0447 1.8098 3 0.0053 -0.0707
6 -0.3024 0.8923 1 -0.1944 0.0837
7 -2.2244 -0.0966 3 0.0053 -0.0707
8 0.7240 -0.3772 1 -0.1944 0.0837
9 0.3590 -0.5053 1 -0.1944 0.0837
.. ... ... ... ... ...
90 -0.6906 1.5567 2 0.0497 -0.0496
91 -0.6815 -0.4189 3 0.0053 -0.0707
92 -1.5122 -0.4097 1 -0.1944 0.0837
93 2.1969 1.1164 2 0.0497 -0.0496
94 1.0412 -0.2510 3 0.0053 -0.0707
95 -0.0332 -0.4152 1 -0.1944 0.0837
96 0.0656 -0.6391 3 0.0053 -0.0707
97 0.2658 2.4978 1 -0.1944 0.0837
98 1.1518 -3.0051 2 0.0497 -0.0496
99 0.1380 -0.8740 3 0.0053 -0.0707
[100 rows x 5 columns]

Related

How to Group by the mean of specific columns in Python

In the dataframe below:
import pandas as pd
import numpy as np
df= {
'Gen':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Site':['FRX','FX','FRX','FRX','FRX','FX','FRX','FX','FX','FX','FX','FRX','FRX','FRX','FRX','FRX'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'AIC':['<1','<1','<1','<1',1,1,1,1,2,2,2,2,'>2','>2','>2','>2'],
'AIC_TRX':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Grwth_Time1':[150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,244.73,269.91,249.19],
'Grwth_Time2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Grwth_Time3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Grwth_Time5':[25.78,22.34,28.53,27.69,30.07,17.7,29.81,33.15,34.87,32.54,36.59,39.97,29.28,34.73,36.91,34.12],
'Grwth_Time6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
}
df = pd.DataFrame(df,columns = ['Gen','Site','Type','AIC','AIC_TRX','diff','series','Grwth_Time1','Grwth_Time2','Grwth_Time3','Grwth_Time4','Grwth_Time5','Grwth_Time6','Grwth_Time7'])
df.info()
I want to do the following:
Find the average of each unique series per AIC_TRX for each Grwth_Time (Grwth_Time1, Grwth_Time2,....,Grwth_Time7)
Export all the outputs as one xlsx file (refer to the figure below)
The desired outputs look like the figure below (note: the numbers in this output are not the actual average values, they were randomly generated)
My attempt:
# Select the columns -> AIC_TRX, series, Grwth_Time1,Grwth_Time2,....,Grwth_Time7
df1 = df[['AIC_TRX', 'diff', 'series',
'Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4',
'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']]
#Below is where I need help, I want to groupby the 'series' and 'AIC_TRX' for all the 'Grwth_Time1_to_7'
df1.groupby('series').Grwth_Time1.agg(['mean'])
Thanks in advance

You have to groupby two columns: ['series', 'AIC_TRX'] and find mean of each Grwth_Time.
df.groupby(['series', 'AIC_TRX'])[['Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3',
'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']].mean().unstack().to_excel("output.xlsx")
Output:
AIC_TRX 1 2 3 4
series
1 150.78 208.07 146.87 229.28
2 162.34 217.76 182.54 244.73
4 188.53 229.48 189.57 269.91
8 197.69 139.51 199.97 249.19
AIC_TRX 1 2 3 4
series
1 250.78 308.07 346.87 329.28
2 262.34 317.70 382.54 347.73
4 288.53 329.81 369.59 369.91
8 297.69 339.15 399.97 349.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 270.84 318.73 398.75 494.85
2 282.14 327.47 432.18 509.39
4 298.53 369.63 449.78 515.52
8 306.69 389.59 473.55 539.23
AIC_TRX 1 2 3 4
series
1 25.78 30.07 34.87 29.28
2 22.34 17.70 32.54 34.73
4 28.53 29.81 36.59 36.91
8 27.69 33.15 39.97 34.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 27.84 18.73 38.75 13.85
2 28.14 27.47 24.18 9.39
4 29.53 36.63 24.78 15.52
8 30.69 38.59 21.55 39.23

Just use the df.apply method to average across each column based on series and AIC_TRX grouping.
result = df1.groupby(['series', 'AIC_TRX']).apply(np.mean, axis=1)
Result:
series AIC_TRX
1 1 0 120.738
2 4 156.281
3 8 170.285
4 12 196.270
2 1 1 122.358
2 5 152.758
3 9 184.494
4 13 205.175
4 1 2 135.471
2 6 171.968
3 10 187.825
4 14 214.907
8 1 3 142.183
2 7 162.849
3 11 196.851
4 15 216.455
dtype: float64

Converting time format to second in a panda dataframe

I have a df with time data and I would like to transform these data to second (see example below).
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 0:19.938 0:24.649 0:3.062
1 1 76 0:17.910 0:25.929 0:3.098
2 2 74 1:02.619 0:27.724 0:3.014
3 3 73 0:20.607 0:27.937 0:3.193
4 4 67 0:19.598 0:28.853 0:2.925
5 5 67 0:21.032 0:30.119 0:3.206
6 6 66 0:27.013 0:31.462 0:3.106
7 7 65 0:27.337 0:36.226 0:3.060
8 8 64 0:37.651 0:47.246 0:2.933
9 9 64 0:59.241 1:8.333 0:3.027
This is the output I would like to obtain.
df["Real time (s)"]
0 19.938
1 17.910
2 62.619
...
I have some useful code but I do not how to itinerate this code in a data frame
x = time.strptime("00:01:00","%H:%M:%S")
datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min, seconds=x.tm_sec).total_seconds()

Add 00: from right side for 0hours, pass to to_timedelta and then add Series.dt.total_seconds:
df["Real time (s)"] = pd.to_timedelta(df["Real time (s)"].radd('00:')).dt.total_seconds()
print (df)
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 19.938 0:24.649 0:3.062
1 1 76 17.910 0:25.929 0:3.098
2 2 74 62.619 0:27.724 0:3.014
3 3 73 20.607 0:27.937 0:3.193
4 4 67 19.598 0:28.853 0:2.925
5 5 67 21.032 0:30.119 0:3.206
6 6 66 27.013 0:31.462 0:3.106
7 7 65 27.337 0:36.226 0:3.060
8 8 64 37.651 0:47.246 0:2.933
9 9 64 59.241 1:8.333 0:3.027
Solution for processing multiple columns:
def to_td(x):
return pd.to_timedelta(x.radd('00:')).dt.total_seconds()
cols = ["Real time (s)", "User time (s)", "Sys time (s)"]
df[cols] = df[cols].apply(to_td)
print (df)
Compression_level Size (M) Real time (s) User time (s) Sys time (s)
0 0 265 19.938 24.649 3.062
1 1 76 17.910 25.929 3.098
2 2 74 62.619 27.724 3.014
3 3 73 20.607 27.937 3.193
4 4 67 19.598 28.853 2.925
5 5 67 21.032 30.119 3.206
6 6 66 27.013 31.462 3.106
7 7 65 27.337 36.226 3.060
8 8 64 37.651 47.246 2.933
9 9 64 59.241 68.333 3.027

How can I Extract only numbers from this columns?

Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present
4 SELECTIO
6 N NO
14 37001
26 37002
38 37003
47 37004
60 37005
73 37006
82 37007
92 37008
105 37009
119 37010
132 37011
143 37012
157 37013
168 37014
184 37015
196 37016
207 37017
220 37018
236 37019
253 37020
267 37021
280 37022
287 Krishan
290 37023
300 37024
316 37025
337 37026
365 37027
...
74141 42471
74154 42472
74169 42473
74184 42474
74200 42475
74216 42476
74233 42477
74242 42478
74256 42479
74271 42480
74290 42481
74309 42482
74323 42483
74336 42484
74350 42485
74365 42486
74378 42487
74389 42488
74398 42489
74413 42490
74430 42491
74446 42492
74459 42493
74474 42494
74491 42495
74504 42496
74516 42497
74530 42498
74544 42499
74558 42500
Name: Selection No., Length: 5602, dtype: object
and I want to get only numeric values like this in python using pandas
37001
37002
37003
37004
37005
how can I do this? I have attached my code in python using pandas..............................................
def selection(sle):
if sle in re.match('[3-4][0-9]{4}',sle):
return 1
else:
return 0
select['status'] = select['Selection No.'].apply(selection)
and now I am geting an "argument of type 'NoneType' is not iterable" error.

Try using Numpy with np.isreal and only select numbers..
import pandas as pd
import numpy as np
df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]})
df
SELECTIO some_col
0 N NO 4
1 37002 6
2 37003 14
3 Krishan 26
4 37004 38
5 singh 47
6 37005 60
>>> df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
result:
Specific to column SELECTIO ..
df[df[['SELECTIO']].applymap(np.isreal).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
OR just another approach importing numbers + lambda :
import numbers
df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)]
SELECTIO some_col
1 37002 6
2 37003 14
4 37004 38
6 37005 60
Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!

Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match)
I would suggest to proceed with pd.Series.str.isnumeric function:
In [544]: df
Out[544]:
Selection No.
0 37001
1 37002
2 37003
3 asnsh
4 37004
5 singh
6 37005
In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int)
In [546]: df
Out[546]:
Selection No. Status
0 37001 1
1 37002 1
2 37003 1
3 asnsh 0
4 37004 1
5 singh 0
6 37005 1
If a strict regex pattern is required - use pd.Series.str.contains function:
df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)

Binning a data set using Pandas

Given a csv file of...
neg,,,,,,,
SAMPLE 1,,SAMPLE 2,,SAMPLE 3,,SAMPLE 4,
50.0261,2.17E+02,50.0224,3.31E+02,50.0007,5.38E+02,50.0199,2.39E+02
50.1057,2.65E+02,50.0435,3.92E+02,50.0657,5.52E+02,50.0465,3.37E+02
50.1514,2.90E+02,50.0781,3.88E+02,50.1115,5.75E+02,50.0584,2.58E+02
50.166,3.85E+02,50.1245,4.25E+02,50.1258,5.11E+02,50.0765,4.47E+02
50.1831,2.55E+02,50.1748,3.71E+02,50.1411,6.21E+02,50.1246,1.43E+02
50.2023,3.45E+02,50.2161,2.59E+02,50.1671,5.56E+02,50.1866,3.77E+02
50.223,4.02E+02,50.2381,4.33E+02,50.1968,6.31E+02,50.2276,3.41E+02
50.2631,1.89E+02,50.2826,4.63E+02,50.211,3.92E+02,50.2717,4.71E+02
50.2922,2.72E+02,50.3593,4.52E+02,50.2279,5.92E+02,50.376,3.09E+02
50.319,2.46E+02,50.4019,4.15E+02,50.2929,5.60E+02,50.3979,2.56E+02
50.3523,3.57E+02,50.423,3.31E+02,50.3659,4.84E+02,50.4237,3.28E+02
50.3968,4.67E+02,50.4402,1.76E+02,50.437,1.89E+02,50.4504,2.71E+02
50.4431,1.88E+02,50.479,4.85E+02,50.5137,6.63E+02,50.5078,2.54E+02
50.481,3.63E+02,50.5448,3.51E+02,50.5401,5.11E+02,50.5436,2.69E+02
50.506,3.73E+02,50.5872,4.03E+02,50.5593,6.56E+02,50.555,3.06E+02
50.5379,3.00E+02,50.6076,2.96E+02,50.6034,5.02E+02,50.6059,2.83E+02
50.5905,2.38E+02,50.6341,2.67E+02,50.6579,6.37E+02,50.6484,1.99E+02
50.6564,1.30E+02,50.662,3.53E+02,50.6888,7.37E+02,50.7945,4.84E+02
50.7428,2.38E+02,50.6952,4.21E+02,50.7132,6.71E+02,50.8044,4.41E+02
50.8052,3.67E+02,50.7397,1.99E+02,50.7421,6.29E+02,50.8213,1.69E+02
50.8459,2.80E+02,50.7685,3.73E+02,50.7872,5.30E+02,50.8401,3.88E+02
50.9021,3.56E+02,50.7757,4.54E+02,50.8251,4.13E+02,50.8472,3.61E+02
50.9425,3.89E+02,50.8027,7.20E+02,50.8418,5.73E+02,50.8893,1.18E+02
51.0117,2.29E+02,50.8206,2.93E+02,50.8775,4.34E+02,50.9285,2.64E+02
51.0244,5.19E+02,50.8364,4.80E+02,50.9101,4.25E+02,50.9591,1.64E+02
51.0319,3.62E+02,50.8619,2.90E+02,50.9222,5.11E+02,51.0034,2.70E+02
51.0439,4.24E+02,50.9098,3.22E+02,50.9675,4.33E+02,51.0577,2.88E+02
51.0961,3.59E+02,50.969,3.87E+02,51.0123,6.03E+02,51.0712,3.18E+02
51.1429,2.49E+02,51.0009,2.42E+02,51.0266,7.30E+02,51.1015,1.84E+02
51.1597,2.71E+02,51.0262,1.32E+02,51.0554,3.69E+02,51.1291,3.71E+02
51.177,2.84E+02,51.0778,1.58E+02,51.1113,4.50E+02,51.1378,3.54E+02
51.1924,2.00E+02,51.1313,4.07E+02,51.1464,3.86E+02,51.1871,1.55E+02
51.2055,2.25E+02,51.1844,2.08E+02,51.1826,7.06E+02,51.2511,2.05E+02
51.2302,3.81E+02,51.2197,5.49E+02,51.2284,7.00E+02,51.3036,2.60E+02
51.264,2.16E+02,51.2306,3.76E+02,51.271,3.83E+02,51.3432,1.99E+02
51.2919,2.29E+02,51.2468,2.87E+02,51.308,3.89E+02,51.3775,2.45E+02
51.3338,3.67E+02,51.2739,5.56E+02,51.3394,5.17E+02,51.3977,3.86E+02
51.3743,2.57E+02,51.3228,3.18E+02,51.3619,6.03E+02,51.4151,3.37E+02
51.3906,3.78E+02,51.3685,2.33E+02,51.3844,4.44E+02,51.4254,2.72E+02
51.4112,3.29E+02,51.3912,5.03E+02,51.4179,5.68E+02,51.4426,3.17E+02
51.4423,1.86E+02,51.4165,2.68E+02,51.4584,5.10E+02,51.4834,3.87E+02
51.537,3.48E+02,51.4645,3.76E+02,51.5179,5.75E+02,51.544,4.37E+02
51.637,4.51E+02,51.5078,2.76E+02,51.569,4.73E+02,51.5554,4.52E+02
51.665,2.27E+02,51.5388,2.51E+02,51.5894,4.57E+02,51.5958,1.96E+02
51.6925,5.60E+02,51.5486,2.79E+02,51.614,4.88E+02,51.6329,5.40E+02
51.7409,4.19E+02,51.5584,2.53E+02,51.6458,5.72E+02,51.6477,3.23E+02
51.7851,4.29E+02,51.5961,2.72E+02,51.7076,4.36E+02,51.6577,2.70E+02
51.8176,3.11E+02,51.6608,2.04E+02,51.776,5.59E+02,51.6699,3.89E+02
51.8764,3.94E+02,51.7093,5.14E+02,51.8157,6.66E+02,51.6788,2.83E+02
51.9135,3.26E+02,51.7396,1.88E+02,51.8514,4.26E+02,51.7201,3.91E+02
51.9592,2.66E+02,51.7931,2.72E+02,51.8791,5.61E+02,51.7546,3.41E+02
51.9954,2.97E+02,51.8428,5.96E+02,51.9129,5.14E+02,51.7646,2.27E+02
52.0751,2.24E+02,51.8923,3.94E+02,51.959,5.18E+02,51.7801,1.43E+02
52.1456,3.26E+02,51.9177,2.82E+02,52.0116,4.21E+02,51.8022,2.27E+02
52.1846,3.42E+02,51.9265,3.21E+02,52.0848,5.10E+02,51.83,2.66E+02
52.2284,2.66E+02,51.9413,3.56E+02,52.1412,6.20E+02,51.8698,1.74E+02
52.2666,5.32E+02,51.9616,2.19E+02,52.1722,5.72E+02,51.9084,2.89E+02
52.2936,4.24E+02,51.9845,1.53E+02,52.1821,5.18E+02,51.937,1.69E+02
52.3256,3.69E+02,52.0051,3.53E+02,52.2473,5.51E+02,51.9641,3.31E+02
52.3566,2.50E+02,52.0299,2.87E+02,52.3103,4.12E+02,52.0292,2.63E+02
52.4192,3.08E+02,52.0603,3.15E+02,52.35,8.76E+02,52.0633,3.94E+02
52.4757,2.99E+02,52.0988,3.45E+02,52.3807,6.95E+02,52.0797,2.88E+02
52.498,2.37E+02,52.1176,3.63E+02,52.4234,4.89E+02,52.1073,2.97E+02
52.57,2.58E+02,52.1698,3.11E+02,52.4451,4.54E+02,52.1546,3.41E+02
52.6178,4.29E+02,52.2352,3.96E+02,52.4627,5.38E+02,52.2219,3.68E+02
How can one split the samples using overlapping bins of 0.25 m/z - where the first column of each tuple (Sample n,,) contains a m/z value and the second containing the weight?
To load the file into a Pandas DataFrame I currently do:
import csv, pandas as pd
def load_raw_data():
raw_data = []
with open("negsmaller.csv", "rb") as rawfile:
reader = csv.reader(rawfile, delimiter=",")
next(reader)
for row in reader:
raw_data.append(row)
raw_data = pd.DataFrame(raw_data)
return raw_data.T
if __name__ == '__main__':
raw_data = load_raw_data()
print raw_data
Which returns
0 1 2 3 4 5 6 \
0 SAMPLE 1 50.0261 50.1057 50.1514 50.166 50.1831 50.2023
1 2.17E+02 2.65E+02 2.90E+02 3.85E+02 2.55E+02 3.45E+02
2 SAMPLE 2 50.0224 50.0435 50.0781 50.1245 50.1748 50.2161
3 3.31E+02 3.92E+02 3.88E+02 4.25E+02 3.71E+02 2.59E+02
4 SAMPLE 3 50.0007 50.0657 50.1115 50.1258 50.1411 50.1671
5 5.38E+02 5.52E+02 5.75E+02 5.11E+02 6.21E+02 5.56E+02
6 SAMPLE 4 50.0199 50.0465 50.0584 50.0765 50.1246 50.1866
7 2.39E+02 3.37E+02 2.58E+02 4.47E+02 1.43E+02 3.77E+02
7 8 9 ... 56 57 58 \
0 50.223 50.2631 50.2922 ... 52.2284 52.2666 52.2936
1 4.02E+02 1.89E+02 2.72E+02 ... 2.66E+02 5.32E+02 4.24E+02
2 50.2381 50.2826 50.3593 ... 51.9413 51.9616 51.9845
3 4.33E+02 4.63E+02 4.52E+02 ... 3.56E+02 2.19E+02 1.53E+02
4 50.1968 50.211 50.2279 ... 52.1412 52.1722 52.1821
5 6.31E+02 3.92E+02 5.92E+02 ... 6.20E+02 5.72E+02 5.18E+02
6 50.2276 50.2717 50.376 ... 51.8698 51.9084 51.937
7 3.41E+02 4.71E+02 3.09E+02 ... 1.74E+02 2.89E+02 1.69E+02
59 60 61 62 63 64 65
0 52.3256 52.3566 52.4192 52.4757 52.498 52.57 52.6178
1 3.69E+02 2.50E+02 3.08E+02 2.99E+02 2.37E+02 2.58E+02 4.29E+02
2 52.0051 52.0299 52.0603 52.0988 52.1176 52.1698 52.2352
3 3.53E+02 2.87E+02 3.15E+02 3.45E+02 3.63E+02 3.11E+02 3.96E+02
4 52.2473 52.3103 52.35 52.3807 52.4234 52.4451 52.4627
5 5.51E+02 4.12E+02 8.76E+02 6.95E+02 4.89E+02 4.54E+02 5.38E+02
6 51.9641 52.0292 52.0633 52.0797 52.1073 52.1546 52.2219
7 3.31E+02 2.63E+02 3.94E+02 2.88E+02 2.97E+02 3.41E+02 3.68E+02
[8 rows x 66 columns]
Process finished with exit code 0
My desired output: To take the overlapping 0.25 bins and then take the average of the column next to it and have it as one. So,
0.01 3
0.10 4
0.24 2
would become .25 3

Pandas appending Series to DataFrame to write to a file

I have list of Dataframes that I want to compute the mean on
~ pieces[1].head()
Sample Label C_RUNTIMEN N_TQ N_TR ... N_GEAR1 N_GEAR2 N_GEAR3 \
301 manual 82.150833 7 69 ... 3.615 1.952 1.241
302 manual 82.150833 7 69 ... 3.615 1.952 1.241
303 manual 82.150833 7 69 ... 3.615 1.952 1.241
304 manual 82.150833 7 69 ... 3.615 1.952 1.241
305 manual 82.150833 7 69 ... 3.615 1.952 1.241
, So i am looping through them ->
pieces = np.array_split(df,size)
output = pd.DataFrame()
for piece in pieces:
dp = piece.mean()
output = output.append(dp,ignore_index=True)
Unfortunately the output is sorted (the column names are alphabetical in the output) and I want to keep the original column order (as seen up top).
~ output.head()
C_ABSHUM C_ACCFUELGALN C_AFR C_AFRO C_FRAIRWS C_GEARRATIO \
0 44.578937 66.183858 14.466816 14.113321 18.831117 6.677792
1 34.042593 66.231229 14.320409 14.113321 22.368983 6.677792
2 34.497194 66.309320 14.210066 14.113321 25.353414 6.677792
3 43.430931 66.376632 14.314854 14.113321 28.462130 6.677792
4 44.419204 66.516515 14.314653 14.113321 32.244107 6.677792
I have tried variations of concat etc with no success. Is there a different way to think about this ?

My recommendation would be to concat the list of dataframes using pd.concat. This will allow you to use the standard group-by/apply. In this example, multi_df is a MultiIndex which behaves like a standard data frame, only the indexing and group by is a little different:
x = []
for i in range(10):
x.append(pd.DataFrame(dict(zip(list('abc'), [i + 1, i + 2, i + 3])), index = list('ind')))
Now x contains a list of data frames of the shape
a b c
i 2 3 4
n 2 3 4
d 2 3 4
And with
multi_df = pd.concat(x, keys = range(len(x)))
result = multi_df.groupby(level = [0]).apply(np.mean)
we get a data frame that looks like
a b c
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
You can then just call result.to_csv('filepath') to write that out.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to add a repeated column using pandas - python

Related

How to Group by the mean of specific columns in Python

Converting time format to second in a panda dataframe

How can I Extract only numbers from this columns?

Binning a data set using Pandas

Pandas appending Series to DataFrame to write to a file

Categories

Resources