How to add a repeated column using pandas - python
I am doing my homework and I encounter a problem, I have a large matrix, the first column Y002 is a nominal variable, which has 3 levels and encoded as 1,2,3 respectively. The other two columns V96 and V97 are just numeric.
Now, I wanna get a group mean corresponds to the variable Y002. I wrote the code like this
group = data2.groupby(by=["Y002"]).mean()
Then I index to get each group mean using
group1 = group["V96"]
group2 = group["V97"]
Now I wanna append this group mean as a new column into the original dataframe, in which each mean matches the corresponding Y002 code(1 or 2 or 3). Actually I tried this code, but it only shows NAN.
data2["group1"] = pd.Series(group1, index=data2.index)
Hope someone could help me with this, many thanks :)
PS: Hope this makes sense. just like R language, we can do the same thing using
data2$group1 = with(data2, tapply(V97,Y002,mean))[data2$Y002]
But how can we implement this in Python and pandas???
You can use .transform()
import pandas as pd
import numpy as np
# your data
# ============================
np.random.seed(0)
df = pd.DataFrame({'Y002': np.random.randint(1,4,100), 'V96': np.random.randn(100), 'V97': np.random.randn(100)})
print(df)
V96 V97 Y002
0 -0.6866 -0.1478 1
1 0.0149 1.6838 2
2 -0.3757 0.9718 1
3 -0.0382 1.6077 2
4 0.3680 -0.2571 2
5 -0.0447 1.8098 3
6 -0.3024 0.8923 1
7 -2.2244 -0.0966 3
8 0.7240 -0.3772 1
9 0.3590 -0.5053 1
.. ... ... ...
90 -0.6906 1.5567 2
91 -0.6815 -0.4189 3
92 -1.5122 -0.4097 1
93 2.1969 1.1164 2
94 1.0412 -0.2510 3
95 -0.0332 -0.4152 1
96 0.0656 -0.6391 3
97 0.2658 2.4978 1
98 1.1518 -3.0051 2
99 0.1380 -0.8740 3
# processing
# ===========================
df['V96_mean'] = df.groupby('Y002')['V96'].transform(np.mean)
df['V97_mean'] = df.groupby('Y002')['V97'].transform(np.mean)
df
V96 V97 Y002 V96_mean V97_mean
0 -0.6866 -0.1478 1 -0.1944 0.0837
1 0.0149 1.6838 2 0.0497 -0.0496
2 -0.3757 0.9718 1 -0.1944 0.0837
3 -0.0382 1.6077 2 0.0497 -0.0496
4 0.3680 -0.2571 2 0.0497 -0.0496
5 -0.0447 1.8098 3 0.0053 -0.0707
6 -0.3024 0.8923 1 -0.1944 0.0837
7 -2.2244 -0.0966 3 0.0053 -0.0707
8 0.7240 -0.3772 1 -0.1944 0.0837
9 0.3590 -0.5053 1 -0.1944 0.0837
.. ... ... ... ... ...
90 -0.6906 1.5567 2 0.0497 -0.0496
91 -0.6815 -0.4189 3 0.0053 -0.0707
92 -1.5122 -0.4097 1 -0.1944 0.0837
93 2.1969 1.1164 2 0.0497 -0.0496
94 1.0412 -0.2510 3 0.0053 -0.0707
95 -0.0332 -0.4152 1 -0.1944 0.0837
96 0.0656 -0.6391 3 0.0053 -0.0707
97 0.2658 2.4978 1 -0.1944 0.0837
98 1.1518 -3.0051 2 0.0497 -0.0496
99 0.1380 -0.8740 3 0.0053 -0.0707
[100 rows x 5 columns]
Related
How to Group by the mean of specific columns in Python
In the dataframe below: import pandas as pd import numpy as np df= { 'Gen':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'], 'Site':['FRX','FX','FRX','FRX','FRX','FX','FRX','FX','FX','FX','FX','FRX','FRX','FRX','FRX','FRX'], 'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'], 'AIC':['<1','<1','<1','<1',1,1,1,1,2,2,2,2,'>2','>2','>2','>2'], 'AIC_TRX':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4], 'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3], 'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8], 'Grwth_Time1':[150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,244.73,269.91,249.19], 'Grwth_Time2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12], 'Grwth_Time3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33], 'Grwth_Time4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23], 'Grwth_Time5':[25.78,22.34,28.53,27.69,30.07,17.7,29.81,33.15,34.87,32.54,36.59,39.97,29.28,34.73,36.91,34.12], 'Grwth_Time6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33], 'Grwth_Time7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23], } df = pd.DataFrame(df,columns = ['Gen','Site','Type','AIC','AIC_TRX','diff','series','Grwth_Time1','Grwth_Time2','Grwth_Time3','Grwth_Time4','Grwth_Time5','Grwth_Time6','Grwth_Time7']) df.info() I want to do the following: Find the average of each unique series per AIC_TRX for each Grwth_Time (Grwth_Time1, Grwth_Time2,....,Grwth_Time7) Export all the outputs as one xlsx file (refer to the figure below) The desired outputs look like the figure below (note: the numbers in this output are not the actual average values, they were randomly generated) My attempt: # Select the columns -> AIC_TRX, series, Grwth_Time1,Grwth_Time2,....,Grwth_Time7 df1 = df[['AIC_TRX', 'diff', 'series', 'Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']] #Below is where I need help, I want to groupby the 'series' and 'AIC_TRX' for all the 'Grwth_Time1_to_7' df1.groupby('series').Grwth_Time1.agg(['mean']) Thanks in advance
You have to groupby two columns: ['series', 'AIC_TRX'] and find mean of each Grwth_Time. df.groupby(['series', 'AIC_TRX'])[['Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']].mean().unstack().to_excel("output.xlsx") Output: AIC_TRX 1 2 3 4 series 1 150.78 208.07 146.87 229.28 2 162.34 217.76 182.54 244.73 4 188.53 229.48 189.57 269.91 8 197.69 139.51 199.97 249.19 AIC_TRX 1 2 3 4 series 1 250.78 308.07 346.87 329.28 2 262.34 317.70 382.54 347.73 4 288.53 329.81 369.59 369.91 8 297.69 339.15 399.97 349.12 AIC_TRX 1 2 3 4 series 1 240.18 338.07 365.87 429.08 2 232.14 307.74 392.48 448.39 4 258.53 359.16 399.97 465.15 8 276.69 339.25 410.75 469.33 AIC_TRX 1 2 3 4 series 1 270.84 318.73 398.75 494.85 2 282.14 327.47 432.18 509.39 4 298.53 369.63 449.78 515.52 8 306.69 389.59 473.55 539.23 AIC_TRX 1 2 3 4 series 1 25.78 30.07 34.87 29.28 2 22.34 17.70 32.54 34.73 4 28.53 29.81 36.59 36.91 8 27.69 33.15 39.97 34.12 AIC_TRX 1 2 3 4 series 1 240.18 338.07 365.87 429.08 2 232.14 307.74 392.48 448.39 4 258.53 359.16 399.97 465.15 8 276.69 339.25 410.75 469.33 AIC_TRX 1 2 3 4 series 1 27.84 18.73 38.75 13.85 2 28.14 27.47 24.18 9.39 4 29.53 36.63 24.78 15.52 8 30.69 38.59 21.55 39.23
Just use the df.apply method to average across each column based on series and AIC_TRX grouping. result = df1.groupby(['series', 'AIC_TRX']).apply(np.mean, axis=1) Result: series AIC_TRX 1 1 0 120.738 2 4 156.281 3 8 170.285 4 12 196.270 2 1 1 122.358 2 5 152.758 3 9 184.494 4 13 205.175 4 1 2 135.471 2 6 171.968 3 10 187.825 4 14 214.907 8 1 3 142.183 2 7 162.849 3 11 196.851 4 15 216.455 dtype: float64
Converting time format to second in a panda dataframe
I have a df with time data and I would like to transform these data to second (see example below). Compression_level Size (M) Real time (s) User time (s) Sys time (s) 0 0 265 0:19.938 0:24.649 0:3.062 1 1 76 0:17.910 0:25.929 0:3.098 2 2 74 1:02.619 0:27.724 0:3.014 3 3 73 0:20.607 0:27.937 0:3.193 4 4 67 0:19.598 0:28.853 0:2.925 5 5 67 0:21.032 0:30.119 0:3.206 6 6 66 0:27.013 0:31.462 0:3.106 7 7 65 0:27.337 0:36.226 0:3.060 8 8 64 0:37.651 0:47.246 0:2.933 9 9 64 0:59.241 1:8.333 0:3.027 This is the output I would like to obtain. df["Real time (s)"] 0 19.938 1 17.910 2 62.619 ... I have some useful code but I do not how to itinerate this code in a data frame x = time.strptime("00:01:00","%H:%M:%S") datetime.timedelta(hours=x.tm_hour,minutes=x.tm_min, seconds=x.tm_sec).total_seconds()
Add 00: from right side for 0hours, pass to to_timedelta and then add Series.dt.total_seconds: df["Real time (s)"] = pd.to_timedelta(df["Real time (s)"].radd('00:')).dt.total_seconds() print (df) Compression_level Size (M) Real time (s) User time (s) Sys time (s) 0 0 265 19.938 0:24.649 0:3.062 1 1 76 17.910 0:25.929 0:3.098 2 2 74 62.619 0:27.724 0:3.014 3 3 73 20.607 0:27.937 0:3.193 4 4 67 19.598 0:28.853 0:2.925 5 5 67 21.032 0:30.119 0:3.206 6 6 66 27.013 0:31.462 0:3.106 7 7 65 27.337 0:36.226 0:3.060 8 8 64 37.651 0:47.246 0:2.933 9 9 64 59.241 1:8.333 0:3.027 Solution for processing multiple columns: def to_td(x): return pd.to_timedelta(x.radd('00:')).dt.total_seconds() cols = ["Real time (s)", "User time (s)", "Sys time (s)"] df[cols] = df[cols].apply(to_td) print (df) Compression_level Size (M) Real time (s) User time (s) Sys time (s) 0 0 265 19.938 24.649 3.062 1 1 76 17.910 25.929 3.098 2 2 74 62.619 27.724 3.014 3 3 73 20.607 27.937 3.193 4 4 67 19.598 28.853 2.925 5 5 67 21.032 30.119 3.206 6 6 66 27.013 31.462 3.106 7 7 65 27.337 36.226 3.060 8 8 64 37.651 47.246 2.933 9 9 64 59.241 68.333 3.027
How can I Extract only numbers from this columns?
Suppose, you have a column in excel, with values like this... there are only 5500 numbers present but it show length 5602 means that 102 strings are present 4 SELECTIO 6 N NO 14 37001 26 37002 38 37003 47 37004 60 37005 73 37006 82 37007 92 37008 105 37009 119 37010 132 37011 143 37012 157 37013 168 37014 184 37015 196 37016 207 37017 220 37018 236 37019 253 37020 267 37021 280 37022 287 Krishan 290 37023 300 37024 316 37025 337 37026 365 37027 ... 74141 42471 74154 42472 74169 42473 74184 42474 74200 42475 74216 42476 74233 42477 74242 42478 74256 42479 74271 42480 74290 42481 74309 42482 74323 42483 74336 42484 74350 42485 74365 42486 74378 42487 74389 42488 74398 42489 74413 42490 74430 42491 74446 42492 74459 42493 74474 42494 74491 42495 74504 42496 74516 42497 74530 42498 74544 42499 74558 42500 Name: Selection No., Length: 5602, dtype: object and I want to get only numeric values like this in python using pandas 37001 37002 37003 37004 37005 how can I do this? I have attached my code in python using pandas.............................................. def selection(sle): if sle in re.match('[3-4][0-9]{4}',sle): return 1 else: return 0 select['status'] = select['Selection No.'].apply(selection) and now I am geting an "argument of type 'NoneType' is not iterable" error.
Try using Numpy with np.isreal and only select numbers.. import pandas as pd import numpy as np df = pd.DataFrame({'SELECTIO':['N NO',37002,37003,'Krishan',37004,'singh',37005], 'some_col':[4,6,14,26,38,47,60]}) df SELECTIO some_col 0 N NO 4 1 37002 6 2 37003 14 3 Krishan 26 4 37004 38 5 singh 47 6 37005 60 >>> df[df[['SELECTIO']].applymap(np.isreal).all(1)] SELECTIO some_col 1 37002 6 2 37003 14 4 37004 38 6 37005 60 result: Specific to column SELECTIO .. df[df[['SELECTIO']].applymap(np.isreal).all(1)] SELECTIO some_col 1 37002 6 2 37003 14 4 37004 38 6 37005 60 OR just another approach importing numbers + lambda : import numbers df[df[['SELECTIO']].applymap(lambda x: isinstance(x, numbers.Number)).all(1)] SELECTIO some_col 1 37002 6 2 37003 14 4 37004 38 6 37005 60 Note: there is problem when you are extracting a column you are using ['Selection No.'] but indeed you have a Space in the name it will be like ['Selection No. '] that's the reason you are getting KeyError while executing it, try and see!
Your function contains wrong expression: if sle in re.match('[3-4][0-9]{4}',sle): - it tries to find a column value sle IN match object which "always have a boolean value of True" (re.match returns None when there's no match) I would suggest to proceed with pd.Series.str.isnumeric function: In [544]: df Out[544]: Selection No. 0 37001 1 37002 2 37003 3 asnsh 4 37004 5 singh 6 37005 In [545]: df['Status'] = df['Selection No.'].str.isnumeric().astype(int) In [546]: df Out[546]: Selection No. Status 0 37001 1 1 37002 1 2 37003 1 3 asnsh 0 4 37004 1 5 singh 0 6 37005 1 If a strict regex pattern is required - use pd.Series.str.contains function: df['Status'] = df['Selection No.'].str.contains('^[3-4][0-9]{4}$', regex=True).astype(int)
Binning a data set using Pandas
Given a csv file of... neg,,,,,,, SAMPLE 1,,SAMPLE 2,,SAMPLE 3,,SAMPLE 4, 50.0261,2.17E+02,50.0224,3.31E+02,50.0007,5.38E+02,50.0199,2.39E+02 50.1057,2.65E+02,50.0435,3.92E+02,50.0657,5.52E+02,50.0465,3.37E+02 50.1514,2.90E+02,50.0781,3.88E+02,50.1115,5.75E+02,50.0584,2.58E+02 50.166,3.85E+02,50.1245,4.25E+02,50.1258,5.11E+02,50.0765,4.47E+02 50.1831,2.55E+02,50.1748,3.71E+02,50.1411,6.21E+02,50.1246,1.43E+02 50.2023,3.45E+02,50.2161,2.59E+02,50.1671,5.56E+02,50.1866,3.77E+02 50.223,4.02E+02,50.2381,4.33E+02,50.1968,6.31E+02,50.2276,3.41E+02 50.2631,1.89E+02,50.2826,4.63E+02,50.211,3.92E+02,50.2717,4.71E+02 50.2922,2.72E+02,50.3593,4.52E+02,50.2279,5.92E+02,50.376,3.09E+02 50.319,2.46E+02,50.4019,4.15E+02,50.2929,5.60E+02,50.3979,2.56E+02 50.3523,3.57E+02,50.423,3.31E+02,50.3659,4.84E+02,50.4237,3.28E+02 50.3968,4.67E+02,50.4402,1.76E+02,50.437,1.89E+02,50.4504,2.71E+02 50.4431,1.88E+02,50.479,4.85E+02,50.5137,6.63E+02,50.5078,2.54E+02 50.481,3.63E+02,50.5448,3.51E+02,50.5401,5.11E+02,50.5436,2.69E+02 50.506,3.73E+02,50.5872,4.03E+02,50.5593,6.56E+02,50.555,3.06E+02 50.5379,3.00E+02,50.6076,2.96E+02,50.6034,5.02E+02,50.6059,2.83E+02 50.5905,2.38E+02,50.6341,2.67E+02,50.6579,6.37E+02,50.6484,1.99E+02 50.6564,1.30E+02,50.662,3.53E+02,50.6888,7.37E+02,50.7945,4.84E+02 50.7428,2.38E+02,50.6952,4.21E+02,50.7132,6.71E+02,50.8044,4.41E+02 50.8052,3.67E+02,50.7397,1.99E+02,50.7421,6.29E+02,50.8213,1.69E+02 50.8459,2.80E+02,50.7685,3.73E+02,50.7872,5.30E+02,50.8401,3.88E+02 50.9021,3.56E+02,50.7757,4.54E+02,50.8251,4.13E+02,50.8472,3.61E+02 50.9425,3.89E+02,50.8027,7.20E+02,50.8418,5.73E+02,50.8893,1.18E+02 51.0117,2.29E+02,50.8206,2.93E+02,50.8775,4.34E+02,50.9285,2.64E+02 51.0244,5.19E+02,50.8364,4.80E+02,50.9101,4.25E+02,50.9591,1.64E+02 51.0319,3.62E+02,50.8619,2.90E+02,50.9222,5.11E+02,51.0034,2.70E+02 51.0439,4.24E+02,50.9098,3.22E+02,50.9675,4.33E+02,51.0577,2.88E+02 51.0961,3.59E+02,50.969,3.87E+02,51.0123,6.03E+02,51.0712,3.18E+02 51.1429,2.49E+02,51.0009,2.42E+02,51.0266,7.30E+02,51.1015,1.84E+02 51.1597,2.71E+02,51.0262,1.32E+02,51.0554,3.69E+02,51.1291,3.71E+02 51.177,2.84E+02,51.0778,1.58E+02,51.1113,4.50E+02,51.1378,3.54E+02 51.1924,2.00E+02,51.1313,4.07E+02,51.1464,3.86E+02,51.1871,1.55E+02 51.2055,2.25E+02,51.1844,2.08E+02,51.1826,7.06E+02,51.2511,2.05E+02 51.2302,3.81E+02,51.2197,5.49E+02,51.2284,7.00E+02,51.3036,2.60E+02 51.264,2.16E+02,51.2306,3.76E+02,51.271,3.83E+02,51.3432,1.99E+02 51.2919,2.29E+02,51.2468,2.87E+02,51.308,3.89E+02,51.3775,2.45E+02 51.3338,3.67E+02,51.2739,5.56E+02,51.3394,5.17E+02,51.3977,3.86E+02 51.3743,2.57E+02,51.3228,3.18E+02,51.3619,6.03E+02,51.4151,3.37E+02 51.3906,3.78E+02,51.3685,2.33E+02,51.3844,4.44E+02,51.4254,2.72E+02 51.4112,3.29E+02,51.3912,5.03E+02,51.4179,5.68E+02,51.4426,3.17E+02 51.4423,1.86E+02,51.4165,2.68E+02,51.4584,5.10E+02,51.4834,3.87E+02 51.537,3.48E+02,51.4645,3.76E+02,51.5179,5.75E+02,51.544,4.37E+02 51.637,4.51E+02,51.5078,2.76E+02,51.569,4.73E+02,51.5554,4.52E+02 51.665,2.27E+02,51.5388,2.51E+02,51.5894,4.57E+02,51.5958,1.96E+02 51.6925,5.60E+02,51.5486,2.79E+02,51.614,4.88E+02,51.6329,5.40E+02 51.7409,4.19E+02,51.5584,2.53E+02,51.6458,5.72E+02,51.6477,3.23E+02 51.7851,4.29E+02,51.5961,2.72E+02,51.7076,4.36E+02,51.6577,2.70E+02 51.8176,3.11E+02,51.6608,2.04E+02,51.776,5.59E+02,51.6699,3.89E+02 51.8764,3.94E+02,51.7093,5.14E+02,51.8157,6.66E+02,51.6788,2.83E+02 51.9135,3.26E+02,51.7396,1.88E+02,51.8514,4.26E+02,51.7201,3.91E+02 51.9592,2.66E+02,51.7931,2.72E+02,51.8791,5.61E+02,51.7546,3.41E+02 51.9954,2.97E+02,51.8428,5.96E+02,51.9129,5.14E+02,51.7646,2.27E+02 52.0751,2.24E+02,51.8923,3.94E+02,51.959,5.18E+02,51.7801,1.43E+02 52.1456,3.26E+02,51.9177,2.82E+02,52.0116,4.21E+02,51.8022,2.27E+02 52.1846,3.42E+02,51.9265,3.21E+02,52.0848,5.10E+02,51.83,2.66E+02 52.2284,2.66E+02,51.9413,3.56E+02,52.1412,6.20E+02,51.8698,1.74E+02 52.2666,5.32E+02,51.9616,2.19E+02,52.1722,5.72E+02,51.9084,2.89E+02 52.2936,4.24E+02,51.9845,1.53E+02,52.1821,5.18E+02,51.937,1.69E+02 52.3256,3.69E+02,52.0051,3.53E+02,52.2473,5.51E+02,51.9641,3.31E+02 52.3566,2.50E+02,52.0299,2.87E+02,52.3103,4.12E+02,52.0292,2.63E+02 52.4192,3.08E+02,52.0603,3.15E+02,52.35,8.76E+02,52.0633,3.94E+02 52.4757,2.99E+02,52.0988,3.45E+02,52.3807,6.95E+02,52.0797,2.88E+02 52.498,2.37E+02,52.1176,3.63E+02,52.4234,4.89E+02,52.1073,2.97E+02 52.57,2.58E+02,52.1698,3.11E+02,52.4451,4.54E+02,52.1546,3.41E+02 52.6178,4.29E+02,52.2352,3.96E+02,52.4627,5.38E+02,52.2219,3.68E+02 How can one split the samples using overlapping bins of 0.25 m/z - where the first column of each tuple (Sample n,,) contains a m/z value and the second containing the weight? To load the file into a Pandas DataFrame I currently do: import csv, pandas as pd def load_raw_data(): raw_data = [] with open("negsmaller.csv", "rb") as rawfile: reader = csv.reader(rawfile, delimiter=",") next(reader) for row in reader: raw_data.append(row) raw_data = pd.DataFrame(raw_data) return raw_data.T if __name__ == '__main__': raw_data = load_raw_data() print raw_data Which returns 0 1 2 3 4 5 6 \ 0 SAMPLE 1 50.0261 50.1057 50.1514 50.166 50.1831 50.2023 1 2.17E+02 2.65E+02 2.90E+02 3.85E+02 2.55E+02 3.45E+02 2 SAMPLE 2 50.0224 50.0435 50.0781 50.1245 50.1748 50.2161 3 3.31E+02 3.92E+02 3.88E+02 4.25E+02 3.71E+02 2.59E+02 4 SAMPLE 3 50.0007 50.0657 50.1115 50.1258 50.1411 50.1671 5 5.38E+02 5.52E+02 5.75E+02 5.11E+02 6.21E+02 5.56E+02 6 SAMPLE 4 50.0199 50.0465 50.0584 50.0765 50.1246 50.1866 7 2.39E+02 3.37E+02 2.58E+02 4.47E+02 1.43E+02 3.77E+02 7 8 9 ... 56 57 58 \ 0 50.223 50.2631 50.2922 ... 52.2284 52.2666 52.2936 1 4.02E+02 1.89E+02 2.72E+02 ... 2.66E+02 5.32E+02 4.24E+02 2 50.2381 50.2826 50.3593 ... 51.9413 51.9616 51.9845 3 4.33E+02 4.63E+02 4.52E+02 ... 3.56E+02 2.19E+02 1.53E+02 4 50.1968 50.211 50.2279 ... 52.1412 52.1722 52.1821 5 6.31E+02 3.92E+02 5.92E+02 ... 6.20E+02 5.72E+02 5.18E+02 6 50.2276 50.2717 50.376 ... 51.8698 51.9084 51.937 7 3.41E+02 4.71E+02 3.09E+02 ... 1.74E+02 2.89E+02 1.69E+02 59 60 61 62 63 64 65 0 52.3256 52.3566 52.4192 52.4757 52.498 52.57 52.6178 1 3.69E+02 2.50E+02 3.08E+02 2.99E+02 2.37E+02 2.58E+02 4.29E+02 2 52.0051 52.0299 52.0603 52.0988 52.1176 52.1698 52.2352 3 3.53E+02 2.87E+02 3.15E+02 3.45E+02 3.63E+02 3.11E+02 3.96E+02 4 52.2473 52.3103 52.35 52.3807 52.4234 52.4451 52.4627 5 5.51E+02 4.12E+02 8.76E+02 6.95E+02 4.89E+02 4.54E+02 5.38E+02 6 51.9641 52.0292 52.0633 52.0797 52.1073 52.1546 52.2219 7 3.31E+02 2.63E+02 3.94E+02 2.88E+02 2.97E+02 3.41E+02 3.68E+02 [8 rows x 66 columns] Process finished with exit code 0 My desired output: To take the overlapping 0.25 bins and then take the average of the column next to it and have it as one. So, 0.01 3 0.10 4 0.24 2 would become .25 3
Pandas appending Series to DataFrame to write to a file
I have list of Dataframes that I want to compute the mean on ~ pieces[1].head() Sample Label C_RUNTIMEN N_TQ N_TR ... N_GEAR1 N_GEAR2 N_GEAR3 \ 301 manual 82.150833 7 69 ... 3.615 1.952 1.241 302 manual 82.150833 7 69 ... 3.615 1.952 1.241 303 manual 82.150833 7 69 ... 3.615 1.952 1.241 304 manual 82.150833 7 69 ... 3.615 1.952 1.241 305 manual 82.150833 7 69 ... 3.615 1.952 1.241 , So i am looping through them -> pieces = np.array_split(df,size) output = pd.DataFrame() for piece in pieces: dp = piece.mean() output = output.append(dp,ignore_index=True) Unfortunately the output is sorted (the column names are alphabetical in the output) and I want to keep the original column order (as seen up top). ~ output.head() C_ABSHUM C_ACCFUELGALN C_AFR C_AFRO C_FRAIRWS C_GEARRATIO \ 0 44.578937 66.183858 14.466816 14.113321 18.831117 6.677792 1 34.042593 66.231229 14.320409 14.113321 22.368983 6.677792 2 34.497194 66.309320 14.210066 14.113321 25.353414 6.677792 3 43.430931 66.376632 14.314854 14.113321 28.462130 6.677792 4 44.419204 66.516515 14.314653 14.113321 32.244107 6.677792 I have tried variations of concat etc with no success. Is there a different way to think about this ?
My recommendation would be to concat the list of dataframes using pd.concat. This will allow you to use the standard group-by/apply. In this example, multi_df is a MultiIndex which behaves like a standard data frame, only the indexing and group by is a little different: x = [] for i in range(10): x.append(pd.DataFrame(dict(zip(list('abc'), [i + 1, i + 2, i + 3])), index = list('ind'))) Now x contains a list of data frames of the shape a b c i 2 3 4 n 2 3 4 d 2 3 4 And with multi_df = pd.concat(x, keys = range(len(x))) result = multi_df.groupby(level = [0]).apply(np.mean) we get a data frame that looks like a b c 0 1 2 3 1 2 3 4 2 3 4 5 3 4 5 6 4 5 6 7 5 6 7 8 6 7 8 9 7 8 9 10 8 9 10 11 9 10 11 12 You can then just call result.to_csv('filepath') to write that out.