Pandas appending Series to DataFrame to write to a file - python
I have list of Dataframes that I want to compute the mean on
~ pieces[1].head()
Sample Label C_RUNTIMEN N_TQ N_TR ... N_GEAR1 N_GEAR2 N_GEAR3 \
301 manual 82.150833 7 69 ... 3.615 1.952 1.241
302 manual 82.150833 7 69 ... 3.615 1.952 1.241
303 manual 82.150833 7 69 ... 3.615 1.952 1.241
304 manual 82.150833 7 69 ... 3.615 1.952 1.241
305 manual 82.150833 7 69 ... 3.615 1.952 1.241
, So i am looping through them ->
pieces = np.array_split(df,size)
output = pd.DataFrame()
for piece in pieces:
dp = piece.mean()
output = output.append(dp,ignore_index=True)
Unfortunately the output is sorted (the column names are alphabetical in the output) and I want to keep the original column order (as seen up top).
~ output.head()
C_ABSHUM C_ACCFUELGALN C_AFR C_AFRO C_FRAIRWS C_GEARRATIO \
0 44.578937 66.183858 14.466816 14.113321 18.831117 6.677792
1 34.042593 66.231229 14.320409 14.113321 22.368983 6.677792
2 34.497194 66.309320 14.210066 14.113321 25.353414 6.677792
3 43.430931 66.376632 14.314854 14.113321 28.462130 6.677792
4 44.419204 66.516515 14.314653 14.113321 32.244107 6.677792
I have tried variations of concat etc with no success. Is there a different way to think about this ?
My recommendation would be to concat the list of dataframes using pd.concat. This will allow you to use the standard group-by/apply. In this example, multi_df is a MultiIndex which behaves like a standard data frame, only the indexing and group by is a little different:
x = []
for i in range(10):
x.append(pd.DataFrame(dict(zip(list('abc'), [i + 1, i + 2, i + 3])), index = list('ind')))
Now x contains a list of data frames of the shape
a b c
i 2 3 4
n 2 3 4
d 2 3 4
And with
multi_df = pd.concat(x, keys = range(len(x)))
result = multi_df.groupby(level = [0]).apply(np.mean)
we get a data frame that looks like
a b c
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
You can then just call result.to_csv('filepath') to write that out.
Related
Changing of Data format from Pivoted data in Dataframes using Pandas Python
The Scenario My dataset was in format as follows: Which I refer as ACTUAL FORMAT uid iid rat tmp 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 298 474 4 884182806 115 265 2 881171488 253 465 5 891628467 305 451 3 886324817 6 86 3 883603013 and while passing it to other function (KMeans Clustering) it requires to be format like this, which I've created using Pivot mapping: Which I refer as MATRIX FORMAT uid 1 2 3 4 4 4.3320762062 4.3407749532 4.3111995162 4.3411425423 5 4 3 2.1952622349 3.1913491995 6 4 3.4233243638 3.8255108621 3.948791424 7 4.4983411706 4.0477240538 4.0241460801 5 8 4.1773004578 4.0191412859 4.0442369862 4.1754642909 9 4.2733984521 4.2797130861 4.2682723131 4.2816986988 15 1 3.0554789259 3.2279546684 3.1282278957 16 5 4.3473697565 4.0675394438 5 The Problem: Now, Since I need the result / MATRIX FORMAT Data to passed again to the First Algorithm, I need to convert it to OLD FORMAT. Coversion: For conversion of OLD to MATRIX Format I did: Pivot_Matrix = source_data.pivot(values='rat', index='uid', columns='iid') I tried reversing & interchanging of values to get the OLD FORMAT, which has apparently failed. Is there any way to retrieve MATRIX to OLD FORMAT?
You need stack with rename_axis for columns names and last reset_index: df = df.stack().rename_axis(('uid','iid')).reset_index(name='rat') print (df.head()) uid iid rat 0 4 1 4.332076 1 4 2 4.340775 2 4 3 4.311200 3 4 4 4.341143 4 5 1 4.000000
Scale values of a particular column of python dataframe between 1-10
I have a dataframe which contains youtube videos views, I want to scale these values in the range of 1-10. Below is the sample of how values look like? How do i normalize it in the range of 1-10 or is there any more efficient way to do this thing? rating 4394029 274358 473691 282858 703750 255967 3298456 136643 796896 2932 220661 48688 4661584 2526119 332176 7189818 322896 188162 157437 1153128 788310 1307902
One possibility is performing a scaling with max. 1 + df / df.max() * 9 rating 0 6.500315 1 1.343433 2 1.592952 3 1.354073 4 1.880933 5 1.320412 6 5.128909 7 1.171046 8 1.997531 9 1.003670 10 1.276217 11 1.060946 12 6.835232 13 4.162121 14 1.415808 15 10.000000 16 1.404192 17 1.235536 18 1.197075 19 2.443451 20 1.986783 21 2.637193 Similar solution by Wen (now deleted): 1 + (df - df.min()) * 9 / (df.max() - df.min()) rating 0 6.498887 1 1.339902 2 1.589522 3 1.350546 4 1.877621 5 1.316871 6 5.126922 7 1.167444 8 1.994266 9 1.000000 10 1.272658 11 1.057299 12 6.833941 13 4.159739 14 1.412306 15 10.000000 16 1.400685 17 1.231960 18 1.193484 19 2.440368 20 1.983514 21 2.634189
Binning a data set using Pandas
Given a csv file of... neg,,,,,,, SAMPLE 1,,SAMPLE 2,,SAMPLE 3,,SAMPLE 4, 50.0261,2.17E+02,50.0224,3.31E+02,50.0007,5.38E+02,50.0199,2.39E+02 50.1057,2.65E+02,50.0435,3.92E+02,50.0657,5.52E+02,50.0465,3.37E+02 50.1514,2.90E+02,50.0781,3.88E+02,50.1115,5.75E+02,50.0584,2.58E+02 50.166,3.85E+02,50.1245,4.25E+02,50.1258,5.11E+02,50.0765,4.47E+02 50.1831,2.55E+02,50.1748,3.71E+02,50.1411,6.21E+02,50.1246,1.43E+02 50.2023,3.45E+02,50.2161,2.59E+02,50.1671,5.56E+02,50.1866,3.77E+02 50.223,4.02E+02,50.2381,4.33E+02,50.1968,6.31E+02,50.2276,3.41E+02 50.2631,1.89E+02,50.2826,4.63E+02,50.211,3.92E+02,50.2717,4.71E+02 50.2922,2.72E+02,50.3593,4.52E+02,50.2279,5.92E+02,50.376,3.09E+02 50.319,2.46E+02,50.4019,4.15E+02,50.2929,5.60E+02,50.3979,2.56E+02 50.3523,3.57E+02,50.423,3.31E+02,50.3659,4.84E+02,50.4237,3.28E+02 50.3968,4.67E+02,50.4402,1.76E+02,50.437,1.89E+02,50.4504,2.71E+02 50.4431,1.88E+02,50.479,4.85E+02,50.5137,6.63E+02,50.5078,2.54E+02 50.481,3.63E+02,50.5448,3.51E+02,50.5401,5.11E+02,50.5436,2.69E+02 50.506,3.73E+02,50.5872,4.03E+02,50.5593,6.56E+02,50.555,3.06E+02 50.5379,3.00E+02,50.6076,2.96E+02,50.6034,5.02E+02,50.6059,2.83E+02 50.5905,2.38E+02,50.6341,2.67E+02,50.6579,6.37E+02,50.6484,1.99E+02 50.6564,1.30E+02,50.662,3.53E+02,50.6888,7.37E+02,50.7945,4.84E+02 50.7428,2.38E+02,50.6952,4.21E+02,50.7132,6.71E+02,50.8044,4.41E+02 50.8052,3.67E+02,50.7397,1.99E+02,50.7421,6.29E+02,50.8213,1.69E+02 50.8459,2.80E+02,50.7685,3.73E+02,50.7872,5.30E+02,50.8401,3.88E+02 50.9021,3.56E+02,50.7757,4.54E+02,50.8251,4.13E+02,50.8472,3.61E+02 50.9425,3.89E+02,50.8027,7.20E+02,50.8418,5.73E+02,50.8893,1.18E+02 51.0117,2.29E+02,50.8206,2.93E+02,50.8775,4.34E+02,50.9285,2.64E+02 51.0244,5.19E+02,50.8364,4.80E+02,50.9101,4.25E+02,50.9591,1.64E+02 51.0319,3.62E+02,50.8619,2.90E+02,50.9222,5.11E+02,51.0034,2.70E+02 51.0439,4.24E+02,50.9098,3.22E+02,50.9675,4.33E+02,51.0577,2.88E+02 51.0961,3.59E+02,50.969,3.87E+02,51.0123,6.03E+02,51.0712,3.18E+02 51.1429,2.49E+02,51.0009,2.42E+02,51.0266,7.30E+02,51.1015,1.84E+02 51.1597,2.71E+02,51.0262,1.32E+02,51.0554,3.69E+02,51.1291,3.71E+02 51.177,2.84E+02,51.0778,1.58E+02,51.1113,4.50E+02,51.1378,3.54E+02 51.1924,2.00E+02,51.1313,4.07E+02,51.1464,3.86E+02,51.1871,1.55E+02 51.2055,2.25E+02,51.1844,2.08E+02,51.1826,7.06E+02,51.2511,2.05E+02 51.2302,3.81E+02,51.2197,5.49E+02,51.2284,7.00E+02,51.3036,2.60E+02 51.264,2.16E+02,51.2306,3.76E+02,51.271,3.83E+02,51.3432,1.99E+02 51.2919,2.29E+02,51.2468,2.87E+02,51.308,3.89E+02,51.3775,2.45E+02 51.3338,3.67E+02,51.2739,5.56E+02,51.3394,5.17E+02,51.3977,3.86E+02 51.3743,2.57E+02,51.3228,3.18E+02,51.3619,6.03E+02,51.4151,3.37E+02 51.3906,3.78E+02,51.3685,2.33E+02,51.3844,4.44E+02,51.4254,2.72E+02 51.4112,3.29E+02,51.3912,5.03E+02,51.4179,5.68E+02,51.4426,3.17E+02 51.4423,1.86E+02,51.4165,2.68E+02,51.4584,5.10E+02,51.4834,3.87E+02 51.537,3.48E+02,51.4645,3.76E+02,51.5179,5.75E+02,51.544,4.37E+02 51.637,4.51E+02,51.5078,2.76E+02,51.569,4.73E+02,51.5554,4.52E+02 51.665,2.27E+02,51.5388,2.51E+02,51.5894,4.57E+02,51.5958,1.96E+02 51.6925,5.60E+02,51.5486,2.79E+02,51.614,4.88E+02,51.6329,5.40E+02 51.7409,4.19E+02,51.5584,2.53E+02,51.6458,5.72E+02,51.6477,3.23E+02 51.7851,4.29E+02,51.5961,2.72E+02,51.7076,4.36E+02,51.6577,2.70E+02 51.8176,3.11E+02,51.6608,2.04E+02,51.776,5.59E+02,51.6699,3.89E+02 51.8764,3.94E+02,51.7093,5.14E+02,51.8157,6.66E+02,51.6788,2.83E+02 51.9135,3.26E+02,51.7396,1.88E+02,51.8514,4.26E+02,51.7201,3.91E+02 51.9592,2.66E+02,51.7931,2.72E+02,51.8791,5.61E+02,51.7546,3.41E+02 51.9954,2.97E+02,51.8428,5.96E+02,51.9129,5.14E+02,51.7646,2.27E+02 52.0751,2.24E+02,51.8923,3.94E+02,51.959,5.18E+02,51.7801,1.43E+02 52.1456,3.26E+02,51.9177,2.82E+02,52.0116,4.21E+02,51.8022,2.27E+02 52.1846,3.42E+02,51.9265,3.21E+02,52.0848,5.10E+02,51.83,2.66E+02 52.2284,2.66E+02,51.9413,3.56E+02,52.1412,6.20E+02,51.8698,1.74E+02 52.2666,5.32E+02,51.9616,2.19E+02,52.1722,5.72E+02,51.9084,2.89E+02 52.2936,4.24E+02,51.9845,1.53E+02,52.1821,5.18E+02,51.937,1.69E+02 52.3256,3.69E+02,52.0051,3.53E+02,52.2473,5.51E+02,51.9641,3.31E+02 52.3566,2.50E+02,52.0299,2.87E+02,52.3103,4.12E+02,52.0292,2.63E+02 52.4192,3.08E+02,52.0603,3.15E+02,52.35,8.76E+02,52.0633,3.94E+02 52.4757,2.99E+02,52.0988,3.45E+02,52.3807,6.95E+02,52.0797,2.88E+02 52.498,2.37E+02,52.1176,3.63E+02,52.4234,4.89E+02,52.1073,2.97E+02 52.57,2.58E+02,52.1698,3.11E+02,52.4451,4.54E+02,52.1546,3.41E+02 52.6178,4.29E+02,52.2352,3.96E+02,52.4627,5.38E+02,52.2219,3.68E+02 How can one split the samples using overlapping bins of 0.25 m/z - where the first column of each tuple (Sample n,,) contains a m/z value and the second containing the weight? To load the file into a Pandas DataFrame I currently do: import csv, pandas as pd def load_raw_data(): raw_data = [] with open("negsmaller.csv", "rb") as rawfile: reader = csv.reader(rawfile, delimiter=",") next(reader) for row in reader: raw_data.append(row) raw_data = pd.DataFrame(raw_data) return raw_data.T if __name__ == '__main__': raw_data = load_raw_data() print raw_data Which returns 0 1 2 3 4 5 6 \ 0 SAMPLE 1 50.0261 50.1057 50.1514 50.166 50.1831 50.2023 1 2.17E+02 2.65E+02 2.90E+02 3.85E+02 2.55E+02 3.45E+02 2 SAMPLE 2 50.0224 50.0435 50.0781 50.1245 50.1748 50.2161 3 3.31E+02 3.92E+02 3.88E+02 4.25E+02 3.71E+02 2.59E+02 4 SAMPLE 3 50.0007 50.0657 50.1115 50.1258 50.1411 50.1671 5 5.38E+02 5.52E+02 5.75E+02 5.11E+02 6.21E+02 5.56E+02 6 SAMPLE 4 50.0199 50.0465 50.0584 50.0765 50.1246 50.1866 7 2.39E+02 3.37E+02 2.58E+02 4.47E+02 1.43E+02 3.77E+02 7 8 9 ... 56 57 58 \ 0 50.223 50.2631 50.2922 ... 52.2284 52.2666 52.2936 1 4.02E+02 1.89E+02 2.72E+02 ... 2.66E+02 5.32E+02 4.24E+02 2 50.2381 50.2826 50.3593 ... 51.9413 51.9616 51.9845 3 4.33E+02 4.63E+02 4.52E+02 ... 3.56E+02 2.19E+02 1.53E+02 4 50.1968 50.211 50.2279 ... 52.1412 52.1722 52.1821 5 6.31E+02 3.92E+02 5.92E+02 ... 6.20E+02 5.72E+02 5.18E+02 6 50.2276 50.2717 50.376 ... 51.8698 51.9084 51.937 7 3.41E+02 4.71E+02 3.09E+02 ... 1.74E+02 2.89E+02 1.69E+02 59 60 61 62 63 64 65 0 52.3256 52.3566 52.4192 52.4757 52.498 52.57 52.6178 1 3.69E+02 2.50E+02 3.08E+02 2.99E+02 2.37E+02 2.58E+02 4.29E+02 2 52.0051 52.0299 52.0603 52.0988 52.1176 52.1698 52.2352 3 3.53E+02 2.87E+02 3.15E+02 3.45E+02 3.63E+02 3.11E+02 3.96E+02 4 52.2473 52.3103 52.35 52.3807 52.4234 52.4451 52.4627 5 5.51E+02 4.12E+02 8.76E+02 6.95E+02 4.89E+02 4.54E+02 5.38E+02 6 51.9641 52.0292 52.0633 52.0797 52.1073 52.1546 52.2219 7 3.31E+02 2.63E+02 3.94E+02 2.88E+02 2.97E+02 3.41E+02 3.68E+02 [8 rows x 66 columns] Process finished with exit code 0 My desired output: To take the overlapping 0.25 bins and then take the average of the column next to it and have it as one. So, 0.01 3 0.10 4 0.24 2 would become .25 3
How to add a repeated column using pandas
I am doing my homework and I encounter a problem, I have a large matrix, the first column Y002 is a nominal variable, which has 3 levels and encoded as 1,2,3 respectively. The other two columns V96 and V97 are just numeric. Now, I wanna get a group mean corresponds to the variable Y002. I wrote the code like this group = data2.groupby(by=["Y002"]).mean() Then I index to get each group mean using group1 = group["V96"] group2 = group["V97"] Now I wanna append this group mean as a new column into the original dataframe, in which each mean matches the corresponding Y002 code(1 or 2 or 3). Actually I tried this code, but it only shows NAN. data2["group1"] = pd.Series(group1, index=data2.index) Hope someone could help me with this, many thanks :) PS: Hope this makes sense. just like R language, we can do the same thing using data2$group1 = with(data2, tapply(V97,Y002,mean))[data2$Y002] But how can we implement this in Python and pandas???
You can use .transform() import pandas as pd import numpy as np # your data # ============================ np.random.seed(0) df = pd.DataFrame({'Y002': np.random.randint(1,4,100), 'V96': np.random.randn(100), 'V97': np.random.randn(100)}) print(df) V96 V97 Y002 0 -0.6866 -0.1478 1 1 0.0149 1.6838 2 2 -0.3757 0.9718 1 3 -0.0382 1.6077 2 4 0.3680 -0.2571 2 5 -0.0447 1.8098 3 6 -0.3024 0.8923 1 7 -2.2244 -0.0966 3 8 0.7240 -0.3772 1 9 0.3590 -0.5053 1 .. ... ... ... 90 -0.6906 1.5567 2 91 -0.6815 -0.4189 3 92 -1.5122 -0.4097 1 93 2.1969 1.1164 2 94 1.0412 -0.2510 3 95 -0.0332 -0.4152 1 96 0.0656 -0.6391 3 97 0.2658 2.4978 1 98 1.1518 -3.0051 2 99 0.1380 -0.8740 3 # processing # =========================== df['V96_mean'] = df.groupby('Y002')['V96'].transform(np.mean) df['V97_mean'] = df.groupby('Y002')['V97'].transform(np.mean) df V96 V97 Y002 V96_mean V97_mean 0 -0.6866 -0.1478 1 -0.1944 0.0837 1 0.0149 1.6838 2 0.0497 -0.0496 2 -0.3757 0.9718 1 -0.1944 0.0837 3 -0.0382 1.6077 2 0.0497 -0.0496 4 0.3680 -0.2571 2 0.0497 -0.0496 5 -0.0447 1.8098 3 0.0053 -0.0707 6 -0.3024 0.8923 1 -0.1944 0.0837 7 -2.2244 -0.0966 3 0.0053 -0.0707 8 0.7240 -0.3772 1 -0.1944 0.0837 9 0.3590 -0.5053 1 -0.1944 0.0837 .. ... ... ... ... ... 90 -0.6906 1.5567 2 0.0497 -0.0496 91 -0.6815 -0.4189 3 0.0053 -0.0707 92 -1.5122 -0.4097 1 -0.1944 0.0837 93 2.1969 1.1164 2 0.0497 -0.0496 94 1.0412 -0.2510 3 0.0053 -0.0707 95 -0.0332 -0.4152 1 -0.1944 0.0837 96 0.0656 -0.6391 3 0.0053 -0.0707 97 0.2658 2.4978 1 -0.1944 0.0837 98 1.1518 -3.0051 2 0.0497 -0.0496 99 0.1380 -0.8740 3 0.0053 -0.0707 [100 rows x 5 columns]
Nested if loop with DataFrame is very,very slow
I have 10 million rows to go through and it will take many hours to process, I must be doing something wrong I converted the names of my df variables for ease in typing Close=df['Close'] eqId=df['eqId'] date=df['date'] IntDate=df['IntDate'] expiry=df['expiry'] delta=df['delta'] ivMid=df['ivMid'] conf=df['conf'] The below code works fine, just ungodly slow, any suggestions? print(datetime.datetime.now().time()) for i in range(2,1000): if delta[i]==90: if delta[i-1]==50: if delta[i-2]==10: if expiry[i]==expiry[i-2]: df.Skew[i]=ivMid[i]-ivMid[i-2] print(datetime.datetime.now().time()) 14:02:11.014396 14:02:13.834275 df.head(100) Close eqId date IntDate expiry delta ivMid conf Skew 0 37.380005 7 2008-01-02 39447 1 50 0.3850 0.8663 1 37.380005 7 2008-01-02 39447 1 90 0.5053 0.7876 2 36.960007 7 2008-01-03 39448 1 50 0.3915 0.8597 3 36.960007 7 2008-01-03 39448 1 90 0.5119 0.7438 4 35.179993 7 2008-01-04 39449 1 50 0.4055 0.8454 5 35.179993 7 2008-01-04 39449 1 90 0.5183 0.7736 6 33.899994 7 2008-01-07 39452 1 50 0.4464 0.8400 7 33.899994 7 2008-01-07 39452 1 90 0.5230 0.7514 8 31.250000 7 2008-01-08 39453 1 10 0.4453 0.7086 9 31.250000 7 2008-01-08 39453 1 50 0.4826 0.8246 10 31.250000 7 2008-01-08 39453 1 90 0.5668 0.6474 0.1215 11 30.830002 7 2008-01-09 39454 1 10 0.4716 0.7186 12 30.830002 7 2008-01-09 39454 1 50 0.4963 0.8479 13 30.830002 7 2008-01-09 39454 1 90 0.5735 0.6704 0.1019 14 31.460007 7 2008-01-10 39455 1 10 0.4254 0.6737 15 31.460007 7 2008-01-10 39455 1 50 0.4929 0.8218 16 31.460007 7 2008-01-10 39455 1 90 0.5902 0.6411 0.1648 17 30.699997 7 2008-01-11 39456 1 10 0.4868 0.7183 18 30.699997 7 2008-01-11 39456 1 50 0.4965 0.8411 19 30.639999 7 2008-01-14 39459 1 10 0.5117 0.7620 20 30.639999 7 2008-01-14 39459 1 50 0.4989 0.8804 21 30.639999 7 2008-01-14 39459 1 90 0.5887 0.6845 0.077 22 29.309998 7 2008-01-15 39460 1 10 0.4956 0.7363 23 29.309998 7 2008-01-15 39460 1 50 0.5054 0.8643 24 30.080002 7 2008-01-16 39461 1 10 0.4983 0.6646 At this rate it will take 7.77 hrs to process
Basically, the whole point of numpy & pandas is to avoid loops like the plague, and do things in a vectorial way. As you noticed, without that, speed is gone. Let's break your problem into steps. The Conditions Here, your your first condition can be written like this: df.delta == 90 (Note how this compares the entire column at once. This is much much faster than your loop!). and the second one can be written like this (using shift): df.delta.shift(1) == 50 The rest of your conditions are similar. Note that to combine conditions, you need to use parentheses. So, the first two conditions, together, should be written as: (df.delta == 90) & (df.delta.shift(1) == 50) You should be able to now write an expression combining all your conditions. Let's call it cond, i.e., cond = (df.delta == 90) & (df.delta.shift(1) == 50) & ... The assignment To assign things to a new column, use df['skew'] = ... We just need to figure out what to put on the right-hand-sign The Right Hand Side Since we have cond, we can write the right-hand-side as np.where(cond, df.ivMid - df.ivMid.shift(2), 0) What this says is: when condition is true, take the second term; when it's not, take the third term (in this case I used 0, but do whatever you like). By combining all of this, you should be able to write a very efficient version of your code.