Pandas appending Series to DataFrame to write to a file - python

I have list of Dataframes that I want to compute the mean on
~ pieces[1].head()
Sample Label C_RUNTIMEN N_TQ N_TR ... N_GEAR1 N_GEAR2 N_GEAR3 \
301 manual 82.150833 7 69 ... 3.615 1.952 1.241
302 manual 82.150833 7 69 ... 3.615 1.952 1.241
303 manual 82.150833 7 69 ... 3.615 1.952 1.241
304 manual 82.150833 7 69 ... 3.615 1.952 1.241
305 manual 82.150833 7 69 ... 3.615 1.952 1.241
, So i am looping through them ->
pieces = np.array_split(df,size)
output = pd.DataFrame()
for piece in pieces:
dp = piece.mean()
output = output.append(dp,ignore_index=True)
Unfortunately the output is sorted (the column names are alphabetical in the output) and I want to keep the original column order (as seen up top).
~ output.head()
C_ABSHUM C_ACCFUELGALN C_AFR C_AFRO C_FRAIRWS C_GEARRATIO \
0 44.578937 66.183858 14.466816 14.113321 18.831117 6.677792
1 34.042593 66.231229 14.320409 14.113321 22.368983 6.677792
2 34.497194 66.309320 14.210066 14.113321 25.353414 6.677792
3 43.430931 66.376632 14.314854 14.113321 28.462130 6.677792
4 44.419204 66.516515 14.314653 14.113321 32.244107 6.677792
I have tried variations of concat etc with no success. Is there a different way to think about this ?

My recommendation would be to concat the list of dataframes using pd.concat. This will allow you to use the standard group-by/apply. In this example, multi_df is a MultiIndex which behaves like a standard data frame, only the indexing and group by is a little different:
x = []
for i in range(10):
x.append(pd.DataFrame(dict(zip(list('abc'), [i + 1, i + 2, i + 3])), index = list('ind')))
Now x contains a list of data frames of the shape
a b c
i 2 3 4
n 2 3 4
d 2 3 4
And with
multi_df = pd.concat(x, keys = range(len(x)))
result = multi_df.groupby(level = [0]).apply(np.mean)
we get a data frame that looks like
a b c
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
8 9 10 11
9 10 11 12
You can then just call result.to_csv('filepath') to write that out.

Related

Changing of Data format from Pivoted data in Dataframes using Pandas Python

The Scenario
My dataset was in format as follows:
Which I refer as ACTUAL FORMAT
uid iid rat tmp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
and while passing it to other function (KMeans Clustering) it requires to be format like this, which I've created using Pivot mapping:
Which I refer as MATRIX FORMAT
uid 1 2 3 4
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
5 4 3 2.1952622349 3.1913491995
6 4 3.4233243638 3.8255108621 3.948791424
7 4.4983411706 4.0477240538 4.0241460801 5
8 4.1773004578 4.0191412859 4.0442369862 4.1754642909
9 4.2733984521 4.2797130861 4.2682723131 4.2816986988
15 1 3.0554789259 3.2279546684 3.1282278957
16 5 4.3473697565 4.0675394438 5
The Problem:
Now, Since I need the result / MATRIX FORMAT Data to passed again to the First Algorithm, I need to convert it to OLD FORMAT.
Coversion:
For conversion of OLD to MATRIX Format I did:
Pivot_Matrix = source_data.pivot(values='rat', index='uid', columns='iid')
I tried reversing & interchanging of values to get the OLD FORMAT, which has apparently failed. Is there any way to retrieve MATRIX to OLD FORMAT?
You need stack with rename_axis for columns names and last reset_index:
df = df.stack().rename_axis(('uid','iid')).reset_index(name='rat')
print (df.head())
uid iid rat
0 4 1 4.332076
1 4 2 4.340775
2 4 3 4.311200
3 4 4 4.341143
4 5 1 4.000000

Scale values of a particular column of python dataframe between 1-10

I have a dataframe which contains youtube videos views, I want to scale these values in the range of 1-10.
Below is the sample of how values look like? How do i normalize it in the range of 1-10 or is there any more efficient way to do this thing?
rating
4394029
274358
473691
282858
703750
255967
3298456
136643
796896
2932
220661
48688
4661584
2526119
332176
7189818
322896
188162
157437
1153128
788310
1307902
One possibility is performing a scaling with max.
1 + df / df.max() * 9
rating
0 6.500315
1 1.343433
2 1.592952
3 1.354073
4 1.880933
5 1.320412
6 5.128909
7 1.171046
8 1.997531
9 1.003670
10 1.276217
11 1.060946
12 6.835232
13 4.162121
14 1.415808
15 10.000000
16 1.404192
17 1.235536
18 1.197075
19 2.443451
20 1.986783
21 2.637193
Similar solution by Wen (now deleted):
1 + (df - df.min()) * 9 / (df.max() - df.min())
rating
0 6.498887
1 1.339902
2 1.589522
3 1.350546
4 1.877621
5 1.316871
6 5.126922
7 1.167444
8 1.994266
9 1.000000
10 1.272658
11 1.057299
12 6.833941
13 4.159739
14 1.412306
15 10.000000
16 1.400685
17 1.231960
18 1.193484
19 2.440368
20 1.983514
21 2.634189

Binning a data set using Pandas

Given a csv file of...
neg,,,,,,,
SAMPLE 1,,SAMPLE 2,,SAMPLE 3,,SAMPLE 4,
50.0261,2.17E+02,50.0224,3.31E+02,50.0007,5.38E+02,50.0199,2.39E+02
50.1057,2.65E+02,50.0435,3.92E+02,50.0657,5.52E+02,50.0465,3.37E+02
50.1514,2.90E+02,50.0781,3.88E+02,50.1115,5.75E+02,50.0584,2.58E+02
50.166,3.85E+02,50.1245,4.25E+02,50.1258,5.11E+02,50.0765,4.47E+02
50.1831,2.55E+02,50.1748,3.71E+02,50.1411,6.21E+02,50.1246,1.43E+02
50.2023,3.45E+02,50.2161,2.59E+02,50.1671,5.56E+02,50.1866,3.77E+02
50.223,4.02E+02,50.2381,4.33E+02,50.1968,6.31E+02,50.2276,3.41E+02
50.2631,1.89E+02,50.2826,4.63E+02,50.211,3.92E+02,50.2717,4.71E+02
50.2922,2.72E+02,50.3593,4.52E+02,50.2279,5.92E+02,50.376,3.09E+02
50.319,2.46E+02,50.4019,4.15E+02,50.2929,5.60E+02,50.3979,2.56E+02
50.3523,3.57E+02,50.423,3.31E+02,50.3659,4.84E+02,50.4237,3.28E+02
50.3968,4.67E+02,50.4402,1.76E+02,50.437,1.89E+02,50.4504,2.71E+02
50.4431,1.88E+02,50.479,4.85E+02,50.5137,6.63E+02,50.5078,2.54E+02
50.481,3.63E+02,50.5448,3.51E+02,50.5401,5.11E+02,50.5436,2.69E+02
50.506,3.73E+02,50.5872,4.03E+02,50.5593,6.56E+02,50.555,3.06E+02
50.5379,3.00E+02,50.6076,2.96E+02,50.6034,5.02E+02,50.6059,2.83E+02
50.5905,2.38E+02,50.6341,2.67E+02,50.6579,6.37E+02,50.6484,1.99E+02
50.6564,1.30E+02,50.662,3.53E+02,50.6888,7.37E+02,50.7945,4.84E+02
50.7428,2.38E+02,50.6952,4.21E+02,50.7132,6.71E+02,50.8044,4.41E+02
50.8052,3.67E+02,50.7397,1.99E+02,50.7421,6.29E+02,50.8213,1.69E+02
50.8459,2.80E+02,50.7685,3.73E+02,50.7872,5.30E+02,50.8401,3.88E+02
50.9021,3.56E+02,50.7757,4.54E+02,50.8251,4.13E+02,50.8472,3.61E+02
50.9425,3.89E+02,50.8027,7.20E+02,50.8418,5.73E+02,50.8893,1.18E+02
51.0117,2.29E+02,50.8206,2.93E+02,50.8775,4.34E+02,50.9285,2.64E+02
51.0244,5.19E+02,50.8364,4.80E+02,50.9101,4.25E+02,50.9591,1.64E+02
51.0319,3.62E+02,50.8619,2.90E+02,50.9222,5.11E+02,51.0034,2.70E+02
51.0439,4.24E+02,50.9098,3.22E+02,50.9675,4.33E+02,51.0577,2.88E+02
51.0961,3.59E+02,50.969,3.87E+02,51.0123,6.03E+02,51.0712,3.18E+02
51.1429,2.49E+02,51.0009,2.42E+02,51.0266,7.30E+02,51.1015,1.84E+02
51.1597,2.71E+02,51.0262,1.32E+02,51.0554,3.69E+02,51.1291,3.71E+02
51.177,2.84E+02,51.0778,1.58E+02,51.1113,4.50E+02,51.1378,3.54E+02
51.1924,2.00E+02,51.1313,4.07E+02,51.1464,3.86E+02,51.1871,1.55E+02
51.2055,2.25E+02,51.1844,2.08E+02,51.1826,7.06E+02,51.2511,2.05E+02
51.2302,3.81E+02,51.2197,5.49E+02,51.2284,7.00E+02,51.3036,2.60E+02
51.264,2.16E+02,51.2306,3.76E+02,51.271,3.83E+02,51.3432,1.99E+02
51.2919,2.29E+02,51.2468,2.87E+02,51.308,3.89E+02,51.3775,2.45E+02
51.3338,3.67E+02,51.2739,5.56E+02,51.3394,5.17E+02,51.3977,3.86E+02
51.3743,2.57E+02,51.3228,3.18E+02,51.3619,6.03E+02,51.4151,3.37E+02
51.3906,3.78E+02,51.3685,2.33E+02,51.3844,4.44E+02,51.4254,2.72E+02
51.4112,3.29E+02,51.3912,5.03E+02,51.4179,5.68E+02,51.4426,3.17E+02
51.4423,1.86E+02,51.4165,2.68E+02,51.4584,5.10E+02,51.4834,3.87E+02
51.537,3.48E+02,51.4645,3.76E+02,51.5179,5.75E+02,51.544,4.37E+02
51.637,4.51E+02,51.5078,2.76E+02,51.569,4.73E+02,51.5554,4.52E+02
51.665,2.27E+02,51.5388,2.51E+02,51.5894,4.57E+02,51.5958,1.96E+02
51.6925,5.60E+02,51.5486,2.79E+02,51.614,4.88E+02,51.6329,5.40E+02
51.7409,4.19E+02,51.5584,2.53E+02,51.6458,5.72E+02,51.6477,3.23E+02
51.7851,4.29E+02,51.5961,2.72E+02,51.7076,4.36E+02,51.6577,2.70E+02
51.8176,3.11E+02,51.6608,2.04E+02,51.776,5.59E+02,51.6699,3.89E+02
51.8764,3.94E+02,51.7093,5.14E+02,51.8157,6.66E+02,51.6788,2.83E+02
51.9135,3.26E+02,51.7396,1.88E+02,51.8514,4.26E+02,51.7201,3.91E+02
51.9592,2.66E+02,51.7931,2.72E+02,51.8791,5.61E+02,51.7546,3.41E+02
51.9954,2.97E+02,51.8428,5.96E+02,51.9129,5.14E+02,51.7646,2.27E+02
52.0751,2.24E+02,51.8923,3.94E+02,51.959,5.18E+02,51.7801,1.43E+02
52.1456,3.26E+02,51.9177,2.82E+02,52.0116,4.21E+02,51.8022,2.27E+02
52.1846,3.42E+02,51.9265,3.21E+02,52.0848,5.10E+02,51.83,2.66E+02
52.2284,2.66E+02,51.9413,3.56E+02,52.1412,6.20E+02,51.8698,1.74E+02
52.2666,5.32E+02,51.9616,2.19E+02,52.1722,5.72E+02,51.9084,2.89E+02
52.2936,4.24E+02,51.9845,1.53E+02,52.1821,5.18E+02,51.937,1.69E+02
52.3256,3.69E+02,52.0051,3.53E+02,52.2473,5.51E+02,51.9641,3.31E+02
52.3566,2.50E+02,52.0299,2.87E+02,52.3103,4.12E+02,52.0292,2.63E+02
52.4192,3.08E+02,52.0603,3.15E+02,52.35,8.76E+02,52.0633,3.94E+02
52.4757,2.99E+02,52.0988,3.45E+02,52.3807,6.95E+02,52.0797,2.88E+02
52.498,2.37E+02,52.1176,3.63E+02,52.4234,4.89E+02,52.1073,2.97E+02
52.57,2.58E+02,52.1698,3.11E+02,52.4451,4.54E+02,52.1546,3.41E+02
52.6178,4.29E+02,52.2352,3.96E+02,52.4627,5.38E+02,52.2219,3.68E+02
How can one split the samples using overlapping bins of 0.25 m/z - where the first column of each tuple (Sample n,,) contains a m/z value and the second containing the weight?
To load the file into a Pandas DataFrame I currently do:
import csv, pandas as pd
def load_raw_data():
raw_data = []
with open("negsmaller.csv", "rb") as rawfile:
reader = csv.reader(rawfile, delimiter=",")
next(reader)
for row in reader:
raw_data.append(row)
raw_data = pd.DataFrame(raw_data)
return raw_data.T
if __name__ == '__main__':
raw_data = load_raw_data()
print raw_data
Which returns
0 1 2 3 4 5 6 \
0 SAMPLE 1 50.0261 50.1057 50.1514 50.166 50.1831 50.2023
1 2.17E+02 2.65E+02 2.90E+02 3.85E+02 2.55E+02 3.45E+02
2 SAMPLE 2 50.0224 50.0435 50.0781 50.1245 50.1748 50.2161
3 3.31E+02 3.92E+02 3.88E+02 4.25E+02 3.71E+02 2.59E+02
4 SAMPLE 3 50.0007 50.0657 50.1115 50.1258 50.1411 50.1671
5 5.38E+02 5.52E+02 5.75E+02 5.11E+02 6.21E+02 5.56E+02
6 SAMPLE 4 50.0199 50.0465 50.0584 50.0765 50.1246 50.1866
7 2.39E+02 3.37E+02 2.58E+02 4.47E+02 1.43E+02 3.77E+02
7 8 9 ... 56 57 58 \
0 50.223 50.2631 50.2922 ... 52.2284 52.2666 52.2936
1 4.02E+02 1.89E+02 2.72E+02 ... 2.66E+02 5.32E+02 4.24E+02
2 50.2381 50.2826 50.3593 ... 51.9413 51.9616 51.9845
3 4.33E+02 4.63E+02 4.52E+02 ... 3.56E+02 2.19E+02 1.53E+02
4 50.1968 50.211 50.2279 ... 52.1412 52.1722 52.1821
5 6.31E+02 3.92E+02 5.92E+02 ... 6.20E+02 5.72E+02 5.18E+02
6 50.2276 50.2717 50.376 ... 51.8698 51.9084 51.937
7 3.41E+02 4.71E+02 3.09E+02 ... 1.74E+02 2.89E+02 1.69E+02
59 60 61 62 63 64 65
0 52.3256 52.3566 52.4192 52.4757 52.498 52.57 52.6178
1 3.69E+02 2.50E+02 3.08E+02 2.99E+02 2.37E+02 2.58E+02 4.29E+02
2 52.0051 52.0299 52.0603 52.0988 52.1176 52.1698 52.2352
3 3.53E+02 2.87E+02 3.15E+02 3.45E+02 3.63E+02 3.11E+02 3.96E+02
4 52.2473 52.3103 52.35 52.3807 52.4234 52.4451 52.4627
5 5.51E+02 4.12E+02 8.76E+02 6.95E+02 4.89E+02 4.54E+02 5.38E+02
6 51.9641 52.0292 52.0633 52.0797 52.1073 52.1546 52.2219
7 3.31E+02 2.63E+02 3.94E+02 2.88E+02 2.97E+02 3.41E+02 3.68E+02
[8 rows x 66 columns]
Process finished with exit code 0
My desired output: To take the overlapping 0.25 bins and then take the average of the column next to it and have it as one. So,
0.01 3
0.10 4
0.24 2
would become .25 3

How to add a repeated column using pandas

I am doing my homework and I encounter a problem, I have a large matrix, the first column Y002 is a nominal variable, which has 3 levels and encoded as 1,2,3 respectively. The other two columns V96 and V97 are just numeric.
Now, I wanna get a group mean corresponds to the variable Y002. I wrote the code like this
group = data2.groupby(by=["Y002"]).mean()
Then I index to get each group mean using
group1 = group["V96"]
group2 = group["V97"]
Now I wanna append this group mean as a new column into the original dataframe, in which each mean matches the corresponding Y002 code(1 or 2 or 3). Actually I tried this code, but it only shows NAN.
data2["group1"] = pd.Series(group1, index=data2.index)
Hope someone could help me with this, many thanks :)
PS: Hope this makes sense. just like R language, we can do the same thing using
data2$group1 = with(data2, tapply(V97,Y002,mean))[data2$Y002]
But how can we implement this in Python and pandas???
You can use .transform()
import pandas as pd
import numpy as np
# your data
# ============================
np.random.seed(0)
df = pd.DataFrame({'Y002': np.random.randint(1,4,100), 'V96': np.random.randn(100), 'V97': np.random.randn(100)})
print(df)
V96 V97 Y002
0 -0.6866 -0.1478 1
1 0.0149 1.6838 2
2 -0.3757 0.9718 1
3 -0.0382 1.6077 2
4 0.3680 -0.2571 2
5 -0.0447 1.8098 3
6 -0.3024 0.8923 1
7 -2.2244 -0.0966 3
8 0.7240 -0.3772 1
9 0.3590 -0.5053 1
.. ... ... ...
90 -0.6906 1.5567 2
91 -0.6815 -0.4189 3
92 -1.5122 -0.4097 1
93 2.1969 1.1164 2
94 1.0412 -0.2510 3
95 -0.0332 -0.4152 1
96 0.0656 -0.6391 3
97 0.2658 2.4978 1
98 1.1518 -3.0051 2
99 0.1380 -0.8740 3
# processing
# ===========================
df['V96_mean'] = df.groupby('Y002')['V96'].transform(np.mean)
df['V97_mean'] = df.groupby('Y002')['V97'].transform(np.mean)
df
V96 V97 Y002 V96_mean V97_mean
0 -0.6866 -0.1478 1 -0.1944 0.0837
1 0.0149 1.6838 2 0.0497 -0.0496
2 -0.3757 0.9718 1 -0.1944 0.0837
3 -0.0382 1.6077 2 0.0497 -0.0496
4 0.3680 -0.2571 2 0.0497 -0.0496
5 -0.0447 1.8098 3 0.0053 -0.0707
6 -0.3024 0.8923 1 -0.1944 0.0837
7 -2.2244 -0.0966 3 0.0053 -0.0707
8 0.7240 -0.3772 1 -0.1944 0.0837
9 0.3590 -0.5053 1 -0.1944 0.0837
.. ... ... ... ... ...
90 -0.6906 1.5567 2 0.0497 -0.0496
91 -0.6815 -0.4189 3 0.0053 -0.0707
92 -1.5122 -0.4097 1 -0.1944 0.0837
93 2.1969 1.1164 2 0.0497 -0.0496
94 1.0412 -0.2510 3 0.0053 -0.0707
95 -0.0332 -0.4152 1 -0.1944 0.0837
96 0.0656 -0.6391 3 0.0053 -0.0707
97 0.2658 2.4978 1 -0.1944 0.0837
98 1.1518 -3.0051 2 0.0497 -0.0496
99 0.1380 -0.8740 3 0.0053 -0.0707
[100 rows x 5 columns]

Nested if loop with DataFrame is very,very slow

I have 10 million rows to go through and it will take many hours to process, I must be doing something wrong
I converted the names of my df variables for ease in typing
Close=df['Close']
eqId=df['eqId']
date=df['date']
IntDate=df['IntDate']
expiry=df['expiry']
delta=df['delta']
ivMid=df['ivMid']
conf=df['conf']
The below code works fine, just ungodly slow, any suggestions?
print(datetime.datetime.now().time())
for i in range(2,1000):
if delta[i]==90:
if delta[i-1]==50:
if delta[i-2]==10:
if expiry[i]==expiry[i-2]:
df.Skew[i]=ivMid[i]-ivMid[i-2]
print(datetime.datetime.now().time())
14:02:11.014396
14:02:13.834275
df.head(100)
Close eqId date IntDate expiry delta ivMid conf Skew
0 37.380005 7 2008-01-02 39447 1 50 0.3850 0.8663
1 37.380005 7 2008-01-02 39447 1 90 0.5053 0.7876
2 36.960007 7 2008-01-03 39448 1 50 0.3915 0.8597
3 36.960007 7 2008-01-03 39448 1 90 0.5119 0.7438
4 35.179993 7 2008-01-04 39449 1 50 0.4055 0.8454
5 35.179993 7 2008-01-04 39449 1 90 0.5183 0.7736
6 33.899994 7 2008-01-07 39452 1 50 0.4464 0.8400
7 33.899994 7 2008-01-07 39452 1 90 0.5230 0.7514
8 31.250000 7 2008-01-08 39453 1 10 0.4453 0.7086
9 31.250000 7 2008-01-08 39453 1 50 0.4826 0.8246
10 31.250000 7 2008-01-08 39453 1 90 0.5668 0.6474 0.1215
11 30.830002 7 2008-01-09 39454 1 10 0.4716 0.7186
12 30.830002 7 2008-01-09 39454 1 50 0.4963 0.8479
13 30.830002 7 2008-01-09 39454 1 90 0.5735 0.6704 0.1019
14 31.460007 7 2008-01-10 39455 1 10 0.4254 0.6737
15 31.460007 7 2008-01-10 39455 1 50 0.4929 0.8218
16 31.460007 7 2008-01-10 39455 1 90 0.5902 0.6411 0.1648
17 30.699997 7 2008-01-11 39456 1 10 0.4868 0.7183
18 30.699997 7 2008-01-11 39456 1 50 0.4965 0.8411
19 30.639999 7 2008-01-14 39459 1 10 0.5117 0.7620
20 30.639999 7 2008-01-14 39459 1 50 0.4989 0.8804
21 30.639999 7 2008-01-14 39459 1 90 0.5887 0.6845 0.077
22 29.309998 7 2008-01-15 39460 1 10 0.4956 0.7363
23 29.309998 7 2008-01-15 39460 1 50 0.5054 0.8643
24 30.080002 7 2008-01-16 39461 1 10 0.4983 0.6646
At this rate it will take 7.77 hrs to process
Basically, the whole point of numpy & pandas is to avoid loops like the plague, and do things in a vectorial way. As you noticed, without that, speed is gone.
Let's break your problem into steps.
The Conditions
Here, your your first condition can be written like this:
df.delta == 90
(Note how this compares the entire column at once. This is much much faster than your loop!).
and the second one can be written like this (using shift):
df.delta.shift(1) == 50
The rest of your conditions are similar.
Note that to combine conditions, you need to use parentheses. So, the first two conditions, together, should be written as:
(df.delta == 90) & (df.delta.shift(1) == 50)
You should be able to now write an expression combining all your conditions. Let's call it cond, i.e.,
cond = (df.delta == 90) & (df.delta.shift(1) == 50) & ...
The assignment
To assign things to a new column, use
df['skew'] = ...
We just need to figure out what to put on the right-hand-sign
The Right Hand Side
Since we have cond, we can write the right-hand-side as
np.where(cond, df.ivMid - df.ivMid.shift(2), 0)
What this says is: when condition is true, take the second term; when it's not, take the third term (in this case I used 0, but do whatever you like).
By combining all of this, you should be able to write a very efficient version of your code.

Categories

Resources