I have df:
CU Parameters 1 2 3
379-H Output Energy, (Wh/h) 0.045 0.055 0.042
349-J Output Energy, (Wh/h) 0.001 0.003 0
625-H Output Energy, (Wh/h) 2.695 1.224 1.272
626-F Output Energy, (Wh/h) 1.381 1.494 1.3
I would like to create two separate dfs, getting the mean of column values by grouping index on level 0 (CU):
df1: (379-H and 625-H)
Parameters 1 2 3
Output Energy, (Wh/h) 1.37 0.63 0.657
df2: (the rest)
Parameters 1 2 3
Output Energy, (Wh/h) 0.69 0.74 0.65
I can get the mean for all using by grouping level 1:
df = df.apply(pd.to_numeric, errors='coerce').dropna(how='all').groupby(level=1).mean()
but how do I group these according to level 0?
SOLUTION:
lightsonly = ["379-H", "625-H"]
df = df.apply(pd.to_numeric, errors='coerce').dropna(how='all')
mask = df.index.get_level_values(0).isin(lightsonly)
df1 = df[mask].groupby(level=1).mean()
df2 = df[~mask].groupby(level=1).mean()
Use get_level_values + isin for True and False index and then get mean with rename by dict:
d = {True: '379-H and 625-H', False: 'the rest'}
df.index = df.index.get_level_values(0).isin(['379-H', '625-H'])
df = df.mean(level=0).rename(d)
print (df)
1 2 3
the rest 0.691 0.7485 0.650
379-H and 625-H 1.370 0.6395 0.657
For separately dfs is possible also use boolean indexing:
mask= df.index.get_level_values(0).isin(['379-H', '625-H'])
df1 = df[mask].mean().rename('379-H and 625-H').to_frame().T
print (df1)
1 2 3
379-H and 625-H 1.37 0.6395 0.657
df2 = df[~mask].mean().rename('the rest').to_frame().T
print (df2)
1 2 3
the rest 0.691 0.7485 0.65
Another numpy solution with DataFrame constructor:
a1 = df[mask].values.mean(axis=0)
#alternatively
#a1 = df.values[mask].mean(axis=0)
df1 = pd.DataFrame(a1.reshape(-1, len(a1)), index=['379-H and 625-H'], columns=df.columns)
print (df1)
1 2 3
379-H and 625-H 1.37 0.6395 0.657
Consider the dataframe df where CU and Parameters are assumed to be in the index.
1 2 3
CU Parameters
379-H Output Energy, (Wh/h) 0.045 0.055 0.042
349-J Output Energy, (Wh/h) 0.001 0.003 0.000
625-H Output Energy, (Wh/h) 2.695 1.224 1.272
626-F Output Energy, (Wh/h) 1.381 1.494 1.300
Then we can groupby the truth values of whether the first level values are in the list ['379-H', '625-H'].
m = {True: 'Main', False: 'Rest'}
l = ['379-H', '625-H']
g = df.index.get_level_values('CU').isin(l)
df.groupby(g).mean().rename(index=m)
1 2 3
Rest 0.691 0.7485 0.650
Main 1.370 0.6395 0.657
#Use a lambda function to change index to 2 groups and then groupby using the modified index.
df.groupby(by=lambda x:'379-H,625-H' if x[0] in ['379-H','625-H'] else 'Others').mean()
Out[22]:
1 2 3
379-H,625-H 1.370 0.6395 0.657
Others 0.691 0.7485 0.650
Related
I have the following code. I am running a linear model on the dataframe 'x', with gender and highest education level achieved as categorical variables.
The aim is to assess how well age, gender and highest level of education achieved can predict 'weighteddistance'.
resultmodeldistancevariation2sleep = smf.ols(formula='weighteddistance ~ age + C(gender) + C(highest_education_level_acheived)',data=x).fit()
summarymodel = resultmodeldistancevariation2sleep.summary()
print(summarymodel)
This gives me the output:
0 1 2 3 4 5 6
0 coef std err t P>|t| [0.025 0.975]
1 Intercept 6.3693 1.391 4.580 0.000 3.638 9.100
2 C(gender)[T.2.0] 0.2301 0.155 1.489 0.137 -0.073 0.534
3 C(gender)[T.3.0] 0.0302 0.429 0.070 0.944 -0.812 0.872
4 C(highest_education_level_acheived)[T.3] 1.1292 0.501 2.252 0.025 0.145 2.114
5 C(highest_education_level_acheived)[T.4] 1.0876 0.513 2.118 0.035 0.079 2.096
6 C(highest_education_level_acheived)[T.5] 1.0692 0.498 2.145 0.032 0.090 2.048
7 C(highest_education_level_acheived)[T.6] 1.2995 0.525 2.476 0.014 0.269 2.330
8 C(highest_education_level_acheived)[T.7] 1.7391 0.605 2.873 0.004 0.550 2.928
However, I want to calculate the main effect of each categorical variable on distance, which are not shown in the model above, and so I entered the model fit into an anova using 'anova_lm'.
anovaoutput = sm.stats.anova_lm(resultmodeldistancevariation2sleep)
anovaoutput['PR(>F)'] = anovaoutput['PR(>F)'].round(4)
This gives me following output below, and as I wanted, does show me the main effect of each categorical variable - gender and highest education level achieved - rather than the different groups within that variable (meaning that there is no gender[2.0] and gender[3.0] in the output below).
df sum_sq mean_sq F PR(>F)
C(gender) 2.0 4.227966 2.113983 5.681874 0.0036
C(highest_education_level_acheived) 5.0 11.425706 2.285141 6.141906 0.0000
age 1.0 8.274317 8.274317 22.239357 0.0000
Residual 647.0 240.721120 0.372057 NaN NaN
However, this output no longer shows me the confidence intervals or the coefficients for each variable.
So in other words, I would like the bottom anova table should have a column with 'coef' and '[0.025 0.975]' like in the first table.
How can I achieve this?
I would be so grateful for a helping hand!
I have this file called 'test.txt' and it looks like this:
3.H5 5.40077
2.H8 7.75894
3.H6 7.60437
3.H5 5.40001
5.H5 5.70502
4.H8 7.55438
5.H1' 5.43574
5.H6 7.96472
""
""
""
""
""
""
6.H6 7.96178
6.H5 5.71068
""
""
7.H8 8.29385
7.H1' 6.01136
""
""
""
""
8.H5 5.51053
8.H6 7.67437
I want to see if the values in the first column are the same (i.e.: if 8.H5 occurs more than once), and if they are, I want to count how many times and take their average. I want my output to look like this:
Atom nVa predppm avgppm stdev delta QPred QMulti qTotal
1.H1' 1 5.820 5.737 0.000 0.000 0.985 1.000 0.995
2.H1' 1 5.903 5.892 0.000 0.000 0.998 1.000 0.999
3.H1' 1 5.549 5.454 0.000 0.000 0.983 1.000 0.994
4.H1' 1 5.741 5.737 0.000 0.000 0.999 1.000 1.000
6.H1' 1 5.543 5.600 0.000 0.000 0.990 1.000 0.997
8.H1' 1 5.363 5.359 0.000 0.000 0.999 1.000 1.000
10.H1' 1 5.378 5.408 0.000 0.000 0.995 1.000 0.998
11.H1' 1 5.501 5.497 0.000 0.000 0.999 1.000 1.000
14.H1' 1 5.962 5.893 0.000 0.000 0.988 1.000 0.996
Right now, my code reads from test.txt and computes the count and the mean of the values and gives an output which looks like this (output.txt):
Atom nVa avgppm
1.H1' 1 5.737
2.H1' 1 5.892
3.H1' 1 5.454
4.H1' 1 5.737
6.H1' 1 5.600
But it does not account for the "" rows, how can I get my code to skip lines that have ""?
I also have a file called test2.txt which looks like this:
5.H6 7.72158 0.3
6.H6 7.70272 0.3
7.H8 8.16859 0.3
8.H6 7.65014 0.3
9.H8 8.1053 0.3
10.H6 7.5231 0.3
12.H6 7.72805 0.3
13.H6 8.02977 0.3
14.H6 7.69624 0.3
17.H8 7.24899 0.3
16.H8 8.27957 0.3
18.H6 7.6439 0.3
19.H8 7.65501 0.3
20.H8 7.78512 0.3
21.H8 8.06057 0.3
22.H8 7.47677 0.3
23.H6 7.7306 0.3
24.H6 7.80104 0.3
I want to read in values from the first column of test.txt and values from the first column in test2.txt and see if they are the same (i.e.: if 20.H8 = 20.H8) and if they are, I want to insert a column in my output.txt between the nVa column and the avgppm column, and put in the values from test2.txt. How can I insert a column into this output file which also accounts for the blank spaces, by not using those lines?
This is my current code:
import pandas as pd
import os
import sys
test = 'test.txt'
test2 = 'test2.txt'
df = pd.read_csv(test, sep = ' ', header = None)
df.columns = ["Atom","ppm"]
gb = (df.groupby("Atom", as_index=False)
.agg({"ppm":["count","mean"]})
.rename(columns={"count":"nVa", "mean":"avgppm"}))
gb.head()
gb.columns = gb.columns.droplevel()
gb = gb.rename(columns={"":"Atom"})
gb.to_csv("output.txt", sep =" ", index=False)
df2 = pd.read_csv(test2, sep = r'/s+', header = None)
df2.columns = ["Atoms","ppms","error"]
shift1 = df2["Atoms"]
shift2 = df2["ppms"]
I'm not exactly sure how to proceed.
To drop the row with "" as the values, use the dropna method of the data frame. You can follow this by reset_index to reset the row counts
df = pd.read_csv(test, sep = ' ', header = None)
df.columns = ["Atom","ppm"]
df = df.dropna().reset_index(drop=True)
gb = ...
To find matching values, you can use merge method and compare the columns of interest.
df2 = pd.read_csv(test2, sep = r'/s+', header = None)
df2.columns = ["Atoms","ppms","error"]
gb.merge(df2, left_on='Atom', right_on='Atoms', how='left').drop(['Atoms','ppms'], axis=1)
This will leave you with NA values if the value in gb is not in df2.
A left merge() should be able to bring df and df2 together the way you want.
df = pd.read_csv("test.txt", sep=" ", header=None, names=["Atom", "ppm"])
df2 = pd.read_csv("test2.txt", sep=" ", header=None, names=["Atom", "ppms", "error"])
gb = df.groupby("Atom").agg(["count", "mean"])
gb.merge(df2.set_index("Atom"), how="left", left_index=True, right_index=True)
(ppm, count) (ppm, mean) ppms error
Atom
2.H8 1 7.75894 NaN NaN
3.H5 2 5.40039 NaN NaN
3.H6 1 7.60437 NaN NaN
4.H8 1 7.55438 NaN NaN
5.H1' 1 5.43574 NaN NaN
5.H5 1 5.70502 NaN NaN
5.H6 1 7.96472 7.72158 0.3
6.H5 1 5.71068 NaN NaN
6.H6 1 7.96178 7.70272 0.3
7.H1' 1 6.01136 NaN NaN
7.H8 1 8.29385 8.16859 0.3
8.H5 1 5.51053 NaN NaN
8.H6 1 7.67437 7.65014 0.3
Note: It doesn't seem that you even need dropna() for the missing rows in df. read_csv() interprets the "" values as NaN, and groupby() ignores NaN when grouping.
I have a pandas data frame as follows:
request_id crash_id counter num_acc_x num_acc_y num_acc_z
745109.0 670140638.0 0 0.010 0.000 -0.045
745109.0 670140638.0 1 0.016 -0.006 -0.034
745109.0 670140638.0 2 0.016 -0.006 -0.034
my id vars are : "request_id" and "crash_id", the target vars are nu_acc_x, num_acc_y and num_acc_z
I would like to create a new DataFrame where target vars are wide reshaped, that is adding max(counter)*3 new vars like num_acc_x_0, num_acc_x_1, ... num_acc_y_0,num_acc_y_1,... num_acc_z_0, num_acc_z_1 possibly without having a pivot as final result (I would like a true DataFrame as in R).
Thanks in advance for the attention
I think you need set_index with unstack, last create columns names from MultiIndex by map:
df = df.set_index(['request_id','crash_id','counter']).unstack()
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[0], x[1]))
df = df.reset_index()
print (df)
request_id crash_id num_acc_x_0 num_acc_x_1 num_acc_x_2 \
0 745109.0 670140638.0 0.01 0.016 0.016
num_acc_y_0 num_acc_y_1 num_acc_y_2 num_acc_z_0 num_acc_z_1 \
0 0.0 -0.006 -0.006 -0.045 -0.034
num_acc_z_2
0 -0.034
Another solution with aggreagting duplicates with pivot_table:
df = df.pivot_table(index=['request_id','crash_id'], columns='counter', aggfunc='mean')
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[0], x[1]))
df = df.reset_index()
print (df)
request_id crash_id num_acc_x_0 num_acc_x_1 num_acc_x_2 \
0 745109.0 670140638.0 0.01 0.016 0.016
num_acc_y_0 num_acc_y_1 num_acc_y_2 num_acc_z_0 num_acc_z_1 \
0 0.0 -0.006 -0.006 -0.045 -0.034
num_acc_z_2
0 -0.034
df = df.groupby(['request_id','crash_id','counter']).mean().unstack()
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[0], x[1]))
df = df.reset_index()
print (df)
request_id crash_id num_acc_x_0 num_acc_x_1 num_acc_x_2 \
0 745109.0 670140638.0 0.01 0.016 0.016
num_acc_y_0 num_acc_y_1 num_acc_y_2 num_acc_z_0 num_acc_z_1 \
0 0.0 -0.006 -0.006 -0.045 -0.034
num_acc_z_2
0 -0.034
I need to create some new columns based on the value of a dataframe filed and a look up dataframe with some rates.
Having df1 as
zone hh hhind
0 14 112.0 3.4
1 15 5.0 4.4
2 16 0.0 1.0
and a look_up df as
ind per1 per2 per3 per4
0 1.0 1.000 0.000 0.000 0.000
24 3.4 0.145 0.233 0.165 0.457
34 4.4 0.060 0.114 0.075 0.751
how can i update df1.hh1 by multiplying the look_up.per1 based on df1.hhind and lookup.ind
zone hh hhind hh1
0 14 112.0 3.4 16.240
1 15 5.0 4.4 0.300
2 16 0.0 1.0 0.000
at the moment im getting the result by merging the tables and the doing the arithmetic.
r = pd.merge(df1, look_up, left_on="hhind", right_on="ind")
r["hh1"] = r.hh *r.per1
i'd like to know if there is a more straight way to accomplish this by not merging the tables?
You could first set hhind and ind as the index axis of df1 and look_up dataframes respectively. Then, multiply corresponding elements in hh and per1 element-wise.
Map these results to the column hhind and assign these to a new column later as shown:
mapper = df1.set_index('hhind')['hh'].mul(look_up.set_index('ind')['per1'])
df1.assign(hh1=df1['hhind'].map(mapper))
another solution:
df1['hh1'] = (df1['hhind'].map(lambda x: look_up[look_up["ind"]==x]["per1"])) * df1['hh']
I have a big pandas dataframe (1 million rows), and I need better performance in my code to process this data.
My code is below, and a profiling analysis is also provided.
Header of the dataset:
key_id, date, par1, par2, par3, par4, pop, price, value
For each key, we have a row with every of the 5000 dates possibles
There is 200 key_id * 5000 date = 1000000 rows
Using different variables var1, ..., var4, I compute a value for each row, and I want to extract the top 20 dates with best value for each key_id, and then compute the popularity of the set of variables used.
In the end, I want to find the variables which optimize this popularity.
def compute_value_col(dataset, val1=0, val2=0, val3=0, val4=0):
dataset['value'] = dataset['price'] + val1 * dataset['par1'] \
+ val2 * dataset['par2'] + val3 * dataset['par3'] \
+ val4 * dataset['par4']
return dataset
def params_to_score(dataset, top=10, val1=0, val2=0, val3=0, val4=0):
dataset = compute_value_col(dataset, val1, val2, val3, val4)
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(top).reset_index(drop=True)
return dataset['pop'].sum()
def optimize(dataset, top):
for i,j,k,l in product(xrange(10),xrange(10),xrange(10),xrange(10)):
print i, j, k, l, params_to_score(dataset, top, 10*i, 10*j, 10*k, 10*l)
optimize(my_dataset, 20)
I need to enhance perf
Here is a %prun output, after running 49 params_to_score
ncalls tottime percall cumtime percall filename:lineno(function)
98 2.148 0.022 2.148 0.022 {pandas.algos.take_2d_axis1_object_object}
49 1.663 0.034 9.852 0.201 <ipython-input-59-88fc8127a27f>:150(params_to_score)
49 1.311 0.027 1.311 0.027 {method 'get_labels' of 'pandas.hashtable.Float64HashTable' objects}
49 1.219 0.025 1.223 0.025 {pandas.algos.groupby_indices}
49 0.875 0.018 0.875 0.018 {method 'get_labels' of 'pandas.hashtable.PyObjectHashTable' objects}
147 0.452 0.003 0.457 0.003 index.py:581(is_unique)
343 0.193 0.001 0.193 0.001 {method 'copy' of 'numpy.ndarray' objects}
1 0.136 0.136 10.058 10.058 <ipython-input-59-88fc8127a27f>:159(optimize)
147 0.122 0.001 0.122 0.001 {method 'argsort' of 'numpy.ndarray' objects}
833 0.112 0.000 0.112 0.000 {numpy.core.multiarray.empty}
49 0.109 0.002 0.109 0.002 {method 'get_labels_groupby' of 'pandas.hashtable.Int64HashTable' objects}
98 0.083 0.001 0.083 0.001 {pandas.algos.take_2d_axis1_float64_float64}
49 0.078 0.002 1.460 0.030 groupby.py:1014(_cumcount_array)
I think I could split the big dataframe in small dataframe by key_id, to improve the sort time, as I want to take the top 20 dates with best value for each key_id, so sorting by key is just to separate the different keys.
But I would need any advice, how can I improve the efficience of this code, as I would need to run thousands of params_to_score ?
EDIT: #Jeff
Thanks a lot for your help!
I tried using nsmallest instead of sort & head, but strangely it is 5-6 times slower, when I benchmark the two following functions:
def to_bench1(dataset):
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(50).reset_index(drop=True)
return dataset['pop'].sum()
def to_bench2(dataset):
dataset = dataset.set_index('pop')
dataset = dataset.groupby(['key_id'])['value'].nsmallest(50).reset_index()
return dataset['pop'].sum()
On a sample of ~100000 rows, to_bench2 performs in 0.5 seconds, while to_bench1 takes only 0.085 seconds on average.
After profiling to_bench2, I notice many more isinstance call, compared to before, but I do not know from where they come from...
The way to make this significantly faster is like this.
Create some sample data
In [148]: df = DataFrame({'A' : range(5), 'B' : [1,1,1,2,2] })
Define the compute_val_column like you have
In [149]: def f(p):
return DataFrame({ 'A' : df['A']*p, 'B' : df.B })
.....:
These are the cases (this you prob want a list of tuples), e.g. the cartesian product of all of the cases that you want to feed into the above function
In [150]: parms = [1,3]
Create a new data frame that has the full set of values, keyed by each of the parms). This is basically a broadcasting operation.
In [151]: df2 = pd.concat([ f(p) for p in parms ],keys=parms,names=['parm','indexer']).reset_index()
In [155]: df2
Out[155]:
parm indexer A B
0 1 0 0 1
1 1 1 1 1
2 1 2 2 1
3 1 3 3 2
4 1 4 4 2
5 3 0 0 1
6 3 1 3 1
7 3 2 6 1
8 3 3 9 2
9 3 4 12 2
Here's the magic. Groupby by whatever columns you want, including parm as the first one (or possibly multiple ones). Then do a partial sort (this is what nlargest does); this is more efficient that sort & head (well it depends on the group density a bit). Sum at the end (again by the groupers that we are about, as you are doing a 'partial' reduction)
In [153]: df2.groupby(['parm','B']).A.nlargest(2).sum(level=['parm','B'])
Out[153]:
parm B
1 1 3
2 7
3 1 9
2 21
dtype: int64