Create a rolling custom EWMA on a pandas dataframe - python

I am trying to create a rolling EWMA with the following decay= 1-ln(2)/3 on the last 13 values of a df such has :
factor
Out[36]:
EWMA
0 0.043
1 0.056
2 0.072
3 0.094
4 0.122
5 0.159
6 0.207
7 0.269
8 0.350
9 0.455
10 0.591
11 0.769
12 1.000
I have a df of monthly returns like this :
change.tail(5)
Out[41]:
date
2016-04-30 0.033 0.031 0.010 0.007 0.014 -0.006 -0.001 0.035 -0.004 0.020 0.011 0.003
2016-05-31 0.024 0.007 0.017 0.022 -0.012 0.034 0.019 0.001 0.006 0.032 -0.002 0.015
2016-06-30 -0.027 -0.004 -0.060 -0.057 -0.001 -0.096 -0.027 -0.096 -0.034 -0.024 0.044 0.001
2016-07-31 0.063 0.036 0.048 0.068 0.053 0.064 0.032 0.052 0.048 0.013 0.034 0.036
2016-08-31 -0.004 0.012 -0.005 0.009 0.028 0.005 -0.002 -0.003 -0.001 0.005 0.013 0.003
I am just trying to apply this rolling EWMA to each columns. I know that pandas has a EWMA method but I can't figure out how to pass the right 1-ln(2)/3 factor.
help would be appreciated! thanks!

#piRSquared 's answer is a good approximation, but values outside the last 13 also have weightings (albeit tiny), so it's not totally correct.
pandas could do rolling window calculations. However, amongst all the rolling function it supports, ewm is not one of them, which means we have to implement our own.
Assuming series is our time series to average:
from functools import partial
import numpy as np
window = 13
alpha = 1-np.log(2)/3 # This is ewma's decay factor.
weights = list(reversed([(1-alpha)**n for n in range(window)]))
ewma = partial(np.average, weights=weights)
rolling_average = series.rolling(window).apply(ewma)

use ewm with mean()
df.ewm(halflife=1 - np.log(2) / 3).mean()

Related

Input Variables With Inconsistent Numbers of Samples for Polynomial Regression

trying to do polynomial regression and having some trouble fitting the model.
Getting
ValueError: Found input variables with inconsistent numbers of samples: [1040, 260]
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
x = BTCdata.iloc[:, [1, 2, 4, 5]]
y = BTCdata.iloc[:,3]
x, y = np.array(x).reshape((-1, 1)), np.array(y).reshape((-1, 1))
poly_features= PolynomialFeatures(degree= 4, include_bias = False)
x_ = poly_features.fit_transform(x)
model = LinearRegression()
model.fit(x_, y)
The problem comes from this line:
x = np.array(x).reshape((-1, 1))
By doing that you are transforming a dataframe of n rows and m columns into an array of n x m rows and 1 column. In your example, x ends up having 260 x 4 = 1040 rows whereas y has 260, raising this error.
If your goal is to convert your data to numpy arrays before using it in a model, then you can simply do:
x = x.to_numpy()
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
#
BTCdata = pd.read_excel('BitcoinRegression.xlsx', sheet_name='FinalBTC')
x = BTCdata.iloc[:, [1, 2, 4, 5]]
print(x.shape)
y = BTCdata.iloc[:,3]
print(y.shape)
#x, y = np.array(x).reshape((-1, 1)), np.array(y).reshape((-1, 1))
poly_features= PolynomialFeatures(degree= 4, include_bias = False)
x_ = poly_features.fit_transform(x)
#model = LinearRegression()
#model.fit(x_, y)
mod = sm.OLS(y, x_).fit()
mod.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: BTC R-squared: 0.886
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 46.86
Date: Wed, 17 Mar 2021 Prob (F-statistic): 2.63e-85
Time: 20:49:58 Log-Likelihood: -2299.3
No. Observations: 260 AIC: 4675.
Df Residuals: 222 BIC: 4810.
Df Model: 37
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -0.0089 0.019 -0.468 0.640 -0.046 0.028
x2 0.0033 0.004 0.797 0.426 -0.005 0.012
x3 2.621e-05 3.55e-05 0.737 0.462 -4.38e-05 9.62e-05
x4 0.0005 0.001 0.789 0.431 -0.001 0.002
x5 -0.0238 0.067 -0.355 0.723 -0.156 0.108
x6 0.0790 0.688 0.115 0.909 -1.277 1.435
x7 0.0942 0.131 0.722 0.471 -0.163 0.352
x8 0.9679 1.276 0.759 0.449 -1.546 3.482
x9 0.0184 0.133 0.139 0.890 -0.243 0.280
x10 0.0093 0.013 0.726 0.469 -0.016 0.035
x11 0.0957 0.125 0.766 0.444 -0.150 0.342
x12 0.0001 0.000 0.864 0.389 -0.000 0.000
x13 0.0008 0.001 0.599 0.550 -0.002 0.003
x14 0.0207 0.026 0.783 0.435 -0.031 0.073
x15 3.594e-05 2.89e-05 1.245 0.214 -2.09e-05 9.28e-05
x16 -0.0004 0.001 -0.496 0.621 -0.002 0.001
x17 0.0158 0.010 1.621 0.106 -0.003 0.035
x18 -0.0068 0.002 -2.945 0.004 -0.011 -0.002
x19 -0.0014 0.007 -0.202 0.840 -0.015 0.012
x20 -0.0389 0.086 -0.454 0.650 -0.208 0.130
x21 0.1104 0.043 2.558 0.011 0.025 0.195
x22 0.7337 0.819 0.896 0.371 -0.881 2.348
x23 -1.4583 0.432 -3.378 0.001 -2.309 -0.607
x24 0.0601 0.031 1.913 0.057 -0.002 0.122
x25 0.0192 0.021 0.893 0.373 -0.023 0.061
x26 0.0403 0.091 0.445 0.657 -0.138 0.219
x27 -0.5110 0.224 -2.284 0.023 -0.952 -0.070
x28 0.0697 0.078 0.892 0.374 -0.084 0.224
x29 -0.1316 0.039 -3.397 0.001 -0.208 -0.055
x30 0.0054 0.103 0.052 0.958 -0.198 0.209
x31 0.0003 0.000 0.951 0.343 -0.000 0.001
x32 0.0060 0.007 0.856 0.393 -0.008 0.020
x33 -0.0124 0.012 -1.078 0.282 -0.035 0.010
x34 0.3317 0.394 0.842 0.400 -0.444 1.108
x35 -4.886e-09 1.1e-09 -4.439 0.000 -7.05e-09 -2.72e-09
x36 1.387e-07 3.68e-08 3.767 0.000 6.62e-08 2.11e-07
x37 5.106e-07 3.44e-06 0.148 0.882 -6.28e-06 7.3e-06
x38 4.652e-07 2.91e-07 1.601 0.111 -1.07e-07 1.04e-06
x39 -1.623e-06 5.17e-07 -3.138 0.002 -2.64e-06 -6.04e-07
x40 -8.446e-05 9.05e-05 -0.933 0.352 -0.000 9.39e-05
x41 -8.729e-06 7.38e-06 -1.182 0.238 -2.33e-05 5.82e-06
x42 -0.0017 0.002 -0.804 0.422 -0.006 0.002
x43 0.0007 0.000 1.705 0.090 -0.000 0.001
x44 -1.815e-05 2.11e-05 -0.862 0.390 -5.96e-05 2.33e-05
x45 9.562e-06 3.43e-06 2.788 0.006 2.8e-06 1.63e-05
x46 0.0012 0.001 1.413 0.159 -0.000 0.003
x47 5.405e-05 6.5e-05 0.831 0.407 -7.41e-05 0.000
x48 0.0069 0.044 0.156 0.876 -0.080 0.093
x49 -0.0078 0.006 -1.414 0.159 -0.019 0.003
x50 0.0001 0.000 0.307 0.759 -0.001 0.001
x51 0.1505 0.090 1.669 0.096 -0.027 0.328
x52 0.1555 0.046 3.410 0.001 0.066 0.245
x53 -0.0296 0.024 -1.210 0.227 -0.078 0.019
x54 0.0016 0.001 2.182 0.030 0.000 0.003
x55 -2.28e-05 8.77e-06 -2.600 0.010 -4.01e-05 -5.52e-06
x56 -0.0045 0.003 -1.594 0.112 -0.010 0.001
x57 -0.0002 0.000 -0.947 0.344 -0.001 0.000
x58 -0.0067 0.237 -0.028 0.977 -0.474 0.461
x59 0.0134 0.021 0.629 0.530 -0.029 0.055
x60 0.0020 0.002 1.123 0.262 -0.002 0.006
x61 0.0277 0.016 1.689 0.093 -0.005 0.060
x62 -0.3824 0.413 -0.926 0.355 -1.196 0.431
x63 0.3528 0.179 1.970 0.050 -0.000 0.706
x64 -0.0282 0.005 -5.708 0.000 -0.038 -0.018
x65 -0.0002 0.000 -0.695 0.488 -0.001 0.000
x66 0.0098 0.009 1.142 0.255 -0.007 0.027
x67 0.0901 0.103 0.873 0.384 -0.113 0.293
x68 -0.1941 0.648 -0.300 0.765 -1.471 1.083
x69 0.0237 0.021 1.128 0.261 -0.018 0.065
==============================================================================
Omnibus: 127.728 Durbin-Watson: 0.552
Prob(Omnibus): 0.000 Jarque-Bera (JB): 851.418
Skew: 1.861 Prob(JB): 1.31e-185
Kurtosis: 11.046 Cond. No. 4.00e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4e+16. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

Odds Ratios in MN Logit regression in stats model

I have this Multi Numinal regression model done by statsmodel:
writer = pd.ExcelWriter(path=os.path.join(export_path, f'regression.xlsx'), engine='xlsxwriter')
vars_matrix_df = pd.read_csv(data_path, skipinitialspace=True)
corr_cols = ['sales_vs_service', 'agent_experience', 'minutes_passed_since_shift_started', 'stage_in_conv',
'current_cust_wait_time', 'prev_cust_line_words', 'total_cust_words_in_conv',
'agent_total_turns', 'sentiment_score', 'max_sentiment', 'min_sentiment', 'last_sentiment',
'agent_response_time', 'customer_response_rate', 'is_last_cust_answered',
'conversation_opening', 'queue_length', 'total_lines_from_rep',
'agent_number_of_conversations', 'concurrency', 'rep_shift_start_time', 'first_cust_line_num_of_words',
'queue_wait_time', 'day_of_week', 'time_of_day']
reg_equation = st.formula.mnlogit(f'visitor_was_answered ~C(day_of_week)+C(time_of_day)+{"+".join(corr_cols)} ',
vars_matrix_df).fit()
the reg results:
visitor_was_answered=1 coef std err z P>|z| \
0 C(time_of_day)[T.10] 0.0071 1910000.000 3.700000e-09 1.000
1 C(time_of_day)[T.11] 0.0067 698000.000 9.600000e-09 1.000
2 C(time_of_day)[T.12] 0.0016 1790000.000 9.200000e-10 1.000
3 C(time_of_day)[T.13] 0.0031 561000.000 5.570000e-09 1.000
4 C(time_of_day)[T.14] 0.0037 1310000.000 2.840000e-09 1.000
5 C(time_of_day)[T.15] 0.0011 548000.000 2.020000e-09 1.000
6 C(time_of_day)[T.17] 0.0044 814000.000 5.440000e-09 1.000
7 C(time_of_day)[T.18] 0.0009 1100000.000 8.270000e-10 1.000
8 C(time_of_day)[T.19] 0.0047 835000.000 5.640000e-09 1.000
9 C(time_of_day)[T.20] 0.0009 1140000.000 8.100000e-10 1.000
10 time_of_day[T.10] 0.0071 1930000.000 3.670000e-09 1.000
11 time_of_day[T.11] 0.0067 686000.000 9.770000e-09 1.000
12 time_of_day[T.12] 0.0016 1800000.000 9.150000e-10 1.000
13 time_of_day[T.13] 0.0031 556000.000 5.620000e-09 1.000
14 time_of_day[T.14] 0.0037 1240000.000 3.010000e-09 1.000
15 time_of_day[T.15] 0.0011 638000.000 1.740000e-09 1.000
16 time_of_day[T.17] 0.0044 1010000.000 4.400000e-09 1.000
17 time_of_day[T.18] 0.0009 1130000.000 8.020000e-10 1.000
18 time_of_day[T.19] 0.0047 860000.000 5.480000e-09 1.000
19 time_of_day[T.20] 0.0009 1120000.000 8.270000e-10 1.000
20 sales_vs_service -0.0448 0.006 -8.102000e+00 0.000
21 agent_experience -0.0414 0.008 -4.955000e+00 0.000
22 current_cust_wait_time -39.1333 0.414 -9.457400e+01 0.000
23 prev_cust_line_words 20.0439 0.236 8.494600e+01 0.000
24 agent_total_turns 0.1110 0.038 2.949000e+00 0.003
25 sentiment_score -4.3454 0.157 -2.759000e+01 0.000
26 agent_response_time -118.0821 2.205 -5.354600e+01 0.000
27 customer_response_rate -7.0865 0.630 -1.125500e+01 0.000
28 is_last_cust_answered -0.2537 0.005 -4.860800e+01 0.000
29 conversation_opening -0.4533 0.006 -7.206300e+01 0.000
30 queue_length -1.5427 0.018 -8.642700e+01 0.000
31 agent_number_of_conversations 0.0013 0.018 7.300000e-02 0.941
32 first_cust_line_num_of_words -3.7545 0.123 -3.056900e+01 0.000
33 queue_wait_time -0.3308 0.166 -1.997000e+00 0.046
To this regression, I want to add the odds ratio values of each variable. I think that the coefficients are already odds ratio but I didn't find any proof to that. Any idea how this can be done? and what are the coefficients represent here?
Thanks!

Replicating rows based on index multiplies the rows instead of replicating

I have dataframe, where i would like to replicate few rows:
X Y diff No
Index
1d 0.000 0.017 0.000e+00 0
2D 0.083 0.017 3.000e-03 1
3D 0.250 0.017 7.200e-03 2
6D 0.500 0.019 2.400e-03 3
1DD 1.000 0.020 2.400e-03 4
2DD 2.000 0.023 1.300e-03 5
3DD 3.000 0.024 1.000e-03 6
5DD 5.000 0.026 6.500e-04 7
7DD 7.000 0.027 2.667e-04 8
10DD 10.000 0.028 1.200e-04 9
20DD 20.000 0.029 1.200e-04 10
30DD 30.000 0.031 0.000e+00 11
I want to replicate 30DD 30 times and 20DD 20 times and 10DD 10 times with same index name.
I tried this, instead of replicating it multiplies
for i in range(4):
test1 = df.append(df.ix['30DD']*30)
X Y diff No
Index
1d 0.000 0.017 0.000e+00 0
2D 0.083 0.017 3.000e-03 1
3D 0.250 0.017 7.200e-03 2
6D 0.500 0.019 2.400e-03 3
1DD 1.000 0.020 2.400e-03 4
2DD 2.000 0.023 1.300e-03 5
3DD 3.000 0.024 1.000e-03 6
5DD 5.000 0.026 6.500e-04 7
7DD 7.000 0.027 2.667e-04 8
10DD 10.000 0.028 1.200e-04 9
20DD 20.000 0.029 1.200e-04 10
30DD 30.000 0.031 0.000e+00 11
30DD 900 0.918 0 330
Add new rows, but subtract 1, because append to original DataFrame:
vals = ['30DD'] * 29 + ['20DD'] * 19 + ['10DD'] * 9
df = df.append(df.loc[vals])
Last if want sorting values by numbers of index values:
df = df.iloc[df.index.str.extract('(\d+)').astype(int).squeeze().argsort()]
Using numpy.repeat, you can create a list of indices for rows you wish to append. Then feed to pd.DataFrame.loc and append to your original dataframe.
vals = ['30DD', '20DD', '10DD']
counts = [30, 20, 10]
df = df.append(df.loc[np.repeat(vals, counts)])

Improve Pandas Merge performance

I specifically dont have performace issue with Pands Merge, as other posts suggest, but I've a class in which there are lot of methods, which does a lot of merge on datasets.
The class has around 10 group by and around 15 merge. While groupby is pretty fast, out of total execution time of 1.5 seconds for class, around 0.7 seconds goes in those 15 merge calls.
I want to speed up performace in those merge calls. As I will have around 4000 iterations, hence saving .5 seconds overall in single iteration will lead to overall performance reduction by around 30min, which will be great.
Any suggestions I should try? I tried:
Cython
Numba, and Numba was slower.
Thanks
Edit 1:
Adding sample code snippets:
My merge statements:
tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left')
tmp = tmpDf
tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left')
tmp = tmpDf
tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left')
tmp = tmpDf
tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left')
tmp = tmpDf
tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left')
And, by implementing Joins, I incorporate the following satatements:
dat = self.data.set_index('APPT_NBR')
t1.set_index('APPT_NBR', inplace=True)
t2.set_index('APPT_NBR', inplace=True)
t3.set_index('APPT_NBR', inplace=True)
t4.set_index('APPT_NBR', inplace=True)
t5.set_index('APPT_NBR', inplace=True)
tmpDf = dat.join(t1, how='left')
tmpDf = tmpDf.join(t2, how='left')
tmpDf = tmpDf.join(t3, how='left')
tmpDf = tmpDf.join(t4, how='left')
tmpDf = tmpDf.join(t5, how='left')
tmpDf.reset_index(inplace=True)
Note, all are part of a function named: def merge_earlier_created_values(self):
And, when I did timedcall from profilehooks by following:
#timedcall(immediate=True)
def merge_earlier_created_values(self):
I get following results:
The result of profiling of that method gives:
#profile(immediate=True)
def merge_earlier_created_values(self):
The profiling of function, by using Merge is as follows:
*** PROFILER RESULTS ***
merge_earlier_created_values (E:\Projects\Predictive Inbound Cartoon Estimation-MLO\Python\CodeToSubmit\helpers\get_prev_data_by_date.py:122)
function called 1 times
71665 function calls (70588 primitive calls) in 0.524 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 563 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.012 0.012 0.524 0.524 get_prev_data_by_date.py:122(merge_earlier_created_values)
14 0.000 0.000 0.285 0.020 generic.py:1901(_update_inplace)
14 0.000 0.000 0.285 0.020 generic.py:1402(_maybe_update_cacher)
19 0.000 0.000 0.284 0.015 generic.py:1492(_check_setitem_copy)
7 0.283 0.040 0.283 0.040 {built-in method gc.collect}
15 0.000 0.000 0.181 0.012 generic.py:1842(drop)
10 0.000 0.000 0.153 0.015 merge.py:26(merge)
10 0.000 0.000 0.140 0.014 merge.py:201(get_result)
8/4 0.000 0.000 0.126 0.031 decorators.py:65(wrapper)
4 0.000 0.000 0.126 0.031 frame.py:3028(drop_duplicates)
1 0.000 0.000 0.102 0.102 get_prev_data_by_date.py:264(recreate_previous_cartons)
1 0.000 0.000 0.101 0.101 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
1 0.000 0.000 0.098 0.098 get_prev_data_by_date.py:360(recreate_previous_freight_type)
10 0.000 0.000 0.092 0.009 internals.py:4455(concatenate_block_managers)
10 0.001 0.000 0.088 0.009 internals.py:4471(<listcomp>)
120 0.001 0.000 0.084 0.001 internals.py:4559(concatenate_join_units)
266 0.004 0.000 0.067 0.000 common.py:733(take_nd)
120 0.000 0.000 0.061 0.001 internals.py:4569(<listcomp>)
120 0.003 0.000 0.061 0.001 internals.py:4814(get_reindexed_values)
1 0.000 0.000 0.059 0.059 get_prev_data_by_date.py:295(recreate_previous_appt_status)
10 0.000 0.000 0.038 0.004 merge.py:322(_get_join_info)
10 0.001 0.000 0.036 0.004 merge.py:516(_get_join_indexers)
25 0.001 0.000 0.024 0.001 merge.py:687(_factorize_keys)
74 0.023 0.000 0.023 0.000 {pandas.algos.take_2d_axis1_object_object}
50 0.022 0.000 0.022 0.000 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects}
120 0.003 0.000 0.022 0.000 internals.py:4479(get_empty_dtype_and_na)
88 0.000 0.000 0.021 0.000 frame.py:1969(__getitem__)
1 0.000 0.000 0.019 0.019 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
39 0.000 0.000 0.018 0.000 internals.py:3495(reindex_indexer)
537 0.017 0.000 0.017 0.000 {built-in method numpy.core.multiarray.empty}
15 0.000 0.000 0.017 0.001 ops.py:725(wrapper)
15 0.000 0.000 0.015 0.001 frame.py:2011(_getitem_array)
24 0.000 0.000 0.014 0.001 internals.py:3625(take)
10 0.000 0.000 0.014 0.001 merge.py:157(__init__)
10 0.000 0.000 0.014 0.001 merge.py:382(_get_merge_keys)
15 0.008 0.001 0.013 0.001 ops.py:662(na_op)
234 0.000 0.000 0.013 0.000 common.py:158(isnull)
234 0.001 0.000 0.013 0.000 common.py:179(_isnull_new)
15 0.000 0.000 0.012 0.001 generic.py:1609(take)
20 0.000 0.000 0.012 0.001 generic.py:2191(reindex)
The profiling by using Joins is as follows:
65079 function calls (63990 primitive calls) in 0.550 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 592 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.016 0.016 0.550 0.550 get_prev_data_by_date.py:122(merge_earlier_created_values)
14 0.000 0.000 0.295 0.021 generic.py:1901(_update_inplace)
14 0.000 0.000 0.295 0.021 generic.py:1402(_maybe_update_cacher)
19 0.000 0.000 0.294 0.015 generic.py:1492(_check_setitem_copy)
7 0.293 0.042 0.293 0.042 {built-in method gc.collect}
10 0.000 0.000 0.173 0.017 generic.py:1842(drop)
10 0.000 0.000 0.139 0.014 merge.py:26(merge)
8/4 0.000 0.000 0.138 0.034 decorators.py:65(wrapper)
4 0.000 0.000 0.138 0.034 frame.py:3028(drop_duplicates)
10 0.000 0.000 0.132 0.013 merge.py:201(get_result)
5 0.000 0.000 0.122 0.024 frame.py:4324(join)
5 0.000 0.000 0.122 0.024 frame.py:4371(_join_compat)
1 0.000 0.000 0.111 0.111 get_prev_data_by_date.py:264(recreate_previous_cartons)
1 0.000 0.000 0.103 0.103 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
1 0.000 0.000 0.099 0.099 get_prev_data_by_date.py:360(recreate_previous_freight_type)
10 0.000 0.000 0.093 0.009 internals.py:4455(concatenate_block_managers)
10 0.001 0.000 0.089 0.009 internals.py:4471(<listcomp>)
100 0.001 0.000 0.085 0.001 internals.py:4559(concatenate_join_units)
205 0.003 0.000 0.068 0.000 common.py:733(take_nd)
100 0.000 0.000 0.060 0.001 internals.py:4569(<listcomp>)
100 0.001 0.000 0.060 0.001 internals.py:4814(get_reindexed_values)
1 0.000 0.000 0.056 0.056 get_prev_data_by_date.py:295(recreate_previous_appt_status)
10 0.000 0.000 0.033 0.003 merge.py:322(_get_join_info)
52 0.031 0.001 0.031 0.001 {pandas.algos.take_2d_axis1_object_object}
5 0.000 0.000 0.030 0.006 base.py:2329(join)
37 0.001 0.000 0.027 0.001 internals.py:2754(apply)
6 0.000 0.000 0.024 0.004 frame.py:2763(set_index)
7 0.000 0.000 0.023 0.003 merge.py:516(_get_join_indexers)
2 0.000 0.000 0.022 0.011 base.py:2483(_join_non_unique)
7 0.000 0.000 0.021 0.003 generic.py:2950(copy)
7 0.000 0.000 0.021 0.003 internals.py:3046(copy)
84 0.000 0.000 0.020 0.000 frame.py:1969(__getitem__)
19 0.001 0.000 0.019 0.001 merge.py:687(_factorize_keys)
100 0.002 0.000 0.019 0.000 internals.py:4479(get_empty_dtype_and_na)
1 0.000 0.000 0.018 0.018 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
15 0.000 0.000 0.017 0.001 ops.py:725(wrapper)
34 0.001 0.000 0.017 0.000 internals.py:3495(reindex_indexer)
83 0.004 0.000 0.016 0.000 internals.py:3211(_consolidate_inplace)
68 0.015 0.000 0.015 0.000 {method 'copy' of 'numpy.ndarray' objects}
15 0.000 0.000 0.015 0.001 frame.py:2011(_getitem_array)
As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.
Thanks
set_index on merging column does indeed speed this up. Below is a slightly more realistic version of julien-marrec's Answer.
import pandas as pd
import numpy as np
myids=np.random.choice(np.arange(10000000), size=1000000, replace=False)
df1 = pd.DataFrame(myids, columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))
%%timeit
x = df1.merge(df2, how='left', left_on='A', right_on='A2')
#1 loop, best of 3: 664 ms per loop
%%timeit
x = df1.set_index('A').join(df2.set_index('A2'), how='left')
#1 loop, best of 3: 354 ms per loop
%%time
df1.set_index('A', inplace=True)
df2.set_index('A2', inplace=True)
#Wall time: 16 ms
%%timeit
x = df1.join(df2, how='left')
#10 loops, best of 3: 80.4 ms per loop
When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.
I suggest that you set your merge columns as index, and use df1.join(df2) instead of merge, it's much faster.
Here's some example including profiling:
In [1]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(1000000), columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.arange(1000000), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))
Here's a regular left merge on A and A2:
In [2]: %%timeit
x = df1.merge(df2, how='left', left_on='A', right_on='A2')
1 loop, best of 3: 441 ms per loop
Here's the same, using join:
In [3]: %%timeit
x = df1.set_index('A').join(df2.set_index('A2'), how='left')
1 loop, best of 3: 184 ms per loop
Now obviously if you can set the index before looping, the gain in terms of time will be much greater:
# Do this before looping
In [4]: %%time
df1.set_index('A', inplace=True)
df2.set_index('A2', inplace=True)
CPU times: user 9.78 ms, sys: 9.31 ms, total: 19.1 ms
Wall time: 16.8 ms
Then in the loop, you'll get something that in this case is 30 times faster:
In [5]: %%timeit
x = df1.join(df2, how='left')
100 loops, best of 3: 14.3 ms per loop
I don't know if this deserved a new answer but personally, the following tricks helped me improve a bit more the joins I had to do on big DataFrames (millions of rows and hundreds of columns):
Beside using set_index(index, inplace=True), you may want to sort it using sort_index(inplace=True). This speeds up a lot the join if your index is not ordered.
For example, creating the DataFrames with
import random
import pandas as pd
import numpy as np
nbre_items = 100000
ids = np.arange(nbre_items)
random.shuffle(ids)
df1 = pd.DataFrame({"id": ids})
df1['value'] = 1
df1.set_index("id", inplace=True)
random.shuffle(ids)
df2 = pd.DataFrame({"id": ids})
df2['value2'] = 2
df2.set_index("id", inplace=True)
I got the following results:
%timeit df1.join(df2)
13.2 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And after sorting the index (which takes a limited amount of time):
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
%timeit df1.join(df2)
764 µs ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can split one of your DataFrames in multiple ones with fewer columns. This trick gave me mixed results so be cautious when using it.
For example:
for i in range(0, df2.shape[1], 100):
df1 = df1.join(df2.iloc[:, i:min(df2.shape[1], (i + 100))], how='outer')

How do I plot this DataFrame?

I've got a pandas.DataFrame that looks like this:
>>> print df
0 1 2 3 4 5 6 7 8 9 10 11 \
0 0.198 0.198 0.266 0.198 0.236 0.199 0.198 0.198 0.199 0.199 0.199 0.198
1 0.032 0.034 0.039 0.405 0.442 0.382 0.343 0.311 0.282 0.255 0.232 0.210
2 0.702 0.702 0.742 0.709 0.755 0.708 0.708 0.712 0.707 0.706 0.706 0.706
3 0.109 0.112 0.114 0.114 0.128 0.532 0.149 0.118 0.115 0.114 0.114 0.112
4 0.309 0.306 0.311 0.311 0.316 0.513 1.977 0.313 0.311 0.310 0.311 0.309
5 0.280 0.277 0.282 0.278 0.282 0.383 1.122 1.685 0.280 0.280 0.282 0.280
6 0.466 0.460 0.465 0.465 0.468 0.508 0.829 1.100 1.987 0.465 0.465 0.463
7 0.469 0.464 0.469 0.470 0.469 0.490 0.648 0.783 1.095 2.002 0.469 0.466
8 0.137 0.120 0.137 0.138 0.137 0.136 0.144 0.149 0.166 0.209 0.137 0.136
9 0.125 0.107 0.125 0.126 0.125 0.122 0.126 0.128 0.132 0.144 0.125 0.123
10 0.125 0.106 0.125 0.123 0.123 0.122 0.125 0.128 0.132 0.142 0.125 0.123
11 0.127 0.107 0.125 0.125 0.125 0.122 0.126 0.127 0.132 0.142 0.125 0.123
12 0.125 0.107 0.125 0.128 0.125 0.123 0.126 0.127 0.132 0.142 0.125 0.122
13 0.871 0.862 0.871 0.872 0.872 0.872 0.873 0.872 0.875 0.880 0.873 0.872
14 0.114 0.115 0.116 0.117 0.131 0.536 0.153 0.123 0.118 0.117 0.117 0.116
15 0.033 0.032 0.031 0.032 0.032 0.040 0.033 0.033 0.032 0.032 0.032 0.032
12 13
0 0.198 0.198
1 0.190 0.172
2 0.705 0.705
3 0.112 0.115
4 0.308 0.310
5 0.275 0.278
6 0.462 0.463
7 0.466 0.466
8 0.134 1.678
9 0.122 1.692
10 0.122 1.694
11 0.122 1.695
12 0.122 1.684
13 0.872 1.255
14 0.116 0.127
15 0.031 0.032
[16 rows x 14 columns]
Each row represents a measurement value for an analog port. Each column is a test case. Thus there's one measurement for each of the analog ports, in each column.
When I plot this with DataFrame.plot() I end up with the following plot:
But this presents my rows, the 16 analog ports on the x-axis. I would like to have the column numbers on the x-axis. I've tried to define the x-axis in plot() as below:
>>> df.plot(x=df.columns)
Which results in a
ValueError: Length mismatch: Expected axis has 16 elements, new values have 14 elements
How should I approach this? Below is an example image which shows the correct x-axis values.
You want something like
df.T.plot()
Plus some other formatting. But that will get you started.
the .T method transposes the DataFrame.

Categories

Resources