Input Variables With Inconsistent Numbers of Samples for Polynomial Regression

Input Variables With Inconsistent Numbers of Samples for Polynomial Regression - python

trying to do polynomial regression and having some trouble fitting the model.
Getting
ValueError: Found input variables with inconsistent numbers of samples: [1040, 260]
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
x = BTCdata.iloc[:, [1, 2, 4, 5]]
y = BTCdata.iloc[:,3]
x, y = np.array(x).reshape((-1, 1)), np.array(y).reshape((-1, 1))
poly_features= PolynomialFeatures(degree= 4, include_bias = False)
x_ = poly_features.fit_transform(x)
model = LinearRegression()
model.fit(x_, y)

The problem comes from this line:
x = np.array(x).reshape((-1, 1))
By doing that you are transforming a dataframe of n rows and m columns into an array of n x m rows and 1 column. In your example, x ends up having 260 x 4 = 1040 rows whereas y has 260, raising this error.
If your goal is to convert your data to numpy arrays before using it in a model, then you can simply do:
x = x.to_numpy()

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
#
BTCdata = pd.read_excel('BitcoinRegression.xlsx', sheet_name='FinalBTC')
x = BTCdata.iloc[:, [1, 2, 4, 5]]
print(x.shape)
y = BTCdata.iloc[:,3]
print(y.shape)
#x, y = np.array(x).reshape((-1, 1)), np.array(y).reshape((-1, 1))
poly_features= PolynomialFeatures(degree= 4, include_bias = False)
x_ = poly_features.fit_transform(x)
#model = LinearRegression()
#model.fit(x_, y)
mod = sm.OLS(y, x_).fit()
mod.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: BTC R-squared: 0.886
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 46.86
Date: Wed, 17 Mar 2021 Prob (F-statistic): 2.63e-85
Time: 20:49:58 Log-Likelihood: -2299.3
No. Observations: 260 AIC: 4675.
Df Residuals: 222 BIC: 4810.
Df Model: 37
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -0.0089 0.019 -0.468 0.640 -0.046 0.028
x2 0.0033 0.004 0.797 0.426 -0.005 0.012
x3 2.621e-05 3.55e-05 0.737 0.462 -4.38e-05 9.62e-05
x4 0.0005 0.001 0.789 0.431 -0.001 0.002
x5 -0.0238 0.067 -0.355 0.723 -0.156 0.108
x6 0.0790 0.688 0.115 0.909 -1.277 1.435
x7 0.0942 0.131 0.722 0.471 -0.163 0.352
x8 0.9679 1.276 0.759 0.449 -1.546 3.482
x9 0.0184 0.133 0.139 0.890 -0.243 0.280
x10 0.0093 0.013 0.726 0.469 -0.016 0.035
x11 0.0957 0.125 0.766 0.444 -0.150 0.342
x12 0.0001 0.000 0.864 0.389 -0.000 0.000
x13 0.0008 0.001 0.599 0.550 -0.002 0.003
x14 0.0207 0.026 0.783 0.435 -0.031 0.073
x15 3.594e-05 2.89e-05 1.245 0.214 -2.09e-05 9.28e-05
x16 -0.0004 0.001 -0.496 0.621 -0.002 0.001
x17 0.0158 0.010 1.621 0.106 -0.003 0.035
x18 -0.0068 0.002 -2.945 0.004 -0.011 -0.002
x19 -0.0014 0.007 -0.202 0.840 -0.015 0.012
x20 -0.0389 0.086 -0.454 0.650 -0.208 0.130
x21 0.1104 0.043 2.558 0.011 0.025 0.195
x22 0.7337 0.819 0.896 0.371 -0.881 2.348
x23 -1.4583 0.432 -3.378 0.001 -2.309 -0.607
x24 0.0601 0.031 1.913 0.057 -0.002 0.122
x25 0.0192 0.021 0.893 0.373 -0.023 0.061
x26 0.0403 0.091 0.445 0.657 -0.138 0.219
x27 -0.5110 0.224 -2.284 0.023 -0.952 -0.070
x28 0.0697 0.078 0.892 0.374 -0.084 0.224
x29 -0.1316 0.039 -3.397 0.001 -0.208 -0.055
x30 0.0054 0.103 0.052 0.958 -0.198 0.209
x31 0.0003 0.000 0.951 0.343 -0.000 0.001
x32 0.0060 0.007 0.856 0.393 -0.008 0.020
x33 -0.0124 0.012 -1.078 0.282 -0.035 0.010
x34 0.3317 0.394 0.842 0.400 -0.444 1.108
x35 -4.886e-09 1.1e-09 -4.439 0.000 -7.05e-09 -2.72e-09
x36 1.387e-07 3.68e-08 3.767 0.000 6.62e-08 2.11e-07
x37 5.106e-07 3.44e-06 0.148 0.882 -6.28e-06 7.3e-06
x38 4.652e-07 2.91e-07 1.601 0.111 -1.07e-07 1.04e-06
x39 -1.623e-06 5.17e-07 -3.138 0.002 -2.64e-06 -6.04e-07
x40 -8.446e-05 9.05e-05 -0.933 0.352 -0.000 9.39e-05
x41 -8.729e-06 7.38e-06 -1.182 0.238 -2.33e-05 5.82e-06
x42 -0.0017 0.002 -0.804 0.422 -0.006 0.002
x43 0.0007 0.000 1.705 0.090 -0.000 0.001
x44 -1.815e-05 2.11e-05 -0.862 0.390 -5.96e-05 2.33e-05
x45 9.562e-06 3.43e-06 2.788 0.006 2.8e-06 1.63e-05
x46 0.0012 0.001 1.413 0.159 -0.000 0.003
x47 5.405e-05 6.5e-05 0.831 0.407 -7.41e-05 0.000
x48 0.0069 0.044 0.156 0.876 -0.080 0.093
x49 -0.0078 0.006 -1.414 0.159 -0.019 0.003
x50 0.0001 0.000 0.307 0.759 -0.001 0.001
x51 0.1505 0.090 1.669 0.096 -0.027 0.328
x52 0.1555 0.046 3.410 0.001 0.066 0.245
x53 -0.0296 0.024 -1.210 0.227 -0.078 0.019
x54 0.0016 0.001 2.182 0.030 0.000 0.003
x55 -2.28e-05 8.77e-06 -2.600 0.010 -4.01e-05 -5.52e-06
x56 -0.0045 0.003 -1.594 0.112 -0.010 0.001
x57 -0.0002 0.000 -0.947 0.344 -0.001 0.000
x58 -0.0067 0.237 -0.028 0.977 -0.474 0.461
x59 0.0134 0.021 0.629 0.530 -0.029 0.055
x60 0.0020 0.002 1.123 0.262 -0.002 0.006
x61 0.0277 0.016 1.689 0.093 -0.005 0.060
x62 -0.3824 0.413 -0.926 0.355 -1.196 0.431
x63 0.3528 0.179 1.970 0.050 -0.000 0.706
x64 -0.0282 0.005 -5.708 0.000 -0.038 -0.018
x65 -0.0002 0.000 -0.695 0.488 -0.001 0.000
x66 0.0098 0.009 1.142 0.255 -0.007 0.027
x67 0.0901 0.103 0.873 0.384 -0.113 0.293
x68 -0.1941 0.648 -0.300 0.765 -1.471 1.083
x69 0.0237 0.021 1.128 0.261 -0.018 0.065
==============================================================================
Omnibus: 127.728 Durbin-Watson: 0.552
Prob(Omnibus): 0.000 Jarque-Bera (JB): 851.418
Skew: 1.861 Prob(JB): 1.31e-185
Kurtosis: 11.046 Cond. No. 4.00e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4e+16. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

Related

Odds Ratios in MN Logit regression in stats model

I have this Multi Numinal regression model done by statsmodel:
writer = pd.ExcelWriter(path=os.path.join(export_path, f'regression.xlsx'), engine='xlsxwriter')
vars_matrix_df = pd.read_csv(data_path, skipinitialspace=True)
corr_cols = ['sales_vs_service', 'agent_experience', 'minutes_passed_since_shift_started', 'stage_in_conv',
'current_cust_wait_time', 'prev_cust_line_words', 'total_cust_words_in_conv',
'agent_total_turns', 'sentiment_score', 'max_sentiment', 'min_sentiment', 'last_sentiment',
'agent_response_time', 'customer_response_rate', 'is_last_cust_answered',
'conversation_opening', 'queue_length', 'total_lines_from_rep',
'agent_number_of_conversations', 'concurrency', 'rep_shift_start_time', 'first_cust_line_num_of_words',
'queue_wait_time', 'day_of_week', 'time_of_day']
reg_equation = st.formula.mnlogit(f'visitor_was_answered ~C(day_of_week)+C(time_of_day)+{"+".join(corr_cols)} ',
vars_matrix_df).fit()
the reg results:
visitor_was_answered=1 coef std err z P>|z| \
0 C(time_of_day)[T.10] 0.0071 1910000.000 3.700000e-09 1.000
1 C(time_of_day)[T.11] 0.0067 698000.000 9.600000e-09 1.000
2 C(time_of_day)[T.12] 0.0016 1790000.000 9.200000e-10 1.000
3 C(time_of_day)[T.13] 0.0031 561000.000 5.570000e-09 1.000
4 C(time_of_day)[T.14] 0.0037 1310000.000 2.840000e-09 1.000
5 C(time_of_day)[T.15] 0.0011 548000.000 2.020000e-09 1.000
6 C(time_of_day)[T.17] 0.0044 814000.000 5.440000e-09 1.000
7 C(time_of_day)[T.18] 0.0009 1100000.000 8.270000e-10 1.000
8 C(time_of_day)[T.19] 0.0047 835000.000 5.640000e-09 1.000
9 C(time_of_day)[T.20] 0.0009 1140000.000 8.100000e-10 1.000
10 time_of_day[T.10] 0.0071 1930000.000 3.670000e-09 1.000
11 time_of_day[T.11] 0.0067 686000.000 9.770000e-09 1.000
12 time_of_day[T.12] 0.0016 1800000.000 9.150000e-10 1.000
13 time_of_day[T.13] 0.0031 556000.000 5.620000e-09 1.000
14 time_of_day[T.14] 0.0037 1240000.000 3.010000e-09 1.000
15 time_of_day[T.15] 0.0011 638000.000 1.740000e-09 1.000
16 time_of_day[T.17] 0.0044 1010000.000 4.400000e-09 1.000
17 time_of_day[T.18] 0.0009 1130000.000 8.020000e-10 1.000
18 time_of_day[T.19] 0.0047 860000.000 5.480000e-09 1.000
19 time_of_day[T.20] 0.0009 1120000.000 8.270000e-10 1.000
20 sales_vs_service -0.0448 0.006 -8.102000e+00 0.000
21 agent_experience -0.0414 0.008 -4.955000e+00 0.000
22 current_cust_wait_time -39.1333 0.414 -9.457400e+01 0.000
23 prev_cust_line_words 20.0439 0.236 8.494600e+01 0.000
24 agent_total_turns 0.1110 0.038 2.949000e+00 0.003
25 sentiment_score -4.3454 0.157 -2.759000e+01 0.000
26 agent_response_time -118.0821 2.205 -5.354600e+01 0.000
27 customer_response_rate -7.0865 0.630 -1.125500e+01 0.000
28 is_last_cust_answered -0.2537 0.005 -4.860800e+01 0.000
29 conversation_opening -0.4533 0.006 -7.206300e+01 0.000
30 queue_length -1.5427 0.018 -8.642700e+01 0.000
31 agent_number_of_conversations 0.0013 0.018 7.300000e-02 0.941
32 first_cust_line_num_of_words -3.7545 0.123 -3.056900e+01 0.000
33 queue_wait_time -0.3308 0.166 -1.997000e+00 0.046
To this regression, I want to add the odds ratio values of each variable. I think that the coefficients are already odds ratio but I didn't find any proof to that. Any idea how this can be done? and what are the coefficients represent here?
Thanks!

running VARMAX in Python with a different and separate regressor for the equations

I would like to assign a specific exogenous variable to a specific regression. In specific, consider the code below. How can I restrict beta.exog_only_for_inc_equation coefficient to be zero for equation dln_inv and restrict beta.exog_only_for_inv_equation coefficient to be zero for equation dln_inc?
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
dta = sm.datasets.webuse('lutkepohl2', 'https://www.stata-press.com/data/r12/')
dta.index = dta.qtr
endog = dta.loc['1960-04-01':'1978-10-01', ['dln_inv', 'dln_inc', 'dln_consump']]
endog['exog_only_for_inv_equation']=[endog.index[i].quarter for i in range(len(endog.index))]
endog['exog_only_for_inc_equation']=endog['dln_consump']
exog = endog[['exog_only_for_inv_equation','exog_only_for_inc_equation']]
mod = sm.tsa.VARMAX(endog[['dln_inv', 'dln_inc']], order=(2,0), trend='n', exog=exog)
res = mod.fit(maxiter=1000, disp=False)
print(res.summary())
Statespace Model Results
==================================================================================
Dep. Variable: ['dln_inv', 'dln_inc'] No. Observations: 75
Model: VARX(2) Log Likelihood 363.197
Date: Wed, 22 Apr 2020 AIC -696.394
Time: 22:49:07 BIC -661.631
Sample: 04-01-1960 HQIC -682.513
- 10-01-1978
Covariance Type: opg
===================================================================================
Ljung-Box (Q): 60.43, 38.85 Jarque-Bera (JB): 8.31, 4.57
Prob(Q): 0.02, 0.52 Prob(JB): 0.02, 0.10
Heteroskedasticity (H): 0.46, 0.42 Skew: 0.13, -0.55
Prob(H) (two-sided): 0.06, 0.04 Kurtosis: 4.61, 3.48
Results for equation dln_inv
===================================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------------------
L1.dln_inv -0.2468 0.094 -2.624 0.009 -0.431 -0.062
L1.dln_inc 0.2937 0.481 0.610 0.542 -0.649 1.237
L2.dln_inv -0.1873 0.152 -1.235 0.217 -0.485 0.110
L2.dln_inc -0.0805 0.413 -0.195 0.846 -0.891 0.730
beta.exog_only_for_inv_equation -0.0007 0.004 -0.172 0.863 -0.009 0.008
beta.exog_only_for_inc_equation 1.2446 0.639 1.947 0.052 -0.008 2.497
Results for equation dln_inc
===================================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------------------------
L1.dln_inv 0.0606 0.033 1.830 0.067 -0.004 0.126
L1.dln_inc 0.0170 0.133 0.128 0.898 -0.243 0.277
L2.dln_inv 0.0116 0.035 0.333 0.739 -0.056 0.080
L2.dln_inc -0.0187 0.130 -0.143 0.886 -0.273 0.236
beta.exog_only_for_inv_equation 0.0018 0.001 1.756 0.079 -0.000 0.004
beta.exog_only_for_inc_equation 0.7046 0.111 6.321 0.000 0.486 0.923
Error covariance matrix
============================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------
sqrt.var.dln_inv 0.0434 0.004 12.267 0.000 0.036 0.050
sqrt.cov.dln_inv.dln_inc 5.319e-06 0.002 0.003 0.998 -0.004 0.004
sqrt.var.dln_inc 0.0106 0.001 10.475 0.000 0.009 0.013
============================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

I had to dig through the documentation at statsmodels.tsa.statespace.dynamic_factor.DynamicFactor.
Right after mod, change the code with the following lines
with mod.fix_params({'beta.exog_only_for_inc_equation.dln_inv': 0,'beta.exog_only_for_inv_equation.dln_inc':0}):
res = mod.fit()
print(res.summary())
which will yield:
Statespace Model Results
==================================================================================
Dep. Variable: ['dln_inv', 'dln_inc'] No. Observations: 75
Model: VARX(2) Log Likelihood 359.238
Date: Sat, 25 Apr 2020 AIC -692.475
Time: 00:52:20 BIC -662.348
Sample: 04-01-1960 HQIC -680.446
- 10-01-1978
Covariance Type: opg
===================================================================================
Ljung-Box (Q): 61.97, 39.25 Jarque-Bera (JB): 14.10, 2.67
Prob(Q): 0.01, 0.50 Prob(JB): 0.00, 0.26
Heteroskedasticity (H): 0.44, 0.39 Skew: 0.10, -0.40
Prob(H) (two-sided): 0.05, 0.02 Kurtosis: 5.11, 3.47
Results for equation dln_inv
===========================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------
L1.dln_inv -0.2537 0.095 -2.663 0.008 -0.440 -0.067
L1.dln_inc 0.5490 0.442 1.243 0.214 -0.317 1.415
L2.dln_inv -0.1359 0.175 -0.778 0.436 -0.478 0.206
L2.dln_inc 0.4770 0.371 1.286 0.198 -0.250 1.204
beta.exog_only_for_inv_equation 0.0015 0.005 0.321 0.748 -0.008 0.011
beta.exog_only_for_inc_equation (fixed) 0 nan nan nan nan nan
Results for equation dln_inc
===========================================================================================================
coef std err z P>|z| [0.025 0.975]
-----------------------------------------------------------------------------------------------------------
L1.dln_inv 0.0615 0.035 1.737 0.082 -0.008 0.131
L1.dln_inc 0.0584 0.105 0.557 0.577 -0.147 0.264
L2.dln_inv 0.0091 0.031 0.289 0.773 -0.052 0.071
L2.dln_inc 0.0181 0.126 0.144 0.886 -0.229 0.265
beta.exog_only_for_inv_equation (fixed) 0 nan nan nan nan nan
beta.exog_only_for_inc_equation 0.8123 0.115 7.070 0.000 0.587 1.038
Error covariance matrix
============================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------
sqrt.var.dln_inv 0.0445 0.003 14.175 0.000 0.038 0.051
sqrt.cov.dln_inv.dln_inc -5.595e-05 0.002 -0.028 0.978 -0.004 0.004
sqrt.var.dln_inc 0.0108 0.001 11.536 0.000 0.009 0.013
============================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
To find the names of the parameters just type
res.param_names
which will show you all the param names you can use. For the above example,
['L1.dln_inv.dln_inv',
'L1.dln_inc.dln_inv',
'L2.dln_inv.dln_inv',
'L2.dln_inc.dln_inv',
'L1.dln_inv.dln_inc',
'L1.dln_inc.dln_inc',
'L2.dln_inv.dln_inc',
'L2.dln_inc.dln_inc',
'beta.exog_only_for_inv_equation.dln_inv',
'beta.exog_only_for_inc_equation.dln_inv',
'beta.exog_only_for_inv_equation.dln_inc',
'beta.exog_only_for_inc_equation.dln_inc',
'sqrt.var.dln_inv',
'sqrt.cov.dln_inv.dln_inc',
'sqrt.var.dln_inc']
Hope this proves to be useful.

Create a rolling custom EWMA on a pandas dataframe

I am trying to create a rolling EWMA with the following decay= 1-ln(2)/3 on the last 13 values of a df such has :
factor
Out[36]:
EWMA
0 0.043
1 0.056
2 0.072
3 0.094
4 0.122
5 0.159
6 0.207
7 0.269
8 0.350
9 0.455
10 0.591
11 0.769
12 1.000
I have a df of monthly returns like this :
change.tail(5)
Out[41]:
date
2016-04-30 0.033 0.031 0.010 0.007 0.014 -0.006 -0.001 0.035 -0.004 0.020 0.011 0.003
2016-05-31 0.024 0.007 0.017 0.022 -0.012 0.034 0.019 0.001 0.006 0.032 -0.002 0.015
2016-06-30 -0.027 -0.004 -0.060 -0.057 -0.001 -0.096 -0.027 -0.096 -0.034 -0.024 0.044 0.001
2016-07-31 0.063 0.036 0.048 0.068 0.053 0.064 0.032 0.052 0.048 0.013 0.034 0.036
2016-08-31 -0.004 0.012 -0.005 0.009 0.028 0.005 -0.002 -0.003 -0.001 0.005 0.013 0.003
I am just trying to apply this rolling EWMA to each columns. I know that pandas has a EWMA method but I can't figure out how to pass the right 1-ln(2)/3 factor.
help would be appreciated! thanks!

#piRSquared 's answer is a good approximation, but values outside the last 13 also have weightings (albeit tiny), so it's not totally correct.
pandas could do rolling window calculations. However, amongst all the rolling function it supports, ewm is not one of them, which means we have to implement our own.
Assuming series is our time series to average:
from functools import partial
import numpy as np
window = 13
alpha = 1-np.log(2)/3 # This is ewma's decay factor.
weights = list(reversed([(1-alpha)**n for n in range(window)]))
ewma = partial(np.average, weights=weights)
rolling_average = series.rolling(window).apply(ewma)

use ewm with mean()
df.ewm(halflife=1 - np.log(2) / 3).mean()

Improve performance of MongoDB client (sockets)

I am using Python 2.7 (Anaconda distribution) on Windows 8.1 Pro.
I have a database of articles with their respective topics.
I am building an application which queries textual phrases in my database and associates article topics to each queried phrase. The topics are assigned based on the relevance of the phrase for the article.
The bottleneck seems to be Python socket communication with the localhost.
Here are my cProfile outputs:
topics_fit (PhraseVectorizer_1_1.py:668)
function called 1 times
1930698 function calls (1929630 primitive calls) in 148.209 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 286 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.224 1.224 148.209 148.209 PhraseVectorizer_1_1.py:668(topics_fit)
206272 0.193 0.000 146.780 0.001 cursor.py:1041(next)
601 0.189 0.000 146.455 0.244 cursor.py:944(_refresh)
534 0.030 0.000 146.263 0.274 cursor.py:796(__send_message)
534 0.009 0.000 141.532 0.265 mongo_client.py:725(_send_message_with_response)
534 0.002 0.000 141.484 0.265 mongo_client.py:768(_reset_on_error)
534 0.019 0.000 141.482 0.265 server.py:69(send_message_with_response)
534 0.002 0.000 141.364 0.265 pool.py:225(receive_message)
535 0.083 0.000 141.362 0.264 network.py:106(receive_message)
1070 1.202 0.001 141.278 0.132 network.py:127(_receive_data_on_socket)
3340 140.074 0.042 140.074 0.042 {method 'recv' of '_socket.socket' objects}
535 0.778 0.001 4.700 0.009 helpers.py:88(_unpack_response)
535 3.828 0.007 3.920 0.007 {bson._cbson.decode_all}
67 0.099 0.001 0.196 0.003 {method 'sort' of 'list' objects}
206187 0.096 0.000 0.096 0.000 PhraseVectorizer_1_1.py:705(<lambda>)
206187 0.096 0.000 0.096 0.000 database.py:339(_fix_outgoing)
206187 0.074 0.000 0.092 0.000 objectid.py:68(__init__)
1068 0.005 0.000 0.054 0.000 server.py:135(get_socket)
1068/534 0.010 0.000 0.041 0.000 contextlib.py:21(__exit__)
1068 0.004 0.000 0.041 0.000 pool.py:501(get_socket)
534 0.003 0.000 0.028 0.000 pool.py:208(send_message)
534 0.009 0.000 0.026 0.000 pool.py:573(return_socket)
567 0.001 0.000 0.026 0.000 socket.py:227(meth)
535 0.024 0.000 0.024 0.000 {method 'sendall' of '_socket.socket' objects}
534 0.003 0.000 0.023 0.000 topology.py:134(select_server)
206806 0.020 0.000 0.020 0.000 collection.py:249(database)
418997 0.019 0.000 0.019 0.000 {len}
449 0.001 0.000 0.018 0.000 topology.py:143(select_server_by_address)
534 0.005 0.000 0.018 0.000 topology.py:82(select_servers)
1068/534 0.001 0.000 0.018 0.000 contextlib.py:15(__enter__)
534 0.002 0.000 0.013 0.000 thread_util.py:83(release)
207307 0.010 0.000 0.011 0.000 {isinstance}
534 0.005 0.000 0.011 0.000 pool.py:538(_get_socket_no_auth)
534 0.004 0.000 0.011 0.000 thread_util.py:63(release)
534 0.001 0.000 0.011 0.000 mongo_client.py:673(_get_topology)
535 0.003 0.000 0.010 0.000 topology.py:57(open)
206187 0.008 0.000 0.008 0.000 {method 'popleft' of 'collections.deque' objects}
535 0.002 0.000 0.007 0.000 topology.py:327(_apply_selector)
536 0.003 0.000 0.007 0.000 topology.py:286(_ensure_opened)
1071 0.004 0.000 0.007 0.000 periodic_executor.py:50(open)
In particular: {method 'recv' of '_socket.socket' objects} seems to cause trouble.
According to suggestions found in What can I do to improve socket performance in Python 3?, I tried gevent.
I added this snippet at the beginning of my script (before importing anything):
from gevent import monkey
monkey.patch_all()
This resulted in even slower performance...
*** PROFILER RESULTS ***
topics_fit (PhraseVectorizer_1_1.py:671)
function called 1 times
1956879 function calls (1951292 primitive calls) in 158.260 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 427 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 158.170 158.170 hub.py:358(run)
1 0.000 0.000 158.170 158.170 {method 'run' of 'gevent.core.loop' objects}
2/1 1.286 0.643 158.166 158.166 PhraseVectorizer_1_1.py:671(topics_fit)
206272 0.198 0.000 156.670 0.001 cursor.py:1041(next)
601 0.192 0.000 156.203 0.260 cursor.py:944(_refresh)
534 0.029 0.000 156.008 0.292 cursor.py:796(__send_message)
534 0.012 0.000 150.514 0.282 mongo_client.py:725(_send_message_with_response)
534 0.002 0.000 150.439 0.282 mongo_client.py:768(_reset_on_error)
534 0.017 0.000 150.437 0.282 server.py:69(send_message_with_response)
551/535 0.002 0.000 150.316 0.281 pool.py:225(receive_message)
552/536 0.079 0.000 150.314 0.280 network.py:106(receive_message)
1104/1072 0.815 0.001 150.234 0.140 network.py:127(_receive_data_on_socket)
2427/2395 0.019 0.000 149.418 0.062 socket.py:381(recv)
608/592 0.003 0.000 48.541 0.082 socket.py:284(_wait)
552 0.885 0.002 5.464 0.010 helpers.py:88(_unpack_response)
552 4.475 0.008 4.577 0.008 {bson._cbson.decode_all}
3033 2.021 0.001 2.021 0.001 {method 'recv' of '_socket.socket' objects}
7/4 0.000 0.000 0.221 0.055 hub.py:189(_import)
4 0.127 0.032 0.221 0.055 {__import__}
67 0.104 0.002 0.202 0.003 {method 'sort' of 'list' objects}
536/535 0.003 0.000 0.142 0.000 topology.py:57(open)
537/536 0.002 0.000 0.139 0.000 topology.py:286(_ensure_opened)
1072/1071 0.003 0.000 0.138 0.000 periodic_executor.py:50(open)
537/536 0.001 0.000 0.136 0.000 server.py:33(open)
537/536 0.001 0.000 0.135 0.000 monitor.py:69(open)
20/19 0.000 0.000 0.132 0.007 topology.py:342(_update_servers)
4 0.000 0.000 0.131 0.033 hub.py:418(_get_resolver)
1 0.000 0.000 0.122 0.122 resolver_thread.py:13(__init__)
1 0.000 0.000 0.122 0.122 hub.py:433(_get_threadpool)
206187 0.081 0.000 0.101 0.000 objectid.py:68(__init__)
206187 0.100 0.000 0.100 0.000 database.py:339(_fix_outgoing)
206187 0.098 0.000 0.098 0.000 PhraseVectorizer_1_1.py:708(<lambda>)
1 0.073 0.073 0.093 0.093 threadpool.py:2(<module>)
2037 0.003 0.000 0.092 0.000 hub.py:159(get_hub)
2 0.000 0.000 0.090 0.045 thread.py:39(start_new_thread)
2 0.000 0.000 0.090 0.045 greenlet.py:195(spawn)
2 0.000 0.000 0.090 0.045 greenlet.py:74(__init__)
1 0.000 0.000 0.090 0.090 hub.py:259(__init__)
1102 0.004 0.000 0.078 0.000 pool.py:501(get_socket)
1068 0.005 0.000 0.074 0.000 server.py:135(get_socket)
This performance is somewhat unacceptable for my application - I would like it to be much faster (this is timed and profiled for a subset of ~20 documents, and I need to process few tens of thousands).
Any ideas on how to speed it up?
Much appreciated.
Edit:
Code snippet that I profiled:
# also tried monkey patching all here, see profiler
from pymongo import MongoClient
def topics_fit(self):
client = MongoClient()
# tried motor for multithreading - also slow
#client = motor.motor_tornado.MotorClient()
# initialize DB cursors
db_wiki = client.wiki
# initialize topic feature dictionary
self.topics = OrderedDict()
self.topic_mapping = OrderedDict()
vocabulary_keys = self.vocabulary.keys()
num_categories = 0
for phrase in vocabulary_keys:
phrase_tokens = phrase.split()
if len(phrase_tokens) > 1:
# query for current phrase
AND_phrase = "\"" + phrase + "\""
cursor = db_wiki.categories.find({ "$text" : { "$search": AND_phrase } },{ "score": { "$meta": "textScore" } })
cursor = list(cursor)
if cursor:
cursor.sort(key=lambda k: k["score"], reverse = True)
added_categories = cursor[0]["category_ids"]
for added_category in added_categories:
if not (added_category in self.topics):
self.topics[added_category] = num_categories
if not (self.vocabulary[phrase] in self.topic_mapping):
self.topic_mapping[self.vocabulary[phrase]] = [num_categories, ]
else:
self.topic_mapping[self.vocabulary[phrase]].append(num_categories)
num_categories+=1
else:
if not (self.vocabulary[phrase] in self.topic_mapping):
self.topic_mapping[self.vocabulary[phrase]] = [self.topics[added_category], ]
else:
self.topic_mapping[self.vocabulary[phrase]].append(self.topics[added_category])
Edit 2: output of index_information():
{u'_id_':
{u'ns': u'wiki.categories', u'key': [(u'_id', 1)], u'v': 1},
u'article_title_text_article_body_text_category_names_text': {u'default_language': u'english', u'weights': SON([(u'article_body', 1), (u'article_title', 1), (u'category_names', 1)]), u'key': [(u'_fts', u'text'), (u'_ftsx', 1)], u'v': 1, u'language_override': u'language', u'ns': u'wiki.categories', u'textIndexVersion': 2}}

How do I plot this DataFrame?

I've got a pandas.DataFrame that looks like this:
>>> print df
0 1 2 3 4 5 6 7 8 9 10 11 \
0 0.198 0.198 0.266 0.198 0.236 0.199 0.198 0.198 0.199 0.199 0.199 0.198
1 0.032 0.034 0.039 0.405 0.442 0.382 0.343 0.311 0.282 0.255 0.232 0.210
2 0.702 0.702 0.742 0.709 0.755 0.708 0.708 0.712 0.707 0.706 0.706 0.706
3 0.109 0.112 0.114 0.114 0.128 0.532 0.149 0.118 0.115 0.114 0.114 0.112
4 0.309 0.306 0.311 0.311 0.316 0.513 1.977 0.313 0.311 0.310 0.311 0.309
5 0.280 0.277 0.282 0.278 0.282 0.383 1.122 1.685 0.280 0.280 0.282 0.280
6 0.466 0.460 0.465 0.465 0.468 0.508 0.829 1.100 1.987 0.465 0.465 0.463
7 0.469 0.464 0.469 0.470 0.469 0.490 0.648 0.783 1.095 2.002 0.469 0.466
8 0.137 0.120 0.137 0.138 0.137 0.136 0.144 0.149 0.166 0.209 0.137 0.136
9 0.125 0.107 0.125 0.126 0.125 0.122 0.126 0.128 0.132 0.144 0.125 0.123
10 0.125 0.106 0.125 0.123 0.123 0.122 0.125 0.128 0.132 0.142 0.125 0.123
11 0.127 0.107 0.125 0.125 0.125 0.122 0.126 0.127 0.132 0.142 0.125 0.123
12 0.125 0.107 0.125 0.128 0.125 0.123 0.126 0.127 0.132 0.142 0.125 0.122
13 0.871 0.862 0.871 0.872 0.872 0.872 0.873 0.872 0.875 0.880 0.873 0.872
14 0.114 0.115 0.116 0.117 0.131 0.536 0.153 0.123 0.118 0.117 0.117 0.116
15 0.033 0.032 0.031 0.032 0.032 0.040 0.033 0.033 0.032 0.032 0.032 0.032
12 13
0 0.198 0.198
1 0.190 0.172
2 0.705 0.705
3 0.112 0.115
4 0.308 0.310
5 0.275 0.278
6 0.462 0.463
7 0.466 0.466
8 0.134 1.678
9 0.122 1.692
10 0.122 1.694
11 0.122 1.695
12 0.122 1.684
13 0.872 1.255
14 0.116 0.127
15 0.031 0.032
[16 rows x 14 columns]
Each row represents a measurement value for an analog port. Each column is a test case. Thus there's one measurement for each of the analog ports, in each column.
When I plot this with DataFrame.plot() I end up with the following plot:
But this presents my rows, the 16 analog ports on the x-axis. I would like to have the column numbers on the x-axis. I've tried to define the x-axis in plot() as below:
>>> df.plot(x=df.columns)
Which results in a
ValueError: Length mismatch: Expected axis has 16 elements, new values have 14 elements
How should I approach this? Below is an example image which shows the correct x-axis values.

You want something like
df.T.plot()
Plus some other formatting. But that will get you started.
the .T method transposes the DataFrame.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Input Variables With Inconsistent Numbers of Samples for Polynomial Regression - python

Related

Odds Ratios in MN Logit regression in stats model

running VARMAX in Python with a different and separate regressor for the equations

Create a rolling custom EWMA on a pandas dataframe

Improve performance of MongoDB client (sockets)

How do I plot this DataFrame?

Categories

Resources