I have a huge data set, with many values. I would like to exclude certain rows I see as containing less accurate information. For instance say I have:
1 150.37265 1.6093940 11986.75879 4343.98486 6345.683 8.535 2458.348 3.069 2554.732 2.205 2011.244 1.855 1665.491 2.055 2229.020 11.092 1159.925 63.576 1238.034 63.029 1513.357 76.582 -99.999 -99.999 -99.999 -99.999 609.524 1.071 430.542 0.779 293.832 0.365 201.463 0.499 88.605 1.054 316.139 2.791 426.547 2.960 659.435 3.337 761.369 2.897 982.764 3.981 915.068 3.799 147.845 2.344 284.971 2.969 413.933 3.471 520.958 3.385 761.208 3.425 1299.578 4.812 27.115 0.127 32.692 0.134 3946.924 11.148 0.000 0.030 27.50304 1.00000 -1.00000 -1.00000 -1 0 0 2 230 1 1
2 150.40848 1.6075042 11126.90527 4298.73779 2326.038 3.374 1683.321 2.562 2624.063 2.233 2718.523 2.144 2892.133 2.693 140.665 61.195 281.988 20.099 427.518 22.779 735.361 37.903 -99.999 -99.999 -99.999 -99.999 -99.999 -99.999 -99.999 -99.999 -99.999 -99.999 -99.999 -99.999 480.256 2.452 1503.665 6.085 1532.825 5.610 1883.756 5.638 2196.444 4.918 -99.999 -99.999 2087.671 5.736 892.003 5.755 1354.323 6.468 1339.161 6.241 1990.614 6.614 1823.208 5.300 -99.999 -99.999 0.522 0.225 16.993 0.240 -99.900 -99.900 0.000 0.750 12.51440 1.00000 -1.00000 -1.00000 -1 1 0 11 295 1 0
3 150.40550 1.6069111 11198.41992 4284.49414 223.931 3.299 111.582 0.887 94.436 0.678 67.895 0.511 61.085 0.507 64.002 6.935 55.312 8.437 65.572 4.568 88.131 5.368 46.054 0.342 36.760 0.223 20.608 0.206 11.796 0.140 8.360 0.086 6.925 0.100 4.889 0.251 8.405 0.461 10.009 0.460 22.655 0.625 28.231 0.567 34.231 0.754 37.358 0.781 6.587 0.501 7.931 0.507 9.492 0.535 15.271 0.591 30.671 0.695 38.314 0.841 1.864 0.125 4.507 0.130 142.376 9.231 0.000 0.030 17.73935 1.00000 -1.00000 -1.00000 -1 0 0 0 314 1 1
4 150.39050 1.6043303 11558.18359 4222.49707 33.437 1.502 23.667 0.681 16.188 0.566 11.345 0.410 8.666 0.358 6.252 7.394 16.608 6.876 12.765 1.795 25.299 2.120 6.197 0.216 4.550 0.115 1.558 0.082 0.789 0.064 0.392 0.062 0.305 0.044 0.183 0.065 0.463 0.131 0.906 0.157 1.353 0.177 2.328 0.190 3.503 0.273 4.320 0.300 0.098 0.099 0.257 0.142 0.455 0.152 0.721 0.172 3.101 0.241 5.155 0.342 0.047 0.304 -0.538 0.245 21.609 8.478 0.000 0.750 11.57455 1.00248 -1.00000 -1.00000 -1 0 0 0 322 1 1
as a sample of my data set, and I say that row 2 and 3 are not accurate enough, how would I import only rows 1 and 4. I would like more of a general trick as opposed to using comments 2,3 since the data set is very vast.
EDIT: The operation does not have to be done with genfromtxt, if there is another method that does the same thing as genfromtxt but can skip columns as well as rows, that would be great!
Related
trying to do polynomial regression and having some trouble fitting the model.
Getting
ValueError: Found input variables with inconsistent numbers of samples: [1040, 260]
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
x = BTCdata.iloc[:, [1, 2, 4, 5]]
y = BTCdata.iloc[:,3]
x, y = np.array(x).reshape((-1, 1)), np.array(y).reshape((-1, 1))
poly_features= PolynomialFeatures(degree= 4, include_bias = False)
x_ = poly_features.fit_transform(x)
model = LinearRegression()
model.fit(x_, y)
The problem comes from this line:
x = np.array(x).reshape((-1, 1))
By doing that you are transforming a dataframe of n rows and m columns into an array of n x m rows and 1 column. In your example, x ends up having 260 x 4 = 1040 rows whereas y has 260, raising this error.
If your goal is to convert your data to numpy arrays before using it in a model, then you can simply do:
x = x.to_numpy()
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
#
BTCdata = pd.read_excel('BitcoinRegression.xlsx', sheet_name='FinalBTC')
x = BTCdata.iloc[:, [1, 2, 4, 5]]
print(x.shape)
y = BTCdata.iloc[:,3]
print(y.shape)
#x, y = np.array(x).reshape((-1, 1)), np.array(y).reshape((-1, 1))
poly_features= PolynomialFeatures(degree= 4, include_bias = False)
x_ = poly_features.fit_transform(x)
#model = LinearRegression()
#model.fit(x_, y)
mod = sm.OLS(y, x_).fit()
mod.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: BTC R-squared: 0.886
Model: OLS Adj. R-squared: 0.868
Method: Least Squares F-statistic: 46.86
Date: Wed, 17 Mar 2021 Prob (F-statistic): 2.63e-85
Time: 20:49:58 Log-Likelihood: -2299.3
No. Observations: 260 AIC: 4675.
Df Residuals: 222 BIC: 4810.
Df Model: 37
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -0.0089 0.019 -0.468 0.640 -0.046 0.028
x2 0.0033 0.004 0.797 0.426 -0.005 0.012
x3 2.621e-05 3.55e-05 0.737 0.462 -4.38e-05 9.62e-05
x4 0.0005 0.001 0.789 0.431 -0.001 0.002
x5 -0.0238 0.067 -0.355 0.723 -0.156 0.108
x6 0.0790 0.688 0.115 0.909 -1.277 1.435
x7 0.0942 0.131 0.722 0.471 -0.163 0.352
x8 0.9679 1.276 0.759 0.449 -1.546 3.482
x9 0.0184 0.133 0.139 0.890 -0.243 0.280
x10 0.0093 0.013 0.726 0.469 -0.016 0.035
x11 0.0957 0.125 0.766 0.444 -0.150 0.342
x12 0.0001 0.000 0.864 0.389 -0.000 0.000
x13 0.0008 0.001 0.599 0.550 -0.002 0.003
x14 0.0207 0.026 0.783 0.435 -0.031 0.073
x15 3.594e-05 2.89e-05 1.245 0.214 -2.09e-05 9.28e-05
x16 -0.0004 0.001 -0.496 0.621 -0.002 0.001
x17 0.0158 0.010 1.621 0.106 -0.003 0.035
x18 -0.0068 0.002 -2.945 0.004 -0.011 -0.002
x19 -0.0014 0.007 -0.202 0.840 -0.015 0.012
x20 -0.0389 0.086 -0.454 0.650 -0.208 0.130
x21 0.1104 0.043 2.558 0.011 0.025 0.195
x22 0.7337 0.819 0.896 0.371 -0.881 2.348
x23 -1.4583 0.432 -3.378 0.001 -2.309 -0.607
x24 0.0601 0.031 1.913 0.057 -0.002 0.122
x25 0.0192 0.021 0.893 0.373 -0.023 0.061
x26 0.0403 0.091 0.445 0.657 -0.138 0.219
x27 -0.5110 0.224 -2.284 0.023 -0.952 -0.070
x28 0.0697 0.078 0.892 0.374 -0.084 0.224
x29 -0.1316 0.039 -3.397 0.001 -0.208 -0.055
x30 0.0054 0.103 0.052 0.958 -0.198 0.209
x31 0.0003 0.000 0.951 0.343 -0.000 0.001
x32 0.0060 0.007 0.856 0.393 -0.008 0.020
x33 -0.0124 0.012 -1.078 0.282 -0.035 0.010
x34 0.3317 0.394 0.842 0.400 -0.444 1.108
x35 -4.886e-09 1.1e-09 -4.439 0.000 -7.05e-09 -2.72e-09
x36 1.387e-07 3.68e-08 3.767 0.000 6.62e-08 2.11e-07
x37 5.106e-07 3.44e-06 0.148 0.882 -6.28e-06 7.3e-06
x38 4.652e-07 2.91e-07 1.601 0.111 -1.07e-07 1.04e-06
x39 -1.623e-06 5.17e-07 -3.138 0.002 -2.64e-06 -6.04e-07
x40 -8.446e-05 9.05e-05 -0.933 0.352 -0.000 9.39e-05
x41 -8.729e-06 7.38e-06 -1.182 0.238 -2.33e-05 5.82e-06
x42 -0.0017 0.002 -0.804 0.422 -0.006 0.002
x43 0.0007 0.000 1.705 0.090 -0.000 0.001
x44 -1.815e-05 2.11e-05 -0.862 0.390 -5.96e-05 2.33e-05
x45 9.562e-06 3.43e-06 2.788 0.006 2.8e-06 1.63e-05
x46 0.0012 0.001 1.413 0.159 -0.000 0.003
x47 5.405e-05 6.5e-05 0.831 0.407 -7.41e-05 0.000
x48 0.0069 0.044 0.156 0.876 -0.080 0.093
x49 -0.0078 0.006 -1.414 0.159 -0.019 0.003
x50 0.0001 0.000 0.307 0.759 -0.001 0.001
x51 0.1505 0.090 1.669 0.096 -0.027 0.328
x52 0.1555 0.046 3.410 0.001 0.066 0.245
x53 -0.0296 0.024 -1.210 0.227 -0.078 0.019
x54 0.0016 0.001 2.182 0.030 0.000 0.003
x55 -2.28e-05 8.77e-06 -2.600 0.010 -4.01e-05 -5.52e-06
x56 -0.0045 0.003 -1.594 0.112 -0.010 0.001
x57 -0.0002 0.000 -0.947 0.344 -0.001 0.000
x58 -0.0067 0.237 -0.028 0.977 -0.474 0.461
x59 0.0134 0.021 0.629 0.530 -0.029 0.055
x60 0.0020 0.002 1.123 0.262 -0.002 0.006
x61 0.0277 0.016 1.689 0.093 -0.005 0.060
x62 -0.3824 0.413 -0.926 0.355 -1.196 0.431
x63 0.3528 0.179 1.970 0.050 -0.000 0.706
x64 -0.0282 0.005 -5.708 0.000 -0.038 -0.018
x65 -0.0002 0.000 -0.695 0.488 -0.001 0.000
x66 0.0098 0.009 1.142 0.255 -0.007 0.027
x67 0.0901 0.103 0.873 0.384 -0.113 0.293
x68 -0.1941 0.648 -0.300 0.765 -1.471 1.083
x69 0.0237 0.021 1.128 0.261 -0.018 0.065
==============================================================================
Omnibus: 127.728 Durbin-Watson: 0.552
Prob(Omnibus): 0.000 Jarque-Bera (JB): 851.418
Skew: 1.861 Prob(JB): 1.31e-185
Kurtosis: 11.046 Cond. No. 4.00e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4e+16. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
I have this Multi Numinal regression model done by statsmodel:
writer = pd.ExcelWriter(path=os.path.join(export_path, f'regression.xlsx'), engine='xlsxwriter')
vars_matrix_df = pd.read_csv(data_path, skipinitialspace=True)
corr_cols = ['sales_vs_service', 'agent_experience', 'minutes_passed_since_shift_started', 'stage_in_conv',
'current_cust_wait_time', 'prev_cust_line_words', 'total_cust_words_in_conv',
'agent_total_turns', 'sentiment_score', 'max_sentiment', 'min_sentiment', 'last_sentiment',
'agent_response_time', 'customer_response_rate', 'is_last_cust_answered',
'conversation_opening', 'queue_length', 'total_lines_from_rep',
'agent_number_of_conversations', 'concurrency', 'rep_shift_start_time', 'first_cust_line_num_of_words',
'queue_wait_time', 'day_of_week', 'time_of_day']
reg_equation = st.formula.mnlogit(f'visitor_was_answered ~C(day_of_week)+C(time_of_day)+{"+".join(corr_cols)} ',
vars_matrix_df).fit()
the reg results:
visitor_was_answered=1 coef std err z P>|z| \
0 C(time_of_day)[T.10] 0.0071 1910000.000 3.700000e-09 1.000
1 C(time_of_day)[T.11] 0.0067 698000.000 9.600000e-09 1.000
2 C(time_of_day)[T.12] 0.0016 1790000.000 9.200000e-10 1.000
3 C(time_of_day)[T.13] 0.0031 561000.000 5.570000e-09 1.000
4 C(time_of_day)[T.14] 0.0037 1310000.000 2.840000e-09 1.000
5 C(time_of_day)[T.15] 0.0011 548000.000 2.020000e-09 1.000
6 C(time_of_day)[T.17] 0.0044 814000.000 5.440000e-09 1.000
7 C(time_of_day)[T.18] 0.0009 1100000.000 8.270000e-10 1.000
8 C(time_of_day)[T.19] 0.0047 835000.000 5.640000e-09 1.000
9 C(time_of_day)[T.20] 0.0009 1140000.000 8.100000e-10 1.000
10 time_of_day[T.10] 0.0071 1930000.000 3.670000e-09 1.000
11 time_of_day[T.11] 0.0067 686000.000 9.770000e-09 1.000
12 time_of_day[T.12] 0.0016 1800000.000 9.150000e-10 1.000
13 time_of_day[T.13] 0.0031 556000.000 5.620000e-09 1.000
14 time_of_day[T.14] 0.0037 1240000.000 3.010000e-09 1.000
15 time_of_day[T.15] 0.0011 638000.000 1.740000e-09 1.000
16 time_of_day[T.17] 0.0044 1010000.000 4.400000e-09 1.000
17 time_of_day[T.18] 0.0009 1130000.000 8.020000e-10 1.000
18 time_of_day[T.19] 0.0047 860000.000 5.480000e-09 1.000
19 time_of_day[T.20] 0.0009 1120000.000 8.270000e-10 1.000
20 sales_vs_service -0.0448 0.006 -8.102000e+00 0.000
21 agent_experience -0.0414 0.008 -4.955000e+00 0.000
22 current_cust_wait_time -39.1333 0.414 -9.457400e+01 0.000
23 prev_cust_line_words 20.0439 0.236 8.494600e+01 0.000
24 agent_total_turns 0.1110 0.038 2.949000e+00 0.003
25 sentiment_score -4.3454 0.157 -2.759000e+01 0.000
26 agent_response_time -118.0821 2.205 -5.354600e+01 0.000
27 customer_response_rate -7.0865 0.630 -1.125500e+01 0.000
28 is_last_cust_answered -0.2537 0.005 -4.860800e+01 0.000
29 conversation_opening -0.4533 0.006 -7.206300e+01 0.000
30 queue_length -1.5427 0.018 -8.642700e+01 0.000
31 agent_number_of_conversations 0.0013 0.018 7.300000e-02 0.941
32 first_cust_line_num_of_words -3.7545 0.123 -3.056900e+01 0.000
33 queue_wait_time -0.3308 0.166 -1.997000e+00 0.046
To this regression, I want to add the odds ratio values of each variable. I think that the coefficients are already odds ratio but I didn't find any proof to that. Any idea how this can be done? and what are the coefficients represent here?
Thanks!
What's the agreed upon pythonic way to format columns in a DataFrame, while maintaining the original data?
For example, I have a large DataFrame which contains floats. For display purposes only, I would like to format some columns as percents, some as dollars, and some others as numbers rounded to the hundredths place. The remainder would be unchanged. The original data would be preserved and only the display would be affected. The solution would start with Raw df and return the Formatted df below.
Raw df:
index percent dollars rounded2 float
0 0.524 0.787 1.202 0.133
1 0.166 0.291 0.208 0.483
2 0.815 0.319 0.205 1.350
3 0.421 0.634 1.380 1.352
4 1.144 0.790 0.279 0.636
5 0.215 0.258 0.895 0.949
6 0.796 0.834 0.809 1.194
7 0.920 0.176 0.589 1.036
8 1.012 0.790 1.224 1.279
9 1.231 1.175 1.232 0.496
10 0.494 1.319 0.912 0.088
11 0.400 0.291 0.491 1.041
Formatted df:
index percent dollars rounded2 float
0 52.4% $0.79 1.20 0.133
1 16.6% $0.29 0.21 0.483
2 81.5% $0.32 0.20 1.350
3 42.1% $0.63 1.38 1.352
4 114.4% $0.79 0.28 0.636
5 21.5% $0.26 0.90 0.949
6 79.6% $0.83 0.81 1.194
7 92.0% $0.18 0.59 1.036
8 101.2% $0.79 1.22 1.279
9 123.1% $1.17 1.23 0.496
10 49.4% $1.32 0.91 0.088
11 40.0% $0.29 0.49 1.041
This seems to be pretty routine, but the available solutions for similar tasks are neither simple nor user friendly. I'd appreciate anyone who can provide a parsimonious method.
I have the following data:
INPUT
ID A
1 0.040
2 0.086
3 0.127
4 0.173
5 0.141
6 0.047
7 0.068
8 0.038
I want to create B column, each two row in B have the same average from A. As following:
OUTPUT
ID A B
1 0.040 0.063
2 0.086 0.063
3 0.127 0.150
4 0.173 0.150
5 0.141 0.094
6 0.047 0.094
7 0.068 0.053
8 0.038 0.053
I tried this code
df["B"]= (df['A'] + df['A'].shift(-1))/2
I got the average but I can't make it distrbute bi-row.
you can do it this way:
In [10]: df['B'] = df.groupby(np.arange(len(df)) // 2)['A'].transform('mean')
In [11]: df
Out[11]:
ID A B
0 1 0.040 0.063
1 2 0.086 0.063
2 3 0.127 0.150
3 4 0.173 0.150
4 5 0.141 0.094
5 6 0.047 0.094
6 7 0.068 0.053
7 8 0.038 0.053
I've got a pandas.DataFrame that looks like this:
>>> print df
0 1 2 3 4 5 6 7 8 9 10 11 \
0 0.198 0.198 0.266 0.198 0.236 0.199 0.198 0.198 0.199 0.199 0.199 0.198
1 0.032 0.034 0.039 0.405 0.442 0.382 0.343 0.311 0.282 0.255 0.232 0.210
2 0.702 0.702 0.742 0.709 0.755 0.708 0.708 0.712 0.707 0.706 0.706 0.706
3 0.109 0.112 0.114 0.114 0.128 0.532 0.149 0.118 0.115 0.114 0.114 0.112
4 0.309 0.306 0.311 0.311 0.316 0.513 1.977 0.313 0.311 0.310 0.311 0.309
5 0.280 0.277 0.282 0.278 0.282 0.383 1.122 1.685 0.280 0.280 0.282 0.280
6 0.466 0.460 0.465 0.465 0.468 0.508 0.829 1.100 1.987 0.465 0.465 0.463
7 0.469 0.464 0.469 0.470 0.469 0.490 0.648 0.783 1.095 2.002 0.469 0.466
8 0.137 0.120 0.137 0.138 0.137 0.136 0.144 0.149 0.166 0.209 0.137 0.136
9 0.125 0.107 0.125 0.126 0.125 0.122 0.126 0.128 0.132 0.144 0.125 0.123
10 0.125 0.106 0.125 0.123 0.123 0.122 0.125 0.128 0.132 0.142 0.125 0.123
11 0.127 0.107 0.125 0.125 0.125 0.122 0.126 0.127 0.132 0.142 0.125 0.123
12 0.125 0.107 0.125 0.128 0.125 0.123 0.126 0.127 0.132 0.142 0.125 0.122
13 0.871 0.862 0.871 0.872 0.872 0.872 0.873 0.872 0.875 0.880 0.873 0.872
14 0.114 0.115 0.116 0.117 0.131 0.536 0.153 0.123 0.118 0.117 0.117 0.116
15 0.033 0.032 0.031 0.032 0.032 0.040 0.033 0.033 0.032 0.032 0.032 0.032
12 13
0 0.198 0.198
1 0.190 0.172
2 0.705 0.705
3 0.112 0.115
4 0.308 0.310
5 0.275 0.278
6 0.462 0.463
7 0.466 0.466
8 0.134 1.678
9 0.122 1.692
10 0.122 1.694
11 0.122 1.695
12 0.122 1.684
13 0.872 1.255
14 0.116 0.127
15 0.031 0.032
[16 rows x 14 columns]
Each row represents a measurement value for an analog port. Each column is a test case. Thus there's one measurement for each of the analog ports, in each column.
When I plot this with DataFrame.plot() I end up with the following plot:
But this presents my rows, the 16 analog ports on the x-axis. I would like to have the column numbers on the x-axis. I've tried to define the x-axis in plot() as below:
>>> df.plot(x=df.columns)
Which results in a
ValueError: Length mismatch: Expected axis has 16 elements, new values have 14 elements
How should I approach this? Below is an example image which shows the correct x-axis values.
You want something like
df.T.plot()
Plus some other formatting. But that will get you started.
the .T method transposes the DataFrame.