Replicating rows based on index multiplies the rows instead of replicating - python

I have dataframe, where i would like to replicate few rows:
X Y diff No
Index
1d 0.000 0.017 0.000e+00 0
2D 0.083 0.017 3.000e-03 1
3D 0.250 0.017 7.200e-03 2
6D 0.500 0.019 2.400e-03 3
1DD 1.000 0.020 2.400e-03 4
2DD 2.000 0.023 1.300e-03 5
3DD 3.000 0.024 1.000e-03 6
5DD 5.000 0.026 6.500e-04 7
7DD 7.000 0.027 2.667e-04 8
10DD 10.000 0.028 1.200e-04 9
20DD 20.000 0.029 1.200e-04 10
30DD 30.000 0.031 0.000e+00 11
I want to replicate 30DD 30 times and 20DD 20 times and 10DD 10 times with same index name.
I tried this, instead of replicating it multiplies
for i in range(4):
test1 = df.append(df.ix['30DD']*30)
X Y diff No
Index
1d 0.000 0.017 0.000e+00 0
2D 0.083 0.017 3.000e-03 1
3D 0.250 0.017 7.200e-03 2
6D 0.500 0.019 2.400e-03 3
1DD 1.000 0.020 2.400e-03 4
2DD 2.000 0.023 1.300e-03 5
3DD 3.000 0.024 1.000e-03 6
5DD 5.000 0.026 6.500e-04 7
7DD 7.000 0.027 2.667e-04 8
10DD 10.000 0.028 1.200e-04 9
20DD 20.000 0.029 1.200e-04 10
30DD 30.000 0.031 0.000e+00 11
30DD 900 0.918 0 330

Add new rows, but subtract 1, because append to original DataFrame:
vals = ['30DD'] * 29 + ['20DD'] * 19 + ['10DD'] * 9
df = df.append(df.loc[vals])
Last if want sorting values by numbers of index values:
df = df.iloc[df.index.str.extract('(\d+)').astype(int).squeeze().argsort()]

Using numpy.repeat, you can create a list of indices for rows you wish to append. Then feed to pd.DataFrame.loc and append to your original dataframe.
vals = ['30DD', '20DD', '10DD']
counts = [30, 20, 10]
df = df.append(df.loc[np.repeat(vals, counts)])

Related

Odds Ratios in MN Logit regression in stats model

I have this Multi Numinal regression model done by statsmodel:
writer = pd.ExcelWriter(path=os.path.join(export_path, f'regression.xlsx'), engine='xlsxwriter')
vars_matrix_df = pd.read_csv(data_path, skipinitialspace=True)
corr_cols = ['sales_vs_service', 'agent_experience', 'minutes_passed_since_shift_started', 'stage_in_conv',
'current_cust_wait_time', 'prev_cust_line_words', 'total_cust_words_in_conv',
'agent_total_turns', 'sentiment_score', 'max_sentiment', 'min_sentiment', 'last_sentiment',
'agent_response_time', 'customer_response_rate', 'is_last_cust_answered',
'conversation_opening', 'queue_length', 'total_lines_from_rep',
'agent_number_of_conversations', 'concurrency', 'rep_shift_start_time', 'first_cust_line_num_of_words',
'queue_wait_time', 'day_of_week', 'time_of_day']
reg_equation = st.formula.mnlogit(f'visitor_was_answered ~C(day_of_week)+C(time_of_day)+{"+".join(corr_cols)} ',
vars_matrix_df).fit()
the reg results:
visitor_was_answered=1 coef std err z P>|z| \
0 C(time_of_day)[T.10] 0.0071 1910000.000 3.700000e-09 1.000
1 C(time_of_day)[T.11] 0.0067 698000.000 9.600000e-09 1.000
2 C(time_of_day)[T.12] 0.0016 1790000.000 9.200000e-10 1.000
3 C(time_of_day)[T.13] 0.0031 561000.000 5.570000e-09 1.000
4 C(time_of_day)[T.14] 0.0037 1310000.000 2.840000e-09 1.000
5 C(time_of_day)[T.15] 0.0011 548000.000 2.020000e-09 1.000
6 C(time_of_day)[T.17] 0.0044 814000.000 5.440000e-09 1.000
7 C(time_of_day)[T.18] 0.0009 1100000.000 8.270000e-10 1.000
8 C(time_of_day)[T.19] 0.0047 835000.000 5.640000e-09 1.000
9 C(time_of_day)[T.20] 0.0009 1140000.000 8.100000e-10 1.000
10 time_of_day[T.10] 0.0071 1930000.000 3.670000e-09 1.000
11 time_of_day[T.11] 0.0067 686000.000 9.770000e-09 1.000
12 time_of_day[T.12] 0.0016 1800000.000 9.150000e-10 1.000
13 time_of_day[T.13] 0.0031 556000.000 5.620000e-09 1.000
14 time_of_day[T.14] 0.0037 1240000.000 3.010000e-09 1.000
15 time_of_day[T.15] 0.0011 638000.000 1.740000e-09 1.000
16 time_of_day[T.17] 0.0044 1010000.000 4.400000e-09 1.000
17 time_of_day[T.18] 0.0009 1130000.000 8.020000e-10 1.000
18 time_of_day[T.19] 0.0047 860000.000 5.480000e-09 1.000
19 time_of_day[T.20] 0.0009 1120000.000 8.270000e-10 1.000
20 sales_vs_service -0.0448 0.006 -8.102000e+00 0.000
21 agent_experience -0.0414 0.008 -4.955000e+00 0.000
22 current_cust_wait_time -39.1333 0.414 -9.457400e+01 0.000
23 prev_cust_line_words 20.0439 0.236 8.494600e+01 0.000
24 agent_total_turns 0.1110 0.038 2.949000e+00 0.003
25 sentiment_score -4.3454 0.157 -2.759000e+01 0.000
26 agent_response_time -118.0821 2.205 -5.354600e+01 0.000
27 customer_response_rate -7.0865 0.630 -1.125500e+01 0.000
28 is_last_cust_answered -0.2537 0.005 -4.860800e+01 0.000
29 conversation_opening -0.4533 0.006 -7.206300e+01 0.000
30 queue_length -1.5427 0.018 -8.642700e+01 0.000
31 agent_number_of_conversations 0.0013 0.018 7.300000e-02 0.941
32 first_cust_line_num_of_words -3.7545 0.123 -3.056900e+01 0.000
33 queue_wait_time -0.3308 0.166 -1.997000e+00 0.046
To this regression, I want to add the odds ratio values of each variable. I think that the coefficients are already odds ratio but I didn't find any proof to that. Any idea how this can be done? and what are the coefficients represent here?
Thanks!

Supervised learning for time series data

I have following time series data.I want to use classification model.for independent variable i want to pass an array of previous 5 values of feature 1 /feature 2 given some weight.for example on 06-03-2015 for id 1:[ a1 a2 a3 a4 a5] [0.053 0.036 0.044 0.087 0.02 ]
ID feature1 Date feature2
1 0.053 02-03-2015 0.0115
1 0.05 08-03-2015 0.0117
1 0.099 09-03-2015 0.00355
1 0.006 10-03-2015 0.0088
1 0.007 11-03-2015 0.0968
1 0.0045 12-03-2015 0.08325
1 0.068 13-03-2015 0.0055
1 0.097 14-03-2015 0.0668
1 0.082 18-03-2015 0.0635
2 0.053 21-03-2015 0.0115
2 0.05 26-03-2015 0.0117
2 0.099 27-03-2015 0.00355
2 0.006 28-03-2015 0.0088
2 0.007 29-03-2015 0.0968
2 0.068 31-03-2015 0.0055
2 0.097 01-04-2015 0.0668
2 0.017 02-04-2015 0.0145
2 0.049 06-04-2015 0.0556
How would I assign weights to values on rolling basis where window =5.weights can between 0 to 1 .so I can multiply them with values and result should go as 1 of the independent variable.How can i use LSTM model for this kind of data.
This article on Machine Learning Mastery walks you through how to do that.

Pandas compute average for two consecutive rows and save result in two cells

I have the following data:
INPUT
ID A
1 0.040
2 0.086
3 0.127
4 0.173
5 0.141
6 0.047
7 0.068
8 0.038
I want to create B column, each two row in B have the same average from A. As following:
OUTPUT
ID A B
1 0.040 0.063
2 0.086 0.063
3 0.127 0.150
4 0.173 0.150
5 0.141 0.094
6 0.047 0.094
7 0.068 0.053
8 0.038 0.053
I tried this code
df["B"]= (df['A'] + df['A'].shift(-1))/2
I got the average but I can't make it distrbute bi-row.
you can do it this way:
In [10]: df['B'] = df.groupby(np.arange(len(df)) // 2)['A'].transform('mean')
In [11]: df
Out[11]:
ID A B
0 1 0.040 0.063
1 2 0.086 0.063
2 3 0.127 0.150
3 4 0.173 0.150
4 5 0.141 0.094
5 6 0.047 0.094
6 7 0.068 0.053
7 8 0.038 0.053

Pandas reshaping functions

To add to the many excellent examples of this, I'm trying to reshape my data into the format I want.
I currently have data indexed by customer, purchase category and date, with observations for each intra-day time period across the columns:
I want to aggregate by purchase category, and reshape so that my data is indexed by date and time, while customers appear across the columns.
What's the simplest way to achieve this?
In text form, the original data looks like this:
<table><tbody><tr><th>Customer</th><th>Purchase Category</th><th>date</th><th>00:30</th><th>01:00</th><th>01:30</th></tr><tr><td>1</td><td>A</td><td>01/07/2012</td><td>1.25</td><td>1.25</td><td>1.25</td></tr><tr><td>1</td><td>B</td><td>01/07/2012</td><td>0.855</td><td>0.786</td><td>0.604</td></tr><tr><td>1</td><td>C</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>1</td><td>A</td><td>02/07/2012</td><td>1.25</td><td>1.25</td><td>1.125</td></tr><tr><td>1</td><td>B</td><td>02/07/2012</td><td>0.309</td><td>0.082</td><td>0.059</td></tr><tr><td>1</td><td>C</td><td>02/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>A</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>B</td><td>01/07/2012</td><td>0.167</td><td>0.108</td><td>0.119</td></tr><tr><td>2</td><td>C</td><td>01/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>A</td><td>02/07/2012</td><td>0</td><td>0</td><td>0</td></tr><tr><td>2</td><td>B</td><td>02/07/2012</td><td>0.11</td><td>0.109</td><td>0.123</td></tr></tbody></table>
I think you need groupby with aggregating sum with reshape by stack and unstack. Last pop column level_1, add to date and convert to_datetime:
print (df)
Customer Purchase Category date 00:30 01:00 01:30
0 1 A 01/07/2012 1.250 1.250 1.250
1 1 B 01/07/2012 0.855 0.786 0.604
2 1 C 01/07/2012 0.000 0.000 0.000
3 1 A 02/07/2012 1.250 1.250 1.125
4 1 B 02/07/2012 0.309 0.082 0.059
5 1 C 02/07/2012 0.000 0.000 0.000
6 2 A 01/07/2012 0.000 0.000 0.000
7 2 B 01/07/2012 0.167 0.108 0.119
8 2 C 01/07/2012 0.000 0.000 0.000
9 2 A 02/07/2012 0.000 0.000 0.000
10 2 B 02/07/2012 0.110 0.109 0.123
df1 = df.groupby(['Customer','date']).sum().stack().unstack(0).reset_index()
df1.date = pd.to_datetime(df1.date + df1.pop('level_1'), format='%d/%m/%Y%H:%M')
print (df1)
Customer date 1 2
0 2012-07-01 00:30:00 2.105 0.167
1 2012-07-01 01:00:00 2.036 0.108
2 2012-07-01 01:30:00 1.854 0.119
3 2012-07-02 00:30:00 1.559 0.110
4 2012-07-02 01:00:00 1.332 0.109
5 2012-07-02 01:30:00 1.184 0.123

Create a rolling custom EWMA on a pandas dataframe

I am trying to create a rolling EWMA with the following decay= 1-ln(2)/3 on the last 13 values of a df such has :
factor
Out[36]:
EWMA
0 0.043
1 0.056
2 0.072
3 0.094
4 0.122
5 0.159
6 0.207
7 0.269
8 0.350
9 0.455
10 0.591
11 0.769
12 1.000
I have a df of monthly returns like this :
change.tail(5)
Out[41]:
date
2016-04-30 0.033 0.031 0.010 0.007 0.014 -0.006 -0.001 0.035 -0.004 0.020 0.011 0.003
2016-05-31 0.024 0.007 0.017 0.022 -0.012 0.034 0.019 0.001 0.006 0.032 -0.002 0.015
2016-06-30 -0.027 -0.004 -0.060 -0.057 -0.001 -0.096 -0.027 -0.096 -0.034 -0.024 0.044 0.001
2016-07-31 0.063 0.036 0.048 0.068 0.053 0.064 0.032 0.052 0.048 0.013 0.034 0.036
2016-08-31 -0.004 0.012 -0.005 0.009 0.028 0.005 -0.002 -0.003 -0.001 0.005 0.013 0.003
I am just trying to apply this rolling EWMA to each columns. I know that pandas has a EWMA method but I can't figure out how to pass the right 1-ln(2)/3 factor.
help would be appreciated! thanks!
#piRSquared 's answer is a good approximation, but values outside the last 13 also have weightings (albeit tiny), so it's not totally correct.
pandas could do rolling window calculations. However, amongst all the rolling function it supports, ewm is not one of them, which means we have to implement our own.
Assuming series is our time series to average:
from functools import partial
import numpy as np
window = 13
alpha = 1-np.log(2)/3 # This is ewma's decay factor.
weights = list(reversed([(1-alpha)**n for n in range(window)]))
ewma = partial(np.average, weights=weights)
rolling_average = series.rolling(window).apply(ewma)
use ewm with mean()
df.ewm(halflife=1 - np.log(2) / 3).mean()

Categories

Resources