Pandas drop_duplicates not finding all duplicates

Pandas drop_duplicates not finding all duplicates - python

I have a problem with drop_duplicates in a pandas dataframe. I'm importing lots of mixed data from an excel file into a dataframe and then doing various things to clean up the data. One of the stages is to remove any duplicates based on their coordinates.
In general this is working pretty well and importantly it's very fast, but I've had some problems and after an extensive search of the dataset I've found out that pandas is always a few duplicates.
Here's my test dataset:
x y z radius scale type
0 128.798699 76.038331 0.000 1.172 1.000 Node_B
1 136.373699 78.068331 0.000 1.172 1.000 Node_B
2 133.171699 74.866331 0.000 1.172 1.000 Node_B
3 135.201699 76.038331 0.000 1.172 1.000 Node_B
4 135.201699 82.442331 0.000 1.172 1.000 Node_B
5 136.373699 80.412331 0.000 1.172 1.000 Node_B
6 133.171699 83.614331 0.000 1.172 1.000 Node_B
7 127.626699 78.068331 0.000 1.172 1.000 Node_B
8 131.999699 79.240331 0.000 2.750 1.000 Node_A
9 90.199699 94.795331 0.626 0.325 0.650 Rib_B
10 85.799699 95.445331 0.626 0.325 0.650 Rib_B
11 90.199699 95.445331 0.626 0.325 0.650 Rib_B
12 91.865699 95.557331 0.537 0.438 0.876 Rib_B
13 128.798699 82.442331 0.000 1.172 1.000 Node_B
14 136.373699 80.412331 0.000 1.172 1.000 Node_B
15 158.373699 38.448331 0.000 1.172 1.000 Node_B
16 152.827699 35.246331 0.000 1.172 1.000 Node_B
17 157.201699 36.418331 0.000 1.172 1.000 Node_B
18 155.171699 35.246331 0.000 1.172 1.000 Node_B
19 215.626699 80.412331 0.000 1.172 1.000 Node_B
20 218.827699 83.614331 0.000 1.172 1.000 Node_B
21 216.798699 82.442331 0.000 1.172 1.000 Node_B
22 131.999699 79.240331 0.000 2.750 1.000 Node_A
23 128.798699 76.038331 0.000 1.172 1.000 Node_B
24 136.373699 78.068331 0.000 1.172 1.000 Node_B
25 162.051699 70.180331 0.626 0.325 0.650 Rib_D
26 162.619699 70.496331 0.626 0.325 0.650 Rib_D
27 189.948699 70.180331 0.626 0.325 0.650 Rib_D
I'm finding duplicates based on the x,y,z coordinates as these should be unique locations so I use df.drop_duplicates(subset=['x', 'y', 'z'], inplace=True) to remove any duplicates from the data frame. This seems to remove about 90% of my duplicates but it always seem to be missing some.
In the example dataframe there are several duplicates [0==23, 1==24, 6==14, 8==22] but pandas fails to remove them.
I found this using numpy and a very slow iterative loop that is comparing every point to every other point. It's ok for 50 or 100 points, but takes 15-20 minutes when I have 100-200K records in the dataframe.
How do I fix this? There is no precision parameter for drop_duplicates so why does it miss some?

You can use round as suggested by #mozway:
PRECISION = 3
df.drop(df[['x', 'y', 'z']].round(PRECISION).duplicated().loc[lambda x: x].index, inplace=True)
print(df)
# Output
x y z radius scale type
0 128.798699 76.038331 0.000 1.172 1.000 Node_B
1 136.373699 78.068331 0.000 1.172 1.000 Node_B
2 133.171699 74.866331 0.000 1.172 1.000 Node_B
3 135.201699 76.038331 0.000 1.172 1.000 Node_B
4 135.201699 82.442331 0.000 1.172 1.000 Node_B
5 136.373699 80.412331 0.000 1.172 1.000 Node_B
6 133.171699 83.614331 0.000 1.172 1.000 Node_B
7 127.626699 78.068331 0.000 1.172 1.000 Node_B
8 131.999699 79.240331 0.000 2.750 1.000 Node_A
9 90.199699 94.795331 0.626 0.325 0.650 Rib_B
10 85.799699 95.445331 0.626 0.325 0.650 Rib_B
11 90.199699 95.445331 0.626 0.325 0.650 Rib_B
12 91.865699 95.557331 0.537 0.438 0.876 Rib_B
13 128.798699 82.442331 0.000 1.172 1.000 Node_B
15 158.373699 38.448331 0.000 1.172 1.000 Node_B
16 152.827699 35.246331 0.000 1.172 1.000 Node_B
17 157.201699 36.418331 0.000 1.172 1.000 Node_B
18 155.171699 35.246331 0.000 1.172 1.000 Node_B
19 215.626699 80.412331 0.000 1.172 1.000 Node_B
20 218.827699 83.614331 0.000 1.172 1.000 Node_B
21 216.798699 82.442331 0.000 1.172 1.000 Node_B
25 162.051699 70.180331 0.626 0.325 0.650 Rib_D
26 162.619699 70.496331 0.626 0.325 0.650 Rib_D
27 189.948699 70.180331 0.626 0.325 0.650 Rib_D

Related

Odds Ratios in MN Logit regression in stats model

I have this Multi Numinal regression model done by statsmodel:
writer = pd.ExcelWriter(path=os.path.join(export_path, f'regression.xlsx'), engine='xlsxwriter')
vars_matrix_df = pd.read_csv(data_path, skipinitialspace=True)
corr_cols = ['sales_vs_service', 'agent_experience', 'minutes_passed_since_shift_started', 'stage_in_conv',
'current_cust_wait_time', 'prev_cust_line_words', 'total_cust_words_in_conv',
'agent_total_turns', 'sentiment_score', 'max_sentiment', 'min_sentiment', 'last_sentiment',
'agent_response_time', 'customer_response_rate', 'is_last_cust_answered',
'conversation_opening', 'queue_length', 'total_lines_from_rep',
'agent_number_of_conversations', 'concurrency', 'rep_shift_start_time', 'first_cust_line_num_of_words',
'queue_wait_time', 'day_of_week', 'time_of_day']
reg_equation = st.formula.mnlogit(f'visitor_was_answered ~C(day_of_week)+C(time_of_day)+{"+".join(corr_cols)} ',
vars_matrix_df).fit()
the reg results:
visitor_was_answered=1 coef std err z P>|z| \
0 C(time_of_day)[T.10] 0.0071 1910000.000 3.700000e-09 1.000
1 C(time_of_day)[T.11] 0.0067 698000.000 9.600000e-09 1.000
2 C(time_of_day)[T.12] 0.0016 1790000.000 9.200000e-10 1.000
3 C(time_of_day)[T.13] 0.0031 561000.000 5.570000e-09 1.000
4 C(time_of_day)[T.14] 0.0037 1310000.000 2.840000e-09 1.000
5 C(time_of_day)[T.15] 0.0011 548000.000 2.020000e-09 1.000
6 C(time_of_day)[T.17] 0.0044 814000.000 5.440000e-09 1.000
7 C(time_of_day)[T.18] 0.0009 1100000.000 8.270000e-10 1.000
8 C(time_of_day)[T.19] 0.0047 835000.000 5.640000e-09 1.000
9 C(time_of_day)[T.20] 0.0009 1140000.000 8.100000e-10 1.000
10 time_of_day[T.10] 0.0071 1930000.000 3.670000e-09 1.000
11 time_of_day[T.11] 0.0067 686000.000 9.770000e-09 1.000
12 time_of_day[T.12] 0.0016 1800000.000 9.150000e-10 1.000
13 time_of_day[T.13] 0.0031 556000.000 5.620000e-09 1.000
14 time_of_day[T.14] 0.0037 1240000.000 3.010000e-09 1.000
15 time_of_day[T.15] 0.0011 638000.000 1.740000e-09 1.000
16 time_of_day[T.17] 0.0044 1010000.000 4.400000e-09 1.000
17 time_of_day[T.18] 0.0009 1130000.000 8.020000e-10 1.000
18 time_of_day[T.19] 0.0047 860000.000 5.480000e-09 1.000
19 time_of_day[T.20] 0.0009 1120000.000 8.270000e-10 1.000
20 sales_vs_service -0.0448 0.006 -8.102000e+00 0.000
21 agent_experience -0.0414 0.008 -4.955000e+00 0.000
22 current_cust_wait_time -39.1333 0.414 -9.457400e+01 0.000
23 prev_cust_line_words 20.0439 0.236 8.494600e+01 0.000
24 agent_total_turns 0.1110 0.038 2.949000e+00 0.003
25 sentiment_score -4.3454 0.157 -2.759000e+01 0.000
26 agent_response_time -118.0821 2.205 -5.354600e+01 0.000
27 customer_response_rate -7.0865 0.630 -1.125500e+01 0.000
28 is_last_cust_answered -0.2537 0.005 -4.860800e+01 0.000
29 conversation_opening -0.4533 0.006 -7.206300e+01 0.000
30 queue_length -1.5427 0.018 -8.642700e+01 0.000
31 agent_number_of_conversations 0.0013 0.018 7.300000e-02 0.941
32 first_cust_line_num_of_words -3.7545 0.123 -3.056900e+01 0.000
33 queue_wait_time -0.3308 0.166 -1.997000e+00 0.046
To this regression, I want to add the odds ratio values of each variable. I think that the coefficients are already odds ratio but I didn't find any proof to that. Any idea how this can be done? and what are the coefficients represent here?
Thanks!

Pandas compute average for two consecutive rows and save result in two cells

I have the following data:
INPUT
ID A
1 0.040
2 0.086
3 0.127
4 0.173
5 0.141
6 0.047
7 0.068
8 0.038
I want to create B column, each two row in B have the same average from A. As following:
OUTPUT
ID A B
1 0.040 0.063
2 0.086 0.063
3 0.127 0.150
4 0.173 0.150
5 0.141 0.094
6 0.047 0.094
7 0.068 0.053
8 0.038 0.053
I tried this code
df["B"]= (df['A'] + df['A'].shift(-1))/2
I got the average but I can't make it distrbute bi-row.

you can do it this way:
In [10]: df['B'] = df.groupby(np.arange(len(df)) // 2)['A'].transform('mean')
In [11]: df
Out[11]:
ID A B
0 1 0.040 0.063
1 2 0.086 0.063
2 3 0.127 0.150
3 4 0.173 0.150
4 5 0.141 0.094
5 6 0.047 0.094
6 7 0.068 0.053
7 8 0.038 0.053

Create a rolling custom EWMA on a pandas dataframe

I am trying to create a rolling EWMA with the following decay= 1-ln(2)/3 on the last 13 values of a df such has :
factor
Out[36]:
EWMA
0 0.043
1 0.056
2 0.072
3 0.094
4 0.122
5 0.159
6 0.207
7 0.269
8 0.350
9 0.455
10 0.591
11 0.769
12 1.000
I have a df of monthly returns like this :
change.tail(5)
Out[41]:
date
2016-04-30 0.033 0.031 0.010 0.007 0.014 -0.006 -0.001 0.035 -0.004 0.020 0.011 0.003
2016-05-31 0.024 0.007 0.017 0.022 -0.012 0.034 0.019 0.001 0.006 0.032 -0.002 0.015
2016-06-30 -0.027 -0.004 -0.060 -0.057 -0.001 -0.096 -0.027 -0.096 -0.034 -0.024 0.044 0.001
2016-07-31 0.063 0.036 0.048 0.068 0.053 0.064 0.032 0.052 0.048 0.013 0.034 0.036
2016-08-31 -0.004 0.012 -0.005 0.009 0.028 0.005 -0.002 -0.003 -0.001 0.005 0.013 0.003
I am just trying to apply this rolling EWMA to each columns. I know that pandas has a EWMA method but I can't figure out how to pass the right 1-ln(2)/3 factor.
help would be appreciated! thanks!

#piRSquared 's answer is a good approximation, but values outside the last 13 also have weightings (albeit tiny), so it's not totally correct.
pandas could do rolling window calculations. However, amongst all the rolling function it supports, ewm is not one of them, which means we have to implement our own.
Assuming series is our time series to average:
from functools import partial
import numpy as np
window = 13
alpha = 1-np.log(2)/3 # This is ewma's decay factor.
weights = list(reversed([(1-alpha)**n for n in range(window)]))
ewma = partial(np.average, weights=weights)
rolling_average = series.rolling(window).apply(ewma)

use ewm with mean()
df.ewm(halflife=1 - np.log(2) / 3).mean()

Tests are 4 times slower under PyPy

I am running tests for my project using nose2:
#!/bin/sh
nose2 --config=tests/nose2.cfg "$#"
Under CPython tests run 4 times faster than under PyPy:
Python 2.7.8 (default, Oct 20 2014, 15:05:19)
[GCC 4.9.1] on linux2
...
Ran 58 tests in 25.369s
Python 2.7.9 (2.5.1+dfsg-1~ppa1+ubuntu14.04, Mar 27 2015, 19:19:42)
[PyPy 2.5.1 with GCC 4.8.2] on linux2
...
Ran 58 tests in 100.854s
What could be the cause?
Is there a way to tweak PyPy configuration using environment variables or or a configuration file on some standard path? Because in my case I am running nose bootstrap script and I cannot control command line options for PyPy.
Here is one specific test:
1272695 function calls (1261234 primitive calls) in 1.165 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 1224 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.171 1.171 test_progress.py:37(test_progress)
15 0.000 0.000 1.169 0.078 __init__.py:52(api_request)
15 0.000 0.000 1.160 0.077 __init__.py:46(request)
15 0.000 0.000 1.152 0.077 test.py:695(open)
15 0.000 0.000 1.150 0.077 test.py:655(run_wsgi_app)
15 0.000 0.000 1.144 0.076 test.py:828(run_wsgi_app)
15 0.000 0.000 1.144 0.076 application.py:101(__call__)
15 0.000 0.000 1.138 0.076 sessions.py:329(__call__)
15 0.000 0.000 1.071 0.071 course_object.py:14(__call__)
15 0.000 0.000 1.005 0.067 user_auth.py:7(__call__)
15 0.000 0.000 0.938 0.063 application.py:27(application)
15 0.000 0.000 0.938 0.063 application.py:81(wsgi_app)
15 0.000 0.000 0.876 0.058 ember_backend.py:188(__call__)
15 0.000 0.000 0.875 0.058 ember_backend.py:233(handle_request)
176 0.002 0.000 0.738 0.004 __init__.py:42(instrumented_method)
176 0.003 0.000 0.623 0.004 __init__.py:58(get_stack)
176 0.000 0.000 0.619 0.004 inspect.py:1053(stack)
176 0.010 0.000 0.619 0.004 inspect.py:1026(getouterframes)
294 0.001 0.000 0.614 0.002 cursor.py:1072(next)
248 0.001 0.000 0.612 0.002 cursor.py:998(_refresh)
144 0.002 0.000 0.608 0.004 cursor.py:912(__send_message)
7248 0.041 0.000 0.607 0.000 inspect.py:988(getframeinfo)
8 0.000 0.000 0.544 0.068 test_progress.py:31(get_progress_data)
4 0.000 0.000 0.529 0.132 test_progress.py:25(finish_item)
249 0.001 0.000 0.511 0.002 base.py:1131(next)
7 0.000 0.000 0.449 0.064 ember_backend.py:240(_handle_request)
8 0.000 0.000 0.420 0.053 ember_backend.py:307(_handle_request)
8 0.000 0.000 0.420 0.053 user_state.py:13(list)
8 0.001 0.000 0.407 0.051 user_progress.py:28(update_progress)
4 0.000 0.000 0.397 0.099 entity.py:253(post)
7248 0.051 0.000 0.362 0.000 inspect.py:518(findsource)
14532 0.083 0.000 0.332 0.000 inspect.py:440(getsourcefile)
92/63 0.000 0.000 0.308 0.005 objects.py:22(__get__)
61 0.001 0.000 0.304 0.005 base.py:168(get)
139 0.000 0.000 0.250 0.002 queryset.py:65(_iter_results)
51 0.001 0.000 0.249 0.005 queryset.py:83(_populate_cache)
29 0.000 0.000 0.220 0.008 __init__.py:81(save)
29 0.001 0.000 0.219 0.008 document.py:181(save)
21780 0.051 0.000 0.140 0.000 inspect.py:398(getfile)
32 0.002 0.000 0.139 0.004 {pymongo._cmessage._do_batched_write_command}
And the with PyPy:
6037905 function calls (6012014 primitive calls) in 7.475 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 1354 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 7.960 7.960 test_progress.py:37(test_progress)
15 0.000 0.000 7.948 0.530 __init__.py:52(api_request)
15 0.000 0.000 7.873 0.525 __init__.py:46(request)
15 0.000 0.000 7.860 0.524 test.py:695(open)
15 0.000 0.000 7.856 0.524 test.py:655(run_wsgi_app)
15 0.000 0.000 7.845 0.523 test.py:828(run_wsgi_app)
15 0.000 0.000 7.844 0.523 application.py:101(__call__)
15 0.000 0.000 7.827 0.522 sessions.py:329(__call__)
15 0.000 0.000 7.205 0.480 course_object.py:14(__call__)
176 0.004 0.000 6.605 0.038 __init__.py:42(instrumented_method)
15 0.000 0.000 6.591 0.439 user_auth.py:7(__call__)
176 0.008 0.000 6.314 0.036 __init__.py:58(get_stack)
176 0.001 0.000 6.305 0.036 inspect.py:1063(stack)
176 0.027 0.000 6.304 0.036 inspect.py:1036(getouterframes)
7839 0.081 0.000 6.274 0.001 inspect.py:998(getframeinfo)
15 0.000 0.000 5.983 0.399 application.py:27(application)
15 0.001 0.000 5.983 0.399 application.py:81(wsgi_app)
15 0.000 0.000 5.901 0.393 ember_backend.py:188(__call__)
15 0.000 0.000 5.899 0.393 ember_backend.py:233(handle_request)
15714/15713 0.189 0.000 5.828 0.000 inspect.py:441(getsourcefile)
294 0.002 0.000 5.473 0.019 cursor.py:1072(next)
248 0.002 0.000 5.470 0.022 cursor.py:998(_refresh)
144 0.004 0.000 5.445 0.038 cursor.py:912(__send_message)
8367 2.133 0.000 5.342 0.001 inspect.py:473(getmodule)
249 0.002 0.000 4.316 0.017 base.py:1131(next)
8 0.000 0.000 3.966 0.496 test_progress.py:31(get_progress_data)
4 0.000 0.000 3.209 0.802 test_progress.py:25(finish_item)
7839 0.098 0.000 3.185 0.000 inspect.py:519(findsource)
7 0.000 0.000 2.944 0.421 ember_backend.py:240(_handle_request)
8 0.000 0.000 2.898 0.362 ember_backend.py:307(_handle_request)
8 0.000 0.000 2.898 0.362 user_state.py:13(list)
8 0.001 0.000 2.820 0.352 user_progress.py:28(update_progress)
61 0.001 0.000 2.546 0.042 base.py:168(get)
4 0.000 0.000 2.534 0.633 entity.py:253(post)
850362/849305 2.315 0.000 2.344 0.000 {hasattr}
92/63 0.001 0.000 2.004 0.032 objects.py:22(__get__)
127 0.001 0.000 1.915 0.015 queryset.py:65(_iter_results)
51 0.001 0.000 1.914 0.038 queryset.py:83(_populate_cache)
29 0.000 0.000 1.607 0.055 __init__.py:81(save)
29 0.001 0.000 1.605 0.055 document.py:181(save)

How do I plot this DataFrame?

I've got a pandas.DataFrame that looks like this:
>>> print df
0 1 2 3 4 5 6 7 8 9 10 11 \
0 0.198 0.198 0.266 0.198 0.236 0.199 0.198 0.198 0.199 0.199 0.199 0.198
1 0.032 0.034 0.039 0.405 0.442 0.382 0.343 0.311 0.282 0.255 0.232 0.210
2 0.702 0.702 0.742 0.709 0.755 0.708 0.708 0.712 0.707 0.706 0.706 0.706
3 0.109 0.112 0.114 0.114 0.128 0.532 0.149 0.118 0.115 0.114 0.114 0.112
4 0.309 0.306 0.311 0.311 0.316 0.513 1.977 0.313 0.311 0.310 0.311 0.309
5 0.280 0.277 0.282 0.278 0.282 0.383 1.122 1.685 0.280 0.280 0.282 0.280
6 0.466 0.460 0.465 0.465 0.468 0.508 0.829 1.100 1.987 0.465 0.465 0.463
7 0.469 0.464 0.469 0.470 0.469 0.490 0.648 0.783 1.095 2.002 0.469 0.466
8 0.137 0.120 0.137 0.138 0.137 0.136 0.144 0.149 0.166 0.209 0.137 0.136
9 0.125 0.107 0.125 0.126 0.125 0.122 0.126 0.128 0.132 0.144 0.125 0.123
10 0.125 0.106 0.125 0.123 0.123 0.122 0.125 0.128 0.132 0.142 0.125 0.123
11 0.127 0.107 0.125 0.125 0.125 0.122 0.126 0.127 0.132 0.142 0.125 0.123
12 0.125 0.107 0.125 0.128 0.125 0.123 0.126 0.127 0.132 0.142 0.125 0.122
13 0.871 0.862 0.871 0.872 0.872 0.872 0.873 0.872 0.875 0.880 0.873 0.872
14 0.114 0.115 0.116 0.117 0.131 0.536 0.153 0.123 0.118 0.117 0.117 0.116
15 0.033 0.032 0.031 0.032 0.032 0.040 0.033 0.033 0.032 0.032 0.032 0.032
12 13
0 0.198 0.198
1 0.190 0.172
2 0.705 0.705
3 0.112 0.115
4 0.308 0.310
5 0.275 0.278
6 0.462 0.463
7 0.466 0.466
8 0.134 1.678
9 0.122 1.692
10 0.122 1.694
11 0.122 1.695
12 0.122 1.684
13 0.872 1.255
14 0.116 0.127
15 0.031 0.032
[16 rows x 14 columns]
Each row represents a measurement value for an analog port. Each column is a test case. Thus there's one measurement for each of the analog ports, in each column.
When I plot this with DataFrame.plot() I end up with the following plot:
But this presents my rows, the 16 analog ports on the x-axis. I would like to have the column numbers on the x-axis. I've tried to define the x-axis in plot() as below:
>>> df.plot(x=df.columns)
Which results in a
ValueError: Length mismatch: Expected axis has 16 elements, new values have 14 elements
How should I approach this? Below is an example image which shows the correct x-axis values.

You want something like
df.T.plot()
Plus some other formatting. But that will get you started.
the .T method transposes the DataFrame.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas drop_duplicates not finding all duplicates - python

Related

Odds Ratios in MN Logit regression in stats model

Pandas compute average for two consecutive rows and save result in two cells

Create a rolling custom EWMA on a pandas dataframe

Tests are 4 times slower under PyPy

How do I plot this DataFrame?

Categories

Resources