drop duplicates in Python Pandas DataFrame not removing duplicates

drop duplicates in Python Pandas DataFrame not removing duplicates - python

I have a problem with removing the duplicates. My program is based around a loop which generates tuples (x,y) which are then used as nodes in a graph. The final array/matrix of nodes is :
[[ 1. 1. ]
[ 1.12273268 1.15322175]
[..........etc..........]
[ 0.94120695 0.77802849]
**[ 0.84301344 0.91660517]**
[ 0.93096269 1.21383287]
**[ 0.84301344 0.91660517]**
[ 0.75506418 1.0798641 ]]
The length of the array is 22. Now, I need to remove the duplicate entries (see **). So I used:
def urows(array):
df = pandas.DataFrame(array)
df.drop_duplicates(take_last=True)
return df.drop_duplicates(take_last=True).values
Fantastic, but I still get :
0 1
0 1.000000 1.000000
....... etc...........
17 1.039400 1.030320
18 0.941207 0.778028
**19 0.843013 0.916605**
20 0.930963 1.213833
**21 0.843013 0.916605**
So drop duplicates is not removing anything. I tested to see if the nodes where actually the same and I get:
print urows(total_nodes)[19,:]
---> [ 0.84301344 0.91660517]
print urows(total_nodes)[21,:]
---> [ 0.84301344 0.91660517]
print urows(total_nodes)[12,:] - urows(total_nodes)[13,:]
---> [ 0. 0.]
Why is it not working ??? How can I remove those duplicate values ???
One more question....
Say two values are "nearly" equal (say x1 and x2), is there any way to replace them in a way that they are both equal ???? What I want is to replace x2 with x1 if they are "nearly" equal.

If I copy-paste in your data, I get:
>>> df
0 1
0 1.000000 1.000000
1 1.122733 1.153222
2 0.941207 0.778028
3 0.843013 0.916605
4 0.930963 1.213833
5 0.843013 0.916605
6 0.755064 1.079864
>>> df.drop_duplicates()
0 1
0 1.000000 1.000000
1 1.122733 1.153222
2 0.941207 0.778028
3 0.843013 0.916605
4 0.930963 1.213833
6 0.755064 1.079864
so it is actually removed, and your problem is that the arrays aren't exactly equal (though their difference rounds to 0 for display).
One workaround would be to round the data to however many decimal places are applicable with something like df.apply(np.round, args=[4]), then drop the duplicates. If you want to keep the original data but remove rows that are duplicate up to rounding, you can use something like
df = df.ix[~df.apply(np.round, args=[4]).duplicated()]
Here's one really clumsy way to do what you're asking for with setting nearly-equal values to be actually equal:
grouped = df.groupby([df[i].round(4) for i in df.columns])
subbed = grouped.apply(lambda g: g.apply(lambda row: g.irow(0), axis=1))
subbed.drop_index(level=list(df.columns), drop=True, inplace=True)
This reorders the dataframe, but you can then call .sort() to get them back in the original order if you need that.
Explanation: the first line uses groupby to group the data frame by the rounded values. Unfortunately, if you give a function to groupby it applies it to the labels rather than the rows (so you could maybe do df.groupby(lambda k: np.round(df.ix[k], 4)), but that sucks too).
The second line uses the apply method on groupby to replace the dataframe of near-duplicate rows, g, with a new dataframe g.apply(lambda row: g.irow(0), axis=1). That uses the apply method on dataframes to replace each row with the first row of the group.
The result then looks like
0 1
0 1
0.7551 1.0799 6 0.755064 1.079864
0.8430 0.9166 3 0.843013 0.916605
5 0.843013 0.916605
0.9310 1.2138 4 0.930963 1.213833
0.9412 0.7780 2 0.941207 0.778028
1.0000 1.0000 0 1.000000 1.000000
1.1227 1.1532 1 1.122733 1.153222
where groupby has inserted the rounded values as an index. The reset_index line then drops those columns.
Hopefully someone who knows pandas better than I do will drop by and show how to do this better.

Similar to #Dougal answer, but in a slightly different way
In [20]: df.ix[~(df*1e6).astype('int64').duplicated(cols=[0])]
Out[20]:
0 1
0 1.000000 1.000000
1 1.122733 1.153222
2 0.941207 0.778028
3 0.843013 0.916605
4 0.930963 1.213833
6 0.755064 1.079864

Related

Pandas reorder rows of dataframe

I stumble upon very peculiar problem in Pandas. I have this dataframe
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,2.0,3
1,1600349033921620000,1,18.5371406,-14.224917,0,-0.0113912,1.443597,20,0.5,0.9,-1,7,2.0,3
2,1600349033921650000,2,19.808648100000006,-6.778450599999998,0,0.037289,-1.0557937,20,0.5,0.9,-1,7,2.0,3
3,1600349033921670000,3,22.1796988,-5.7078115999999985,0,0.2585675,-1.2431861000000002,20,0.5,0.9,-1,7,2.0,3
4,1600349033921670000,4,20.757325,-16.115366,0,-0.2528627,0.7889673,20,0.5,0.9,-1,7,2.0,3
5,1600349033921690000,5,20.9491012,-17.7806833,0,0.5062633,0.9386511,20,0.5,0.9,-1,7,2.0,3
6,1600349033921690000,6,20.6225258,-5.5344404,0,-0.1192678,-0.7889041,20,0.5,0.9,-1,7,2.0,3
7,1600349033921700000,7,21.8077004,-14.736984,0,-0.0295737,1.3084618,20,0.5,0.9,-1,7,2.0,3
8,1600349033954560000,0,23.206789800000006,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,2.0,3
9,1600349033954570000,1,18.555421300000006,-13.7440508,0,0.0548418,1.4426004,20,0.5,0.9,-1,7,2.0,3
10,1600349033954570000,2,19.8409748,-7.126075500000002,0,0.0969802,-1.0428747,20,0.5,0.9,-1,7,2.0,3
11,1600349033954580000,3,22.3263185,-5.9586202,0,0.4398591,-0.752425,20,0.5,0.9,-1,7,2.0,3
12,1600349033954590000,4,20.7154136,-15.842398800000002,0,-0.12573430000000002,0.8189016,20,0.5,0.9,-1,7,2.0,3
13,1600349033954590000,5,21.038901,-17.4111883,0,0.2693992,1.108485,20,0.5,0.9,-1,7,2.0,3
14,1600349033954600000,6,20.612499,-5.810969,0,-0.030080400000000007,-0.8295869,20,0.5,0.9,-1,7,2.0,3
15,1600349033954600000,7,21.7872537,-14.3011986,0,-0.0613401,1.3073578,20,0.5,0.9,-1,7,2.0,3
16,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
17,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
18,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
This is input file
Please note that Id always starts at 0 up to 7 and repeat and time column is in sequential step (which implies that previous row should be smaller or equal to current one).
I would like to reorder rows of the dataframe as it is below.
,time,id,X,Y,theta,Vx,Vy,ANGLE_FR,DANGER_RAD,RISK_RAD,TTC_DAN_LOW,TTC_DAN_UP,TTC_STOP,SIM
0,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.0,2
1,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.0,2
2,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.0,2
3,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,1
4,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,1
5,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,1
6,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,2
7,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,2
8,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,2
9,1600349033921610000,0,23.2643889,-7.140948599999999,0,0.020961,-1.1414197,20,0.5,0.9,-1,7,1.5,3
10,1600349033954560000,0,23.206789800000003,-7.5171016,0,-0.1727971,-1.1284589,20,0.5,0.9,-1,7,1.5,3
11,1600349033988110000,0,23.21602,-7.897527,0,0.027693000000000002,-1.1412761999999999,20,0.5,0.9,-1,7,1.5,3
This is the desired result
Please note that I need to reorder dataframe rows based on this columns id, time, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP, SIM.
As you see from the desired result we need to reoder dataframe in that way time column from smallest to largest one this holds true for the rest of columns, id, sim, ANGLE_FR, DANGER_RAD, RISK_RAD, TTC_DAN_LOW, TTC_DAN_UP, TTC_STOP.
I tried to sort by several columns without success. Moreover, I tried to use groupby but I failed.
Would you like to help to solve the problem? Any suggestions are welcome.
P.S.
I have paste dataframe so they can be read easily with clipboard function in order to be easily reproducible.
I am attaching pic as well.

What did you try to sort by several columns?
In [10]: df.sort_values(['id', 'time', 'ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP', 'SIM'])
Out[10]:
Unnamed: 0 time id X Y theta Vx Vy ANGLE_FR DANGER_RAD RISK_RAD TTC_DAN_LOW TTC_DAN_UP TTC_STOP SIM
0 0 1600349033921610000 0 23.2644 -7.1409 0 0.0210 -1.1414 20 0.5 0.9 -1 7 2 3
8 8 1600349033954560000 0 23.2068 -7.5171 0 -0.1728 -1.1285 20 0.5 0.9 -1 7 2 3
1 1 1600349033921620000 1 18.5371 -14.2249 0 -0.0114 1.4436 20 0.5 0.9 -1 7 2 3
9 9 1600349033954570000 1 18.5554 -13.7441 0 0.0548 1.4426 20 0.5 0.9 -1 7 2 3
2 2 1600349033921650000 2 19.8086 -6.7785 0 0.0373 -1.0558 20 0.5 0.9 -1 7 2 3

How about this:
groupby_cols = ['ANGLE_FR', 'DANGER_RAD', 'RISK_RAD', 'TTC_DAN_LOW', 'TTC_DAN_UP', 'TTC_STOP, SIM']
df = df.groupby(groupby_cols).reset_index()

Merging two dataframes based on index

I've been on this all night, and just can't figure it out, even though I know it should be simple. So, my sincerest apologies for the following incantation from a sleep-deprived fellow:
So, I have four fields, Employee ID, Name, Station and Shift (ID is non-null integer, the rest are strings or null).
I have about 10 dataframes, all indexed by ID. And each containing only two columns either (Name and Station) or (Name and Shift)
Now of course, I want to combine all of this into one dataframe, which has a unique row for each ID.
But I'm really frustrated by it at this point(especially because I can't find a way to directly check how many unique indices my final dataframe ends with)
After messing around with some very ugly ways of using .merge(), I finally found .concat(). But it keeps making multiple rows per ID, when I check in excel, the indices are like Table1/1234, Table2/1234 etc. One row has the shift, the other one has station, which is precisely what I'm trying to avoid.
How do I compile all my data into one dataframe, having exactly one row per ID? Possibly without using 9 different merge statements, as I have to scale up later.

If I understand your question correctly, this is the thing that you want.
For example with this 3 dataframes..
In [1]: df1
Out[1]:
0 1 2
0 3.588843 3.566220 6.518865
1 7.585399 4.269357 4.781765
2 9.242681 7.228869 5.680521
3 3.600121 3.931781 4.616634
4 9.830029 9.177663 9.842953
5 2.738782 3.767870 0.925619
6 0.084544 6.677092 1.983105
7 5.229042 4.729659 8.638492
8 8.575547 6.453765 6.055660
9 4.386650 5.547295 8.475186
In [2]: df2
Out[2]:
0 1
0 95.013170 90.382886
2 1.317641 29.600709
4 89.908139 21.391058
6 31.233153 3.902560
8 17.186079 94.768480
In [3]: df
Out[3]:
0 1 2
0 0.777689 0.357484 0.753773
1 0.271929 0.571058 0.229887
2 0.417618 0.310950 0.450400
3 0.682350 0.364849 0.933218
4 0.738438 0.086243 0.397642
5 0.237481 0.051303 0.083431
6 0.543061 0.644624 0.288698
7 0.118142 0.536156 0.098139
8 0.892830 0.080694 0.084702
9 0.073194 0.462129 0.015707
You can do
pd.concat([df,df1,df2], axis=1)
This produces
In [6]: pd.concat([df,df1,df2], axis=1)
Out[6]:
0 1 2 0 1 2 0 1
0 0.777689 0.357484 0.753773 3.588843 3.566220 6.518865 95.013170 90.382886
1 0.271929 0.571058 0.229887 7.585399 4.269357 4.781765 NaN NaN
2 0.417618 0.310950 0.450400 9.242681 7.228869 5.680521 1.317641 29.600709
3 0.682350 0.364849 0.933218 3.600121 3.931781 4.616634 NaN NaN
4 0.738438 0.086243 0.397642 9.830029 9.177663 9.842953 89.908139 21.391058
5 0.237481 0.051303 0.083431 2.738782 3.767870 0.925619 NaN NaN
6 0.543061 0.644624 0.288698 0.084544 6.677092 1.983105 31.233153 3.902560
7 0.118142 0.536156 0.098139 5.229042 4.729659 8.638492 NaN NaN
8 0.892830 0.080694 0.084702 8.575547 6.453765 6.055660 17.186079 94.768480
9 0.073194 0.462129 0.015707 4.386650 5.547295 8.475186 NaN NaN
For more details you might want to see pd.concat
Just a tip putting simple illustrative data in your question always helps in getting answer.

How to compare if any value is similar to any other using numpy

I have many pairs of coordinate arrays like so
a=[(1.001,3),(1.334, 4.2),...,(17.83, 3.4)]
b=[(1.002,3.0001),(1.67, 5.4),...,(17.8299, 3.4)]
c=[(1.00101,3.002),(1.3345, 4.202),...,(18.6, 12.511)]
Any coordinate in any of the pairs can be a duplicate of another coordinate in another array of pairs. The arrays are also not the same size.
The duplicates will vary slightly in their value and for an example, I would consider the first value in a, b and c to be duplicates.
I could iterate through each array and compare the values one by one using numpy.isclose, however that will be slow.
Is there an efficient way to tackle this problem, hopefully using numpy to keep computing times low?

you might wanna try the round() function which will round off the numbers in your lists to the nearest integers.
the next thing that I'd suggest might be too extreme:
concat the arrays and put them into a pandas dataframe and drop_duplicates()
this might not be the solution you want

You might want to take a look at numpy.testing if you allow for AsertionError handling.
from numpy import testing as ts
a = np.array((1.001,3))
b = np.array((1.000101, 3.002))
ts.assert_array_almost_equal(a, b, decimal=1) # output None
but
ts.assert_array_almost_equal(a, b, decimal=3)
results in
AssertionError:
Arrays are not almost equal to 3 decimals
Mismatch: 50%
Max absolute difference: 0.002
Max relative difference: 0.00089891
x: array([1.001, 3. ])
y: array([1. , 3.002])
There are some more interesting functions from numpy.testing. Make sure to take a look at the docs.

I'm using pandas to give you an intuitive result, rather than just numbers. Of course you can expand the solution to your need
Say you create a pd.DataFrame from each array, and tag them from which array each belongs to. I am rounding the results to 2 decimal places, you may use whatever tolerance you want
dfa = pd.DataFrame(a).round(2)
dfa['arr'] = 'a'
Then, by concatenating, using duplicated and sorting, you may find an intuitive Dataframe that might fulfill your needs
df = pd.concat([dfa, dfb, dfc])
df[df.duplicated(subset=[0,1], keep=False)].sort_values(by=[0,1])
yields
x y arr
0 1.00 3.0 a
0 1.00 3.0 b
0 1.00 3.0 c
1 1.33 4.2 a
1 1.33 4.2 c
2 17.83 3.4 a
2 17.83 3.4 b
The indexes are duplicated, so you can simply use reset_index() at the end and use the newly-generated column as a parameter that indicates the corresponding index on each array. I.e.:
index x y arr
0 0 1.00 3.0 a
1 0 1.00 3.0 b
2 0 1.00 3.0 c
3 1 1.33 4.2 a
4 1 1.33 4.2 c
5 2 17.83 3.4 a
6 2 17.83 3.4 b
So, for example, line 0 indicates a duplicate coordinate, and is found on index 0 of arr a. Line 1 also indicates a dupe coordinate, found or index 0 of arr b, etc.
Now, if you just want to delete the duplicates and get one final array with only non-duplicate values, you may usedrop_duplicates
df.drop_duplicates(subset=[0,1])[[0,1]].to_numpy()
which yields
array([[ 1. , 3. ],
[ 1.33, 4.2 ],
[17.83, 3.4 ],
[ 1.67, 5.4 ],
[18.6 , 12.51]])

Divide columns in a DataFrame by a Series (result is only NaNs?)

I'm trying to do a similar thing to what is posted in this question: Python Pandas - n X m DataFrame multiplied by 1 X m Dataframe
I have an n x m DataFrame, with all non-zero float values, and a 1 x m column, with all non-zero float values, and I'm trying to divide each column in the n x m dataframe by the values in the column.
So I've got:
a b c
1 2 3
4 5 6
7 8 9
and
x
11
12
13
and I'm looking to return:
a b c
1/11 2/11 3/11
4/12 5/12 6/12
7/13 8/13 9/13
I've tried a multiplication operation first, to see if I can make it work, so I tried applying the two solutions given in the answer to the question above.
df_prod = pd.DataFrame({c:df[c]* df_1[c].ix[0] for c in df.columns})
This produces a "Key Error 0"
And using the other solution to :
df.mul(df_1.iloc[0])
This just gives me all NaN, although in the right shape.

The cause of NaNs are due to misalignment of your indexes. To get over this, you will either need to divide by numpy arrays,
# <=0.23
df.values / df2[['x']].values # or df2.values assuming there's only 1 column
# 0.24+
df.to_numpy() / df[['x']].to_numpy()
array([[0.09090909, 0.18181818, 0.27272727],
[0.33333333, 0.41666667, 0.5 ],
[0.53846154, 0.61538462, 0.69230769]])
Or perform an axis aligned division using .div:
df.div(df2['x'], axis=0)
a b c
0 0.090909 0.181818 0.272727
1 0.333333 0.416667 0.500000
2 0.538462 0.615385 0.692308

simplifying routine in python with numpy array or pandas

The initial problem is the following: I have an initial matrix with let say 10 lines and 12 rows. For all lines, I want to sum two rows together. At the end I must have 10 lines but with only 6 rows. Currently, I am doing the following for loop in python (using initial which is a pandas DataFrame)
for i in range(0,12,2):
coarse[i]=initial.iloc[:,i:i+1].sum(axis=1)
In fact, I am quite sure that something more efficient is possible. I am thinking something like list comprehension but for a DataFrame or a numpy array. Does anybody have an idea ?
Moreover I would want to know if it is better to manipulate large numpy arrays or pandas DataFrame.

Let's create a small sample dataframe to illustrate the solution:
np.random.seed(0)
df = pd.DataFrame(np.random.rand(6, 3))
>>> df
0 1 2
0 0.548814 0.715189 0.602763
1 0.544883 0.423655 0.645894
2 0.437587 0.891773 0.963663
3 0.383442 0.791725 0.528895
4 0.568045 0.925597 0.071036
5 0.087129 0.020218 0.832620
You can use slice notation to select every other row starting from the first row (::2) and starting from the second row (1::2). iloc is for integer indexing. You need to select the values at these locations, and add them together. The result is a numpy array that you could then convert back into a DataFrame if required.
>>> df.iloc[::2].values + df.iloc[1::2].values
array([[ 1.09369669, 1.13884417, 1.24865749],
[ 0.82102873, 1.68349804, 1.49255768],
[ 0.65517386, 0.94581504, 0.9036559 ]])
You use values to remove the indexing. This is what happens otherwise:
>>> df.iloc[::2] + df.iloc[1::2].values
0 1 2
0 1.093697 1.138844 1.248657
2 0.821029 1.683498 1.492558
4 0.655174 0.945815 0.903656
>>> df.iloc[::2].values + df.iloc[1::2]
0 1 2
1 1.093697 1.138844 1.248657
3 0.821029 1.683498 1.492558
5 0.655174 0.945815 0.903656
For a more general solution:
df = pd.DataFrame(np.random.rand(9, 3))
n = 3 # Number of consecutive rows to group.
df['group'] = [idx // n for idx in range(len(df.index))]
df.groupby('group').sum()
0 1 2
group
0 1.531284 2.030617 2.212320
1 1.038615 1.737540 1.432551
2 1.695590 1.971413 1.902501

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

drop duplicates in Python Pandas DataFrame not removing duplicates - python

Similar to #Dougal answer, but in a slightly different way In [20]: df.ix[~(df*1e6).astype('int64').duplicated(cols=[0])] Out[20]: 0 1 0 1.000000 1.000000 1 1.122733 1.153222 2 0.941207 0.778028 3 0.843013 0.916605 4 0.930963 1.213833 6 0.755064 1.079864

Related

Pandas reorder rows of dataframe

Merging two dataframes based on index

How to compare if any value is similar to any other using numpy

Divide columns in a DataFrame by a Series (result is only NaNs?)

simplifying routine in python with numpy array or pandas

Categories

Resources