Pandas find Duplicates in cross values - python

I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df
Out[8]:
a b c d
1 x y e f
2 e f x y
3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate.
I want to drop these lines and only keep one, to have the final output:
df_new
Out[20]:
a b c d
1 x y e f
3 w v s t
How can I efficiently achieve that?

I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
['e' 'f' 'x' 'y']
['s' 't' 'v' 'w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
0 1 2 3
1 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 False
2 True
3 False
dtype: bool
print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 True
2 False
3 True
dtype: bool

Here's another solution, with a for loop:
data = df.as_matrix()
new = []
for row in data:
if not new:
new.append(row)
else:
if not any([c in nrow for nrow in new for c in row]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)

Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.
Later use that duplicates to drop(df.drop) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])

Related

Use dataframe column containing "column name strings", to return values from dataframe based on column name and index without using .apply()

I have a dataframe as follows:
df=pandas.DataFrame()
df['A'] = numpy.random.random(10)
df['B'] = numpy.random.random(10)
df['C'] = numpy.random.random(10)
df['Col_name'] = numpy.random.choice(['A','B','C'],size=10)
I want to obtain an output that uses 'Col_name' and the respective index of the dataframe row to lookup the value in the dataframe.
I can get the desired output this with .apply() follows:
df['output'] = df.apply(lambda x: x[ x['Col_name'] ], axis=1)
.apply() is slow over a large dataframe with it iterating row by row. Is there an obvious solution in pandas that is faster/vectorised?
You can also pick each column name (or give list of possible names) and then apply it as mask to filter your dataframe then pick values from desired column and assign them to all rows matching the mask. Then repeat this for another coulmn.
for column_name in df: #or: for column_name in ['A', 'B', 'C']
df.loc[df['Col_name']==column_name, 'output'] = df[column_name]
Rows that will not match any mask will have NaN values.
PS. Accodring to my test with 10000000 random rows - method with .apply() takes 2min 24s to finish while my method takes only 4,3s.
Use melt to flatten your dataframe and keep rows where Col_name equals to variable column:
df['output'] = df.melt('Col_name', ignore_index=False).query('Col_name == variable')['value']
print(df)
# Output
A B C Col_name output
0 0.202197 0.430735 0.093551 B 0.430735
1 0.344753 0.979453 0.999160 C 0.999160
2 0.500904 0.778715 0.074786 A 0.500904
3 0.050951 0.317732 0.363027 B 0.317732
4 0.722624 0.026065 0.424639 C 0.424639
5 0.578185 0.626698 0.376692 C 0.376692
6 0.540849 0.805722 0.528886 A 0.540849
7 0.918618 0.869893 0.825991 C 0.825991
8 0.688967 0.203809 0.734467 B 0.203809
9 0.811571 0.010081 0.372657 B 0.010081
Transformation after melt:
>>> df.melt('Col_name', ignore_index=False)
Col_name variable value
0 B A 0.202197
1 C A 0.344753
2 A A 0.500904 # keep
3 B A 0.050951
4 C A 0.722624
5 C A 0.578185
6 A A 0.540849 # keep
7 C A 0.918618
8 B A 0.688967
9 B A 0.811571
0 B B 0.430735 # keep
1 C B 0.979453
2 A B 0.778715
3 B B 0.317732 # keep
4 C B 0.026065
5 C B 0.626698
6 A B 0.805722
7 C B 0.869893
8 B B 0.203809 # keep
9 B B 0.010081 # keep
0 B C 0.093551
1 C C 0.999160 # keep
2 A C 0.074786
3 B C 0.363027
4 C C 0.424639 # keep
5 C C 0.376692 # keep
6 A C 0.528886
7 C C 0.825991 # keep
8 B C 0.734467
9 B C 0.372657
Update
Alternative with set_index and stack for #Rabinzel:
df['output'] = (
df.set_index('Col_name', append=True).stack()
.loc[lambda x: x.index.get_level_values(1) == x.index.get_level_values(2)]
.droplevel([1, 2])
)
print(df)
# Output
A B C Col_name output
0 0.209953 0.332294 0.812476 C 0.812476
1 0.284225 0.566939 0.087084 A 0.284225
2 0.815874 0.185154 0.155454 A 0.815874
3 0.017548 0.733474 0.766972 A 0.017548
4 0.494323 0.433719 0.979399 C 0.979399
5 0.875071 0.789891 0.319870 B 0.789891
6 0.475554 0.229837 0.338032 B 0.229837
7 0.123904 0.397463 0.288614 C 0.288614
8 0.288249 0.631578 0.393521 A 0.288249
9 0.107245 0.006969 0.367748 C 0.367748
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['A'] = np.random.random(10)
df['B'] = np.random.random(10)
df['C'] = np.random.random(10)
df['Col_name'] = np.random.choice(['A','B','C'],size=10)
df["output"] = np.nan
Even though you do not like going row per row, I still routinely use loops to go through each row just to know where it breaks when it breaks. Here are two loops just to satisfy myself. The column is created ahead with na values becausethe loops needs it to be.
# each rows by index
for i in range(len(df)):
df['output'][i] = df[df['Col_name'][i]][i]
# each rows but by column name
for col in list(df["Col_name"]):
df.loc[:,'output'] = df.loc[:,col]
Here are some "non-loop" ways to do so.
df["output"] = df.lookup(df.index, df.Col_name)
df['output'] = np.where(np.isnan(df['output']), df[df['Col_name']], np.nan)

Python Pandas - Remove Duplicates with Inverse Values [duplicate]

I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df
Out[8]:
a b c d
1 x y e f
2 e f x y
3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate.
I want to drop these lines and only keep one, to have the final output:
df_new
Out[20]:
a b c d
1 x y e f
3 w v s t
How can I efficiently achieve that?
I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
['e' 'f' 'x' 'y']
['s' 't' 'v' 'w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
0 1 2 3
1 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 False
2 True
3 False
dtype: bool
print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 True
2 False
3 True
dtype: bool
Here's another solution, with a for loop:
data = df.as_matrix()
new = []
for row in data:
if not new:
new.append(row)
else:
if not any([c in nrow for nrow in new for c in row]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)
Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.
Later use that duplicates to drop(df.drop) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])

Pandas DataFrame - list columns with lowest distinct values

I have the following code to find the columns in a data frame with the lowest number of distinct values and list them.
import pandas as pd
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":[1,1,2],"D":[3,3,4]})
print(df)
unique_counts = df.nunique()
lowest_distinct = 100
#
#Find the lowest distinct count across all columns
#
for column_name, distinct_count in unique_counts.iteritems():
if distinct_count < lowest_distinct:
lowest_distinct = distinct_count
lowest_distinct_columns = []
#
#Collect the columns having that count
#
for column_name, distinct_count in unique_counts.iteritems():
if distinct_count == lowest_distinct:
lowest_distinct_columns.append(column_name)
#
#Get the columns and values returned as a data frame
#
melted_df = df.melt(value_vars=lowest_distinct_columns,var_name='column', value_name='value')
print(melted_df)
It feels a bit clunky so I'm wondering if there is a better way to do it? Ultimately I'm trying to get a list of the columns and values that have the lowest number of distinct values.
Any thoughts or tips appreciated.
Cheers
David
Does it do what you want:
unique_counts = df.nunique()
lowest_distinct = unique_counts.min()
lowest_distinct_columns = unique_counts[unique_counts == lowest_distinct].index.tolist()
result = pd.DataFrame({col: df[col].unique() for col in lowest_distinct_columns})
Use
In [114]: df[unique_count[unique_count == unique_count.min()].index].melt(
var_name='column', value_name='value')
Out[114]:
column value
0 C 1
1 C 1
2 C 2
3 D 3
4 D 3
5 D 4
For older versions of pandas (< v.20), consider apply to return a series:
unique_ser = df.apply(lambda col: col.nunique(), axis=0)
print(unique_ser)
# A 3
# B 3
# C 2
# D 2
lowest_unique_ser = unique_ser[unique_ser == unique_ser.min()]
print(lowest_unique_ser)
# C 2
# D 2
final_ser = df[lowest_unique_ser.index].apply(lambda col: col.unique().tolist(), axis=0)
print(final_ser)
# C (1, 2)
# D (3, 4)
Thank you for the responses. The 3 solutions to the first part of the problem work equally well and the 2 responses to the second part of the problem also work very well.
I'll need to use them in practice to see if there is any material difference in performance or behaviour but to summarise the complete solutions:
#Parfait's solution:
unique_ser = df.apply(lambda col: col.nunique(), axis=0)
print(unique_ser)
# A 3
# B 3
# C 2
# D 2
lowest_unique_ser = unique_ser[unique_ser == unique_ser.min()]
print(lowest_unique_ser)
# C 2
# D 2
final_ser = df[lowest_unique_ser.index].apply(lambda col: col.unique().tolist(), axis=0)
print(final_ser)
# C (1, 2)
# D (3, 4)
and #Priker's
unique_counts = df.nunique()
lowest_distinct = unique_counts.min()
lowest_distinct_columns = unique_counts[unique_counts ==
lowest_distinct].index.tolist()
result = pd.DataFrame({col: df[col].unique() for col in lowest_distinct_columns})
Use
df1 = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":[1,1,2],"D":[3,3,4]})
print(df1)
unique_counts = df1.nunique()
A B C D
0 1 2 1 3
1 2 3 1 3
2 3 4 2 4
unique_counts[unique_counts==unique_counts.min()]
C 2
D 2
dtype: int64

Apply function for two dataframes in pandas

I have two dataframe.
df0
a b
c 0.3 0.6
d 0.4 NaN
df1
a b
c 3 2
d 0 4
I have a custom function:
def concat(d0,d1):
if d0 is not None and d1 is not None:
return '%s,%s' % (d0, d1)
return None
Result I expect:
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
How could I apply the function for those two dataframe?
Here is a solution.
The idea is first to reduce your dataframes to a flat list of values. This allows you to loop over the value of the two dataframes using zip and applying your function.
Finally, you go back to original shape using numpy reshape
new_vals = [concat(d0,d1) for d0, d1 in zip(df1.values.flat, df2.values.flat)]
result = pd.DataFrame(np.reshape(new_vals, (2, 2)), index = ['c', 'd'], columns = ['a', 'b'])
If you it's your specific application, you can do :
#Concatenate the two as String
df = df0.astype(str) + "," +df1.astype(str)
#Remove the nan
df = df.applymap(lambda x: x if 'nan' not in x else np.nan)
You'll be better performance wise than using apply
output
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
Use add with applymap and mask:
df = df0.astype(str).add(',').add(df1.astype(str))
df = df.mask(df.applymap(lambda x: 'nan' in x))
print (df)
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
Another solution is last replace NaN by conditions with mask, by default Trues are replaced to NaN:
df = df0.astype(str).add(',').add(df1.astype(str))
m = df0.isnull() | df1.isnull()
print (m)
a b
c False False
d False True
df = df.mask(m)
print (df)
a b
c 0.3,3 0.6,2
d 0.4,0 NaN

What is the dataset return from dataframe.stack()

I am trying to work on dataframe which i have used .stack() function
df = pd.read_csv('test.csv', usecols =['firstround','secondround','thirdround','fourthround','fifthround'])
sortedArray = df.stack().value_counts()
sortedArray = sortedArray.sort_index()
I need to retrieve the first index column values and the 2nd index column values from the sortedArray, meaning i need x and y value from the sorted array.
Any idea how i can do it?
I think you need Series.iloc, because output from stack is Series:
x = sortedArray.iloc[0]
y = sortedArray.iloc[1]
Sample:
df = pd.DataFrame({'A':['a','a','s'],
'B':['a','s','a'],
'C':['s','d','a']})
print (df)
A B C
0 a a s
1 a s d
2 s a a
sortedArray = df.stack().value_counts()
print (sortedArray)
a 5
s 3
d 1
dtype: int64
sortedArray = sortedArray.sort_index()
print (sortedArray)
a 5
d 1
s 3
dtype: int64
x = sortedArray.iloc[0]
y = sortedArray.iloc[1]
print (x)
5
print (y)
1
print (sortedArray.tolist())
[5, 1, 3]
print (sortedArray.index.tolist())
['a', 'd', 's']

Categories

Resources