drop all duplicate values in python

drop all duplicate values in python - python

Assume I have the following dataframe in Python:
A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
B = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
This gives me the above dataframe: I want to remove the rows that have the same values for Col1 but different values for Col3. I have tried to use drop_duplicates command with different subset of columns but it does not give what I want. I can write for loop but that is not efficient at all (since you might have much more columns than this).
C= B.drop_duplicates(['Col1','Col3'],keep = False)
Can anyone help if there is any command in Python can do this without using for loop?
The expected output would be, since A and C are removed because they have the same Col1 but different Col3.

A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]]
df = pd.DataFrame(A, columns = ['Col1','Col2','Col3'])
output = df.drop_duplicates('Col1', keep=False)
print(output)
Output:
Col1 Col2 Col3
2 B 8 9

This can do the job,
grouped_df = df.groupby("Col1")
groups = [grouped_df.get_group(key) for key in grouped_df.groups.keys() if len(grouped_df.get_group(key)["Col3"].unique()) == 1]
new_df = pd.concat(groups).reset_index(drop = True)
Output -
Col1
Col2
Col3
0
B
8
9

Related

Creating a New Column in a Pandas Dataframe in a more pythonic way

I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1

IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)

>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28

Loop through Pandas Dataframe with unique column values

I have the following question and I need help to apply the for loop to iterate through dataframe columns with unique values. For ex I have the following df.
col1 col2 col3
aaa 10 1
bbb 15 2
aaa 12 1
bbb 16 3
ccc 20 3
ccc 50 1
ddd 18 2
I had to apply some manipulation to the dataset for each unique value of col3. Therefore, what I did is I sliced out the df with col3=1 by:
df1 = df[df['col3']==1]
#added all processing here in df1#
Now I need to do the same slicing for col3==2 ... col3==10, and I will be applying the same manipulation as I did in col3==1. For ex I have to do:
df2 = df[df['col3']==2]
#add the same processing here in df2#
df3 = df[df['col3']==3]
#add the same processing here in df3#
Then I will need to append them into a list and then combine them at the end.
I couldn't figure out how to run a for loop that will go through col3 column and look at the unique values so I don't have to create manually ten dfs.
I tried to groupby then apply the manipulation but it didn't work.
I appreciate help on this. Thanks

simple solution. just iterate on the unique values of this column and loc the rows with this unique value. like this:
dfs=[]
for i in df["col3"].unique():
df_i = df.loc[df["Cluster"]==i,:]
dfs.append(df_i.copy())

This should do it but will be slow for large dataframes.
df1 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
df2 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
df3 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
for _, v in df.iterrows():
if v[2] == 1:
# add your code
df1 = df1.append(v)
elif v[2] == 2:
# add your code
df2 = df2.append(v)
elif v[2] == 3:
# add your code
df3 = df3.append(v)
You can then use pd.concat() to rebuild to one df.
Output of df1
col1 col2 col3
0 aaa 10 1
2 aaa 12 1
5 ccc 50 1

Applying function to pandas dataframe column after str.findall

I have such a dataframe df with two columns:
Col1 Col2
'abc-def-ghi' 1
'abc-opq-rst' 2
I created a new column Col3 like this:
df['Col3'] = df['Col1'].str.findall('abc', flags=re.IGNORECASE)
And got such a dataframe afterwards:
Col1 Col2 Col3
'abc-def-ghi' 1 [abc]
'abc-opq-rst' 2 [abc]
What I want to do now is to create a new column Col4 where I get a one if Col3 contains 'abc' and otherwise zero.
I tried to do this with a function:
def f(row):
if row['Col3'] == '[abc]':
val = 1
else:
val = 0
return val
And applied this to my pandas dataframe:
df['Col4'] = df.apply(f, axis=1)
But I only get 0, also in column that contain 'abc'. I think there is something wrong with my if-statement.
How can I solve this?

Just do
df['Col4'] = df.Col3.astype(bool).astype(int)

How to find common elements in several dataframes

I have the following dataframes:
df1 = pd.DataFrame({'col1': ['A','M','C'],
'col2': ['B','N','O'],
# plus many more
})
df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
'col4': ['M','P','Q','J','P','M'],
# plus many more
})
Which look like these:
df1:
col1 col2
A B
M N
C O
#...plus many more
df2:
col3 col4
A M
A P
A Q
B J
B P
B M
#...plus many more
The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this
col1 col2 col4
A B M
A B P
....(and so on for the other rows)
The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.

This should achieve what you want, using a combination of merge, groupby and set intersection:
# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()
# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()
# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)
# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)
print(df3)
col1 col2 col4
0 A B {P, M}
If required, see this answer for examples of how to 'melt' col4.

Find unique values for each column

I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !

You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object

I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

drop all duplicate values in python - python

A = [['A',2,3],['A',5,4],['B',8,9],['C',8,10],['C',9,20],['C',10,20]] df = pd.DataFrame(A, columns = ['Col1','Col2','Col3']) output = df.drop_duplicates('Col1', keep=False) print(output) Output: Col1 Col2 Col3 2 B 8 9

This can do the job, grouped_df = df.groupby("Col1") groups = [grouped_df.get_group(key) for key in grouped_df.groups.keys() if len(grouped_df.get_group(key)["Col3"].unique()) == 1] new_df = pd.concat(groups).reset_index(drop = True) Output - Col1 Col2 Col3 0 B 8 9

Related

Creating a New Column in a Pandas Dataframe in a more pythonic way

Loop through Pandas Dataframe with unique column values

Applying function to pandas dataframe column after str.findall

How to find common elements in several dataframes

Find unique values for each column

Categories

Resources