Find out which is the duplicate in a python panda data structure - python

I have a csv file. It looks something like this;
name,id,
AAA,1111,
BBB,2222,
CCC,3333,
DDD,2222,
I want to find out whether there is a duplicate in column id. If yes, find out the duplicate. In this case, the answer is 2222.
I have the code to find out whether a duplicate exists. Here it is;
import pandas as pd
csv_file = 'C:/test.csv'
df = pd.read_csv(csv_file)
df['id'].duplicated().any()
The problem is how can one find out the duplicate?
I am using python 2.7 and panda.

I think you can use duplicated (keep is omit, because keep='first' is default). Or if you need values tolist:
print df['id'][df.duplicated(subset=['id'])]
3 2222
Name: id, dtype: int64
print df['id'][df.duplicated(subset=['id'])].tolist()
[2222]
You can check duplicated:
print df.duplicated(subset=['id'], keep='first')
0 False
1 False
2 False
3 True
dtype: bool
print df.duplicated(subset=['id'], keep='last')
0 False
1 True
2 False
3 False
dtype: bool
print df.duplicated(subset=['id'], keep=False)
0 False
1 True
2 False
3 True
dtype: bool
And if you need duplicated rows use subset:
print df[df.duplicated(subset=['id'], keep='first')]
name id
3 DDD 2222
print df[df.duplicated(subset=['id'], keep='last')]
name id
1 BBB 2222
print df[df.duplicated(subset=['id'], keep=False)]
name id
1 BBB 2222
3 DDD 2222
Use drop_duplicates for dropping:
print df.drop_duplicates(subset=['id'], keep='first')
name id
0 AAA 1111
1 BBB 2222
2 CCC 3333
print df.drop_duplicates(subset=['id'], keep='last')
name id
0 AAA 1111
2 CCC 3333
3 DDD 2222
print df.drop_duplicates(subset=['id'], keep=False)
name id
0 AAA 1111
2 CCC 3333

Related

VLOOKUP in Python Pandas without using MERGE

I have two DataFrames with one common column as the key and I want to perform a VLOOKUP sort of operation to fetch values from the first DataFrame corresponding to the keys in second DataFrame.
DataFrame 1
key value
0 aaa 111
1 bbb 222
2 ccc 333
DataFrame 2
key value
0 bbb None
1 ccc 333
2 aaa None
3 aaa 111
Desired Output
key value
0 bbb 222
1 ccc 333
2 aaa 111
3 aaa 111
I do not want to use merge as both of my DFs might have NULL values for the key column and since pandas merge behave differently than sql join, all such rows might get joined with each other.
I tried below approach
DF2['value'] = np.where(DF2['key'].isnull(), DF1.loc[DF2['key'].equals(DF1['key'])]['value'], DF2['value'])
but have been getting KeyError: False error.
You can use:
df2['value'] = df2['value'].fillna(df2['key'].map(df1.set_index('key')['value']))
print(df2)
# Output
key value
0 bbb 222
1 ccc 333
2 aaa 111
3 aaa 111

I am stuck in writing the python code for below problem

I have below data frame.
df = pd.DataFrame({'vin':['aaa','bbb','bbb','bbb','ccc','ccc','ddd','eee','eee','fff'],'module':['NORMAL','1ST_PRIORITY','2ND_PRIORITY','HELLO','3RD_PRIORITY','2ND_PRIORITY','2ND_PRIORITY','3RD_PRIORITY','HELLO','ABS']})
I wanted to find if the vin column contains a unique value then in the Result column it should return 'YES' and if the vin column is not unique then it will check the 'module' column and return 'YES' where the module column has more priority value.
I want output like the below data frame.
df = pd.DataFrame({'vin':['aaa','bbb','bbb','bbb','ccc','ccc','ddd','eee','eee','fff'],'module':['NORMAL','1ST_PRIORITY','2ND_PRIORITY','HELLO','3RD_PRIORITY','2ND_PRIORITY','2ND_PRIORITY','3RD_PRIORITY','HELLO','ABS'],
'Result':['YES','YES','NO','NO','NO','YES','YES','YES','NO','YES']})
Below code, I have tried and it gives the correct result but it involves too many steps.
df['count'] = df.groupby('vin').vin.transform('count')
def Check1(df):
if (df["count"] == 1):
return 1
elif ((df["count"] != 1) & (df["module"] == '1ST_PRIORITY')):
return 1
elif ((df["count"] != 1) & (df["module"] == '2ND_PRIORITY')):
return 2
elif ((df["count"] != 1) & (df["module"] == '3RD_PRIORITY')):
return 3
else:
return 4
df['Sort'] = df.apply(Check1, axis=1)
df = df.sort_values(by=['vin', 'Sort'])
df.drop_duplicates(subset=['vin'], keep='first',inplace = True)
df
Here's the trick, you need a custom order:
from pandas.api.types import CategoricalDtype
#create your custom order
custom_order = CategoricalDtype(
['Delhi','Agra','Paris','ABS','HELLO','NORMAL'],
ordered=True)
#then attribute it to the desired column
df['module'] = df['module'].astype(custom_order)
df['Result'] = ((~df.sort_values('module', ascending=True).duplicated('vin'))
.replace({True: 'YES', False: 'NO'}))
Result:
index
vin
module
Result
0
aaa
NORMAL
YES
1
bbb
Delhi
YES
2
bbb
Agra
NO
3
bbb
HELLO
NO
4
ccc
Paris
NO
5
ccc
Agra
YES
6
ddd
Agra
YES
7
eee
Paris
YES
8
eee
HELLO
NO
9
fff
ABS
YES
IIUC, you can use duplicated after sort_values:
df['Result'] = ((~df.sort_values('module').duplicated('vin'))
.replace({True: 'YES', False: 'NO'}))
print(df)
# Output
vin module Result
0 aaa NORMAL YES
1 bbb 1ST_PRIORITY YES
2 bbb 2ND_PRIORITY NO
3 bbb HELLO NO
4 ccc 3RD_PRIORITY NO
5 ccc 2ND_PRIORITY YES
6 ddd 2ND_PRIORITY YES
7 eee 3RD_PRIORITY YES
8 eee HELLO NO
9 fff ABS YES

Pandas Dataframe add columns based on existing data

I have a dataframe with 100s of columns and 1000s of rows but the basic structure is
Index 0 1 2
0 AAA NaN AAA
1 NaN BBB NaN
2 NaN NaN CCC
3 DDD DDD DDD
I would like to add two new columns one would be and id which would be equal to the first value in each row the second would be a count of the values in each row. It would look like this. To be clear all rows will always have the same value.
Index id count 0 1 2
0 AAA 2 AAA NaN AAA
1 BBB 1 NaN BBB NaN
2 CCC 1 NaN NaN CCC
3 DDD 3 DDD DDD DDD
Any help in figuring out a way to do this would be greatly appreciated. Thanks
This should work.
df['id'] = df.bfill(axis=1).iloc[:, 0].fillna('All NANs')
df['count'] = df.drop(columns=["id"]).notnull().sum(axis=1)
To maintain the order of columns:
df = df[list(df.columns[-2:]) + list(df.columns[:-2])]
Create the Dataframe
test_df = pd.DataFrame([['AAA',np.nan,'AAA'], [np.nan,'BBB',np.nan], [np.nan,np.nan, 'CCC'], ['DDD','DDD','DDD']])
Count the non-NaN elements in each row as count
test_df['count'] = test_df.notna().sum(axis=1)
Option-1: Select the first element in the row as id (regardless of NaN value)
test_df['id'] = test_df[0]
Option-2: Select the first non-NaN element as id for each row
test_df['id'] = test_df.apply(lambda x: x[x.first_valid_index()], axis=1)

get all rows from orignal Dataframe which is having number or alpha number in particular column?

I have Dataframe Like this in df_init
column1
0 hi all, i am fine
1 How are you ? 123 a45
2 123444234324!!! (This is also string)
3 sdsfds sdfsdf 233423
5 adsfd xcvbb cbcvbcvcbc
I want to get all those values from this dataframe which is having a number or alpha number
I am expecting like this in df_final
column1
0 How are you ? 123 a45
1 123444234324!!! (This is also string)
2 sdsfds sdfsdf 233423
Use str.contains with \d for match number and filter by boolean indexing:
df = df[df.column1.str.contains('\d')]
print (df)
column1
1 How are you ? 123 a45
2 123444234324!!! (This is also string)
3 sdsfds sdfsdf 233423
EDIT:
print (df)
column1
0 hi all, i am fine d78
1 How are you ? 123 a45
2 123444234324!!!
3 sdsfds sdfsdf 233423
4 adsfd xcvbb cbcvbcvcbc
5 234324#
6 123! vc
df = df[df.column1.str.contains(r'^\d+[<!\-[.*?\]>#]+$')]
print (df)
column1
2 123444234324!!!
5 234324#

Pandas: merge dataframes without creating new columns

I've got 2 dataframes with identical columns:
df1 = pd.DataFrame([['Abe','1','True'],['Ben','2','True'],['Charlie','3','True']], columns=['Name','Number','Other'])
df2 = pd.DataFrame([['Derek','4','False'],['Ben','5','False'],['Erik','6','False']], columns=['Name','Number','Other'])
which give:
Name Number Other
0 Abe 1 True
1 Ben 2 True
2 Charlie 3 True
and
Name Number Other
0 Derek 4 False
1 Ben 5 False
2 Erik 6 False
I want an output dataframe that is an intersection of the two based on "Name":
output_df =
Name Number Other
0 Ben 2 True
1 Ben 5 False
I've tried a basic pandas merge but the return is non-desirable:
pd.merge(df1,df2,how='inner',on='Name') =
Name Number_x Other_x Number_y Other_y
0 Ben 2 True 5 False
These dataframes are quite large so I'd prefer to use some pandas magic to keep things quick.
You can use concat and then filter by isin with numpy.intersect1d using boolean indexing:
val = np.intersect1d(df1.Name, df2.Name)
print (val)
['Ben']
df = pd.concat([df1,df2], ignore_index=True)
print (df[df.Name.isin(val)])
Name Number Other
1 Ben 2 True
4 Ben 5 False
Another possible solution for val is intersection of sets:
val = set(df1.Name).intersection(set(df2.Name))
print (val)
{'Ben'}
Then is possible reset index to monotonic:
df = pd.concat([df1,df2])
print (df[df.Name.isin(val)].reset_index(drop=True))
Name Number Other
0 Ben 2 True
1 Ben 5 False

Categories

Resources