Python Pandas mark all but one specific duplicate row - python

I have a Pandas dataframe that has already been reduced to duplicates only and sorted.
Duplicates are identified by column "HASH" and then sorted by "HASH" and "SIZE"
df_out['is_duplicated'] = df.duplicated(['HASH'], keep=False) #keep=False: mark all duplicates as true
df_out = df_out.ix[(df_out['is_duplicated'] == True)] #Keep only duplicate records
df_out = df_out.sort_values(['HASH', 'SIZE'], ascending=[True, False]) #Sort by "HASH", then by "SIZE"
Result:
HASH SIZE is_duplicated
1 5 TRUE
1 3 TRUE
1 2 TRUE
9 7 TRUE
9 5 TRUE
I would like to add 2 more columns.
First column would identify rows of data with the same "HASH" by an ID.
First set of rows with the same "HASH" would be 1, next set would be 2, etc...
Second column would mark the a single row in each group that has the largest "SIZE"
HASH SIZE ID KEEP
1 5 1 TRUE
1 3 1 FALSE
1 2 1 FALSE
9 7 2 TRUE
9 5 2 FALSE

Perhaps use dicts and list comprehension:
import pandas as pd
df = pd.DataFrame([[1,1,1,9,9],[5,3,2,7,5]]).T
df.columns = ['HASH','SIZE']
hash_dict = dict(zip(df.HASH.unique(),range(1,df.HASH.nunique()+1)))
df['ID'] = [hash_dict[k] for k in df.HASH]
max_dict = dict(df.groupby('HASH')['SIZE'].max())
df['KEEP'] = [True if b==max_dict[a] else False for a,b in zip(df.HASH,df.SIZE)]

Related

dropping rows that has only one non zero value from a pandas dataframe in python

I have a pandas dataframe as shown below:
Pandas Dataframe
I want to drop the rows that has only one non zero value. What's the most efficient way to do this?
Try boolean indexing
# sample data
df = pd.DataFrame(np.zeros((10, 10)), columns=list('abcdefghij'))
df.iloc[2:5, 3] = 1
df.iloc[4:5, 4] = 1
# boolean indexing based on condition
df[df.ne(0).sum(axis=1).ne(1)]
Only rows 2 and 3 are removed because row 4 has two non-zero values and every other row has zero non-zero values. So we drop rows 2 and 3.
df.ne(0).sum(axis=1)
0 0
1 0
2 1
3 1
4 2
5 0
6 0
7 0
8 0
9 0
Not sure if this is the most efficient but I'll try:
df[[col for col in df.columns if (df[col] != 0).sum() == 1]]
2 loops per column here: 1 for checking if != 0 and one more to sum the boolean values up (could break earlier if the second value is found).
Otherwise, you can define a custom function to check without looping twice per column:
def check(column):
already_has_one = False
for value in column:
if value != 0:
if already_has_one:
return False
already_has_one = True
return already_has_one
then:
df[[col for col in df.columns if check(df[col])]]
Which is much faster than the first.
Or like this:
df[(df.applymap(lambda x: bool(x)).sum(1) > 1).values]

Compare columns in a dictionary of dataframes

I have a dictionary of dataframes (Di_1). Each dataframe has the same number of columns, column names, number of rows and row indexes. I also have a list of the names of the dataframes (dfs). I would like to compare the contents of one of the columns (A) in each dataframe with those of the last dataframe in the list to see whether they are the same. For example:
df_A = pd.DataFrame({'A': [1,0,1,0]})
df_B = pd.DataFrame({'A': [1,1,0,0]})
Di_1 = {'X': df_A, 'Y': df_B}
dfs = ['X','Y']
I tried:
for df in dfs:
Di_1[str(df)]['True'] = Di_1[str(df)]['A'] .equals(Di_1[str(dfs[-1])]['A'])
I got:
[0,0,0,0]
I would like to get:
[1,0,0,1]
My attempt is checking whether the whole column is the same but I would instead please like to get it to go through each dataframe row by row.
I think you make things too complicated here. You can
series_last = Di_1[dfs[-1]]['A']
for df in map(Di_1.get, dfs):
df['True'] = df['A'] == series_last
This will produce as result:
>>> df_A
A True
0 1 True
1 0 False
2 1 False
3 0 True
>>> df_B
A True
0 1 True
1 1 True
2 0 True
3 0 True
So each df_i has an extra column named 'True' (perhaps you better use a different name), that checks if for a specific row, the value is the same as the one in the series_last.
In case the dfs contains something else than strings, we can first convert these to strings:
series_last = Di_1[str(dfs[-1])]['A']
for df in map(Di_1.get, map(str, dfs)):
df['True'] = df['A'] == series_last
Create a list:
l=[Di_1[i] for i in dfs]
Then using isin() you can compare the first and last df
l[0].isin(l[-1]).astype(int)
A
0 1
1 0
2 0
3 1

Pandas: Trying to edit data in a row for a list of dataframes

I have a list of 3 DataFrames x, where each DataFrame has 3 columns. It looks like
1 2 T/F
4 7 False
4 11 True
4 20 False
4 25 True
4 40 False
What I want to do is set the value of each row in column 'T/F' to False for each DataFrame in list x
I attempted to do this with the following code
rang = list(range(len(x))) # rang=[0,1,2]
for i in rang:
x[i].iloc[:len(x), 'T/F'] = False
The code compiled, but it didn't appear to work.
Much simpler. Just iterate over the actual dataframes and update the columns with:
for df in [df1, df2]:
df['T/F'] = False
Als note that DataFrame.iloc is a integer-location based indexing. If you want to index using the column names use .loc.

i want to match two dataframe columns in python

I have a two data frame df1 (35k record) and df2(100k records). In df1['col1'] and df2['col3'] i have unique id's. I want to match df1['col1'] with df2['col3']. If they match, I want to update df1 with one more column say df1['Match'] with value true and if not match, update with False value. I want to map this TRUE and False value against Matching and non-matching record only.
I am using .isin()function, I am getting the correct match and not match count but not able to map them correctly.
Match = df1['col1'].isin(df2['col3'])
df1['match'] = Match
I have also used merge function using by passing the parameter how=rightbut did not get the results.
You can simply do as follows:
df1['Match'] = df1['col1'].isin(df2['col3'])
For instance:
import pandas as pd
data1 = [1,2,3,4,5]
data2 = [2,3,5]
df1 = pd.DataFrame(data1, columns=['a'])
df2 = pd.DataFrame(data2,columns=['c'])
print (df1)
print (df2)
df1['Match'] = df1['a'].isin(df2['c']) # if matches it returns True else False
print (df1)
Output:
a
0 1
1 2
2 3
3 4
4 5
c
0 2
1 3
2 5
a Match
0 1 False
1 2 True
2 3 True
3 4 False
4 5 True
Use df.loc indexing:
df1['Match'] = False
df1.loc[df1['col1'].isin(df2['col3']), 'Match'] = True

Pandas dataframe self-dependency in data to fill a column

I have dataframe with data as:
The value of "relation" is determined from the codeid. Leather has "codeid"=11 which is already appeared against bag, so in relation we put the value bag.
Same happens for shoes.
ToDo: Fill the value of "relation", by putting check on codeid in terms of dataframes. Any help would be appreciated.
Edit: Same codeid e.g. 11 can appear > twice. But the "relation" can have only value as bag because bag is the first one to have codeid=11. i have updated the picture as well.
If want only first dupe value to last duplicated use transform with first and then set NaN values by loc with duplicated:
df = pd.DataFrame({'id':[1,2,3,4,5],
'name':list('brslp'),
'codeid':[11,12,13,11,13]})
df['relation'] = df.groupby('codeid')['name'].transform('first')
print (df)
id name codeid relation
0 1 b 11 b
1 2 r 12 r
2 3 s 13 s
3 4 l 11 b
4 5 p 13 s
#get first duplicated values of codeid
print (df['codeid'].duplicated(keep='last'))
0 True
1 False
2 True
3 False
4 False
Name: codeid, dtype: bool
#get all duplicated values of codeid with inverting boolenam mask by ~ for unique rows
print (~df['codeid'].duplicated(keep=False))
0 False
1 True
2 False
3 False
4 False
Name: codeid, dtype: bool
#chain boolen mask together
print (df['codeid'].duplicated(keep='last') | ~df['codeid'].duplicated(keep=False))
0 True
1 True
2 True
3 False
4 False
Name: codeid, dtype: bool
#replace True values by mask by NaN
df.loc[df['codeid'].duplicated(keep='last') |
~df['codeid'].duplicated(keep=False), 'relation'] = np.nan
print (df)
id name codeid relation
0 1 b 11 NaN
1 2 r 12 NaN
2 3 s 13 NaN
3 4 l 11 b
4 5 p 13 s
I think you want to do something like this:
import pandas as pd
df = pd.DataFrame([['bag', 11, 'null'],
['shoes', 12, 'null'],
['shopper', 13, 'null'],
['leather', 11, 'bag'],
['plastic', 13, 'shoes']], columns = ['name', 'codeid', 'relation'])
def codeid_analysis(rows):
if rows['codeid'] == 11:
rows['relation'] = 'bag'
elif rows['codeid'] == 12:
rows['relation'] = 'shirt' #for example. You should put what you want here
elif rows['codeid'] == 13:
rows['relation'] = 'pants' #for example. You should put what you want here
return rows
result = df.apply(codeid_analysis, axis = 1)
print(result)
It is not the optimal solution since it is costly to your memory, but here is my try. df1 is created in order to hold the null values of the relation column, since it seems that nulls are the first occurrence. After some cleaning, the two dataframes are merged to provide into one.
import pandas as pd
df = pd.DataFrame([['bag', 11, 'null'],
['shoes', 12, 'null'],
['shopper', 13, 'null'],
['leather', 11, 'bag'],
['plastic', 13, 'shopper'],
['something',13,""]], columns = ['name', 'codeid', 'relation'])
df1=df.loc[df['relation'] == 'null'].copy()#create a df with only null values in relation
df1.drop_duplicates(subset=['name'], inplace=True)#drops the duplicates and retains the first entry
df1=df1.drop("relation",axis=1)#drop the unneeded column
final_df=pd.merge(df, df1, left_on='codeid', right_on='codeid')#merge the two dfs on the columns names

Categories

Resources