Finding duplicates in one column with non-dups in another - python

I am struggling with how to take a dataset and output a result that finds duplicate information in one column with non-duplicate items in another. If say column 0 and 2 are exact duplicates I don't care about the set of data, only if there are rows where column 0 has entries with more than one value in column 2. And, if that is the case, I want all of the rows that match column 0.
I am first using concat to narrow down the dataset to rows that have duplicates. My problem is now trying to get only the rows where column 2 is different.
My example dataset is:
Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF0723AFE8,device1
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFF862FAF74,device2
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFF2A8AA38,device3
"11111",,Prod_P,Device,"11111",Prod_P,,,,SEPFFFFD2C0A2C6,device4
"22334",,Prod_P,Device,"22334",Prod_P,,,,SEPFFFFCF87AB31,device5
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8
In this set, I want to have a result of the last three rows that are "33333" as they have more than one type of value in column 2. "11111" only matches Prod_P so I don't care about it.
import pandas as pd
ignorelist = []
inputfile = "pandas-problem-data.txt"
data = pd.read_csv(inputfile)
data.columns = data.columns.str.replace(' ','_')
data = pd.concat(g for _, g in data.groupby("Pattern_or_URI") if len(g) > 1)
data = data.loc[(data["Pattern_Usage"]=="Device"), ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"]]
new_rows = []
tempdup = pd.DataFrame()
for i, row in data.iterrows():
if row["Pattern_or_URI"] in ignorelist:
continue
ignorelist.append(row["Pattern_or_URI"])
# testdup = pd.concat(h for _, h in (data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]).groupby("Partition") if len(h) > 1)
# print(data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])])
newrow = data.loc[(data["Pattern_or_URI"]==row["Pattern_or_URI"], ["Pattern_or_URI","Partition","Pattern_Usage","Device_Name","Device_Description"])]
If I uncomment the line where I try to use the same concat to find entries with "Partition" > 1 I get an error ValueError: No objects to concatenate. I know it gets through the first iter with the print statement uncommented.
Is there an easier or better way of doing this? I'm new to pandas and keep thinking there is probably a way to find this that I haven't figured out.
Thank you.
Desired output:
Pattern or URI,Route Filter Clause,Partition,Pattern Usage,Owning Object,Owning Object Partition,Cluster ID,Catalog Name,Route String,Device Name,Device Description
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCF87AAEA,device6
"33333",,Dummy_P,Device,"33333",Dummy_P,,,,SEPFFFF18FF65A0,device7
"33333",,Prod_P,Device,"33333",Prod_P,,,,SEPFFFFCFCCAABB,device8

I think it's a bit misleading to say you're looking for duplicates. This is really a grouping problem.
You want to find groups of identical values in Pattern or URI that correspond with more than one unique value in your Partition Series.
transform + nunique
s = df.groupby('Pattern or URI')['Partition'].transform('nunique').gt(1)
df.loc[s]
Pattern or URI Route Filter Clause Partition Pattern Usage Owning Object Owning Object Partition Cluster ID Catalog Name Route String Device Name Device Description
5 33333 NaN Prod_P Device 33333 Prod_P NaN NaN NaN SEPFFFFCF87AAEA device6
6 33333 NaN Dummy_P Device 33333 Dummy_P NaN NaN NaN SEPFFFF18FF65A0 device7
7 33333 NaN Prod_P Device 33333 Prod_P NaN NaN NaN SEPFFFFCFCCAABB device8

Using df.drop_duplicates() as follows:
df=pd.DataFrame({'a':[111,111,111,222,222,333,333,333],
'b':['a','a','a','b','b','a','b','c'],
'c':[12,13,14,15,61,71,81,19]})
df
a b c
0 111 a 12
1 111 a 13
2 111 a 14
3 222 b 15
4 222 b 61
5 333 a 71
6 333 b 81
7 333 c 19
df1=df.drop_duplicates(['a','b'],keep=False)
df1
a b c
5 333 a 71
6 333 b 81
7 333 c 19
Note, instead of assigning it to a new DF, you can add inplace=True to apply it to the original

Related

Python(Pandas): Replacing specific NaN values conditional on information about other observations in the dataframe

If have a data frame where each observation has a UniqueID identifying the observation and a ObjectID identifying the object. There can be multiple observations for the same object, i.e. the ObjectID is not unique.
Some observations have Null value for a variable, which however only depends on the object. Thus, if a ObjectID appearce multiple times and has the variable specified at least once, Null values of other observations should be replaced with this value.
I am using Python with the libraries Pandas (pd) and Numpy (np)
Example:
sample_frame = {'UniqueID': [1,2,3,4,5,6,7],"PersonID": [3,2,2,5,5,4,4], "Name":
["Alice",np.nan,"Bob","Joe","Joe",np.nan,np.nan]}
sample_frame = pd.DataFrame(data = sample_frame)
sample_frame
Index
UniqueID
PersonID
Name
0
1
3
Alice
1
2
2
Bob
2
3
2
NaN
3
4
5
Joe
4
5
5
Joe
5
6
4
NaN
6
7
4
NaN
Thus, in the line with the index 2 the NaN-value for Name should be replaced with "Bob".
However, there is nothing to do for the observations below.
I found a solution, which works but seems somewhat complicated to me:
dup = sample_frame.loc[sample_frame.duplicated(subset = ["PersonID"]), :]
dup_persId = dup["PersonID"].unique()
name_na = sample_frame[sample_frame["Name"].isna()]
name_na_persId = name_na["PersonID"].unique()
dup_name_av = dup[dup["Name"].isna() == False]
dup_name_av_persId = dup_name_av["PersonID"].unique()
for i in name_na_persId:
if i in dup_name_av_persId:
index = sample_frame.index[sample_frame["PersonID"] == i].tolist()
for k in index:
if sample_frame.at[k,"Name"] is not np.nan:
name_temp = sample_frame.at[k,"Name"]
continue
for j in index:
if sample_frame.at[j,"Name"] is np.nan:
sample_frame.at[j,"Name"] = name_temp
else:
continue
Is there a simpler way to do this?
sample_frame['Name'].fillna(sample_frame.groupby('PersonID')['Name'].transform('first'))
Using groupby by PersonID and then calling .transform('first') on the Name column will return first non-NaN value in the Name column in the group that the row belongs to.

Compare sums of multiple pandas dataframes in an effective way

I have multiple pandas dataframes (5) with some common names index. They have different size. I need sum at least 5 different common colum names (25 in total) from each dataframe and then compare the sums.
Data:
df_files = [df1, df2, df3, df4, df5]
df_files
out:
[ z name ... a b
0 10 DAD ... 4 4
1 10 DAD ... 5 4
2 10 DAD ... 3 6
3 10 DAD ... 9 2
4 10 DAD ... 11 1
... ... ... ... ... ...
7495 <NA> NaN ... 2 0
7496 <NA> NaN ... 5 3
7497 <NA> NaN ... 3 1
7498 <NA> NaN ... 2 0
7499 <NA> NaN ... 4 3
[7500 rows x 35 columns] #The dataframes are like this type but some vary in size.
What I need is to sum some specific common names and then compare those sums to see if they match and if they do print out and OK, and if they do not see which value is not macthing with the others like: The value from "column name" from df3 and df4 do not match with the others values and see the expected common value (when the other majority match) and the colums that match(or if they just match do not need to show it, just the common expected value). Maybe some of them will not match each other but the common expected value it will be assume as the value that is repeated the most and if any of them match that print out that the values needs correction to proceed because any of them match and see the values that not match.
I was begining with something like this:
Example:
df = pd.concat([df1["a"].sum(), df2["a"].sum(), df3["a"].sum(), df4["a"].sum(), df5["a"].sum()])
df
out:
a a a a a
0 425 425 426 427 425
or maybe they can be compared as a list of integers.
I will appreciate your attention with this question. I hope I have been specific.
If I understand your problem correctly, you need to bring the names of data frames and respective columns in one place to compare the sums. In that case I usually use a dictionary to keep the name of variable, something like this:
df_files = {'df1':df1, 'df2':df2, 'df3':df3, 'df4':df4, 'df5':df5}
summary = pd.DataFrame()
for df in df_files.keys():
cols = list(summary)
summary= pd.concat([summary, df_files[df].sum()], axis=1)
summary.columns = cols + [df]
summary = summary.dropna()
The summary will be a data frame with common column names as index, and data frame names as columns. If you have only 5 dfs with 5 common column names it will be an easy job to observe the results. Here is a sample result I ran for 3 dfs:
df1 df2 df3
a 6.0 10.0 6.0
b 15.0 14.0 15.0
But if the numbers grow, you can use the 'mode' of each row to find the most frequent result, and compare the rows (maybe divide all values and look for non-1 results)

Populating a column based on values in another column - pandas

After merging two data frames I have some gaps in my data frame that can be filled in based on neighboring columns (I have many more columns, and rows in the DF but I'm focusing on these three columns):
Example DF:
Unique ID | Type | Location
A 1 Land
A NaN NaN
B 2 sub
B NaN NaN
C 3 Land
C 3 Land
Ultimately I want the three columns to be filled in:
Unique ID | Type | Location
A 1 Land
A 1 Land
B 2 sub
B 2 sub
C 3 Land
C 3 Land
I've tried:
df.loc[df.Type.isnull(), 'Type'] = df.loc[df.Type.isnull(), 'Unique ID'].map(df.loc[df.Type.notnull()].set_index('Unique ID')['Type'])
but it throws:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
What am I missing here? - Thanks
Your example indicates that you want to forward-fill. YOu can do it like this (complete code):
import pandas as pd
from io import StringIO
clientdata = '''ID N T
A 1 Land
A NaN NaN
B 2 sub
B NaN NaN
C 3 Land
C 3 Land'''
df = pd.read_csv(StringIO(clientdata), sep='\s+')
df["N"] = df["N"].fillna(method="ffill")
df["T"] = df["T"].fillna(method="ffill")
print(df)
The best solution is to probably just get rid of the NaN rows instead of overwriting them. Pandas has a simple command for that:
df.dropna()
Here's the documentation for it: pandas.DataFrame.dropna

How would I pivot this basic table using pandas?

What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN

NaNs after merging two dataframes

I have two dataframes like the following:
df1
id name
-------------------------
0 43 c
1 23 t
2 38 j
3 9 s
df2
user id
--------------------------------------------------
0 222087 27,26
1 1343649 6,47,17
2 404134 18,12,23,22,27,43,38,20,35,1
3 1110200 9,23,2,20,26,47,37
I want to split all the ids in df2 into multiple rows and join the resultant dataframe to df1 on "id".
I do the following:
b = pd.DataFrame(df2['id'].str.split(',').tolist(), index=df2.user_id).stack()
b = b.reset_index()[[0, 'user_id']] # var1 variable is currently labeled 0
b.columns = ['Item_id', 'user_id']
When I try to merge, I get NaNs in the resultant dataframe.
pd.merge(b, df1, on = "id", how="left")
id user name
-------------------------------------
0 27 222087 NaN
1 26 222087 NaN
2 6 1343649 NaN
3 47 1343649 NaN
4 17 1343649 NaN
So, I tried doing the following:
b['name']=np.nan
for i in range(0, len(df1)):
b['name'][(b['id'] == df1['id'][i])] = df1['name'][i]
It still gives the same result as above. I am confused as to what could cause this because I am sure both of them should work!
Any help would be much appreciated!
I read similar posts on SO but none seemed to have a concrete answer. I am also not sure if this is not at all related to coding or not.
Thanks in advance!
Problem is you need convert column id in df2 to int, because output of string functions is always string, also if works with numeric.
df2.id = df2.id.astype(int)
Another solution is convert df1.id to string:
df1.id = df1.id.astype(str)
And get NaNs because no match - str values doesnt match with int values.

Categories

Resources