Grouping and comparing groups using pandas - python

I have data that looks like:
Identifier Category1 Category2 Category3 Category4 Category5
1000 foo bat 678 a.x ld
1000 foo bat 78 l.o op
1000 coo cat 678 p.o kt
1001 coo sat 89 a.x hd
1001 foo bat 78 l.o op
1002 foo bat 678 a.x ld
1002 foo bat 78 l.o op
1002 coo cat 678 p.o kt
What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was:
First group all the identifier items into separate dataframes (maybe?). For example, df1 would be all rows pertaining to identifier 1000 and df2 would be all rows pertaining to identifier 1002. (**Please note that I want the code to do this itself as there are millions of rows, as opposed to me writing code to manually compare identifiers **). I have tried using the groupby feature of pandas, it does the part of grouping well, but then I do not know how to compare the groups.
Compare each of the groups/sub-data frames.
One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc).
Any help is appreciated, I am very new to Python. Thanks in advance!

You could do something like the following:
import pandas as pd
input_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']
duplicate_entries = {}
for group in input_file.groupby('Identifier'):
# transforming to tuples so that it can be used as keys on a dict
lines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]
key = tuple(lines)
if key not in duplicate_entries:
duplicate_entries[key] = []
duplicate_entries[key].append(group[0])
Then the duplicate_entries values will have the list of duplicate Identifiers
duplicate_entries.values()
> [[1000, 1002], [1001]]
EDIT:
To get only the entries that have duplicates, you could have something like:
all_dup = [dup for dup in duplicate_entries if len(dup) > 1]
Explaining the indices (sorry I didn't explained it before): Iterating through the df.groupby outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use [1] and the 'Identifier' for that group is found at [0]. Because on the duplicate_entries array we'd like the identifier of that entry, using group[0] would get us that.

We could separate in groups with groupby, then sort all groups (so we can detect equals even when rows are in different order) by all columns except for "Identifier" and compare the groups:
Suppose that columns = ["Identifier", "Category1", "Category2", "Category3", "Category4", "Category5"]
We can do:
groups = []
pure_groups = []
for name, group in df.groupby("Identifier"):
pure_groups += [group]
g_idfless = group[group.columns.difference(["Identifier"])]
groups += [g_idfless.sort_values(columns[1:]).reset_index().drop("index", axis=1)]
And to compare them:
for i in range(len(groups)):
for j in range(i + 1, len(groups)):
id1 = str(pure_groups[i]["Identifier"].iloc[0])
id2 = str(pure_groups[j]["Identifier"].iloc[0])
print(id1 + " and " + id2 + " equal?: " + str(groups[i].equals(groups[j])))
#-->1000 and 1001 equal?: False
#-->1000 and 1002 equal?: True
#-->1001 and 1002 equal?: False
EDIT: Added code to print the identifiers of the groups that match

Related

pandas dataframe: how to select rows where one column-value is like 'values in a list'

I have a requirement where I need to select rows from a dataframe where one column-value is like values in a list.
The requirement is for a large dataframe with millions of rows and need to search for rows where column-value is like values of a list of thousands of values.
Below is a sample data.
NAME,AGE
Amar,80
Rameshwar,60
Farzand,90
Naren,60
Sheikh,45
Ramesh,55
Narendra,85
Rakesh,86
Ram,85
Kajol,80
Naresh,86
Badri,85
Ramendra,80
My code is like below. But problem is that I'm using a for loop, hence with increased number of values in the list-of-values (variable names_like in my code) I need to search, the number of loop and concat operation increases and it makes the code runs very slow.
I can't use the isin() option as isin is for exact match and for me it is not an exact match, it a like condition for me.
Looking for a better more performance efficient way of getting the required result.
My Code:-
import pandas as pd
infile = "input.csv"
df = pd.read_csv(infile)
print(f"df=\n{df}")
names_like = ['Ram', 'Nar']
df_res = pd.DataFrame(columns=df.columns)
for name in names_like:
df1 = df[df['NAME'].str.contains(name, na=False)]
df_res = pd.concat([df_res,df1], axis=0)
print(f"df_res=\n{df_res}")
My Output:-
df_res=
NAME AGE
1 Rameshwar 60
5 Ramesh 55
8 Ram 85
12 Ramendra 80
3 Naren 60
6 Narendra 85
10 Naresh 86
Looking for a better more performance efficient way of getting the required result.
You can pass all names in joined list by | for regex or, loop is not necessary:
df_res = df[df['NAME'].str.contains('|'.join(names_like), na=False)]
Use this hope you will find a great way.
df_res = df[df['NAME'].str.contains('|'.join(names_like), na=False)]

How to Ignore a few rows with a unique index in a pandas data frame while using groupby()?

I have a data frame df:
ID Height
A 168
A 170
A 190
A 159
B 172
B 173
C 185
I am trying to eliminate outliers in df from each ID separately using:
outliersfree = df[df.groupby("ID")['Height'].transform(lambda x : x < (x.quantile(0.95) + 5*(x.quantile(0.95) - x.quantile(0.05)))).eq(1)]
Here, I want to ignore the rows with a unique index. i.e., all the IDs that have only one corresponding entry in them. For instance, in the df given, C index has only one entry. Hence, I want to ignore C while eliminating outliers and present as it is n the new data frame formed outliersfree.
I am also interested in knowing how to ignore/skip IDs which have two entries (For example, B in the df).
One option is to create an OR condition in your lambda function such that if there is one element in your group, you return True.
df.groupby("ID")['Height'].transform(lambda x : (x.count() == 1) |
(x < (x.quantile(0.95) + 5*
(x.quantile(0.95) - x.quantile(0.05)))))
And you can use (x.count() < 3) for groups with two or less.

Using fuzzywuzzy to fix discrepancies in large panda

I have a panda dataframe with 500k rows that contains paid out expenses. It looks like so:
As you can see, the 'LastName' column can contain entries should be the same but in practice they contain minor differences. My ultimate goal is to see how much was paid to each entity by doing a simple group_by and .sum. However, for that to work the entries under 'LastName' must be uniform.
I'm attempting to solve this problem using fuzzywuzzy.
First I take the unique vales from 'LastName' and save them to a list for comparison:
choices = expenditures_df['LastName'].astype('str').unique()
This leaves me with 50k unique entries from 'LastName' that I now need to compare the full 500k against.
Then I run through every line in the dataframe and look at it's similarity to each choice. If the similarity is high enough I overwrite the data in the dataframe with the entity name from choices.
for choice in choices:
word = str(choice)
for i in expenditures_df.index:
if fuzz.ratio(word, str(expenditures_df.loc[i,'LastName'])) > 79:
expenditures_df.loc[i, 'LastName'] = word
The problem, of course, is this is incredibly slow. So, I'd love some thoughts on accomplishing the same thing in a more efficient manner.
See: How to group words whose Levenshtein distance is more than 80 percent in Python
Based on this you can do something like:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'expenditure':[1000,500,250,11,456,755],'last_name':['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']})
choices = df['last_name'].unique()
grs = list() # groups of names with distance > 80
for name in choices:
for g in grs:
if all(fuzz.ratio(name, w) > 80 for w in g):
g.append(name)
break
else:
grs.append([name, ])
name_map = []
for c, group in enumerate(grs):
for name in group:
name_map.append([c,name])
group_map = pd.DataFrame(name_map, columns=['group','name'])
df = df.merge(group_map, left_on='last_name',right_on='name')
df = df.groupby('group')['expenditure'].sum().reset_index()
df = df.merge(group_map.groupby('group')['name'].apply(list), on='group')
OUTPUT:
group expenditure name
0 0 1500 [rakesh, zakesh]
1 1 261 [bikash, zikash]
2 2 1211 [goldman LLC, oldman LLC]

Use str.contains then fuzzy match remaining elements

I have a data frame which contains a column of strings and a master list which I need to match them to. I want this match to be added as a separate column to the data frame. Since the data frame is about 1 million rows I want to first do "string contains" and if it does not contain a string from the master list I want to fuzzy match them at a threshold of 85, then if there is still no match, just return the value in the data frame column.
m_list = [FOO,BAR,DOG]
df
Name Number_Purchased
ALL FOO 1
ALL FOO 4
BARKY 2
L.T. D.OG 1
PUMPKINS 3
I'm trying to achieve this outcome:
df2
Name Number_Purchased Match_Name Match_Score
ALL FOO 1 FOO 100
ALL FOO 4 FOO 100
BARKY 2 BAR 95
L.T. D.OG 1 DOG 90
PUMPKINS 3 PUMKINS 25
My code looks like this:
def matches(df, m_list):
if df['Name'].contains('|'.join(m_list)):
return m_list, 100
else:
new_name, score = process.extractOne(df.name, m_list, scorer = fuzz.token_set_ratio)
if score > 85:
return new_name, score
else:
return name, score
df['Match_Name'], df['Match_Score'] = zip(*df.apply(matches))
I've edited it several times and keep having errors either "str" does not contain attribute "str" or there is a difference in shape of arrays causing problems. How can I adjust this code so it's functional but also scale-able for a column with 1 million+ rows?

use of .apply() for comparing elements

I have a dataframe df of thousands of items where the value of the column "group" repeats from two to ten times. The dataframe has seven columns, one of them is named "url"; another one "flag". All of them are strings.
I would like to use Pandas in order to traverse through these groups. For each group I would like to find the longest item in the "url" column and store a "0" or "1" in the "flag" column that corresponds to that item. I have tried the following but I can not make it work. I would like to 1) get rid of the loop below, and 2) be able to compare all items in the group through df.apply(...)
all_groups = df["group"].drop_duplicates.tolist()
for item in all_groups:
df[df["group"]==item].apply(lambda x: Here I would like to compare the items within one group)
Can apply() and lambda be used in this context? Any faster way to implement this?
Thank you!
Using groupby() and .transform() you could do something like:
df['flag'] = df.groupby('group')['url'].transform(lambda x: x.str.len() == x.map(len).max())
Which provides a boolean value for df['flag']. If you need it as 0, 1 then just add .astype(int) to the end.
Unless you write code and find it's running slowly don't sweat optimizing it. In the words of Donald Knuth "Premature optimization is the root of all evil."
If you want to use apply and lambda (as mentioned in the question):
df = pd.DataFrame({'url': ['abc', 'de', 'fghi', 'jkl', 'm'], 'group': list('aaabb'), 'flag': 0})
Looks like:
flag group url
0 0 a abc
1 0 a de
2 0 a fghi
3 0 b jkl
4 0 b m
Then figure out which elements should have their flag variable set.
indices = df.groupby('group')['url'].apply(lambda s: s.str.len().idxmax())
df.loc[indices, 'flag'] = 1
Note this only gets the first url with maximal length. You can compare the url lengths to the maximum if you want different behavior.
So df now looks like:
flag group url
0 0 a abc
1 0 a de
2 1 a fghi
3 1 b jkl
4 0 b m

Categories

Resources