Use str.contains then fuzzy match remaining elements - python

I have a data frame which contains a column of strings and a master list which I need to match them to. I want this match to be added as a separate column to the data frame. Since the data frame is about 1 million rows I want to first do "string contains" and if it does not contain a string from the master list I want to fuzzy match them at a threshold of 85, then if there is still no match, just return the value in the data frame column.
m_list = [FOO,BAR,DOG]
df
Name Number_Purchased
ALL FOO 1
ALL FOO 4
BARKY 2
L.T. D.OG 1
PUMPKINS 3
I'm trying to achieve this outcome:
df2
Name Number_Purchased Match_Name Match_Score
ALL FOO 1 FOO 100
ALL FOO 4 FOO 100
BARKY 2 BAR 95
L.T. D.OG 1 DOG 90
PUMPKINS 3 PUMKINS 25
My code looks like this:
def matches(df, m_list):
if df['Name'].contains('|'.join(m_list)):
return m_list, 100
else:
new_name, score = process.extractOne(df.name, m_list, scorer = fuzz.token_set_ratio)
if score > 85:
return new_name, score
else:
return name, score
df['Match_Name'], df['Match_Score'] = zip(*df.apply(matches))
I've edited it several times and keep having errors either "str" does not contain attribute "str" or there is a difference in shape of arrays causing problems. How can I adjust this code so it's functional but also scale-able for a column with 1 million+ rows?

Related

I am looking for a partial sting in each row of my pandas Dataframe, I created a Dataframe where the rows are not all equal, so I cannot select by col

I read in a file and created a Dataframe from that file, the problem is that not all of the information that I read was separated properly and was not the same length. I have a df that has 1600 columns but I do not need them all I specifically need the information that is 3 columns to the left of a specific particular sting in one of the previous columns. For Example:
In the 1st row column number 1000, it has a value of ['HFOBR'] and then I need the column value that is 3 to the left.
In the 2nd row the column number with ['PQOBR'] might be 799 but I still need the value that is 3 to the left.
In the 3rd row the column number might be 400 with ['BBSOBR'] but I still need the lave 3 to the left.
And so on I really am trying to search each row for the partial sting OBR and then take the value of 3 to the left of it and put that value in a new df with a column of its own.
Here you will find a snapshot of the dataframe
Here you will see the code I used to create the dataframe in the first place where I read in an HL7 file and tried to convert it to a Dataframe, and each of the HL7 messages are not the same length whish is casing part of the problem I am having
message = []
parsed_msg = []
with open(filename) as msgs:
start = False
for line in msgs.readlines():
if line[:3] == 'MSH':
if start:
parsed_msg = hl7.parse_batch(msg)
#print(parsed_msg)
start = False
message += parsed_msg
msg = line
start = True
else:
msg += line
df = pd.DataFrame(message)
Sample data:
df = pd.DataFrame([["HFOBR", "foo", "a", "b", "c"], ["foo", "PQOBR", "a", "b", "c"]])
df
0 1 2 3 4
0 HFOBR foo a b c
1 foo PQOBR a b c
Define a function to find the value three columns to the left of the first column containing a string with "OBR"
import numpy as np
def find_left_value(row):
obr_col_idx = np.where(row.str.contains("OBR"))[0]
left_col_idx = obr_col_idx + 3
return row[left_col_idx].iloc[0]
Apply this function to your dataframe:
df['result'] = df.apply(find_left_value, axis=1)
Resulting dataframe:
0 1 2 3 4 result
0 HFOBR foo a b c b
1 foo PQOBR a b c c
FYI: making sample data like this that people can test answers on will help you 1) define your problem more clearly, and 2) get answers.

How to select top N columns in a dataframe with a criteria

Here is my dataframe, it has high dimensionality (big number of columns) more than 10000 columns
The columns in my data are split into 3 categories
columns start with "Basic"
columns end with "_T"
and everything else
a sample of my dataframe is this
RowID Basic1011 Basic2837 Lemon836 Car92_T Manf3953 Brat82 Basic383_T Jot112 ...
1 2 8 4 3 1 5 6 7
2 8 3 5 0 9 7 0 5
I want to have in my dataframe all "Basic" & "_T" columns and only TOP N (variable could be 3, 5, 10, 100, etc) of other columns
I have this code to give me top N for all columns. but what I am looking for just the top N for columns are not "Basic" or "_T"
and by Top I mean the greatest values
Top = 20
df = df.where(df.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
How can I achieve that?
Step 1: You can use .filter() with regex to filter the columns with the following 2 conditions:
start with "Basic", or
end with "_T"
The regex used is r'(?:^Basic)|(?:_T$)' where:
(?: ) is a non-capturing group of regex. It is for a temporary grouping.
^ is the start of text anchor to indicate start position of text
Basic matches with the text Basic (together with ^, this Basic must be at the beginning of column label)
| is the regex meta-character for or
_T matches the text _T
$ is the end of text anchor to indicate end of text position (together with _T, _T$ indicate _T at the end of column name.
We name these columns as cols_Basic_T
Step 2: Then, use Index.difference() to find others columns. We name these other columns as cols_others.
Step 3: Then, we apply the similar code you used to give you top N for all columns on these selected columns col_others.
Full set of codes:
## Step 1
cols_Basic_T = df.filter(regex=r'(?:^Basic)|(?:_T$)').columns
## Step 2
cols_others = df.columns.difference(cols_Basic_T)
## Step 3
#Top = 20
Top = 3 # use fewer columns here for smaller sample data here
df_others = df[cols_others].where(df[cols_others].apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
# To keep the original column sequence
df_others = df_others[df.columns.intersection(cols_others)]
Results:
cols_Basic_T
print(cols_Basic_T)
Index(['Basic1011', 'Basic2837', 'Car92_T', 'Basic383_T'], dtype='object')
cols_others
print(cols_others)
Index(['Brat82', 'Jot112', 'Lemon836', 'Manf3953', 'RowID'], dtype='object')
df_others
print(df_others)
## With Top 3 shown as non-zeros. Other non-Top3 masked as zeros
RowID Lemon836 Manf3953 Brat82 Jot112
0 0 4 0 5 7
1 0 0 9 7 5
Try something like this, you may have to play around with column selection on outset to be sure you're filtering correctly.
# this gives you column names with Basic or _T anywhere in the column name.
unwanted = df.filter(regex='Basic|_T').columns.tolist()
# the tilda takes the opposite of the criteria, so no Basic or _T
dfn = df[df.columns[~df.columns.isin(unwanted)]]
#apply your filter
Top = 2
df_ranked = dfn.where(dfn.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
#then merge dfn with df_ranked

Grouping and comparing groups using pandas

I have data that looks like:
Identifier Category1 Category2 Category3 Category4 Category5
1000 foo bat 678 a.x ld
1000 foo bat 78 l.o op
1000 coo cat 678 p.o kt
1001 coo sat 89 a.x hd
1001 foo bat 78 l.o op
1002 foo bat 678 a.x ld
1002 foo bat 78 l.o op
1002 coo cat 678 p.o kt
What i am trying to do is compare 1000 to 1001 and to 1002 and so on. The output I want the code to give is : 1000 is the same as 1002. So, the approach I wanted to use was:
First group all the identifier items into separate dataframes (maybe?). For example, df1 would be all rows pertaining to identifier 1000 and df2 would be all rows pertaining to identifier 1002. (**Please note that I want the code to do this itself as there are millions of rows, as opposed to me writing code to manually compare identifiers **). I have tried using the groupby feature of pandas, it does the part of grouping well, but then I do not know how to compare the groups.
Compare each of the groups/sub-data frames.
One method I was thinking of was reading each row of a particular identifier into an array/vector and comparing arrays/vectors using a comparison metric (Manhattan distance, cosine similarity etc).
Any help is appreciated, I am very new to Python. Thanks in advance!
You could do something like the following:
import pandas as pd
input_file = pd.read_csv("input.csv")
columns = ['Category1','Category2','Category3','Category4','Category5']
duplicate_entries = {}
for group in input_file.groupby('Identifier'):
# transforming to tuples so that it can be used as keys on a dict
lines = [tuple(y) for y in group[1].loc[:,columns].values.tolist()]
key = tuple(lines)
if key not in duplicate_entries:
duplicate_entries[key] = []
duplicate_entries[key].append(group[0])
Then the duplicate_entries values will have the list of duplicate Identifiers
duplicate_entries.values()
> [[1000, 1002], [1001]]
EDIT:
To get only the entries that have duplicates, you could have something like:
all_dup = [dup for dup in duplicate_entries if len(dup) > 1]
Explaining the indices (sorry I didn't explained it before): Iterating through the df.groupby outcome gives a tuple where the first entry is the key of the group (in this case it would be a 'Identifier') and the second one is a Series of the grouped dataframes. So to get the lines that contain the duplicate entries we'd use [1] and the 'Identifier' for that group is found at [0]. Because on the duplicate_entries array we'd like the identifier of that entry, using group[0] would get us that.
We could separate in groups with groupby, then sort all groups (so we can detect equals even when rows are in different order) by all columns except for "Identifier" and compare the groups:
Suppose that columns = ["Identifier", "Category1", "Category2", "Category3", "Category4", "Category5"]
We can do:
groups = []
pure_groups = []
for name, group in df.groupby("Identifier"):
pure_groups += [group]
g_idfless = group[group.columns.difference(["Identifier"])]
groups += [g_idfless.sort_values(columns[1:]).reset_index().drop("index", axis=1)]
And to compare them:
for i in range(len(groups)):
for j in range(i + 1, len(groups)):
id1 = str(pure_groups[i]["Identifier"].iloc[0])
id2 = str(pure_groups[j]["Identifier"].iloc[0])
print(id1 + " and " + id2 + " equal?: " + str(groups[i].equals(groups[j])))
#-->1000 and 1001 equal?: False
#-->1000 and 1002 equal?: True
#-->1001 and 1002 equal?: False
EDIT: Added code to print the identifiers of the groups that match

Using Python, Pandas and Apply/Lambda, how can I write a function which creates multiple new columns?

Apologies for the messy title: Problem as follows:
I have some data frame of the form:
df1 =
Entries
0 "A Level"
1 "GCSE"
2 "BSC"
I also have a data frame of the form:
df2 =
Secondary Undergrad
0 "A Level" "BSC"
1 "GCSE" "BA"
2 "AS Level" "MSc"
I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
return Attribute
I apply this function over all entries in df1, and all columns in df2, using the following iteration:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
This yields an output df which looks something like:
df1 =
Entries Secondary undergrad
0 "A Level" 1 0
1 "GCSE" 1 0
2 "AS Level" 1 0
I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:
Desired dataframe:
Entries Secondary Words Found undergrad Words Found
0 "A Level" 1 "A Level" 0
1 "GCSE" 1 "GCSE" 0
2 "AS Level" 1 "AS Level" 0
If I do:
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
if Yes == 0:
Attribute = 0
return Attribute,Words_Present
My function therefore now has multiple outputs. So applying the following:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
My Output Looks like this:
Entries Secondary undergrad
0 "A Level" [1,"A Level"] 0
1 "GCSE" [1, "GCSE"] 0
2 "AS Level" [1, "AS Level"] 0
The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.
Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?
I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:
Applying function with multiple arguments to create a new pandas column
Create multiple columns in Pandas Dataframe from one function
Apply pandas function to column to create multiple new columns?
For example, I have also tried:
for i in df2:
terms = df2[i].values.tolist()
[df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])
but this simply yields errors such as:
raise ValueError('Length of values does not match length of ' 'index')
Is there a way to use apply, but still extract the extra information directly into multiple columns?
Many thanks, apologies for the length.
The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().
However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.
It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:
df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1
EDIT:
In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.
However, if you have exact matches, you can do everything by a combination of pivots and joins:
keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
df[colname] = (df.Entries == colname) * 1 # Your indicator variable
df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']
The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.
Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!
For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:
secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
if s.Entries.isin(secondary_words):
return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
elif s.Entries.isin(undergrad_words ):
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
else:
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})
This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.

How can I make a DataFrame containing half of the data from another DataFrame, distributed evenly across values in a column?

I'm trying to do some supervised machine learning on a data set.
My data is organized in a single DataFrame with samples as rows and features as columns. One of my columns contains the category to which the sample belongs.
I would like to split my data set in half such that samples are evenly distributed between categories. Is there a native pandas approach to doing so, or will I have to loop through each row and individually assign each sample to either the training or the testing group?
Here is an illustrative example of how my data is organized. The char column indicates the category to which each row belongs.
feature char
0 SimpleCV.Features.Blob.Blob object at (38, 74)... A
1 SimpleCV.Features.Blob.Blob object at (284, 26... A
2 SimpleCV.Features.Blob.Blob object at (87, 123... B
3 SimpleCV.Features.Blob.Blob object at (198, 37... B
4 SimpleCV.Features.Blob.Blob object at (345, 60... C
5 SimpleCV.Features.Blob.Blob object at (139, 92... C
6 SimpleCV.Features.Blob.Blob object at (167, 83... D
7 SimpleCV.Features.Blob.Blob object at (57, 54)... D
8 SimpleCV.Features.Blob.Blob object at (35, 77)... E
9 SimpleCV.Features.Blob.Blob object at (136, 73... E
Refering to the above example, I'd like to end up with two DataFrames, each containing half of the samples in each char category. In this example, there are two of each char types, so the resulting DataFrames would each have one A row, one B row, etc...
I should mention, however, that the number of rows in each char category in my actual data can vary.
Thanks very much in advance!
Here is one way:
>>> print d
A B Cat
0 -1.703752 0.659098 X
1 0.418694 0.507111 X
2 0.385922 1.055286 Y
3 -0.909748 -0.900903 Y
4 -0.845475 1.681000 Y
5 1.257767 2.465161 Y
>>> def whichHalf(t):
... t['Div'] = 'Train'
... t[:len(t)/2]['Div'] = 'Test'
... return t
>>> d.groupby('Cat').apply(whichHalf)
A B Cat Div
0 -1.703752 0.659098 X Test
1 0.418694 0.507111 X Train
2 0.385922 1.055286 Y Test
3 -0.909748 -0.900903 Y Test
4 -0.845475 1.681000 Y Train
5 1.257767 2.465161 Y Train
This assigns the first half of each group to the test set and the second half to the training set. You can then get the two sets by filtering on this new "Div" column. Note that this will only work if each category has an even number of data points. If a category doesn't have an even number of data points, then obviously you can't divide it equally into two parts.

Categories

Resources