Fuzzy Searching a Column in Pandas

Fuzzy Searching a Column in Pandas - python

Is there a way to search for a value in a dataframe column using FuzzyWuzzy or similar library?
I'm trying to find a value in one column that corresponds to the value in another while taking fuzzy matching into account. So
So for example, if I have State Names in one column and State Codes in another, how would I find the state code for Florida, which is FL while catering for abbreviations like "Flor"?
So in other words, I want to find a match for a State Name corresponding to "Flor" and get the corresponding State Code "FL".
Any help is greatly appreciated.

If the abbreviations are all prefixes, you can use the .startswith() string method against either the short or long version of the state.
>>> test_value = "Flor"
>>> test_value.upper().startswith("FL")
True
>>> "Florida".lower().startswith(test_value.lower())
True
However, if you have more complex abbreviations, difflib.get_close_matches will probably do what you want!
>>> import pandas as pd
>>> import difflib
>>> df = pd.DataFrame({"states": ("Florida", "Texas"), "st": ("FL", "TX")})
>>> df
states st
0 Florida FL
1 Texas TX
>>> difflib.get_close_matches("Flor", df["states"].to_list())
['Florida']
>>> difflib.get_close_matches("x", df["states"].to_list(), cutoff=0.2)
['Texas']
>>> df["st"][df.index[df["states"]=="Texas"]].iloc[0]
'TX'
You will probably want to try/except IndexError around reading the first member of the returned list from difflib and possibly tweak the cutoff to get less false matches with close states (perhaps offer all the states as possibilities to some user or require more letters for close states).
You may also see the best results combining the two; testing prefixes first before trying the fuzzy match.
Putting it all together
def state_from_partial(test_text, df, col_fullnames, col_shortnames):
if len(test_text) < 2:
raise ValueError("must have at least 2 characters")
# if there's exactly two characters, try to directly match short name
if len(test_text) == 2 and test_text.upper() in df[col_shortnames]:
return test_text.upper()
states = df[col_fullnames].to_list()
match = None
# this will definitely fail at least for states starting with M or New
#for state in states:
# if state.lower().startswith(test_text.lower())
# match = state
# break # leave loop and prepare to find the prefix
if not match:
try: # see if there's a fuzzy match
match = difflib.get_close_matches(test_text, states)[0] # cutoff=0.6
except IndexError:
pass # consider matching against a list of problematic states with different cutoff
if match:
return df[col_shortnames][df.index[df[col_fullnames]==match]].iloc[0]
raise ValueError("couldn't find a state matching partial: {}".format(test_text))
Beware of states which start with 'New' or 'M' (and probably others), which are all pretty close and will probably want special handling. Testing will do wonders here.

Related

Python pandas if statement based off of boolean qualifier

I am try to do an IF statement where it keeps my currency pairs in alphabetic ordering (i.e. USD/EUR would flip to EUR/USD because E alphabetically comes before U, however CHF/JPY would stay the same because C comes alphabetically before J.) Initially I was going to write code specific to that, but realized there were other fields I'd need to flip (mainly changing a sign for positive to negative or vice versa.)
So what I did was write a function to create a new column and make a boolean identifier as to whether or not the field needs action (True) or not (False).
def flipFx(ccypair):
first = ccypair[:3]
last = ccypair[-3:]
if(first > last):
return True
else:
return False
brsPosFwd['Flip?'] = brsPosFwd['Currency Pair'].apply(flipFx)
This works great and does what I want it to.
Then I try and write an IF statement to use that field to create two new columns:
if brsPosFwd['Flip?'] is True:
brsPosFwd['CurrencyFlip'] = brsPosFwd['Sec Desc'].apply(lambda x:
x.str[-3:]+"/"+x.str[:3])
brsPosFwd['NotionalFlip'] = -brsPosFwd['Current Face']
else:
brsPosFwd['CurrencyFlip'] = brsPosFwd['Sec Desc']
brsPosFwd['NotionalFlip'] = brsPosFwd['Current Face']
However, this is not working properly. It's creating the two new fields, CurrencyFlip and NotionalFlip but treating every record like it is False and just pasting what came before it.
Does anyone have any ideas?

Pandas uses vectorised functions. You are performing operations on entire series objects as if they were single elements.
You can use numpy.where to vectorise your calculations:
import numpy as np
brsPosFwd['CurrencyFlip'] = np.where(brsPosFwd['Flip?'],
brsPosFwd['Sec Desc'].str[-3:]+'/'+brsPosFwd['Sec Desc'].str[:3]),
brsPosFwd['Sec Desc'])
brsPosFwd['NotionalFlip'] = np.where(brsPosFwd['Flip?'],
-brsPosFwd['Current Face'],
brsPosFwd['Current Face'])
Note also that pd.Series.apply should be used as a last resort; since it is a thinly veiled inefficient loop. Here you can simply use the .str accessor.

Python regex issue with strings

Hi wonder if someone can help I've converted to Python from Perl and for the most part I love it. However I struggle with regex in Python this is not as strong or easy as perl for anyway. How do I use a list of exemption values(exemptions_list) to search another list which is being iterated in a for loop. Problem is that the values in the for loop are slightly different from the search exemptions.
i.e. one of the exemptions is the string "default" but the variable coming in to be search is default_10 or default_20. Likewise none is the search pattern but the share is called none_20 etc. I don't really want to iterate over the search patterns as I am already iterating over the shares which come from another subprocess output. So basically it never finds the string as it is looking for default_20 rather than default. How can break down the variable coming in from shared_list so that python uses default from the variable to search again the strings in the exemptions_list. The share variable is as stated generated differently for different systems subprocess output.
Many thanks
in Perl it would be easy.
if ( $share =~ /^.*_[\d\d]/ && $share !~ /$cust_id|$exemptions/ ) {
Python:
exemption_list = "none temp swap container"
shares_list [' this is dynamic and comes in with values such as none_20 temp_20, testtmp etc ]'
def process_share_information(shares_list, customer_id):
for share in shares_list:
share_match = re.search(share, exemption_list)
if not share_match:
print 'we have found a potentially bad share not in exemptions'

Strips last _\d\d from string
re.sub(r'_\d\d$', '', string)
So to check for exemption do
>>> re.sub(r'_\d\d$', '', "none_20") in exemption_list
True
If searched words are in more general format than name_\d\d, iterate over exemptions instead.
>>> exemptions = "none temp swap container".split()
>>> shares_list = "this is dynamic and comes in with values such as none_20 asdfnone anonea temp_20, testtmp etc"
>>> for e in exemptions:
... print(e)
... print(e in shares_list)
... print(re.findall(r'\b\S*?{}\S*?\b'.format(e), shares_list))
... print()
...
none
True
['none_20', 'asdfnone', 'anonea']
temp
True
['temp_20']
swap
False
[]
container
False
[]
Or if you only need one result for whole string
>>> any(e in shares_list for e in exemptions)
True

List Matching in Python using nested for loops

I have three lists, (1) treatments (2) medicine name and (3) medicine code symbol. I am trying to identify the respective medicine code symbol for each of 14,700 treatments. My current approach is to identify if any name in (2) is "in" (1), and then return the corresponding (3). However, I am returned an abitrary list (correct length) of medicine code symbols corresponding to the 14,700 treatments. Code for the method I've written is below:
codes = pandas.read_csv('Codes.csv', dtype=str)
codes_list = _codes.values.tolist()
names = pandas.read_csv('Names.csv', dtype=str)
names_list = names.values.tolist()
treatments = pandas.read_csv('Treatments.csv', dtype=str)
treatments_list = treatments.values.tolist()
matched_codes_list = range(len(treatments_list))
for i in range(len(treatments_list)):
for j in range(len(names_list)):
if names_list[j] in treatments_list[i]:
matched_codes_list[i]=codes_list_text[j]
print matched_codes_list
Any suggestions for where I am going wrong would be much appreciated!

I can't tell what you are expecting. You should replace the xxx_list code with examples instead, since you don't seem to have any problems with the csv reading.
Let's suppose you did that, and your result looks like this.
codes_list = ['shark', 'panda', 'horse']
names_list = ['fin', 'paw', 'hoof']
assert len(codes_list) == len(names_list)
treatments_list = ['tape up fin', 'reverse paw', 'stand on one hoof', 'pawn affinity maneuver', 'alert wing patrol']
it sounds like you are trying to determine the 'code' for each 'treatment', assuming that the number of codes and names are the same (and indicate some mapping). You plan to use the presence of the name to determine the code.
we can zip together the name and codes list to avoid using indexes there, and we can use iteration over the treatment list instead of indexes for pythonic readability
matched_codes_list = []
for treatment in treatment:
matched_codes = []
for name, code in zip(names_list, codes_list):
if name in treatment:
matched_codes.append(code)
matched_codes_list.append(matched_codes)
this would give something like
assert matched_codes_list == [
['shark'], # 'tape up fin'
['panda'], # 'reverse paw'
['horse'], # 'stand on one hoof'
['shark', 'panda', 'horse'], # 'pawn affinity maneuver'
[], # 'alert wing patrol'
]
note that the method used to do this is quite slow (and probably will give false positives, see 4th entry). You will traverse the text of all treatment descriptions once for each name/code pair.
You can use a dictionary like 'lookup = {name: code for name, code in zip(names_list, codes_list)}, or itertools.izip for minor gains. Otherwise something more clever might be needed, perhaps splitting treatments into a set containing words, or mapping words into multiple codes.

How can I match partial strings / is there a better way?

I am pulling hotel names through the Expedia API and cross referencing results with another travel service provider.
The problem I am encountering is that many of the hotel names appear differently on the Expedia API than they do with the other provider and I cannot figure out a good way to match them.
I am storing the results of both in separate dicts with room rates. So, for example, the results from Expedia on a search for Vilnius in Lithuania might look like this:
expediadict = {'Ramada Hotel & Suites Vilnius': 120, 'Hotel Rinno': 100,
'Vilnius Comfort Hotel': 110}
But the results from the other provider might look like this:
altproviderdict = {'Ramada Vilnius': 120, 'Rinno Hotel': 100,
'Comfort Hotel LT': 110}
The only thing I can think of doing is stripping out all instances of 'Hotel', 'Vilnius', 'LT' and 'Lithuania' and then testing whether part of the expediadict key matches part of an altprovderdict key. This seems messy and not very Pythonic, so I wondered if any of you had any cleaner ideas?

>>> def simple_clean(word):
... return word.lower().replace(" ","").replace("hotel","")
...
>>> a = "Ramada Hotel & Suites Vilnius"
>>> b = "Hotel Ramada Suites Vilnous"
>>> a = simple_clean(a)
>>> b = simple_clean(b)
>>> a
'ramada&suitesvilnius'
>>> b
'ramadasuitesvilnous'
>>> import difflib
>>> difflib.SequenceMatcher(None,a,b).ratio()
0.9230769230769231
Do cleaning and normalization of the words : eg. remove words like Hotel,The,Resort etc
, and convert to lower case without spaces etc
Then use a fuzzy string matching algorithm like leveinstein, eg from difflib module.
This method is pretty raw and just an example, you can enhance it to suit your needs for optimal results.

If you only want to match names when the words appear in the same order, you might want to use some longest common sub sequence algorithm like it's used in diff tools. But with words instead of characters or lines.
If order is not important, it's simpler: Put all the words of the name into a set like this:
set(name.split())
and in order to match two names, test the size of the intersection of these two sets. Or test if the symmetric_difference only contains unimportant words.

python: replacing regex with BNF or pyparsing

I am parsing a relatively simple text, where each line describes a game unit. I have little knowledge of parsing techniques, so I used the following ad hoc solution:
class Unit:
# rules is an ordered dictionary of tagged regex that is intended to be applied in the given order
# the group named V would correspond to the value (if any) for that particular tag
rules = (
('Level', r'Lv. (?P<V>\d+)'),
('DPS', r'DPS: (?P<V>\d+)'),
('Type', r'(?P<V>Tank|Infantry|Artillery'),
#the XXX will be expanded into a list of valid traits
#note: (XXX| )* wouldn't work; it will match the first space it finds,
#and stop at that if it's in front of something other than a trait
('Traits', r'(?P<V>(XXX)(XXX| )*)'),
# flavor text, if any, ends with a dot
('FlavorText', r'(?P<V>.*\."?$)'),
)
rules = collections.OrderedDict(rules)
traits = '|'.join('All-Terrain', 'Armored', 'Anti-Aircraft', 'Motorized')
rules['Traits'] = re.sub('XXX', effects, rules['Traits'])
for x in rules:
rules[x] = re.sub('<V>', '<'+x+'>', rules[x])
rules[x] = re.compile(rules[x])
def __init__(self, data)
# data looks like this:
# Lv. 5 Tank DPS: 55 Motorized Armored
for field, regex in Item.rules.items():
data = regex.sub(self.parse, data, 1)
if data:
raise ParserError('Could not parse part of the input: ' + data)
def parse(self, m):
if len(m.groupdict()) != 1:
Exception('Expected a single named group')
field, value = m.groupdict().popitem()
setattr(self, field, value)
return ''
It works fine, but I feel I reached the limit of regex power. Specifically, in the case of Traits, the value ends up being a string that I need to split and convert into a list at a later point: e.g., obj.Traits would be set to 'Motorized Armored' in this code, but in a later function changed to ('Motorized', 'Armored').
I'm thinking of converting this code to use either EBNF or pyparsing grammar or something like that. My goals are:
make this code neater and less error-prone
avoid the ugly treatment of the case with a list of values (where I need do replacement inside the regex first, and later post-process the result to convert a string into a list)
What would be your suggestions about what to use, and how to rewrite the code?
P.S. I skipped some parts of the code to avoid clutter; if I introduced any errors in the process, sorry - the original code does work :)

I started to write up a coaching guide for pyparsing, but looking at your rules, they translate pretty easily into pyparsing elements themselves, without dealing with EBNF, so I just cooked up a quick sample:
from pyparsing import Word, nums, oneOf, Group, OneOrMore, Regex, Optional
integer = Word(nums)
level = "Lv." + integer("Level")
dps = "DPS:" + integer("DPS")
type_ = oneOf("Tank Infantry Artillery")("Type")
traits = Group(OneOrMore(oneOf("All-Terrain Armored Anti-Aircraft Motorized")))("Traits")
flavortext = Regex(r".*\.$")("FlavorText")
rule = (Optional(level) & Optional(dps) & Optional(type_) &
Optional(traits) & Optional(flavortext))
I included the Regex example so you could see how a regular expression could be dropped in to an existing pyparsing grammar. The composition of rule using '&' operators means that the individual items could be found in any order (so the grammar takes care of the iterating over all the rules, instead of you doing it in your own code). Pyparsing uses operator overloading to build up complex parsers from simple ones: '+' for sequence, '|' and '^' for alternatives (first-match or longest-match), and so on.
Here is how the parsed results would look - note that I added results names, just as you used named groups in your regexen:
data = "Lv. 5 Tank DPS: 55 Motorized Armored"
parsed_data = rule.parseString(data)
print parsed_data.dump()
print parsed_data.DPS
print parsed_data.Type
print ' '.join(parsed_data.Traits)
prints:
['Lv.', '5', 'Tank', 'DPS:', '55', ['Motorized', 'Armored']]
- DPS: 55
- Level: 5
- Traits: ['Motorized', 'Armored']
- Type: Tank
55
Tank
Motorized Armored
Please stop by the wiki and see the other examples. You can easy_install to install pyparsing, but if you download the source distribution from SourceForge, there is a lot of additional documentation.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.