Matching partial cases to Python dictionary - python

At first glance I thought this to be a simple issue yet I cannot find an answer that accurately fits...
I have a dictionary of state names and abbreviations like so;
{(' ak', ',ak', ', ak', 'juneau', ',alaska', ', alaska'): 'alaska',
(' al', ',al', ', al', 'montgomery', ',alabama', ', alabama'): 'alabama',
(' ar', ',ar', ', ar', 'little rock', ',arkansas', ', arkansas'): 'arkansas',
(' az', ',az', ', az', 'phoenix', ',arizona', ', arizona'): 'arizona',
I am attempting to map this dictionary over various cases of self reported Twitter location that I have in a pandas dataframe to look for partial matches. For example, if one case read 'anchorage,ak' it would change the value to Alaska. I could see this being quite simple if it were a list, yet there must be another way to do this without looping. Any help is greatly appreciated!

I think timgeb has the right idea above. I would add two things:
1) You can also remove all whitespace from the given case before processing-- thus, there will be no need to include ' ak', ',ak', and ', ak' all as keys-- a simple 'ak' key would suffice.
2) Instead of repeating the state values in the dictionary, I would create an extra hash from integers to states i.e. {0: 'alaska, 1: 'alabama' ...} and store the corresponding integer key in your original dictionary.
Thus your resulting dictionary should look something like this:
A = {'ak': 0, 'juneau': 0, 'alaska': 0, 'al': 1, 'montgomery': 1, 'alabama': 1, ...}
And to access state names from integer values, you should have another dictionary like this for all 50 states:
B = {0: 'alaska', 1: 'alabama', ...}
so given a case...
case = 'anchorage,ak'
case_list = case.replace(' ', '').split(',') # remove all whitespace and split case by comma
for elem in case_list:
if elem in A:
# insert code to replace case with B[A[elem]]
break

Related

Extract text with multiple regex patterns in Python

I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1

how to use "in" condition with many text define in python

i want to get correct result from my condition, here is my condition
this is my database
and here is my code :
my define text
#define
country = ('america','indonesia', 'england', 'france')
city = ('new york', 'jakarta', 'london', 'paris')
c1="Country"
c2="City"
c3="<blank>"
and condition ("text" here is passing from select database, ofc using looping - for)
if str(text) in str(country) :
stat=c1
elif str(text) in str(city) :
stat=c2
else :
stat=c3
and i got wrong result for the condition, like this
any solution to make this code work ? it work when just contain 1 text when using "in", but this case define so many text condition.
If i understood you correctly you need.
text = "i was born in paris"
country = ('america','indonesia', 'england', 'france')
city = ('new york', 'jakarta', 'london', 'paris')
def check(text):
for i in country:
if i in text.lower():
return "Country"
for i in city:
if i in text.lower():
return "City"
return "<blank>"
print(check(text))
print(check("I dnt like vacation in america"))
Output:
City
Country
You could be better off using dictionaries. I assume that text is a list:
dict1 = {
"countries" : ['america','indonesia', 'england', 'france'],
"city" : ['new york', 'jakarta', 'london', 'paris']
}
for x in text:
for y in dict1['countries']:
if y in x:
print 'country: ' + x
for z in dict1['city']:
if z in x:
print 'city: ' + x
First of all, check what you are testing.
>>> country = ('america','indonesia', 'england', 'france')
>>> city = ('new york', 'jakarta', 'london', 'paris')
>>>
>>> c1="Country"
>>> c2="City"
>>> c3="<blank>"
Same as your setup. So, you are testing for the presence of a substring.
>>> str(country)
"('america', 'indonesia', 'england', 'france')"
Let's see if we can find a country.
>>> 'america' in str(country)
True
Yes! Unfortunately a simple string test such as the one above, besides involving an unnecessary conversion of the list to a string, also finds things that aren't countries.
>>> "ca', 'in" in str(country)
True
The in test for strings is true if the string to the right contains the substring on the left. The in test for lists is different, however, and is true when the tested list contains the value on the left as an element.
>>> 'america' in country
True
Nice! Have got got rid of the "weird other matches" bug?
>>> "ca', 'in" in country
False
It would appear so. However, using the list inclusion test you need to check every word in the input string rather than the whole string.
>>> "I don't like to vacation in america" in country
False
The above is similar to what you are doing now, but testing list elements rather than the list as a string. This expression generates a list of words in the input.
>>> [word for word in "I don't like to vacation in america".split()]
['I', "don't", 'like', 'to', 'vacation', 'in', 'america']
Note that you may have to be more careful than I have been in splitting the input. In the example above, "america, steve" when split would give ['america,', 'steve'] and neither word would match.
The any function iterates over a sequence of expressions, returning True at the first true member of the sequence (and False if no such element is found). (Here I use a generator expression instead of a list, but the same iterable sequence is generated).
>>> any(word in country for word in "I don't like to vacation in america".split())
True
For extra marks (and this is left as an exercise for the reader) you could write
a function that takes two arguments, a sentence and a list of possible matches,
and returns True if any of the words in the sentence are present in the list. Then you could use two different calls to that function to handle the countries and the
cities.
You could speed things up somewhat by using sets rather than lists, but the principles are the same.

Regex - Splitting Strings at full-stops unless it's part of an honorific [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a list containing all possible titles:
['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']
I need a Python 2.7 code that can replace all full-stops \. with newline \n unless it's one of the above titles.
Splitting it into a list of strings would be fine as well.
Sample Input:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India. The bill is set to pass.
Sample Output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
This should do the trick, here we use a list comprehension with a conditional statement to concatenate the words with a \n if they contain a full-stop, and are not in the list of key words. Otherwise just concatenate a space.
Finally the words in the sentence are joined using join(), and we use rstrip() to eliminate any newline remaining at the end of the string.
l = set(['Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.',
'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.',
'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.'] )
s = 'Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road
map for introduction of GST in India. The bill is set to pass.'
def split_at_period(input_str, keywords):
final = []
split_l = input_str.split(' ')
for word in split_l:
if '.' in word and word not in keywords:
final.append(word + '\n')
continue
final.append(word + ' ')
return ''.join(final).rstrip()
print split_at_period(s, l)
or a one liner :D
print ''.join([w + '\n' if '.' in w and w not in l else w + ' ' for w in s.split(' ')]).rstrip()
Sample output:
Modi is waiting in line to Thank Dr. Manmohan Singh for preparing a road map for introduction of GST in India.
The bill is set to pass.
How it works?
Firstly we split up our string with a space ' ' delimiter using the split() string function, thus returning the following list:
>>> ['Modi', 'is', 'waiting', 'in', 'line', 'to', 'Thank', 'Dr.',
'Manmohan', 'Singh', 'for', 'preparing', 'a', 'road', 'map', 'for',
'introduction', 'of', 'GST', 'in', 'India.', 'The', 'bill', 'is',
'set', 'to', 'pass.']
We then start to build up a new list by iterating through the split-up list. If we see a word that contains a period, but is not a keyword, (Ex: India. and pass. in this case) then we have to concatenate a newline \n to the word to begin the new sentence. We can then append() to our final list, and continue out of the current iteration.
If the word does not end off a sentence with a period, we can just concatenate a space to rebuild the original string.
This is what final looks like before it is built as a string using join().
>>> ['Modi ', 'is ', 'waiting ', 'in ', 'line ', 'to ', 'Thank ', 'Dr.
', 'Manmohan ', 'Singh ', 'for ', 'preparing ', 'a ', 'road ', 'map ',
'for ', 'introduction ', 'of ', 'GST ', 'in ', 'India.\n', 'The ', 'bill ',
'is ', 'set ', 'to ', 'pass.\n']
Excellent, we have spaces, and newlines where they need to be! Now, we can rebuild the string. Notice however, that the the last element in the list also happens to contain a \n, we can clean that up with calling rstrip() on our new string.
The initial solution did not support spaces in the keywords, I've included a new more robust solution below:
import re
def format_string(input_string, keywords):
regexes = '|'.join(keywords) # Combine all keywords into a regex.
split_list = re.split(regexes, input_string) # Split on keys.
removed = re.findall(regexes, input_string) # Find removed keys.
newly_joined = split_list + removed # Interleave removed and split.
newly_joined[::2] = split_list
newly_joined[1::2] = removed
space_regex = '\.\s*'
for index, section in enumerate(newly_joined):
if '.' in section and section not in removed:
newly_joined[index] = re.sub(space_regex, '.\n', section)
return ''.join(newly_joined).strip()
convert all titles (and sole dot) into a regular expression
use a replacement callback
code:
import re
l = "|".join(map(re.escape,['.','Mr.', 'Mrs.', 'Ms.', 'Dr.', 'Prof.', 'Rev.', 'Capt.', 'Lt.-Col.', 'Col.', 'Lt.-Cmdr.', 'The Hon.', 'Cmdr.', 'Flt. Lt.', 'Brgdr.', 'Wng. Cmdr.', 'Group Capt.' ,'Rt.', 'Maj.-Gen.', 'Rear Admrl.', 'Esq.', 'Mx', 'Adv', 'Jr.']))
e="Dear Mr. Foo, I would like to thank you. Because Lt.-Col. Collins told me blah blah. Bye."
def do_repl(m):
s = m.group(1)
if s==".":
rval=".\n"
else:
rval = s
return rval
z = re.sub("("+l+")",do_repl,e)
# bonus: leading blanks should be stripped even that's not the question
z= re.sub(r"\s*\n\s*","\n",z,re.DOTALL)
print(z)
output:
Dear Mr. Foo, I would like to thank you.
Because Lt.-Col. Collins told me blah blah.
Bye.

Regex: pairing up numbers and '#'-preceded comments in a list of strings

So I have some lines of text that are stored in a list as follows:
lines = ['1.9 #comment 1.11* 1.5 # another comment',
'1.23',
'3.10.3* #commennnnnt 1.2 ']
I want to create:
[{'1.9': 'comment'},
{'1.11*': ''},
{'1.5': 'another comment'},
{'1.23': ''},
{'3.10.3*': 'commennnnnt'},
{'1.2': ''} ]
In other words, I want to take the list apart and pair each decimal number with either the comment (starting with '#'; we can assume that no numbers occur in it) that appears right after it on the same line, or with an empty string if there is no comment (e.g., the next thing after it is another number).
Specifically, a 'decimal number' can be a single digit, followed by a dot and then either one or two digits, optionally followed by a dot and one or two more digits. A '*' may appear at the very end. So like this(?): r'\d\.\d{1,2}(\.\d{1,2})?\*?')
I've tried a few things with re.split() to get started. For example, splitting the first list item on either the crazy decimal regex or #, before worrying about the dict pairings:
>>> crazy=r'\d\.\d{1,2}(\.\d{1,2})?\*?'
>>> re.split(r'({0})|#'.format(crazy), results[0])
Result:
[u'',
u'1.9',
None,
u' ',
None,
None,
u'comment ',
u'1.11',
None,
u' ',
u'1.5',
None,
u' ',
None,
None,
u' test comment']
This looks like something I can filter and work with, but is there a better way? (also, wow...it seems the parentheses in my crazy regex allow me to keep the decimal number delimiters as desired!)
The following seems to work:
lines = ['1.9 #comment 1.11* 1.5 # another comment',
'1.23',
'3.10.3* #commennnnnt 1.2 ']
entries = re.findall(r'([0-9.]+\*?)\s+((?:[\# ])?[a-zA-Z ]*)', " ".join(lines))
ldict = [{k: v.strip(" #")} for k,v in entries]
print ldict
This displays:
[{'1.9': 'comment'}, {'1.11*': ''}, {'1.5': 'another comment'}, {'1.23': ''}, {'3.10.3*': 'commennnnnt'}, {'1.2': ''}]

Iterating over multiple lists in Python

I have a list within a list, and I am trying to iterate through one list, and then in the inner list I want to search for a value, and if this value is present, place that list in a variable.
Here's what I have, which doesn't seem to be doing the job:
for z, g in range(len(tablerows), len(andrewlist)):
tablerowslist = tablerows[z]
if "Andrew Alexander" in tablerowslist:
andrewlist[g] = tablerowslist
Any ideas?
This is the list structure:
[['Kyle Bazzy', 'FUP dropbox message', '8/18/2011', 'Swing Trade Stocks</a>', ' ', 'Affiliate blog'], ['Kyle Bazzy', 'FUP dropbox message', '8/18/2011', 'Swing Trade Software</a>', ' ', 'FUP from dropbox message. Affiliate blog'], ['Kyle Bazzy', 'FUP dropbox message', '8/18/2011', 'Start Day Trading (Blog)</a>', ' ', 'FUP from dropbox message'], ['Kyle Bazzy', 'Call, be VERY NICE', '8/18/2011', ' ', 'r24867</a>', 'We have been very nice to him, but he wants to cancel, we need to keep being nice and seeing what is wrong now.'], ['Jason Raznick', 'Reach out', '8/18/2011', 'Lexis Nexis</a>', ' ', '-'], ['Andrew Alexander', 'Check on account in one week', '8/18/2011', ' ', 'r46876</a>', '-'], ['Andrew Alexander', 'Cancel him from 5 dollar feed', '8/18/2011', ' ', 'r37693</a>', '-'], ['Aaron Wise', 'FUP with contract', '8/18/2011', 'YouTradeFX</a>', ' ', "Zisa is on vacation...FUP next week and then try again if she's still gone."], ['Aaron Wise', 'Email--JASON', '8/18/2011', 'Lexis Nexis</a>', ' ', 'email by today'], ['Sarah Knapp', '3rd FUP', '8/18/2011', 'Steven L. Pomeranz</a>', ' ', '-'], ['Sarah Knapp', 'Are we really interested in partnering?', '8/18/2011', 'Reverse Spins</a>', ' ', "V. political, doesn't seem like high quality content. Do we really want a partnership?"], ['Sarah Knapp', '2nd follow up', '8/18/2011', 'Business World</a>', ' ', '-'], ['Sarah Knapp', 'Determine whether we are actually interested in partnership', '8/18/2011', 'Fayrouz In Dallas</a>', ' ', "Hasn't updated since September 2010."], ['Sarah Knapp', 'See email exchange w/Autumn; what should happen', '8/18/2011', 'Graham and Doddsville</a>', ' ', "Wasn't sure if we could partner bc of regulations, but could do something meant simply to increase traffic both ways."], ['Sarah Knapp', '3rd follow up', '8/18/2011', 'Fund Action</a>', ' ', '-']]
For any value that has a particular value in it, say, Andrew Alexander, I want to make a separate list of these.
For example:
[['Andrew Alexander', 'Check on account in one week', '8/18/2011', ' ', 'r46876</a>', '-'], ['Andrew Alexander', 'Cancel him from 5 dollar feed', '8/18/2011', ' ', 'r37693</a>', '-']]
Assuming you have a list whose elements are lists, this is what I'd do:
andrewlist = [row for row in tablerows if "Andrew Alexander" in row]
>>> #I have a list within a list,
>>> lol = [[1, 2, 42, 3], [4, 5, 6], [7, 42, 8]]
>>> found = []
>>> #iterate through one list,
>>> for i in lol:
... #in the inner list I want to search for a value
... if 42 in i:
... #if this value is present, place that list in a variable
... found.append(i)
...
>>> found
[[1, 2, 42, 3], [7, 42, 8]]
for z, g in range(len(tablerows), len(andrewlist)):
This means "make a list of the numbers which are between the length of tablerows and the length of andrewlist, and then look at each of those numbers in turn, and treat those numbers as a list of two values, and assign the two values to z and g each time through the loop".
A number cannot be treated as a list of two values, so this fails.
You need to be much, much clearer about what you are doing. Show an example of the contents of tablerows before the loop, and the contents of andrewlist before the loop, and what it should look like afterwards. Your description is muddled: I can only guess that when you say "and then I want to iterate through one list" you mean one of the lists in your list-of-lists; but I can't tell whether you want one specific one, or each one in turn. And then when you next say "and then in the inner list I want to...", I have no idea what you're referring to.

Categories

Resources