I'm trying to execute a bunch of code only if the string I'm searching contains a comma.
Here's an example set of rows that I would need to parse (name is a column header for this tab-delimited file and the column (annoyingly) contains the name, degree, and area of practice:
name
Sam da Man J.D.,CEP
Green Eggs Jr. Ed.M.,CEP
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
My issue is that some of the rows contain a comma, which is followed by an acronym which represents an "area of practice" for the professional and some do not.
My code relies on the principle that each line contains a comma, and I will now have to modify the code in order to account for lines where there is no comma.
def parse_ieca_gc(s):
########################## HANDLE NAME ELEMENT ###############################
degrees = ['M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.', 'RN', 'B.S.Ed.', 'M.D.']
degrees_list = []
# separate area of practice from name and degree and bind this to var 'area'
split_area_nmdeg = s['name'].split(',')
area = split_area_nmdeg.pop() # when there is no area of practice and hence no comma, this pops out the name + deg and leaves an empty list, that's why 'print split_area_nmdeg' returns nothing and 'area' returns the name and deg when there's no comma
print 'split area nmdeg'
print area
print split_area_nmdeg
# Split the name and deg by spaces. If there's a deg, it will match with one of elements and will be stored deg list. The deg is removed name_deg list and all that's left is the name.
split_name_deg = re.split('\s',split_area_nmdeg[0])
for word in split_name_deg:
for deg in degrees:
if deg == word:
degrees_list.append(split_name_deg.pop())
name = ' '.join(split_name_deg)
# area of practice
category = area
re.search() and re.match() both do not work, it appears, because they return instances and not a boolean, so what should I use to tell if there's a comma?
The easiest way in python to see if a string contains a character is to use in. For example:
if ',' in s['name']:
if re.match(...) is not None :
instead of looking for boolean use that. Match returns a MatchObject instance on success, and None on failure.
You are already searching for a comma. Just use the results of that search:
split_area_nmdeg = s['name'].split(',')
if len(split_area_nmdeg) > 2:
print "Your old code goes here"
else:
print "Your new code goes here"
Related
I have a text document and I need to add two # symbols before the keywords present in an array.
Sample text and Array:
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
Required Text:
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need, ##name: George ##employee_id:14296 ##blood_group:b positive this is the blood group of the employee ##age:32"
Just use the replace function
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr = ['name','employee_id','blood_group','age']
for w in arr:
str = str.replace(w, f'##{w}')
print(str)
You can simply loop over arr and use the str.replace function:
for repl in arr:
strng.replace(repl, '##'+repl)
print(strng)
However, I urge you to change the variable name str because it is a reserved keyword.
You might use re module for that task following way
import re
txt = "This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
newtxt = re.sub('('+'|'.join(arr)+')',r'##\1',txt)
print(newtxt)
Output:
This is a sample text document which consists of all demographic information of employee here is the value you may need,##name: George ##employee_id:14296##blood_group:b positive this is the blood group of the employee##age:32
Explanation: here I used regular expression to catch words from your list and replace each with ##word. This is single pass, as opposed to X passes when using multiple str.replace (where X is length of arr), so should be more efficient for cases where arr is long.
As an alternative, you can convert the below in a loop for lengthier list. There seems to be space before ## too.
str= str[:str.find(arr[0])] + ' ##' + str[str.find(arr[0]):]
str= str[:str.find(arr[1])] + ' ##' + str[str.find(arr[1]):]
str= str[:str.find(arr[2])] + ' ##' + str[str.find(arr[2]):]
str= str[:str.find(arr[3])] + ' ##' + str[str.find(arr[3]):]
You can replace the value and add space and double ## before the replaced value and in the result replace double spaces with one space.
str ="This is a sample text document which consists of all demographic information of employee here is the value you may need,name: George employee_id:14296blood_group:b positive this is the blood group of the employeeage:32"
arr=['name','employee_id','blood_group','age']
for i in arr:
str = str.replace(i, " ##{}".format(i))
print(str.replace(" ", " "))
Output
This is a sample text document which consists of all demographic information of employee here is the value you may need, ##name: George ##employee_id:14296 ##blood_group:b positive this is the blood group of the employee ##age:32
I am trying to find some email addresses in a source-code and match them with the first name and last name of the person they are associated with.
The first step in my process is to find the first name and the last name of someone. I have a function that does that very well and return a list of full name.
The second step is to find the email address which is the closest to that name (whether it is displayed before the name or after). So I am looking for both: email before and email after.
For that particular purpose I wrote this regex expression:
for name in full_name_list:
# full name followed by the email
print(re.findall(name+'.*?([A-z0-9_.-]+?#[A-z0-9_.-]+?\.[A-z]+)', source))
# email followed by full name
print(re.findall('([A-z0-9_.-]+?#[A-z0-9_.-]+?\.\w+?.+?'+name+')', source))
Now here is the deal, assuming that my source code is like that and that my full_name_list=['John Doe', 'James Henry', 'Jane Doe']:
" John Doe is part of our team and here is his email: johndoe#something.com. James Henry is also part of our team and here his email: jameshenry#something.com. Jane Doe is the team manager and you can contact her at that address: janedoe#something.com"
The first regex returns the name with the closest email after it, which is what I want.
However the second regex always starts from the first email it founds and stops when it matches the name, which is odd since I asking to look for the least amount of character between the email and the name.... (or at least I think I am)
Is my assumption correct? If yes, what's happening? If not, how can I fix that?
The issue is that your pattern has .*? between email and name patterns, and since regex engine parses the string from left to right, matching starts from the first email and then matches up to the leftmost occurrence of name potentially matching across any amount of other emails.
You may use
import re
full_name_list=['John Doe', 'James Henry', 'Jane Doe']
source = r" John Doe is part of our team and here is his email: johndoe#something.com. James Henry is also part of our team and here his email: jameshenry#something.com. Jane Doe is the team manager and you can contact her at that address: janedoe#something.com"
for name in full_name_list:
# full name followed by the email
name_email = re.search(r'\b' + name+r'\b.*?([\w.-]+#[\w.-]+\.w+)', source)
if name_email:
print( 'Email before "{}" keyword: {}'.format(name, name_email.group(1)) )
# email followed by full name
email_name = re.search(r'\b([\w.-]+#[\w.-]+\.\w+)(?:(?![\w.-]+#[\w.-]+\.\w).)*?\b'+name+r'\b', source, re.S)
if email_name:
print( 'Email after "{}" keyword: {}'.format(name, email_name.group(1)) )
See the Python demo.
Output:
Email after "James Henry" keyword: johndoe#something.com
Email after "Jane Doe" keyword: jameshenry#something.com
Notes:
[A-z] matches more than just ASCII letters, you most probably want to use \w instead of [A-Za-z0-9_] (although \w also matches any Unicode letters and digits, but you may turn this behavior off if you pass re.ASCII flag to re.compile)
\b is a word boundary, it is advisable to add it at the start and end of the name variable to match names as whole words
(?:(?![\w.-]+#[\w.-]+\.\w).)*? is the fix for your current issue, namely, this pattern matches the closest text between the email and the subsequent name. It matches any char ((?:.)), 0 or more occurrences (*?), that is not a starting symbol for an email [\w.-]+#[\w.-]+\.\w pattern.
First, separate the email from the domain so that the #something.com is in a different column.
Next, it sounds like you're describing an algorithm for fuzzy matching called Levenshtein distance. You can use a module designed for this, or perhaps write a custom one:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
Now you can get a numerical value for how similar they are. You'll still need to establish a cutoff as to what number is acceptable to you.
Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
There are other similarity algorithms other than Levenshtein. You might try Jaro-Winkler, or perhaps Trigram.
I got this code from: https://www.datacamp.com/community/tutorials/fuzzy-string-python
I have a function that returns me a list of strings.
I need the strings to be concatenated and returned in form of a single string.
List of strings:
data_hold = ['ye la AAA TAM tat TE
0042
on the mountain sta
nding mute Saw hi
m ply t VIC 3181',
'Page 2 of 3
ACCOUNT SUMMARY NEED TO GET IN TOUCH? ',
'YOUR USAGE BREAKDOWN
Average cost per day $1.57 kWh Tonnes']
I tried concatenating them as follows -
data_hold[0] + '\n' + data_hold[1]
Actual result:
"ye la AAA TAM tat TE\n0042\n\non the mountain sta\nnding mute Saw hi\nm ply t VIC 3181ACCOUNT SUMMARY NEED TO GET IN TOUCH? ',\n'YOUR USAGE BREAKDOWNAverage cost per day $1.57 kWh Tonnes'\n
Expected result:
'ye la AAA TAM tat TE
0042
on the mountain sta
nding mute Saw hi
m ply t VIC 3181',
'Page 2 of 3
ACCOUNT SUMMARY NEED TO GET IN TOUCH? ',
'YOUR USAGE BREAKDOWN
Average cost per day $1.57 kWh Tonnes'
Your 'expected result' is not a single string. However, running print('\n'.join(data_hold)) will produce the equivalent single string.
You misunderstand the difference between what the actual value of a string is, what is printed if you print() the string and how Python may represent the string to show you its value on screen.
For example, take a string with the value:
One line.
Another line, with a word in 'quotes'.
So, the string contains a single text, with two lines and some part of the string has the same quotes in it that you would use to mark the beginning and end of the string.
In code, there's various ways you can construct this string:
one_way = '''One line
Another line, with a word in 'quotes'.'''
another_way = 'One line\nAnother line, with a word in \'quotes\'.'
When you run this, you'll find that one_way and another_way contain the exact same string that, when printed, looks just like the example text above.
Python, when you ask it to show you the representation in code, will actually show you the string like it is specified in the code for another_way, except that it prefers to show it using double quotes to avoid having to escape the single quotes:
>>> one_way = '''One line
... Another line, with a word in 'quotes'.'''
>>> one_way
"One line\nAnother line, with a word in 'quotes'."
Compare:
>>> this = '''Some text
... continued here'''
>>> this
'Some text\ncontinued here'
Note how Python decides to use single quotes if there are no single quotes in the string itself. And if both types of quotes are in there, it'll escape like the example code above:
>>> more = '''Some 'text'
... continued "here"'''
>>> more
'Some \'text\'\ncontinued "here"'
But when printed, you get what you'd expect:
>>> print(more)
Some 'text'
continued "here"
I'm running a program which creates product labels based on csv data. The function which I am struggling with takes a data structure which consists of a number combination(width of a wooden plank) and a string (name of product). Possible combinations I search for are as follows:
5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF
My function needs to take in the data, split the width from the name and return them both as separate variables as follows:
desc = row[1]
if filter.lower() in desc.lower():
size = re.search(r'(\d{1})(\-*)(\d{0,1})(\/*)(\d{0,2})(\+*)(\d{0,1})(\-*)(\d{0,1})(\/*)(\d{0,2})', desc)
if size:
# remove size from description
desc = re.sub(size.group(), '', desc)
size = size.group() # extract match from obj
else:
size = "None"
The function does as intended with the first two samples, however when it comes across the last product, it recognizes the size but does not remove it from description. Screen shot below shows the output after I print (size + \n + desc)
Is there an issue with my re expression or elsewhere?
Thanks
re.sub() expects its first argument to be a regex. It works for the first two because they don't contain any characters that have special meaning in the context, however the third contains +, which is special.
There's not actually any reason to use regex there... regular string replacement should work:
desc = desc.replace(size.group(), '')
Why replace and not simply match what you need?
import re
text = """5 MAPLE PEPPER-ANTIQUE
3-1/4 MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4 MAPLE TIMBERWOLF""".split('\n')
print(text)
for t in text:
pattern = r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
m = re.search(pattern,t)
print(m.group('size'))
print(m.group('species'))
Output:
5
MAPLE PEPPER-ANTIQUE
3-1/4
MAPLE CUMIN-ANTIQUE
2-1/4+4-1/4
MAPLE TIMBERWOLF
Regex:
r'(?P<size>[0-9-+/]+) *(?P<species>[^0123456789]*)'
2 named groups, between them 0-n spaces.
1st group only 0123456789-+/ allowed
2nd group any but 0123456789 allowed
I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.