I got a string of such format:
"Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
so basicly it's list of actor's names (optionally followed by their role in parenthesis). The role itself can contain comma (actor's name can not, I strongly hope so).
My goal is to split this string into a list of pairs - (actor name, actor role).
One obvious solution would be to go through each character, check for occurances of '(', ')' and ',' and split it whenever a comma outside occures. But this seems a bit heavy...
I was thinking about spliting it using a regexp: first split the string by parenthesis:
import re
x = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
s = re.split(r'[()]', x)
# ['Wilbur Smith ', 'Billy, son of John', ', Eddie Murphy ', 'John', ', Elvis Presley, Jane Doe ', 'Jane Doe', '']
The odd elements here are actor names, even are the roles. Then I could split the names by commas and somehow extract the name-role pairs. But this seems even worse then my 1st approach.
Are there any easier / nicer ways to do this, either with a single regexp or a nice piece of code?
One way to do it is to use findall with a regex that greedily matches things that can go between separators. eg:
>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> r = re.compile(r'(?:[^,(]|\([^)]*\))+')
>>> r.findall(s)
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']
The regex above matches one or more:
non-comma, non-open-paren characters
strings that start with an open paren, contain 0 or more non-close-parens, and then a close paren
One quirk about this approach is that adjacent separators are treated as a single separator. That is, you won't see an empty string. That may be a bug or a feature depending on your use-case.
Also note that regexes are not suitable for cases where nesting is a possibility. So for example, this would split incorrectly:
"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"
If you need to deal with nesting your best bet would be to partition the string into parens, commas, and everthing else (essentially tokenizing it -- this part could still be done with regexes) and then walk through those tokens reassembling the fields, keeping track of your nesting level as you go (this keeping track of the nesting level is what regexes are incapable of doing on their own).
s = re.split(r',\s*(?=[^)]*(?:\(|$))', x)
The lookahead matches everything up to the next open-parenthesis or to the end of the string, iff there's no close-parenthesis in between. That ensures that the comma is not inside a set of parentheses.
I think the best way to approach this would be to use python's built-in csv module.
Because the csv module only allows a one character quotechar, you would need to do a replace on your inputs to convert () to something like | or ". Then make sure you are using an appropriate dialect and off you go.
An attempt on human-readable regex:
import re
regex = re.compile(r"""
# name starts and ends on word boundary
# no '(' or commas in the name
(?P<name>\b[^(,]+\b)
\s*
# everything inside parentheses is a role
(?:\(
(?P<role>[^)]+)
\))? # role is optional
""", re.VERBOSE)
s = ("Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley,"
"Jane Doe (Jane Doe)")
print re.findall(regex, s)
Output:
[('Wilbur Smith', 'Billy, son of John'), ('Eddie Murphy', 'John'),
('Elvis Presley', ''), ('Jane Doe', 'Jane Doe')]
My answer will not use regex.
I think simple character scanner with state "in_actor_name" should work. Remember then state "in_actor_name" is terminated either by ')' or by comma in this state.
My try:
s = 'Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)'
in_actor_name = 1
role = ''
name = ''
for c in s:
if c == ')' or (c == ',' and in_actor_name):
in_actor_name = 1
name = name.strip()
if name:
print "%s: %s" % (name, role)
name = ''
role = ''
elif c == '(':
in_actor_name = 0
else:
if in_actor_name:
name += c
else:
role += c
if name:
print "%s: %s" % (name, role)
Output:
Wilbur Smith: Billy, son of John
Eddie Murphy: John
Elvis Presley:
Jane Doe: Jane Doe
Here's a general technique I've used in the past for such cases:
Use the sub function of the re module with a function as replacement argument. The function keeps track of opening and closing parens, brackets and braces, as well as single and double quotes, and performs a replacement only outside of such bracketed and quoted substrings. You can then replace the non-bracketed/quoted commas with another character which you're sure doesn't appear in the string (I use the ASCII/Unicode group-separator: chr(29) code), then do a simple string.split on that character. Here's the code:
import re
def srchrepl(srch, repl, string):
"""Replace non-bracketed/quoted occurrences of srch with repl in string"""
resrchrepl = re.compile(r"""(?P<lbrkt>[([{])|(?P<quote>['"])|(?P<sep>["""
+ srch + """])|(?P<rbrkt>[)\]}])""")
return resrchrepl.sub(_subfact(repl), string)
def _subfact(repl):
"""Replacement function factory for regex sub method in srchrepl."""
level = 0
qtflags = 0
def subf(mo):
nonlocal level, qtflags
sepfound = mo.group('sep')
if sepfound:
if level == 0 and qtflags == 0:
return repl
else:
return mo.group(0)
elif mo.group('lbrkt'):
level += 1
return mo.group(0)
elif mo.group('quote') == "'":
qtflags ^= 1 # toggle bit 1
return "'"
elif mo.group('quote') == '"':
qtflags ^= 2 # toggle bit 2
return '"'
elif mo.group('rbrkt'):
level -= 1
return mo.group(0)
return subf
If you don't have nonlocal in your version of Python, just change it to global and define level and qtflags at the module level.
Here's how it's used:
>>> GRPSEP = chr(29)
>>> string = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> lst = srchrepl(',', GRPSEP, string).split(GRPSEP)
>>> lst
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']
This post helped me a lot. I was looking to split a string by commas positioned outside quotes. I used this as a starter. My final line of code was regEx = re.compile(r'(?:[^,"]|"[^"]*")+') This did the trick. Thanks a ton.
I certainly agree with #Wogan above, that using the CSV moudle is a good approach. Having said that if you still want to try a regex solution give this a try, but you will have to adapt it to Python dialect
string.split(/,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))/)
HTH
split by ")"
>>> s="Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> s.split(")")
['Wilbur Smith (Billy, son of John', ', Eddie Murphy (John', ', Elvis Presley, Jane Doe (Jane Doe', '']
>>> for i in s.split(")"):
... print i.split("(")
...
['Wilbur Smith ', 'Billy, son of John']
[', Eddie Murphy ', 'John']
[', Elvis Presley, Jane Doe ', 'Jane Doe']
['']
you can do further checking to get those names that doesn't come with ().
None of the answers above are correct if there are any errors or noise in your data.
It's easy to come up with a good solution if you know the data is right every time. But what happens if there are formatting errors? What do you want to have happen?
Suppose there are nesting parentheses? Suppose there are unmatched parentheses? Suppose the string ends with or begins with a comma, or has two in a row?
All of the above solutions will produce more or less garbage and not report it to you.
Were it up to me, I'd start with a pretty strict restriction on what "correct" data was - no nesting parentheses, no unmatched parentheses, and no empty segments before, between or after comments - validate as I went, and then raise an exception if I wasn't able to validate.
Related
I was practicing with this tiny program with the hopes to capitalize the first letter of each word in: john Smith.
I wanted to capitalize the j in john so I would have an end result of John Smith and this is the code I used:
name = "john Smith"
if (name[0].islower()):
name = name.capitalize()
print(name)
Though, capitalizing the first letter caused an output of: John smith where the S was converted to a lowercase. How can I capitalize the letter j without messing with the rest of the name?
I thank you all for your time and future responses!
I appreciate it very much!!!
As #j1-lee pointed out, what you are looking for is the title method, which will capitalize each word (as opposed to capitalize, which will capitalize only the first word, as if it was a sentence).
So your code becomes
name = "john smith"
name = name.title()
print(name) #> John Smith
Of course you should be using str.title(). However, if you want to reinvent that functionality then you could do this:
name = 'john paul smith'
r = ' '.join(w[0].upper()+w[1:].lower() for w in name.split())
print(r)
Output:
John Paul Smith
Note:
This is not strictly equivalent to str.title() as it assumes all whitespace in the original string is replaced with a single space
I am trying to remove the middle initial at the end of a name string. An example of how the data looks:
df = pd.DataFrame({'Name': ['Smith, Jake K',
'Howard, Rob',
'Smith-Howard, Emily R',
'McDonald, Jim T',
'McCormick, Erica']})
I am currently using the following code, which works for all names except for McCormick, Erica. I first use regex to identify all capital letters. Then any rows with 3 or more capital letters, I remove [:-1] from the string (in an attempt to remove the middle initial and extra space).
df['Cap_Letters'] = df['Name'].str.findall(r'[A-Z]')
df.loc[df['Cap_Letters'].str.len() >= 3, 'Name'] = df['Name'].str[:-1]
This outputs the following:
As you can see, this properly removes the middle initial for all names except for McCormick, Erica. Reason being she has 3 capital letters but no middle initial, which incorrectly removes the 'a' in Erica.
You can use Series.str.replace directly:
df['Name'] = df['Name'].str.replace(r'\s+[A-Z]$', '', regex=True)
Output:
0 Smith, Jake
1 Howard, Rob
2 Smith-Howard, Emily
3 McDonald, Jim
4 McCormick, Erica
Name: Name, dtype: object
See the regex demo. Regex details:
\s+ - one or more whitespaces
[A-Z] - an uppercase letter
$ - end of string.
Another solution(not so pretty) would be to split then take 2 elements then join again
df['Name'] = df['Name'].str.split().str[0:2].str.join(' ')
# 0 Smith, Jake
# 1 Howard, Rob
# 2 Smith-Howard, Emily
# 3 McDonald, Jim
# 4 McCormick, Erica
# Name: Name, dtype: object
I would use something like that :
def removeMaj(string):
tab=string.split(',')
tab[1]=lower(tab[1])
string=",".join(tab)
return(string)
I have a string with first and last names all separated with a space.
For example:
installers = "Joe Bloggs John Murphy Peter Smith"
I now need to replace every second space with ', ' (comma followed by a space) and output this as string.
The desired output is
print installers
Joe Bloggs, John Murphy, Peter Smith
You should be a able to do this with a regex that that finds the spaces and replaces the last one:
import re
installers = "Joe Bloggs John Murphy Peter Smith"
re.sub(r'(\s\S*?)\s', r'\1, ',installers)
# 'Joe Bloggs, John Murphy, Peter Smith'
This says, find a space followed by some non-spaces followed by a space and replace it with the found space followed by some non-spaces and ", ". You could add installers.strip() if there's a possibility of trailing spaces on the string.
One way to do this is to split the string into a space-separated list of names, get an iterator for the list, then loop over the iterator in a for loop, collecting the first name and then advancing to loop iterator to get the second name too.
names = installers.split()
it = iter(names)
out = []
for name in it:
next_name = next(it)
full_name = '{} {}'.format(name, next_name)
out.append(full_name)
fixed = ', '.join(out)
print fixed
'Joe Bloggs, John Murphy, Peter Smith'
The one line version of this would be
>>> ', '.join(' '.join(s) for s in zip(*[iter(installers.split())]*2))
'Joe Bloggs, John Murphy, Peter Smith'
this works by creating a list that contains the same iterator twice, so the zip function returns both parts of the name. See also the grouper recipe from the itertools recipes.
I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName". We can rely on names starting with capital letter.
For example,
'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald'
'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond'
I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond'
The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.'
This is the code I have so far:
s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)
if so:
print so.group(1) + ' ' + so.group(2).split()[1]
print so.group(2)
This returns 'John McDonald' and 'Albert McDonald', but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond'.
An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![A-Z][a-z\-]+)\s([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't.
Could you please let me know how I can correct my regex epression? Or is there a better way to do what I want? Thanks!
As you can rely on names starting with a capital letter, then you could do something like:
((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)
Live preview
Swapping out your current pattern, with this pattern should work with your current Python code. You do need to strip() the captured results though.
Which for your examples and current code would yield:
Input
First print
Second print
John and Albert McDonald
John McDonald
Albert McDonald
Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond
It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond
if one particular word does not end with another particular word, leave it. here is my string:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
i want to print and count all words between john and dead or death or died.
if john does not end with any of the died or dead or death words. leave it. start again with john word.
my code :
x = re.sub(r'[^\w]', ' ', x) # removed all dots, commas, special symbols
for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
print i
print len([word for word in i.split()])
my output:
got shot
2
with his john got killed or
6
with his wife
3
output which i want:
got shot
2
got killed or
3
with his wife
3
i don't know where i am doing mistake.
it is just a sample input. i have to check with 20,000 inputs at a time.
You can use this negative lookahead regex:
>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
... print i.strip()
... print len([word for word in i.split()])
...
got shot
2
got killed or
3
with his wife
3
Instead of your .*? this regex is using (?:(?!john).)*? which will lazily match 0 or more of any characters only when john is not present in this match.
I also suggest using word boundaries to make it match complete words:
re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)
Code Demo
I assume, you want to start over, when there is another john following in your string before dead|died|death occur.
Then, you can split your string by the word john and start matching on the resulting parts afterwards:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
m = re.match('(.+?)(dead|died|death)', e)
if m:
print(m.group(1))
print(len(m.group(1).split()))
yields:
got shot
2
got killed or
3
with his wife
3
Also, note that after the replacements I propose here (before splitting and matching), the string looks like this:
john got shot dead john with his john got killed or died in 1990 john with his wife dead or died
I.e., there are no multiple whitespaces left in a sequence. You manage this by splitting by a whitespace later, but I feel this is a bit cleaner.