Regex Text Cleaning on Multiple forms of text formats

Regex Text Cleaning on Multiple forms of text formats - python

I have a dataframe with multiple forms of names:
JOSEPH W. JASON
Ralph Landau
RAYMOND C ADAMS
ABD, SAMIR
ABDOU TCHOUSNOU, BOUBACAR
ABDL-ALI, OMAR R
For first 3, the rule is last word. For the last three, or anything with comma, the first word is the last name. However, for name like Abdou Tchousnou, I only took the last word, which is Tchousnou.
The expected output is
JASON
LANDAU
ADAMS
ABD
TCHOUNOU
ABDL-ALI
The left is the name, and the right is what I want to return.
str.extract(r'(^(?=[^,]*,?$)[\w-]+|(?<=, )[\w-]+)', expand=False)
Is there anyway to solve this? The current code only returns the first name instead of surname which is the one that I want.

Something like this would work:
(.+(?=,)|\S+$)
( - start capture group #1
.+(?=,) - get everything before a comma
| - or
\S+$ - get everything which is not a whitespace before the end of the line
) - end capture group #1
https://regex101.com/r/myvyS0/1
Python:
str.extract(r'(.+(?=,)|\S+$)', expand=False)

You may use this regex to extract:
>>> print (df)
name
0 JOSEPH W. JASON
1 Ralph Landau
2 RAYMOND C ADAMS
3 ABD, SAMIR
4 ABDOU TCHOUSNOU, BOUBACA
5 ABDL-ALI, OMAR R
>>> df['name'].str.extract(r'([^,]+(?=,)|\w+(?:-\w+)*(?=$))', expand=False)
0 JASON
1 Landau
2 ADAMS
3 ABD
4 ABDOU TCHOUSNOU
5 ABDL-ALI
RegEx Details:
(: Start capture group
[^,]+(?=,): Match 1+ non-comma characters tha
|: OR
\w+: Match 1+ word charcters
(?:-\w+)*: Match - followed 1+ word characters. Match 0 or more of this group
): End capture group
(?=,|$): Lookahead to assert that we have comma or end of line ahead

Related

Extract a substring from a column and replace column data frame

I need some help extracting a substring from a column in my data frame and then replacing that column with a substring. I was wondering if python would be better performance for stripping the string or using regular expression to substitute/replace the string with the substring.
The string looks something like this in the column:
Person
------
<Person 1234567 Tom Brady>
<Person 456789012 Mary Ann Thomas>
<Person 92145 John Smith>
What I would like is this:
Person
------
Tom Brady
Mary Ann Thomas
John Smith
What I have so far as far as regular expressions go is this:
/^([^.]+[.]+[^.]+)[.]/g
And that just gets this part '<Person 1234567 ', not sure how to get the '>' from the end.

Multiple ways, but you can use str.replace():
import pandas as pd
df = pd.DataFrame({'Person': ['<Person 1234567 Tom Brady>',
'<Person 456789012 Mary Ann Thomas>',
'<Person 92145 John Smith>']})
df['Person'] = df['Person'].str.replace(r'(?:<Person[\d\s]+|>)', '', regex=True)
print(df)
Prints:
Person
0 Tom Brady
1 Mary Ann Thomas
2 John Smith
Pattern used: (?:<Person[\d\s]+|>), see an online demo:
(?: - Open non-capture group for alternation;
<Person[\d\s]+ - Match literal '<Person' followed by 1+ whitespace characters or digits;
| - Or;
> - A literal '>'
) - Close group.

You can first identify all the alphabets in keeping things simple with this code
res = re.findall(r"[^()0-9-]+", string)
res[1]
This should return you a list of strings ['Person', 'Tom Brady'], you can then access the name of the Person with res[1]
**Remark: I have yet to try the code, in the case that it also returns spaces, you should be able to easily remove them with strip() or it should be the the third string of the list res[3] instead.
You can read more about re.findall() online or through the documentation.

Python regex has a function called search that finds the matching pattern in a string. With the examples given, you can use regex to extract the names with:
import re
s = "<Person 1234567 John Smith>"
re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", s).group(0)
>>> 'John Smith'
The regular expression [A-Z][a-z]+(\s[A-Z][a-z]+)+ is just matching the names (Tom Brady, Mary Ann Thomas, etc.)
I like to use Panda's apply function to apply an operation on each row, so the final result would look like this:
import re
import pandas as pd
def extract_name(row):
row["Person"] = re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", row["Person"]).group(0)
return row
df = YOUR DATAFRAME
df2 = df.apply(extract_name, axis=1)
and df2 has the Person column with the extracted names.

Python remove middle initial from then end of a name string

I am trying to remove the middle initial at the end of a name string. An example of how the data looks:
df = pd.DataFrame({'Name': ['Smith, Jake K',
'Howard, Rob',
'Smith-Howard, Emily R',
'McDonald, Jim T',
'McCormick, Erica']})
I am currently using the following code, which works for all names except for McCormick, Erica. I first use regex to identify all capital letters. Then any rows with 3 or more capital letters, I remove [:-1] from the string (in an attempt to remove the middle initial and extra space).
df['Cap_Letters'] = df['Name'].str.findall(r'[A-Z]')
df.loc[df['Cap_Letters'].str.len() >= 3, 'Name'] = df['Name'].str[:-1]
This outputs the following:
As you can see, this properly removes the middle initial for all names except for McCormick, Erica. Reason being she has 3 capital letters but no middle initial, which incorrectly removes the 'a' in Erica.

You can use Series.str.replace directly:
df['Name'] = df['Name'].str.replace(r'\s+[A-Z]$', '', regex=True)
Output:
0 Smith, Jake
1 Howard, Rob
2 Smith-Howard, Emily
3 McDonald, Jim
4 McCormick, Erica
Name: Name, dtype: object
See the regex demo. Regex details:
\s+ - one or more whitespaces
[A-Z] - an uppercase letter
$ - end of string.

Another solution(not so pretty) would be to split then take 2 elements then join again
df['Name'] = df['Name'].str.split().str[0:2].str.join(' ')
# 0 Smith, Jake
# 1 Howard, Rob
# 2 Smith-Howard, Emily
# 3 McDonald, Jim
# 4 McCormick, Erica
# Name: Name, dtype: object

I would use something like that :
def removeMaj(string):
tab=string.split(',')
tab[1]=lower(tab[1])
string=",".join(tab)
return(string)

How should I construct a regex match for a various strings within repeated delimiters?

I have a string formatted as:
GENESIS 1:1 In the beginning God created the heavens ...
the ground. 2:7 And the LORD ...
I buried Leah. 49:32 The purchase of the field and of the cave ...
and he was put in a coffin in Egypt. EXODUS 1:1 Now these are the names ...
Using only one regular expression, I want to match as groups:
the book names
the chapter numbers (as above 1, 2, 49, 1)
the verse numbers (as above 1, 7, 32, 1)
the verses themselves
Take the first as example:
(GENESIS)g1 (1)g2:(1)g3 (In the beginning God created the heavens ...)g4
This requires that I individually match everything within number-pair colons, while retaining my other groups, and with the limitation of fixed length lookaheads / lookbehinds. That last part specifically is what is proving difficult.
My expression up to now is (%(BOOK1)s) (\d+):(\d+)\s?(.+?)\s?(?=\d|%(BOOK2)s|$), where BOOK1 and BOOK2 change as they iterate through a predetermined list. $ appears because the very last book will not have a BOOK2 after it. I call re.finditer() on this expression over the whole string and then I iterate through the match object to produce my groups.
The functional part of my expression is currently (\d+):(\d+)\s?(.+?)\s?(?=\d|%(BOOK2)s|$), but by itself this in effect treats GENESIS as BOOK1 always, and matches everything from just after ^ to whatever BOOK2 may be.
Alternatively, keeping my full expression (%(BOOK1)s) (\d+):(\d+)\s?(.+?)\s?(?=\d|%(BOOK2)s|$) as is will only return the very first desired match.
I get the sense that some of my greedy / non-greedy terms are malformed, or that I could better use leading / trailing expressions. Any feedback will be highly appreciated.

One option could be making use of the Python PyPi regex module and use the \G anchor.
Capturing group 1 contains the name of the book and the numbers for the chapter and verse and the verse that follows are in group 2, 3 and 4.
Looping the result, you can check for the presence of the groups.
\b(?:([A-Z]{2,})(?= \d+:\d)|\G(?!^))(?:(\d+):(\d+))?\s*((?:[^\dA-Z]+|\d++(?!:\d)|[A-Z](?![A-Z]+ \d+:\d))*)
Explanation
\b A word boundary
(?: Non capture group
([A-Z]{2,})(?= \d+:\d) Capture group 1, match 2 or more uppercase chars and assert what is directly at the right is a space, 1+ digits : and a digit
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
) Close group
(?: Non capture group
(\d+):(\d+) Capture 1 or more digits in group 2 and group 3
)?\s* Close group and make it optional and match optional whitespace chars
( Capture group 4
(?: Non capture group
[^\dA-Z]+ Match 1+ times any char except a digit or A-Z
| Or
\d++(?!:\d) Match 1+ digits in a possessive way and assert what is at the right is not : followed by a digit
| Or
[A-Z](?![A-Z]+ \d+:\d) Match a char A-Z and assert what is directly at the right is not 1+ chars A-Z, space, 1+ digits : and a digit
)* Close group and repeat 0+ times
) Close group 4
Regex demo | Python demo
For example
import regex
pattern = r"\b(?:([A-Z]{2,})(?= \d+:\d)|\G(?!^))(?:(\d+):(\d+))?\s*((?:[^\dA-Z]+|\d++(?!:\d)|[A-Z](?![A-Z]+ \d+:\d))*)"
s = ("GENESIS 1:1 In the beginning God created the heavens ... the ground. 2:7 And the LORD ... I buried Leah. 49:32 The purchase of the field and of the cave ... and he was put in a coffin in Egypt. EXODUS 1:1 Now these are the names ...\n")
matches = regex.finditer(pattern, s)
for matchNum, match in enumerate(matches, start=1):
if (match.group(1)):
print(f"Book name: {match.group(1)}")
print("------------------------------")
else:
print(f"Chapter Nr: {match.group(2)}\nVerse Nr: {match.group(3)}\nThe verse: {match.group(4)}\n")
Output
Book name: GENESIS
------------------------------
Chapter Nr: 1
Verse Nr: 1
The verse: In the beginning God created the heavens ... the ground.
Chapter Nr: 2
Verse Nr: 7
The verse: And the LORD ... I buried Leah.
Chapter Nr: 49
Verse Nr: 32
The verse: The purchase of the field and of the cave ... and he was put in a coffin in Egypt.
Book name: EXODUS
------------------------------
Chapter Nr: 1
Verse Nr: 1
The verse: Now these are the names ...

I came up with a solution in pure python with re. Thanks to the above response, I was able to get on the right track. Turns out that wrench I was trying to throw in by testing LORD 2:8 ... wasn't actually an issue since non-title capitals never occur before digits that way without punctuation between [A-Z] and \d in the full string.
Using the same example with the derived pattern:
import re
pattern = r"(?:([A-Z]{2,})(?= \d+:\d)|(?!^))(?:(\d+):(\d+))?\s*((?:[^\dA-Z]+(?!:\d)|[A-Z](?![A-Z]+ \d+:\d))+)"
s = ("GENESIS 1:1 In the beginning God created the heavens ... the ground. 2:7 And the LORD ... I buried Leah. 49:32 The purchase of the field and of the cave ... and he was put in a coffin in Egypt. EXODUS 1:1 Now these are the names ...\n")
match = re.finditer(pattern, s)
for matchNum, match in enumerate(matches, start=1):
if (match.group(1)):
print(f"Book name: {match.group(1)}")
print("------------------------------")
else:
print(f"Chapter Nr: {match.group(2)}\nVerse Nr: {match.group(3)}\nThe verse: {match.group(4)}\n")
As with regex, the Output is:
Book name: GENESIS
------------------------------
Chapter Nr: 1
Verse Nr: 1
The verse: In the beginning God created the heavens ... the ground.
Chapter Nr: 2
Verse Nr: 7
The verse: And the LORD ... I buried Leah.
Chapter Nr: 49
Verse Nr: 32
The verse: The purchase of the field and of the cave ... and he was put in a coffin in Egypt.
Book name: EXODUS
------------------------------
Chapter Nr: 1
Verse Nr: 1
The verse: Now these are the names ...

Insert space after the second or third capital letter python

I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON. Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB or even possibly: 101 9 Ave SCalgary AB, where SW refers to south west and S to south.
I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.
So far, I've found that ([A-Z]{2,3}[a-z]) will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:] but I can't figure out how to do this.
I found that re.findall('(?<=[A-Z][A-Z])[A-Z][a-z].+', '101 9 Ave SWCalgary AB')
will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.
Thanks

You may use
df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
See this regex demo
Details
\b - a word boundary
([A-Z]{1,2}) - Capturing group 1 (later referred with \1 from the replacement pattern): one or two uppercase letters
([A-Z][a-z]) - Capturing group 2 (later referred with \2 from the replacement pattern): an uppercase letter + a lowercase one.
If you want to specifically match city quadrants, you may use a bit more specific regex:
df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')
See this regex demo. Here, [NS][EW]|[NESW] matches N or S that are followed with E or W, or a single N, E, S or W.
Pandas demo:
import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON',
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0 481 Rogers Rd York ON
1 101 9 Ave SW Calgary AB
2 101 9 Ave S Calgary AB
Name: Test, dtype: object

You can use
([A-Z]{1,2})(?=[A-Z][a-z])
to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. Then, replace with the first group and a space:
re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)
https://regex101.com/r/TcB4Ph/1

Python Regex Inconsistency

For several different regular expressions I have found optional and conditional sections of the regex to behave differently for the first match and the subsequent matches. This is using python, but I found it to hold generically.
Here are two similar examples that illustrate the issue:
First Example:
expression:
(?:\w. )?([^,.]*).*(\d{4}\w?)
text:
J. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
R. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
matches:
Match 1
wang Wang
2002
Match 2
R
2002
Second example:
expression:
((?:\w\. )?[^,.]*).*(\d{4}\w?)
text:
J. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
R. wang Wang, X. Liu, and A. A. Chien. Empirical Study of Tolerating \nDenial-of-Service Attacks with a Proxy Network. In Proceedings of the USENIX Security Symposium, 2002.
matches:
Match 1
J. wang Wang
2002
Match 2
R
2002
What am I missing?
I would expect this to behave a bit differently, I would think the matches would be consistent. What I think it should be (and don't yet understand why it isn't):
Example 1
Match 1
wang Wang
2002
Match 2
wang Wang
2002
Example 2
Match 1
J. wang Wang
2002
Match 2
R. wang Wang
2002

In your first example you expect the second line to match 'wang Wang'. <<example 1>> shows clearly that's not what's happening.
After the first match, - which ends with '2002.' - the regex tries to match the remaining part which starts with \n\nR. wang Wang. In your first regex the first non-capturing group doesn't match with that, so your group 1 takes over and matches that, ending up with '\n\nR'
(?: # non-capturing group
\w. # word char, followed by 1 char, followed by space
)? # read 0 or 1 times
( # start group 1
[^,.]* # read anything that's not a comma or dot, 0 or more times
) # end group 1
.* # read anything
( # start group 2
\d{4} # until there's 4 digits
\w? # eventually followed by word char
) # end group 2
The same applies to your second regex: even here your non-capturing group (?:\w\. )? doesn't consume the R. because there are a dot and some newlines in front of the initials.
You could have solved it like this ([A-Z]\.)\s([^.,]+).*(\d{4}): See example 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex Text Cleaning on Multiple forms of text formats - python

Related

Extract a substring from a column and replace column data frame

Python remove middle initial from then end of a name string

How should I construct a regex match for a various strings within repeated delimiters?

Insert space after the second or third capital letter python

Python Regex Inconsistency

Categories

Resources