I need some help extracting a substring from a column in my data frame and then replacing that column with a substring. I was wondering if python would be better performance for stripping the string or using regular expression to substitute/replace the string with the substring.
The string looks something like this in the column:
Person
------
<Person 1234567 Tom Brady>
<Person 456789012 Mary Ann Thomas>
<Person 92145 John Smith>
What I would like is this:
Person
------
Tom Brady
Mary Ann Thomas
John Smith
What I have so far as far as regular expressions go is this:
/^([^.]+[.]+[^.]+)[.]/g
And that just gets this part '<Person 1234567 ', not sure how to get the '>' from the end.
Multiple ways, but you can use str.replace():
import pandas as pd
df = pd.DataFrame({'Person': ['<Person 1234567 Tom Brady>',
'<Person 456789012 Mary Ann Thomas>',
'<Person 92145 John Smith>']})
df['Person'] = df['Person'].str.replace(r'(?:<Person[\d\s]+|>)', '', regex=True)
print(df)
Prints:
Person
0 Tom Brady
1 Mary Ann Thomas
2 John Smith
Pattern used: (?:<Person[\d\s]+|>), see an online demo:
(?: - Open non-capture group for alternation;
<Person[\d\s]+ - Match literal '<Person' followed by 1+ whitespace characters or digits;
| - Or;
> - A literal '>'
) - Close group.
You can first identify all the alphabets in keeping things simple with this code
res = re.findall(r"[^()0-9-]+", string)
res[1]
This should return you a list of strings ['Person', 'Tom Brady'], you can then access the name of the Person with res[1]
**Remark: I have yet to try the code, in the case that it also returns spaces, you should be able to easily remove them with strip() or it should be the the third string of the list res[3] instead.
You can read more about re.findall() online or through the documentation.
Python regex has a function called search that finds the matching pattern in a string. With the examples given, you can use regex to extract the names with:
import re
s = "<Person 1234567 John Smith>"
re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", s).group(0)
>>> 'John Smith'
The regular expression [A-Z][a-z]+(\s[A-Z][a-z]+)+ is just matching the names (Tom Brady, Mary Ann Thomas, etc.)
I like to use Panda's apply function to apply an operation on each row, so the final result would look like this:
import re
import pandas as pd
def extract_name(row):
row["Person"] = re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", row["Person"]).group(0)
return row
df = YOUR DATAFRAME
df2 = df.apply(extract_name, axis=1)
and df2 has the Person column with the extracted names.
Related
I have this data df where Names is a column name and below it are its data:
Names
------
23James
0Sania
4124Thomas
101Craig
8Rick
How can I return it to this:
Names
------
James
Sania
Thomas
Craig
Rick
I tried with df.strip but there are certain numbers that are still in the DataFrame.
You can also extract all characters after digits using a capture group:
df['Names'] = df['Names'].str.extract('^\d+(.*)')
print(df)
# Output
Names
0 James
1 Sania
2 Thomas
3 Craig
4 Rick
Details on Regex101
We can use str.replace here with the regex pattern ^\d+, which targets leading digits.
df["Names"] = df["Names"].str.replace(r'^\d+', '')
The answer by Tim certainly solves this but I usually feel uncomfortable using regex as I'm not proficient with it so I would approach it like this -
def removeStartingNums(s):
count = 0
for i in s:
if i.isnumeric():
count += 1
else:
break
return s[count:]
df["Names"] = df["Names"].apply(removeStartingNums)
What the function essentially does is count the number of leading characters which are numeric and then returns a string which has those starting characters sliced off
I'm using pandas to analyze data from 3 different sources, which are imported into dataframes and require modification to account for human error, as this data was all entered by humans and contains errors.
Specifically, I'm working with street names. Until now, I have been using .str.replace() to remove street types (st., street, blvd., ave., etc.), as shown below. This isn't working well enough, and I decided I would like to use regex to match a pattern, and transform that entire column from the original street name, to the pattern matched by regex.
df['street'] = df['street'].str.replace(r' avenue+', '', regex=True)
I've decided I would like to use regex to identify (and remove all other characters from the address column's fields): any number of integers, followed by a space, and then the first 3 number of alphabetic characters.
For example, "3762 pearl street" might become "3762 pea" if x is 3 with the following regex:
(\d+ )+\w{0,3}
How can I use panda's .str.replace to do this? I don't want to specify WHAT I want to replace with the second argument. I want to replace the original string with the pattern matched from regex.
Something that, in my mind, might work like this:
df['street'] = df['street'].str.replace(ORIGINAL STRING, r' (\d+ )+\w{0,3}, regex=True)
which might make 43 milford st. into "43 mil".
Thank you, please let me know if I'm being unclear.
you could use the extract method to overwrite the column with its own content
pat = r'(\d+\s[a-zA-Z]{3})'
df['street'] = df['street'].str.extract(pat)
Just an observation: The regex you shared (\d+ )+\w{0,3} matches the following patterns and returns some funky stuff as well
1131 1313 street
121 avenue
1 1 1 1 1 1 avenue
42
I've changed it up a bit based on what you described, but i'm not sure if that works for all your datapoints.
I am trying to remove the middle initial at the end of a name string. An example of how the data looks:
df = pd.DataFrame({'Name': ['Smith, Jake K',
'Howard, Rob',
'Smith-Howard, Emily R',
'McDonald, Jim T',
'McCormick, Erica']})
I am currently using the following code, which works for all names except for McCormick, Erica. I first use regex to identify all capital letters. Then any rows with 3 or more capital letters, I remove [:-1] from the string (in an attempt to remove the middle initial and extra space).
df['Cap_Letters'] = df['Name'].str.findall(r'[A-Z]')
df.loc[df['Cap_Letters'].str.len() >= 3, 'Name'] = df['Name'].str[:-1]
This outputs the following:
As you can see, this properly removes the middle initial for all names except for McCormick, Erica. Reason being she has 3 capital letters but no middle initial, which incorrectly removes the 'a' in Erica.
You can use Series.str.replace directly:
df['Name'] = df['Name'].str.replace(r'\s+[A-Z]$', '', regex=True)
Output:
0 Smith, Jake
1 Howard, Rob
2 Smith-Howard, Emily
3 McDonald, Jim
4 McCormick, Erica
Name: Name, dtype: object
See the regex demo. Regex details:
\s+ - one or more whitespaces
[A-Z] - an uppercase letter
$ - end of string.
Another solution(not so pretty) would be to split then take 2 elements then join again
df['Name'] = df['Name'].str.split().str[0:2].str.join(' ')
# 0 Smith, Jake
# 1 Howard, Rob
# 2 Smith-Howard, Emily
# 3 McDonald, Jim
# 4 McCormick, Erica
# Name: Name, dtype: object
I would use something like that :
def removeMaj(string):
tab=string.split(',')
tab[1]=lower(tab[1])
string=",".join(tab)
return(string)
Having an issue with Regex and not really understanding its usefulness right now.
Trying to extrapolate data from a file. file consists of first name, last name, grade
File:
Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B
Opening file code:
##Regex Code r'([A-Za-z]+)(: B)
regcode = r'([A-Za-z]+)(: B)'
answer=re.findall(regcode,file)
return answer
The expected result is first name last name. The given result is last name and letter grade. How do I just get the first name and last name for all B grades?
Since you must use regex for this task, here's a simple regex solution that returns the full name:
'(.*): B'
Which works in this case because:
(.*) returns all text up to a match of : B
Click here to see my test and matching output. I recommend this site for your regex testing needs.
You can do it without regex:
students = '''Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B'''
for x in students.split('\n'):
string = x.split(': ')
if string[1] == 'B':
print(string[0])
# Robert Right
# Jim Jim
or
[x[0:-3] for x in students.split('\n') if x[-1] == 'B']
If a regex solution is required (I perosnally like the solution of Roman Zhak more), put inside a group what you are interested in, i.e. the first name and the second name. Follows colon and B:
import re
file = """
Peter Jenkins: A
Robert Right: B
Kim Long: C
Jim Jim: B
"""
regcode = r'([A-Za-z]+) ([A-Za-z]+): B'
answer=re.findall(regcode,file,re.)
print(answer) # [('Robert', 'Right'), ('Jim', 'Jim')]
Add a capturing group ('()') to your expression. Everything outside the group will be ignored, even if it matches the expression.
re.findall('(\w+\s+\w+):\s+B', file)
#['Robert Right', 'Jim Jim']
'\w' is any alphanumeric character, '\s' is any space-like character.
You can add two groups, one for the first name and one for the last name:
re.findall('(\w+)\s+(\w+):\s+B', data)
#[('Robert', 'Right'), ('Jim', 'Jim')]
The latter will not work if there are more than two names on one line.
I'm new into Python/pandas and I'm losing my hair with Regex. I would like to use str.replace() to modify strings into a dataframe.
I have a 'Names' column into dataframe df which looks like this:
Jeffrey[1]
Mike[3]
Philip(1)
Jeffrey[2]
etc...
I would like to remove in each single row of the column the end of the string which follows either the '[' or the '('...
I thought to use something like this below but I have hard time to understand regex, any tip with regard to a nice regex summary for beginner is welcome.
df['Names']=df['Names'].str.replace(r'REGEX??', '')
Thanks!
Extract only the alphabetic letters with Series.str.extract:
df['Names'] = df['Names'].str.extract('([A-Za-z]+)')
Names
0 Jeffrey
1 Mike
2 Philip
3 Jeffrey
This regex would work, with $ indicates the end of the string:
df['Names'] = df['Names'].str.extract('(.*)[\[|\(]\d+[\]\)]$')
You could use split to take everything before the first [ or ( characters.
df['Names'].str.split('\[|\(').str[0]
Names
0 Jeffrey
1 Mike
2 Philip
3 Jeffrey