Remove leading words pandas - python

I have this data df where Names is a column name and below it are its data:
Names
------
23James
0Sania
4124Thomas
101Craig
8Rick
How can I return it to this:
Names
------
James
Sania
Thomas
Craig
Rick
I tried with df.strip but there are certain numbers that are still in the DataFrame.

You can also extract all characters after digits using a capture group:
df['Names'] = df['Names'].str.extract('^\d+(.*)')
print(df)
# Output
Names
0 James
1 Sania
2 Thomas
3 Craig
4 Rick
Details on Regex101

We can use str.replace here with the regex pattern ^\d+, which targets leading digits.
df["Names"] = df["Names"].str.replace(r'^\d+', '')

The answer by Tim certainly solves this but I usually feel uncomfortable using regex as I'm not proficient with it so I would approach it like this -
def removeStartingNums(s):
count = 0
for i in s:
if i.isnumeric():
count += 1
else:
break
return s[count:]
df["Names"] = df["Names"].apply(removeStartingNums)
What the function essentially does is count the number of leading characters which are numeric and then returns a string which has those starting characters sliced off

Related

Extract a substring from a column and replace column data frame

I need some help extracting a substring from a column in my data frame and then replacing that column with a substring. I was wondering if python would be better performance for stripping the string or using regular expression to substitute/replace the string with the substring.
The string looks something like this in the column:
Person
------
<Person 1234567 Tom Brady>
<Person 456789012 Mary Ann Thomas>
<Person 92145 John Smith>
What I would like is this:
Person
------
Tom Brady
Mary Ann Thomas
John Smith
What I have so far as far as regular expressions go is this:
/^([^.]+[.]+[^.]+)[.]/g
And that just gets this part '<Person 1234567 ', not sure how to get the '>' from the end.
Multiple ways, but you can use str.replace():
import pandas as pd
df = pd.DataFrame({'Person': ['<Person 1234567 Tom Brady>',
'<Person 456789012 Mary Ann Thomas>',
'<Person 92145 John Smith>']})
df['Person'] = df['Person'].str.replace(r'(?:<Person[\d\s]+|>)', '', regex=True)
print(df)
Prints:
Person
0 Tom Brady
1 Mary Ann Thomas
2 John Smith
Pattern used: (?:<Person[\d\s]+|>), see an online demo:
(?: - Open non-capture group for alternation;
<Person[\d\s]+ - Match literal '<Person' followed by 1+ whitespace characters or digits;
| - Or;
> - A literal '>'
) - Close group.
You can first identify all the alphabets in keeping things simple with this code
res = re.findall(r"[^()0-9-]+", string)
res[1]
This should return you a list of strings ['Person', 'Tom Brady'], you can then access the name of the Person with res[1]
**Remark: I have yet to try the code, in the case that it also returns spaces, you should be able to easily remove them with strip() or it should be the the third string of the list res[3] instead.
You can read more about re.findall() online or through the documentation.
Python regex has a function called search that finds the matching pattern in a string. With the examples given, you can use regex to extract the names with:
import re
s = "<Person 1234567 John Smith>"
re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", s).group(0)
>>> 'John Smith'
The regular expression [A-Z][a-z]+(\s[A-Z][a-z]+)+ is just matching the names (Tom Brady, Mary Ann Thomas, etc.)
I like to use Panda's apply function to apply an operation on each row, so the final result would look like this:
import re
import pandas as pd
def extract_name(row):
row["Person"] = re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", row["Person"]).group(0)
return row
df = YOUR DATAFRAME
df2 = df.apply(extract_name, axis=1)
and df2 has the Person column with the extracted names.

How create specific dummy variable using regular expression?

I have a pandas dataframe:
col1
johns id is 81245678316
eric bought 82241624316 yesterday
mine is87721624316
frank is a genius
i accepted new 82891224316again
I want to create new column with dummy variables (0,1) depending on col1. If there is 11 numbers starting with 8 and going in a row, than it must be 1, otherwise 0.
So I wrote this code:
df["is_number"] = df.col1.str.contains(r"\b8\d{10}").map({True: 1, False: 0})
However output is:
col1 is_number
johns id is 81245678316 1
eric bought 82241624316 yesterday 1
mine is87721624316 0
frank is a genius 0
i accepted new 82891224316again 0
as you see third and fifth rows have 0 in "is_number", but I want them to have 1, even though space is missing there between words and numbers in some places. How to do that? I want:
col1 is_number
johns id is 81245678316 1
eric bought 82241624316 yesterday 1
mine is87721624316 1
frank is a genius 0
i accepted new 82891224316again 1
You can use numeric boundaries as the numbers in your input can be "glued" to letters (that are word boundaries and thus there is no word boundary between the letters and 8):
df["is_number"] = df['col1'].str.contains(r"(?<!\d)8\d{10}(?!\d)").map({True: 1, False: 0})
Output:
>>> df
col1 is_number
0 johns id is 81245678316 1
1 eric bought 82241624316 yesterday 1
2 mine is87721624316 1
3 frank is a genius 0
4 i accepted new 82891224316again 1
See the regex demo.
You just need to remove the \b which stands for word boundary since you do not care if there is a boundary or not.
df["is_number"] = df.col1.str.contains(r"8\d{10}").map({True: 1, False: 0})
The solution can be as simple as yours, except that '\b' must be removed because it must match a word boundary:
df.col1.str.contains("8\d{10}").astype(int)
If you want exactly 11 digits, not more, then demand that the symbols before and after the eleven digits either do not exist or are not digits:
df.col1.str.contains("(^|\D)8\d{10}($|\D)").astype(int)

Python remove middle initial from then end of a name string

I am trying to remove the middle initial at the end of a name string. An example of how the data looks:
df = pd.DataFrame({'Name': ['Smith, Jake K',
'Howard, Rob',
'Smith-Howard, Emily R',
'McDonald, Jim T',
'McCormick, Erica']})
I am currently using the following code, which works for all names except for McCormick, Erica. I first use regex to identify all capital letters. Then any rows with 3 or more capital letters, I remove [:-1] from the string (in an attempt to remove the middle initial and extra space).
df['Cap_Letters'] = df['Name'].str.findall(r'[A-Z]')
df.loc[df['Cap_Letters'].str.len() >= 3, 'Name'] = df['Name'].str[:-1]
This outputs the following:
As you can see, this properly removes the middle initial for all names except for McCormick, Erica. Reason being she has 3 capital letters but no middle initial, which incorrectly removes the 'a' in Erica.
You can use Series.str.replace directly:
df['Name'] = df['Name'].str.replace(r'\s+[A-Z]$', '', regex=True)
Output:
0 Smith, Jake
1 Howard, Rob
2 Smith-Howard, Emily
3 McDonald, Jim
4 McCormick, Erica
Name: Name, dtype: object
See the regex demo. Regex details:
\s+ - one or more whitespaces
[A-Z] - an uppercase letter
$ - end of string.
Another solution(not so pretty) would be to split then take 2 elements then join again
df['Name'] = df['Name'].str.split().str[0:2].str.join(' ')
# 0 Smith, Jake
# 1 Howard, Rob
# 2 Smith-Howard, Emily
# 3 McDonald, Jim
# 4 McCormick, Erica
# Name: Name, dtype: object
I would use something like that :
def removeMaj(string):
tab=string.split(',')
tab[1]=lower(tab[1])
string=",".join(tab)
return(string)

Is there an easy way to remove end of the string in rows of a dataframe?

I'm new into Python/pandas and I'm losing my hair with Regex. I would like to use str.replace() to modify strings into a dataframe.
I have a 'Names' column into dataframe df which looks like this:
Jeffrey[1]
Mike[3]
Philip(1)
Jeffrey[2]
etc...
I would like to remove in each single row of the column the end of the string which follows either the '[' or the '('...
I thought to use something like this below but I have hard time to understand regex, any tip with regard to a nice regex summary for beginner is welcome.
df['Names']=df['Names'].str.replace(r'REGEX??', '')
Thanks!
Extract only the alphabetic letters with Series.str.extract:
df['Names'] = df['Names'].str.extract('([A-Za-z]+)')
Names
0 Jeffrey
1 Mike
2 Philip
3 Jeffrey
This regex would work, with $ indicates the end of the string:
df['Names'] = df['Names'].str.extract('(.*)[\[|\(]\d+[\]\)]$')
You could use split to take everything before the first [ or ( characters.
df['Names'].str.split('\[|\(').str[0]
Names
0 Jeffrey
1 Mike
2 Philip
3 Jeffrey

Using Reg Ex to Match Strings in a Data Frame and Replace - python

i have data frame that looks like this
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT
i want to be able to strip ,60 ,R-12,HT using regex and also deletes the moreinfo and GoCats rows from the df.
My expected Results:
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
I first removed the strings
del = ['hello', 'moreinfo']
for i in del:
df = df[value!= i]
Can somebody suggest a way to use regex to match and delete all case that do match A067-M4FL-CAA-020 or MZF8-050Z-AAB pattern so i don't have to create a list for all possible cases?
I was able to strip a single line like this but i want to be able to strip all matching cases in the dataframe
pattern = r',\w+ \,\w+-\w+\,\w+ *'
line = 'MRF2-050A-TFC,60 ,R-12,HT'
for i in re.findall(pattern, line):
line = line.replace(i,'')
>>> MRF2-050A-TFC
I tried adjusting my code but it prints out the same output for each row
pattern = r',\w+ \,\w+-\w+\,\w+ *'
for d in df:
for i in re.findall(pattern, d):
d = d.replace(i,'')
Any suggestions will be greatly appreciated. Thanks
You may try this
(?:\w+-){2,}[^,\n]*
Demo
Python scripts may be as follows
ss="""0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT"""
import re
regx=re.compile(r'(?:\w+-){2,}[^,\n]*')
m= regx.findall(ss)
for i in range(len(m)):
print("%d %s" %(i, m[i]))
and the output is
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
2 MZF8-050Z-AAB
3 MZA2-0580-TFD
Here's a simpler approach you can try without using regex. pandas has many in-built functions to deal with text data.
# remove unwanted values
df['value'] = df.value.str.replace(r'moreinfo|60|R-.*|HT|GoCats|\,', '')
# drop na
df = df[(df != '')].dropna()
# print
print(df)
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC
3 MZF8-050Z-AAB
5 MZA2-0580-TFD
-----------
# data used
df = pd.read_fwf(StringIO(u'''
value
0 A067-M4FL-CAA-020
1 MRF2-050A-TFC,60 ,R-12,HT
2 moreinfo
3 MZF8-050Z-AAB
4 GoCats
5 MZA2-0580-TFD,60 ,R-669,LT'''),header=1)
I'd suggest capturing the data you DO want, since it's pretty particular, and the data you do NOT want could be anything.
Your pattern should look something like this:
^\w{4}-\w{4}-\w{3}(?:-\d{3})?
https://regex101.com/r/NtH2Ut/2
I'd recommend being more specific than \w where possible. (Like ^[A-Z]\w{3}) if you know the beginning four character chunk should start with a letter.
edit
Sorry, I may not have read your input and output literally enough:
https://regex101.com/r/NtH2Ut/3
^(?:\d+\s+\w{4}-\w{4}-\w{3}(?:-\d{3})?)|^\s+.*

Categories

Resources