Check if string column last characters are numbers in Pandas - python

I have this dataframe:
Code Mark
0 Abd 43212312312
1 Charles de Gaulle
2 Carlitos 4132411
3 Antonio
If the last 5 characters of the string in the Code column are numbers, I want that 'Mark' is 'A', so it will look like this:
Code Mark
0 Abd 43212312312 A
1 Charles de Gaulle
2 Carlitos 4132411 A
3 Antonio
I'm trying to use isnumeric but I'm constantly getting AttributeError: 'Series' object has no attribute 'isnumeric'
Can someone help on that?

You are close. The trick is to use the .str accessor via pd.Series.str.isnumeric.
Then map to 'A' or an empty string via pd.Series.map:
df['Mark'] = df['Code'].str[-5:]\
.str.isnumeric()\
.map({True: 'A', False: ''})
print(df)
Code Mark
0 Abd43212312312 A
1 CharlesdeGaulle
2 Carlitos4132411 A
3 Antonio

Using pd.Series.str.match, you can use
import numpy as np
df['Mark'] = np.where(df.Code.str.match(r'.*?\d{5}$'), 'A', '')
Note that '.*?' is a non-greedy regex match, '\d{5}' checks for 5 digits, and '$' matches a string end.

Related

Extract a substring from a column and replace column data frame

I need some help extracting a substring from a column in my data frame and then replacing that column with a substring. I was wondering if python would be better performance for stripping the string or using regular expression to substitute/replace the string with the substring.
The string looks something like this in the column:
Person
------
<Person 1234567 Tom Brady>
<Person 456789012 Mary Ann Thomas>
<Person 92145 John Smith>
What I would like is this:
Person
------
Tom Brady
Mary Ann Thomas
John Smith
What I have so far as far as regular expressions go is this:
/^([^.]+[.]+[^.]+)[.]/g
And that just gets this part '<Person 1234567 ', not sure how to get the '>' from the end.
Multiple ways, but you can use str.replace():
import pandas as pd
df = pd.DataFrame({'Person': ['<Person 1234567 Tom Brady>',
'<Person 456789012 Mary Ann Thomas>',
'<Person 92145 John Smith>']})
df['Person'] = df['Person'].str.replace(r'(?:<Person[\d\s]+|>)', '', regex=True)
print(df)
Prints:
Person
0 Tom Brady
1 Mary Ann Thomas
2 John Smith
Pattern used: (?:<Person[\d\s]+|>), see an online demo:
(?: - Open non-capture group for alternation;
<Person[\d\s]+ - Match literal '<Person' followed by 1+ whitespace characters or digits;
| - Or;
> - A literal '>'
) - Close group.
You can first identify all the alphabets in keeping things simple with this code
res = re.findall(r"[^()0-9-]+", string)
res[1]
This should return you a list of strings ['Person', 'Tom Brady'], you can then access the name of the Person with res[1]
**Remark: I have yet to try the code, in the case that it also returns spaces, you should be able to easily remove them with strip() or it should be the the third string of the list res[3] instead.
You can read more about re.findall() online or through the documentation.
Python regex has a function called search that finds the matching pattern in a string. With the examples given, you can use regex to extract the names with:
import re
s = "<Person 1234567 John Smith>"
re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", s).group(0)
>>> 'John Smith'
The regular expression [A-Z][a-z]+(\s[A-Z][a-z]+)+ is just matching the names (Tom Brady, Mary Ann Thomas, etc.)
I like to use Panda's apply function to apply an operation on each row, so the final result would look like this:
import re
import pandas as pd
def extract_name(row):
row["Person"] = re.search("[A-Z][a-z]+(\s[A-Z][a-z]+)+", row["Person"]).group(0)
return row
df = YOUR DATAFRAME
df2 = df.apply(extract_name, axis=1)
and df2 has the Person column with the extracted names.

Remove leading words pandas

I have this data df where Names is a column name and below it are its data:
Names
------
23James
0Sania
4124Thomas
101Craig
8Rick
How can I return it to this:
Names
------
James
Sania
Thomas
Craig
Rick
I tried with df.strip but there are certain numbers that are still in the DataFrame.
You can also extract all characters after digits using a capture group:
df['Names'] = df['Names'].str.extract('^\d+(.*)')
print(df)
# Output
Names
0 James
1 Sania
2 Thomas
3 Craig
4 Rick
Details on Regex101
We can use str.replace here with the regex pattern ^\d+, which targets leading digits.
df["Names"] = df["Names"].str.replace(r'^\d+', '')
The answer by Tim certainly solves this but I usually feel uncomfortable using regex as I'm not proficient with it so I would approach it like this -
def removeStartingNums(s):
count = 0
for i in s:
if i.isnumeric():
count += 1
else:
break
return s[count:]
df["Names"] = df["Names"].apply(removeStartingNums)
What the function essentially does is count the number of leading characters which are numeric and then returns a string which has those starting characters sliced off

How to filter rows with non Latin characters

I am stuck in a problem with a dataframe with a column of film names which has a bunch of non-latin names like Japanese or Chinese (and maybe Russian names too) my code is:
df['title'].head(5)
1 I am legend
2 wonder women
3 アライヴ
4 怪獣総進撃
5 dead sea
I just want an output that removes every non-Latin character title, so I want to remove every row that contains character similar to row 3 and 4, so my desired output is:
df['title'].head(5)
1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude
Any help with this code?
You can use str.match with the Latin character range to identify non-latin characters, and use the boolean output to slice the data:
df_latin = df[~df['title'].str.match(r'.*[^\x00-\xFF]')]
output:
title
1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude
You can encode your title column then decode to latin1. If this double transformation does not match your original data, remove row because it contains some non Latin characters:
df = df[df['title'] == df['title'].str.encode('unicode_escape').str.decode('latin1')]
print(df)
# Output
title
0 I am legend
1 wonder women
3 dead sea
You can use the isascii() method (if you're using Python 3.7+). Example:
"I am legend".isascii() # True
"アライヴ".isascii() # False
Even if you have 1 Non-English letter, the isascii() method will return False.
(Note that for strings like '34?#5' the method will return True, because those are all ASCII characters.)
We can easily makes a function which will return whether it is ascii or not and based on that we can then filter our dataframe.
dict_1 = {'col1':list(range(1,6)), 'col2':['I am legend','wonder women','アライヴ','怪獣総進撃','dead sea']}
def check_ascii(string):
if string.isascii() == True:
return True
else:
return False
df = pd.DataFrame(dict_1)
df['is_eng'] = df['col2'].apply(lambda x: check_ascii(x))
df2 = df[df['is_eng'] == True]
df2
Output

Python remove middle initial from then end of a name string

I am trying to remove the middle initial at the end of a name string. An example of how the data looks:
df = pd.DataFrame({'Name': ['Smith, Jake K',
'Howard, Rob',
'Smith-Howard, Emily R',
'McDonald, Jim T',
'McCormick, Erica']})
I am currently using the following code, which works for all names except for McCormick, Erica. I first use regex to identify all capital letters. Then any rows with 3 or more capital letters, I remove [:-1] from the string (in an attempt to remove the middle initial and extra space).
df['Cap_Letters'] = df['Name'].str.findall(r'[A-Z]')
df.loc[df['Cap_Letters'].str.len() >= 3, 'Name'] = df['Name'].str[:-1]
This outputs the following:
As you can see, this properly removes the middle initial for all names except for McCormick, Erica. Reason being she has 3 capital letters but no middle initial, which incorrectly removes the 'a' in Erica.
You can use Series.str.replace directly:
df['Name'] = df['Name'].str.replace(r'\s+[A-Z]$', '', regex=True)
Output:
0 Smith, Jake
1 Howard, Rob
2 Smith-Howard, Emily
3 McDonald, Jim
4 McCormick, Erica
Name: Name, dtype: object
See the regex demo. Regex details:
\s+ - one or more whitespaces
[A-Z] - an uppercase letter
$ - end of string.
Another solution(not so pretty) would be to split then take 2 elements then join again
df['Name'] = df['Name'].str.split().str[0:2].str.join(' ')
# 0 Smith, Jake
# 1 Howard, Rob
# 2 Smith-Howard, Emily
# 3 McDonald, Jim
# 4 McCormick, Erica
# Name: Name, dtype: object
I would use something like that :
def removeMaj(string):
tab=string.split(',')
tab[1]=lower(tab[1])
string=",".join(tab)
return(string)

Regex not working properly for some cases (python)?

I have a data frame where one column has string values and the other has integers but those columns have special characters with it or the string data has integers with it. So to remove it I used regex my regex is working fine but for the integer column, if 'abc123' is then it is not removing the abc and same with string column if '123abc' is there then it is not removing it. I don't know if the pattern or is wrong or the code is wrong. Below is my code,
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
print(df1)
str int
0 abc 123
1 gbc#* 23abc
2 abc123 abc200
3 124abc 1230&*
4 abcer£$%&*! 230!?*&
num = r'\d+$'
alpha = r'[a-zA-Z]+$'
wrong = df1[~df1['int'].str.contains(num, na=True)]
correct_int = [re.sub(r'([^\d]+?)', '', item) for item in wrong['int']]
print(correct_int)
wrong_str = df1[~df1['str'].str.contains(alpha, na=True)]
correct_str = [re.sub(r'([^a-zA-Z ]+?)', '', item) for item in df1['str']]
print(correct_str)
Output:
correct_int: ['23', '1230', '230']
As you can see it removed for '23abc','1230&*','230!?*&' but not for 'abc200' as the string was coming first
correct_str: ['abc', 'gbc', 'abc', 'abc', 'abcer']
now it removed for all but sometimes it's not removing when the value is '124abc'
Is my pattern wrong? I have also tried giving different patterns but nothing worked
I am removing the integers and special characters in the column 'str' and removing string values and special characters in column 'int'
Expected output:
Once after cleaning and replacing with the old with the cleaned values the output would look like this.
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
You can do it with
df1['str'] = df1['str'].str.replace(r"[\d\W+]", '') # replaces numbers (\d) and non-word characters (\W) with empty strings
df1['int'] = df1['int'].str.replace(r"\D+", '') # replaces any non-decimal digit character (like [^0-9])
Returns:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
Try the following:
'\D' represents any non digit value, substitute those with empty string '' in int column
[^a-zA-Z] represents any character not in the range a-z and A-Z, substitute those with empty string '' in str column
Apply these transformations to both columns using .apply() and a lambda function
import pandas as pd
import re
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
df1['int'] = df1['int'].apply(lambda r: re.sub('\D', '', r))
df1['str'] = df1['str'].apply(lambda r: re.sub('[^a-zA-Z]', '', r))
print(df1)
Output:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230

Categories

Resources