Regex not working properly for some cases (python)? - python

I have a data frame where one column has string values and the other has integers but those columns have special characters with it or the string data has integers with it. So to remove it I used regex my regex is working fine but for the integer column, if 'abc123' is then it is not removing the abc and same with string column if '123abc' is there then it is not removing it. I don't know if the pattern or is wrong or the code is wrong. Below is my code,
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
print(df1)
str int
0 abc 123
1 gbc#* 23abc
2 abc123 abc200
3 124abc 1230&*
4 abcer£$%&*! 230!?*&
num = r'\d+$'
alpha = r'[a-zA-Z]+$'
wrong = df1[~df1['int'].str.contains(num, na=True)]
correct_int = [re.sub(r'([^\d]+?)', '', item) for item in wrong['int']]
print(correct_int)
wrong_str = df1[~df1['str'].str.contains(alpha, na=True)]
correct_str = [re.sub(r'([^a-zA-Z ]+?)', '', item) for item in df1['str']]
print(correct_str)
Output:
correct_int: ['23', '1230', '230']
As you can see it removed for '23abc','1230&*','230!?*&' but not for 'abc200' as the string was coming first
correct_str: ['abc', 'gbc', 'abc', 'abc', 'abcer']
now it removed for all but sometimes it's not removing when the value is '124abc'
Is my pattern wrong? I have also tried giving different patterns but nothing worked
I am removing the integers and special characters in the column 'str' and removing string values and special characters in column 'int'
Expected output:
Once after cleaning and replacing with the old with the cleaned values the output would look like this.
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230

You can do it with
df1['str'] = df1['str'].str.replace(r"[\d\W+]", '') # replaces numbers (\d) and non-word characters (\W) with empty strings
df1['int'] = df1['int'].str.replace(r"\D+", '') # replaces any non-decimal digit character (like [^0-9])
Returns:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230

Try the following:
'\D' represents any non digit value, substitute those with empty string '' in int column
[^a-zA-Z] represents any character not in the range a-z and A-Z, substitute those with empty string '' in str column
Apply these transformations to both columns using .apply() and a lambda function
import pandas as pd
import re
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
df1['int'] = df1['int'].apply(lambda r: re.sub('\D', '', r))
df1['str'] = df1['str'].apply(lambda r: re.sub('[^a-zA-Z]', '', r))
print(df1)
Output:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230

Related

How to remove numbers from a string column that starts with 4 zeros?

I have a column of names and informations of products, i need to remove the codes from the names and every code starts with four or more zeros, some names have four zeros or more in the weight and some are joined with the name as the example below:
data = {
'Name' : ['ANOA 250g 00004689', 'ANOA 10000g 00000059884', '80%c asjw 150000001568 ', 'Shivangi000000478761'],
}
testdf = pd.DataFrame(data)
The correct output would be:
results = {
'Name' : ['ANOA 250g', 'ANOA 10000g', '80%c asjw 150000001568 ', 'Shivangi'],
}
results = pd.DataFrame(results)
you can split the strings by the start of the code pattern, which is expressed by the regex (?<!\d)0{4,}. this pattern consumes four 0s that are not preceded by any digit. after splitting the string, take the first fragment, and the str.strip gets rid of possible trailing space
testdf.Name.str.split('(?<!\d)0{4,}', regex=True, expand=True)[0].str.strip()[0].str.strip()
# outputs:
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
note that this works for the case where the codes are always at the end of your string.
Use a regex with str.replace:
testdf['Name'] = testdf['Name'].str.replace(r'(?:(?<=\D)|\s*\b)0{4}\d*',
'', regex=True)
Or, similar to #HaleemurAli, with a negative match
testdf['Name'] = testdf['Name'].str.replace(r'(?<!\d)0{4,}0{4}\d*',
'', regex=True)
Output:
Name
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
regex1 demo
regex2 demo
try splitting it at each space and checking if the each item has 0000 in it like:
answer=[]
for i in results["Name"]:
answer.append("".join([j for j in i.split() if "0000" not in j]))

Removing numbers from strings in a Data frame column

I want to remove numbers from strings in a column, while at the same time keeping numbers that do not have any strings in the same column.
This is how the data looks like;
df=
id description
1 XG154LU
2 4562689
3 556
4 LE896E
5 65KKL4
This is how i want the output to look like:
id description
1 XGLU
2 4562689
3 556
4 LEE
5 KKL
I used the code below but when i run it it removes all the entries in the description column and replace it with blanks:
def clean_text_round1(text):
text = re.sub('\w*\d\w*', '', text)
text = re.sub('[‘’“”…]', '', text)
text = re.sub(r'\n', '', text)
text = re.sub(r'\r', '', text)
return text
round1 = lambda x: clean_text_round1(x)
df['description'] = df['description'].apply(round1)
Try:
import numpy as np
df['description'] = np.where(df.description.str.contains('^\d+$'), df.description, df.description.str.replace('\d+', ''))
Output:
id description
1 XGLU
2 4562689
3 556
4 LEE
5 KKL
Logic:
Look if the string contains digits only, if yes, dont do anything and just copy the number as it is. If the string has numbers mixed with string, then replace them with black space '' leaving out only characters without the numbers.
This should solve it for you.
def clean_text_round1(text):
if type(text) == int:
return text
else:
text = ''.join([i for i in text if not i.isdigit()])
return text
df['description'] = df['description'].apply(clean_text_round1)
Let me know if this works for you. Not sure about the speed performance. You can use regex instead of join.
def convert(v):
# check if the string is composed of not only numbers
if any([char.isalpha() for char in v]):
va = [char for char in v if char.isalpha()]
va = ''.join(va)
return va
else:
return v
# apply() a function for a single column
df['description']= df['description'].apply(convert)
print(df)
id description
0 XGLU
1 4562689
2 556
3 LEE
4 KKL

search for regex in text using python

I want to search for region in s1 . I want to return 1 if i the text contains "region" or "région" or "regions" or "régions" and 0 in the other case.
i wrote the code below but it does'nt work
s1 = pd.Series(['here is region', 'my regions', 'régionally', 'région','régions','regions','region'])
s1.str.contains('r.gion[s][^a-zA-Z]', regex=True).astype(int)
In this case the result must be
[1,1,0,1,1,1,1]
You may use
s1.str.contains(r'\br[ée]gions?\b').astype(int)
If you want to save the regex in a file and then read in and use as a variable just write \br[ée]gions?\b there.
Test:
>>> import pandas as pd
>>> s1 = pd.Series(['here is region', 'my regions', 'régionally', 'région','régions','regions','region'])
>>> s1.str.contains(r'\br[ée]gions?\b').astype(int)
0 1
1 1
2 0
3 1
4 1
5 1
6 1
dtype: int32
Details
\b - a word boundary
r - r char
[ée] - one of the letters in the character class
gion - gion
s? - an optional s letter
\b - a word boundary.

Regex: How to capture words with spaces/hyphens excluding numbers?

I have a dataset that looks like this:
Column1
-------
abcd - efghi 1234
aasdas - asdas 54321
asda-asd 2344
aasdas(asd) 5234
I want to be able to pull everything out that will exclude a number so it will look like this:
Column2
-------
abcd - efghi
aasdas - asdas
asda-asd
aasdas(asd)
This is my current regex:
df['Column2'] = df['Column1'].str.extract('([A-Z]\w{0,})', expand=True)
But it only extracts out the first word that excludes parenthesis and hyphens. Any help will be appreciated...thank you!
Like using replace
df.Column1.str.replace('\d+','')
Out[775]:
0 abcd-efghi
1 aasdas-asdas
2 asda-asd
3 aasdas(asd)
Name: Column1, dtype: object
#df.Column1=df.Column1.str.replace('\d+','')
Just removing numbers will leave you with unwanted space characters.
This list comprehension removes all digits and keeps
space characters, but removes them on the outside.
df['Column2'] = df['Column1'].apply(
lambda x: ''.join([i for i in x if not i.isdigit()]).strip())

Check if string column last characters are numbers in Pandas

I have this dataframe:
Code Mark
0 Abd 43212312312
1 Charles de Gaulle
2 Carlitos 4132411
3 Antonio
If the last 5 characters of the string in the Code column are numbers, I want that 'Mark' is 'A', so it will look like this:
Code Mark
0 Abd 43212312312 A
1 Charles de Gaulle
2 Carlitos 4132411 A
3 Antonio
I'm trying to use isnumeric but I'm constantly getting AttributeError: 'Series' object has no attribute 'isnumeric'
Can someone help on that?
You are close. The trick is to use the .str accessor via pd.Series.str.isnumeric.
Then map to 'A' or an empty string via pd.Series.map:
df['Mark'] = df['Code'].str[-5:]\
.str.isnumeric()\
.map({True: 'A', False: ''})
print(df)
Code Mark
0 Abd43212312312 A
1 CharlesdeGaulle
2 Carlitos4132411 A
3 Antonio
Using pd.Series.str.match, you can use
import numpy as np
df['Mark'] = np.where(df.Code.str.match(r'.*?\d{5}$'), 'A', '')
Note that '.*?' is a non-greedy regex match, '\d{5}' checks for 5 digits, and '$' matches a string end.

Categories

Resources