Regex: How to capture words with spaces/hyphens excluding numbers? - python

I have a dataset that looks like this:
Column1
-------
abcd - efghi 1234
aasdas - asdas 54321
asda-asd 2344
aasdas(asd) 5234
I want to be able to pull everything out that will exclude a number so it will look like this:
Column2
-------
abcd - efghi
aasdas - asdas
asda-asd
aasdas(asd)
This is my current regex:
df['Column2'] = df['Column1'].str.extract('([A-Z]\w{0,})', expand=True)
But it only extracts out the first word that excludes parenthesis and hyphens. Any help will be appreciated...thank you!

Like using replace
df.Column1.str.replace('\d+','')
Out[775]:
0 abcd-efghi
1 aasdas-asdas
2 asda-asd
3 aasdas(asd)
Name: Column1, dtype: object
#df.Column1=df.Column1.str.replace('\d+','')

Just removing numbers will leave you with unwanted space characters.
This list comprehension removes all digits and keeps
space characters, but removes them on the outside.
df['Column2'] = df['Column1'].apply(
lambda x: ''.join([i for i in x if not i.isdigit()]).strip())

Related

How to remove numbers from a string column that starts with 4 zeros?

I have a column of names and informations of products, i need to remove the codes from the names and every code starts with four or more zeros, some names have four zeros or more in the weight and some are joined with the name as the example below:
data = {
'Name' : ['ANOA 250g 00004689', 'ANOA 10000g 00000059884', '80%c asjw 150000001568 ', 'Shivangi000000478761'],
}
testdf = pd.DataFrame(data)
The correct output would be:
results = {
'Name' : ['ANOA 250g', 'ANOA 10000g', '80%c asjw 150000001568 ', 'Shivangi'],
}
results = pd.DataFrame(results)
you can split the strings by the start of the code pattern, which is expressed by the regex (?<!\d)0{4,}. this pattern consumes four 0s that are not preceded by any digit. after splitting the string, take the first fragment, and the str.strip gets rid of possible trailing space
testdf.Name.str.split('(?<!\d)0{4,}', regex=True, expand=True)[0].str.strip()[0].str.strip()
# outputs:
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
note that this works for the case where the codes are always at the end of your string.
Use a regex with str.replace:
testdf['Name'] = testdf['Name'].str.replace(r'(?:(?<=\D)|\s*\b)0{4}\d*',
'', regex=True)
Or, similar to #HaleemurAli, with a negative match
testdf['Name'] = testdf['Name'].str.replace(r'(?<!\d)0{4,}0{4}\d*',
'', regex=True)
Output:
Name
0 ANOA 250g
1 ANOA 10000g
2 80%c asjw 150000001568
3 Shivangi
regex1 demo
regex2 demo
try splitting it at each space and checking if the each item has 0000 in it like:
answer=[]
for i in results["Name"]:
answer.append("".join([j for j in i.split() if "0000" not in j]))

Remove leading words pandas

I have this data df where Names is a column name and below it are its data:
Names
------
23James
0Sania
4124Thomas
101Craig
8Rick
How can I return it to this:
Names
------
James
Sania
Thomas
Craig
Rick
I tried with df.strip but there are certain numbers that are still in the DataFrame.
You can also extract all characters after digits using a capture group:
df['Names'] = df['Names'].str.extract('^\d+(.*)')
print(df)
# Output
Names
0 James
1 Sania
2 Thomas
3 Craig
4 Rick
Details on Regex101
We can use str.replace here with the regex pattern ^\d+, which targets leading digits.
df["Names"] = df["Names"].str.replace(r'^\d+', '')
The answer by Tim certainly solves this but I usually feel uncomfortable using regex as I'm not proficient with it so I would approach it like this -
def removeStartingNums(s):
count = 0
for i in s:
if i.isnumeric():
count += 1
else:
break
return s[count:]
df["Names"] = df["Names"].apply(removeStartingNums)
What the function essentially does is count the number of leading characters which are numeric and then returns a string which has those starting characters sliced off

Regex not working properly for some cases (python)?

I have a data frame where one column has string values and the other has integers but those columns have special characters with it or the string data has integers with it. So to remove it I used regex my regex is working fine but for the integer column, if 'abc123' is then it is not removing the abc and same with string column if '123abc' is there then it is not removing it. I don't know if the pattern or is wrong or the code is wrong. Below is my code,
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
print(df1)
str int
0 abc 123
1 gbc#* 23abc
2 abc123 abc200
3 124abc 1230&*
4 abcer£$%&*! 230!?*&
num = r'\d+$'
alpha = r'[a-zA-Z]+$'
wrong = df1[~df1['int'].str.contains(num, na=True)]
correct_int = [re.sub(r'([^\d]+?)', '', item) for item in wrong['int']]
print(correct_int)
wrong_str = df1[~df1['str'].str.contains(alpha, na=True)]
correct_str = [re.sub(r'([^a-zA-Z ]+?)', '', item) for item in df1['str']]
print(correct_str)
Output:
correct_int: ['23', '1230', '230']
As you can see it removed for '23abc','1230&*','230!?*&' but not for 'abc200' as the string was coming first
correct_str: ['abc', 'gbc', 'abc', 'abc', 'abcer']
now it removed for all but sometimes it's not removing when the value is '124abc'
Is my pattern wrong? I have also tried giving different patterns but nothing worked
I am removing the integers and special characters in the column 'str' and removing string values and special characters in column 'int'
Expected output:
Once after cleaning and replacing with the old with the cleaned values the output would look like this.
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
You can do it with
df1['str'] = df1['str'].str.replace(r"[\d\W+]", '') # replaces numbers (\d) and non-word characters (\W) with empty strings
df1['int'] = df1['int'].str.replace(r"\D+", '') # replaces any non-decimal digit character (like [^0-9])
Returns:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
Try the following:
'\D' represents any non digit value, substitute those with empty string '' in int column
[^a-zA-Z] represents any character not in the range a-z and A-Z, substitute those with empty string '' in str column
Apply these transformations to both columns using .apply() and a lambda function
import pandas as pd
import re
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
df1['int'] = df1['int'].apply(lambda r: re.sub('\D', '', r))
df1['str'] = df1['str'].apply(lambda r: re.sub('[^a-zA-Z]', '', r))
print(df1)
Output:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230

Removing substring of from a list of strings

There are several countries with numbers and/or parenthesis in my list. How I remove these?
e.g.
'Bolivia (Plurinational State of)' should be 'Bolivia',
'Switzerland17' should be 'Switzerland'.
Run just:
df.Country.replace(r'\d+|\s*\([^)]*\)', '', regex=True, inplace=True)
Assuming that the initial content of your DataFrame is:
Country
0 Bolivia (Plurinational State of)
1 Switzerland17
2 United Kingdom
after the above replace you will have:
Country
0 Bolivia
1 Switzerland
2 United Kingdom
The above pattern contains:
the first alternative - a non-empty sequence of digits,
the second alternative:
an optional sequence of "white" chars,
an opening parenthesis (quoted),
a sequence of chars other than ) (between brackets no quotation is
needed),
a closing parenthesis (also quoted).
Use Series.str.replace with regex for replacement, \s* is for possible spaces before (, then \(.*\) is for values () and values between | is for regex or and \d+ is for numbers with 1 or more digits:
df = pd.DataFrame({'a':['Bolivia (Plurinational State of)','Switzerland17']})
df['a'] = df['a'].str.replace('(\s*\(.*\)|\d+)','')
print (df)
a
0 Bolivia
1 Switzerland
You can remove string by this way:-
Remove numbers:-
import re
a = 'Switzerland17'
pattern = '[0-9]'
res = re.sub(pattern, '', a)
print(res)
Output:-
'Switzerland'
Remove parenthesis:-
b = 'Bolivia (Plurinational State of)'
pattern2 = '(\s*\(.*\))'
res2 = re.sub(pattern2, '', b)
print(res2)
Output:-
'Bolivia'
Using Regex and simple List Operation
Go through the list items, find the regex matching in each item, and replace the values in place. This regex "[a-zA-Z]{2,}" works for only string matching with the minimum size of two or more. It gives your freedom based on parenthesis. The better approach for Regex is to use Matching string based on your input domain (i.e country in your case) and a Country name cannot have a number in its name or Parenthesis. SO you should use the following.
import re
list_of_country_strings = ["Switzerland17", "America290","Korea(S)"]
for index in range(len(list_of_country_strings)):
x = re.match("[a-zA-Z]{2,}",string = list_of_country_strings[index])
if x:
list_of_country_strings[index] = list_of_country_strings[index][x.start():x.end()]
print(list_of_country_strings)
Output
['Switzerland', 'America', 'Korea']

Is there an easy way to remove end of the string in rows of a dataframe?

I'm new into Python/pandas and I'm losing my hair with Regex. I would like to use str.replace() to modify strings into a dataframe.
I have a 'Names' column into dataframe df which looks like this:
Jeffrey[1]
Mike[3]
Philip(1)
Jeffrey[2]
etc...
I would like to remove in each single row of the column the end of the string which follows either the '[' or the '('...
I thought to use something like this below but I have hard time to understand regex, any tip with regard to a nice regex summary for beginner is welcome.
df['Names']=df['Names'].str.replace(r'REGEX??', '')
Thanks!
Extract only the alphabetic letters with Series.str.extract:
df['Names'] = df['Names'].str.extract('([A-Za-z]+)')
Names
0 Jeffrey
1 Mike
2 Philip
3 Jeffrey
This regex would work, with $ indicates the end of the string:
df['Names'] = df['Names'].str.extract('(.*)[\[|\(]\d+[\]\)]$')
You could use split to take everything before the first [ or ( characters.
df['Names'].str.split('\[|\(').str[0]
Names
0 Jeffrey
1 Mike
2 Philip
3 Jeffrey

Categories

Resources