Regex keep \w, hyphen between words and remove digits - python

I struggle to extend following expression to match and remove digits:
[^\w -]|_|-(?!\w)|(?<!\w)-
Example:
123 !"§$%&/()= äüöüÄÖÜÄßßßß hello-123, hello-hello, hello-.
Exprected Output:
äüöüÄÖÜÄßßßß hello hello-hello hello

You can use
-?\d+-?|[^\w -]|_|-(?!\w)|(?<!\w)-
-*\d+(?:\.d+)?-*|[^\w -]|_|-(?!\w)|(?<!\w)-
See the regex demo.
The -?\d+-?| part matches
-? - an optional -
\d+ - one or more digits
-? - an optional -
| - or (the rest of the alternatives).
The -*\d+(?:\.d+)?-* part matches float values, too, and matches zero or more hyphens on both ends of the number.
Replace - with \W to match any non-word char.
See the Python demo:
import re
text = '123 !"§$%&/()= äüöüÄÖÜÄßßßß hello-123, hello-hello, hello-.'
print( re.sub(r'-?\d+-?|[^\w -]|_|-(?!\w)|(?<!\w)-', '', text).strip() )
# => äüöüÄÖÜÄßßßß hello hello-hello hello

Related

Use Regex to replace dashes and keep hyphens?

I want to replace dashes with a full-stop (.). If the dash appears as a hyphen it should be ignored. E.g. -ac-ac with .ac-ac
I started with the following regex: (?<!\s|\-)\-+|\-+(?!\s|\-)
You can use
\B-|-\B
See the regex demo.
The pattern matches
\B- - a hyphen that is preceded by a non-word char or is at the start of a string
| - or
-\B - a hyphen that is followed by a non-word char or is at the end of a string.
See the Python demo:
import re
text = "-ac-ac"
print( re.sub(r'\B-|-\B', '.', text) )
# => .ac-ac
If you want to only narrow this down to letter context, replace \B with negative lookarounds containing a letter pattern:
(?<![^\W\d_])-|-(?![^\W\d_])
See this regex and Python demo.

How do I write a Regex in Python to remove leading zeros for a number in the middle of a string

I have a string composed of both letters followed by a number, and I need to remove all the letters, as well as the leading zeros in the number.
For example: in the test string U012034, I want to match the U and the 0 at the beginning of 012034.
So far I have [^0-9] to match the any characters that aren't digits, but I can't figure out how to also remove the leading zeros in the number.
I know I could do this in multiple steps with something like int(re.sub("[^0-9]", "", test_string) but I need this process to be done in one regex.
You can use
re.sub(r'^\D*0*', '', text)
See the regex demo. Details
^ - start of string
\D* - any zero or more non-digit chars
0* - zero or more zeros.
See Python demo:
import re
text = "U012034"
print( re.sub(r'^\D*0*', '', text) )
# => 12034
If there is more text after the first number, use
print( re.sub(r'^\D*0*(\d+).*', r'\1', text) )
See this regex demo. Details:
^ - start of string
\D* - zero or more non-digits
0* - zero or more zeros
(\d+) - Group 1: one or more digits (use (\d+(?:\.\d+)?) to match float or int values)
`.* - the rest of the string.
The replacement is the Group 1 value.
You may use this re.sub in Python:
string = re.sub(r'^[a-zA-Z]*0*|[a-zA-Z]+', '', string)
RegEx Demo
Explanation:
^: Start
[a-zA-Z]*: Match 0 or more letters
0*L: Match 0 or more zeroes
|: OR
[a-zA-Z]+: Match 1+ of letters
Does this do what you need?
re.sub("[^0-9]+0*", "", "U0123")
>>> '123'

regex to match literally the characters of an abbreviation

I am new with regex and I would like some help. So I have a string below and I want to make my regex match the first character of the acronym literally + any character[a-z] unlimited times but only for the first character. For the rest of the characters, i would like to just match them as they are. Any help on what to change on my regex line to achieve this, would be highly appreciated.
import re
s = 'nUSA stands for northern USA'
x = (f'({"nUSA"}).+?({" ".join( t[0] + "[a-z]" + t[1:] for t in "nUSA")})(?: )')
print(x)
out: (nUSA).+?(n[a-z]+ U[a-z]+ S[a-z]+ A[a-z]+)(?: )
What i want to achieve with my regex line is something like the pattern below so that it can match for the northern USA.
(nUSA).+?(n[a-z]+ U + S + A)(?: )
instead of the one i get
(nUSA).+?(n[a-z]+ U[a-z]+ S[a-z]+ A[a-z]+)(?: )
I would like it to work for any arbitrary text, not only for the specific one. I am not sure if i have expressed my problem properly.
You may use
import re
s = 'nUSA stands for northern USA'
key='nUSA'
x = rf'\b({key})\b.+?\b({key[0]}[a-z]*\s*{key[1:]})(?!\S)'
# => print(x) => \b(nUSA)\b.+?\b(n[a-z]*\s*USA)(?!\S)
# Or, if the key can contain special chars at the end:
# x = rf'\b({re.escape(key)})(?!\w).+?(?<!\w)({re.escape(key[0])}[a-z]*\s*{re.escape(key[1:])})(?!\S)'
print(re.findall(x, s))
# => [('nUSA', 'northern USA')]
See the Python demo. The resulting regex will look like \b(nUSA)\b.+?\b(n[a-z]*\s*USA)(?!\S), see its demo. Details:
\b - word boundary
(nUSA) - Group 1 capturing the key word
\b / (?!\w) - word boundary (right-hand word boundary)
.+? - any 1+ chars other than linebreak chars as few as possible
\b - word boundary
(n[a-z]*\s*USA) - Group 2: n (first char), then any 0+ lowercase ASCII letters, 0+ whitespaces and the rest of the key string.
(?!\S) - a right-hand whitespace boundary (you may consider using (?!\w) again here).

Regex return match and extended matches

Can regex return matches and extended matches. What I mean is one regex expression that can return different number of found elements depending on the structure. My text is:
AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2
My return (match) should be:
'AB', 'CDE', '123.456.1', '1'
'AC', 'DEF','3.1.2'
So if there is a value after a semicolon then the regex should match and return that as well. But if is not there it should still match the part and return the rest.
My code is:
import re
s = '''AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2'''
match1 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)', s)
print(match1[0])
match2 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*', s)
print(match2[0])
Where match1 only matches the first occurrance and match2 only the second. What would be the regex to work in both cases?
The r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)' pattern contains an obligatory (;\s*\d+) pattern at the end. You need to make it optional and you may do it by adding a ? quantifier after it, so as to match 1 or 0 occurrences of the subpattern.
With other minor enhancements, you may use
r'A[BC]\s*:\s*\w+\s*/\s*[\w.]+\s*(?:;\s*\d+)?'
Note all capturing groups are removed, and non-capturing ones are introduced since you only get the whole match value in the end.
Details
A[BC] - AB or AC
\s*:\s* - a colon enclosed with 0+ whitespace chars
\w+ - or more word chars
\s*/\s* - a / enclosed with 0+ whitespace chars
[\w.]+ - 1 or more word or . chars
\s* - 0+ whitespaces
(?:;\s*\d+)? - an optional sequence of
; - a ;
\s* - 0+ whitespaces
\d+ - 1+ digits

Regular expression to capture split strings by numbers, including those with these symbols? (Why doesn't this expression work?)

I want to split a string by its digits, including leading dollar signs, decimals, and trailing percentage signs and parentheses. So in this example
text = 'this is a string (0.7000) $0.9 80% 900000 0.9 chars not numbers.'
I would want my output to look like this
['this is a string', '(0.7000)', '$0.9', '80%', '900000', '0.9', 'chars not numbers.']
According to https://regex101.com/, this should work:
(\(?\$?[\.0-9,]+[.%)]?)
However, when I run this script on my string:
splitText = re.split(r'(\(?\$?[\.0-9,]+[.%)]?)', text)
print(splitText)
I get an empty list.
I have implemented this function successfully in other areas of my script, and so I am not sure why this one doesn't work. Any guidance would be appreciated.
EDIT: Sorry guys, I'm a bit sleep deprived and miswrote my own problem. I didn't want to split the words into characters, I wanted to maintain the words and only split the numbers. I've updated the output to its correct form.
If you are sure your pattern is matching the right entities, all you need to add is a filter(None, results) to get rid of empty elements and add \s* around the pattern to "trim" out whitespace only chunks:
import re
text = 'this is a string (0.7000) $0.9 80% 900000 0.9 chars not numbers.'
print(filter(None, re.split(r"\s*(\(?\$?[0-9.,]+[.%)]?)\s*",text)))
# => ['this is a string', '(0.7000)', '$0.9', '80%', '900000', '0.9', 'chars not numbers', '.']
See the Python demo and a regex demo.
Details:
\s* - 0+ whitespaces
(\(?\$?[0-9.,]+[.%)]?) - Group 1:
\(? - an optional (
\$? - an optional $
[0-9.,]+ - 1+ digits, . or ,
[.%)]? - an optional . or % symbols
\s* - 0+ whitespaces
You can use re.findall() to get better results, with whitespace trim
without the need for a lot of post processing gyrations.
(?s)\s*((?:(?!\(?\$?(?:\d+(?:\.\d*)?|\.\d+)[.%)]?).)+(?<!\s)|\(?\$?(?:\d+(?:\.\d*)?|\.\d+)[.%)]?)\s*
http://rextester.com/FKXM26376
Expanded
(?s)
\s*
( # (1 start)
(?:
(?!
\(? \$?
(?:
\d+
(?: \. \d* )?
| \. \d+
)
[.%)]?
)
.
)+
(?<! \s )
|
\(? \$?
(?:
\d+
(?: \. \d* )?
| \. \d+
)
[.%)]?
) # (1 end)
\s*
Python
import re
text = 'this is a string (0.7000)$0.9 80% 900000 0.9 chars not numbers.'
findText = re.findall(r'(?s)\s*((?:(?!\(?\$?(?:\d+(?:\.\d*)?|\.\d+)[.%)]?).)+(?<!\s)|\(?\$?(?:\d+(?:\.\d*)?|\.\d+)[.%)]?)\s*', text)
print(findText)
Output
['this is a string', '(0.7000)', '$0.9', '80%', '900000', '0.9', 'chars not numbers.']
Running (python2)
import re
text = 'this is a string (0.7000) $0.9 80% 900000 0.9 chars not numbers.'
regex1 = r"(\(?\$?[0-9]+\.?[0-9]*\%?\)?)"
st1 = re.split( regex1, text )
st2 = list( s.strip() for s in st1 if s.strip() != "" )
print st2
gives (edited to fit width)
['this is a string', '(0.7000)', '$0.9', '80%', '900000',
'0.9', 'chars not numbers.']
Parts of the regex are (enclosed in parentheses so that they appear in the result)
\(? optional opening parenthesis
\$? optional dollar sign
[0-9]+ digits before decimal point (at least one)
\.? optional decimal point
[0-9]* optional digits after decimal point
\%? optional percentage sign
\)? optional closing parenthesis
After that, strip extra spaces and remove the empty strings to get your desired output.

Categories

Resources