I want to extract part of a string in a list which does not have a space followed by number in python.
# INPUT
text = ['bits', 'scrap 1.2', 'bits and pieces', 'junk 3.4.2']
# EXPECTED OUTPUT
output = ['bits', 'scrap', 'bits and pieces', 'junk']
I managed to do this using re.sub or re.split:
output = [re.sub(" [0-9].*", "", t) for t in text]
# OR
output = [re.split(' \d',t)[0] for t in text]
When I tried to use re.search and re.findall, it return me empty list or empty result.
[re.search('(.*) \d', t) for t in text]
#[None, <_sre.SRE_Match object; span=(0, 7), match='scrap 1'>, None, <_sre.SRE_Match object; span=(0, 6), match='junk 3'>]
[re.findall('(.*?) \d', t) for t in text]
#[[], ['scrap'], [], ['junk']]
Can anyone help me with the regex that can return expected output for re.search and re.findall?
You may remove the digit-and-dot substrings at the end of the string only with
import re
text = ['bits', 'scrap 1.2', 'bits and pieces', 'junk 3.4.2']
print([re.sub(r'\s+\d+(?:\.\d+)*$', '', x) for x in text])
# => output = ['bits', 'scrap', 'bits and pieces', 'junk']
See the Python demo
The pattern is
\s+ - 1+ whitespaces (note: if those digits can be "glued" to some other text, replace + (one or more occurrences) with * quantifier (zero or more occurrences))
\d+ - 1 or more digits
(?:\.\d+)* - 0 or more sequences of
\. - a dot
\d+ - 1 or more digits
$ - end of string.
See the regex demo.
To do the same with re.findall, you can use
# To get 'abc 5.6 def' (not 'abc') from 'abc 5.6 def 6.8.9'
re.findall(r'^(.*?)(?: \d[\d.]*)?$', x) #
# To get 'abc' (not 'abc 5.6 def') from 'abc 5.6 def 6.8.9'
re.findall(r'^(.*?)(?: \d.*)?$', x) #
See this regex demo.
However, this regex is not efficient enough due to the .*? construct. Here,
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars (use re.DOTALL to match all) as few as possible (so that the next optional group could be tested at each position)
(?: \d[\d.]*)? -an optional non-capturing group matching
- a space
\d - a digit
[\d.]* - zero or more digits or . chars
(OR) .* - any 0+ chars other than line break chars, as many as possible
$ - end of string.
Related
I have a regex that parses US phone numbers into 3 strings.
import re
s = ' 916-2221111 ' # this also works'(916) 222-1111 '
reg_ph = re.match(r'^\s*\(?(\d{3})\)?-? *(\d{3})-? *-?(\d{4})', s)
if reg_ph:
return reg_ph.groups()
else:
raise ValueError ('not a valid phone number')
it works perfectly on the numbers:
'(916) 222-1111 '
' 916-2221111 '
Now I need to add an additional regex to generate a Value Error for numbers such as
s = '916 111-2222' # there are white spaces between the area code and a local number and NO ')'
I tried
reg_ph = re.match(r'^\s*\(?(\d{3})\)?\s*-? *(\d{3})-? *-?(\d{4})', s)
reg_ph = re.match(r'^\s*\(?(\d{3})\)?s*-? *(\d{3})-? *-?(\d{4})', s)
but non rejects the string in question
I will greatly appreciate any ideas. I am very new to Regex!
In Python re you could use a conditional to check for group 1 having the opening parenthesis.
If that is the case match the closing parenthesis, optional spaces and 3 digits. Else match - and 3 digits.
If you use re.match you can omit ^
^\s*(\()?\d+(?(1)\)\s*\d{3}|-\d{3})-?\d{4}
If you want to match the whole string and trailing whitespace chars:
^\s*(\()?\d+(?(1)\)\s*\d{3}|-\d{3})-?\d{4}\s*$
In parts, the pattern matches:
^ Start of string
\s* Match optional whitespace chars
(\()? Optional group 1, match (
\d+ Match 1+ digits
(? Conditional
(1)\)\s*\d{3} If group 1 exist, match the closing ), optional whitespace chars and 3 digits
| Or
-? Match optional -
\d{3} Match 3 digits
) close conditional
-?\d{4} Match optional - and 4 digits
See a regex demo
For example, using capture groups in the pattern to get the digits:
import re
strings = [' (916) 111-2222',' 916-2221111 ', '916 111-2222']
pattern =r'\s*(\()?(\d+)(?(1)\)\s*(\d{3})|-(\d{3}))-?(\d{4})\s*$'
for item in strings:
m=re.match(pattern, item)
if m:
t = tuple(s for s in m.groups() if s is not None and s.isdigit())
print(t)
else:
print("no match for " + item)
Output
('916', '111', '2222')
('916', '222', '1111')
no match for 916 111-2222
Python demo
I struggle to extend following expression to match and remove digits:
[^\w -]|_|-(?!\w)|(?<!\w)-
Example:
123 !"§$%&/()= äüöüÄÖÜÄßßßß hello-123, hello-hello, hello-.
Exprected Output:
äüöüÄÖÜÄßßßß hello hello-hello hello
You can use
-?\d+-?|[^\w -]|_|-(?!\w)|(?<!\w)-
-*\d+(?:\.d+)?-*|[^\w -]|_|-(?!\w)|(?<!\w)-
See the regex demo.
The -?\d+-?| part matches
-? - an optional -
\d+ - one or more digits
-? - an optional -
| - or (the rest of the alternatives).
The -*\d+(?:\.d+)?-* part matches float values, too, and matches zero or more hyphens on both ends of the number.
Replace - with \W to match any non-word char.
See the Python demo:
import re
text = '123 !"§$%&/()= äüöüÄÖÜÄßßßß hello-123, hello-hello, hello-.'
print( re.sub(r'-?\d+-?|[^\w -]|_|-(?!\w)|(?<!\w)-', '', text).strip() )
# => äüöüÄÖÜÄßßßß hello hello-hello hello
I have a python string:
s = 'filename 13.00 50850.8732503344475 37.11 abc'
In order to find the second string with format nn.nn, I know I can do:
re.findall(r'.*(\b\d+\.\d+)',s)[0]
which finds:
'37.11'
But I want to replace it with 99.99.
I tried:
re.sub(r'.*(\b\d+\.\d+)','99.99',s)
But that just yields:
'99.99 abc'
whereas I want:
'filename 13.00 50850.8732503344475 99.99 abc'
Clearly I don't yet understand how regex works. Could someone offer help please?
You should capture what you need to keep and use the unambiguous replaement backreference in the replacement pattern:
s = re.sub(r'(.*)\b\d+\.\d+',r'\g<1>99.99', s)
See the Python demo and the regex demo.
Pattern details
(.*) - Group 1 (its value is referred to with \g<1> backreference from the replacement pattern): any 0+ chars other than line break chars as many as possible
\b - a word boundary
\d+ - 1+ digits
\. - a dot
\d+ - 1+ digits.
Alternatively, you can replace last occurrence of simple pattern (\d+\.\d+):
s = "filename 13.00 50850.8732503344475 37.11 abc"
*_, last = re.finditer(r"(\d+\.\d+)", s)
s = s[:last.start()] + "99.99" + s[last.end():]
It's a bit faster. Results of timeit benchmark(code):
re.finditer() -> 11.30306268
re.sub() -> 15.613837582000002
Can regex return matches and extended matches. What I mean is one regex expression that can return different number of found elements depending on the structure. My text is:
AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2
My return (match) should be:
'AB', 'CDE', '123.456.1', '1'
'AC', 'DEF','3.1.2'
So if there is a value after a semicolon then the regex should match and return that as well. But if is not there it should still match the part and return the rest.
My code is:
import re
s = '''AB : CDE / 123.456.1; 1
AC : DEF / 3.1.2'''
match1 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)', s)
print(match1[0])
match2 = re.search(r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*', s)
print(match2[0])
Where match1 only matches the first occurrance and match2 only the second. What would be the regex to work in both cases?
The r'((?:AB|AC))\s*:\s*(\w+)\s*\/\s*([\w.]+)\s*(;\s*\d+)' pattern contains an obligatory (;\s*\d+) pattern at the end. You need to make it optional and you may do it by adding a ? quantifier after it, so as to match 1 or 0 occurrences of the subpattern.
With other minor enhancements, you may use
r'A[BC]\s*:\s*\w+\s*/\s*[\w.]+\s*(?:;\s*\d+)?'
Note all capturing groups are removed, and non-capturing ones are introduced since you only get the whole match value in the end.
Details
A[BC] - AB or AC
\s*:\s* - a colon enclosed with 0+ whitespace chars
\w+ - or more word chars
\s*/\s* - a / enclosed with 0+ whitespace chars
[\w.]+ - 1 or more word or . chars
\s* - 0+ whitespaces
(?:;\s*\d+)? - an optional sequence of
; - a ;
\s* - 0+ whitespaces
\d+ - 1+ digits
Having this multiline variable:
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
, TEST.RAW_2
, TEST.RAW_3
, TEST.RAW_4
PARALLEL = 4
'''
The structure is always TAG = CONTENT, both strings are NOT fixed and CONTENT could contain new lines.
I need a regex to get:
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1\n , TEST.RAW_2\n , TEST.RAW_3\n , TEST.RAW_4\n'), ('PARALLEL', '4')]
Tried multiple combinations but I'm not able to stop the regex engine at the right point for TABLES tag as its content is a multiline string delimited by the next tag.
Some attempts from the interpreter:
>>> re.findall(r'(\w+?)\s=\s(.+?)', raw, re.DOTALL)
[('CONTENT', 'A'), ('TABLES', 'T'), ('PARALLEL', '4')]
>>> re.findall(r'^(\w+)\s=\s(.+)?', raw, re.M)
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1'), ('PARALLEL', '4')]
>>> re.findall(r'(\w+)\s=\s(.+)?', raw, re.DOTALL)
[('CONTENT', 'ALL\nTABLES = TEST.RAW_1\n , TEST.RAW_2\n , TEST.RAW_3\n , TEST.RAW_4\nPARALLEL = 4\n')]
Thanks!
You can use a positive lookahead to make sure you lazily match the value correctly:
(\w+)\s=\s(.+?)(?=$|\n[A-Z])
^^^^^^^^^^^^
To be used with a DOTALL modifier so that a . could match a newline symbol. The (?=$|\n[A-Z]) lookahead will require .+? to match up to the end of string, or up to the newline followed with an uppercase letter.
See the regex demo.
And alternative, faster regex (as it is an unrolled version of the expression above) - but DOTALL modifier should NOT be used with it:
(\w+)\s*=\s*(.*(?:\n(?![A-Z]).*)*)
See another regex demo
Explanation:
(\w+) - Group 1 capturing 1+ word chars
\s*=\s* - a = symbol wrapped with optional (0+) whitespaces
(.*(?:\n(?![A-Z]).*)*) - Group 2 capturing 0+ sequences of:
.* - any 0+ characters other than a newline
(?:\n(?![A-Z]).*)* - 0+ sequences of:
\n(?![A-Z]) - a newline symbol not followed with an uppercase ASCII letter
.* - any 0+ characters other than a newline
Python demo:
import re
p = re.compile(r'(\w+)\s=\s(.+?)(?=$|\n[A-Z])', re.DOTALL)
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
, TEST.RAW_2
, TEST.RAW_3
, TEST.RAW_4
PARALLEL = 4
'''
print(p.findall(raw))