split the captured group results in re.sub() - python

InputString = r'On <ENAMEX TYPE="DATE">August 17</ENAMEX> , <ENAMEX TYPE="GPE">Tai wan</ENAMEX> is investigation department.'
p1 = r'<ENAMEX TYPE="(\S+)">(.+?)</ENAMEX>'
p2 = '_'.join(r'\2'.split(' '))
plain_text = re.sub(p1,p2,InputString)
Expect Output:
On August_17 , Tai_wan is investigation department.
Unfortunately, I get the result:
On August 17 , Tai wan is investigation department.
How to split the captured group '\2'?

It seems you want to just replace the matches with the second group (text between ENAMEX tags) and replace all spaces with _.
You may use
import re
InputString = r'On <ENAMEX TYPE="DATE">August 17</ENAMEX> , <ENAMEX TYPE="GPE">Tai wan</ENAMEX> is investigation department.'
p1 = r'<ENAMEX TYPE="[^"]+">(.*?)</ENAMEX>'
plain_text = re.sub(p1,lambda p2: p2.group(1).replace(' ', '_'),InputString)
print(plain_text)
# => On August_17 , Tai_wan is investigation department.
See the Python demo.
Here, <ENAMEX TYPE="[^"]+">(.*?)</ENAMEX> matches <ENAMEX TYPE=", any 1+ chars other than " up to and including a ", then matches a > and then captures any 0+ chars other than line break chars into Group 1. Then, </ENAMEX> substring is matched. The lambda expression only pastes back the contents of Group 1 with literal spaces replaced with underscores. Note you may use re.sub(r'\s', '_', p2.group(1)) in case you want to replace any whitespace char with an underscore.

Related

Python Split Regex not split what I need

I have this in my file
import re
sample = """Name: #s
Owner: #a[tag=Admin]"""
target = r"#[sae](\[[\w{}=, ]*\])?"
regex = re.split(target, sample)
print(regex)
I want to split all words that start with #, so like this:
["Name: ", "#s", "\nOwner: ", "#a[tag=Admin]"]
But instead it give this:
['Name: ', None, '\nOwner: ', '[tag=Admin]', '']
How to seperating it?
I would use re.findall here:
sample = """Name: #s
Owner: #a[tag=Admin]"""
parts = re.findall(r'#\w+(?:\[.*?\])?|\s*\S+\s*', sample)
print(parts) # ['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]']
The regex pattern used here says to match:
#\w+ a tag #some_tag
(?:\[.*?\])? followed by an optional [...] term
| OR
\s*\S+\s* any other non whitespace term,
including optional whitespace on both sides
If I understand the requirements correctly you could do that as follows:
import re
s = """Name: #s
Owner: #a[tag=Admin]
"""
rgx = r'(?=#.*)|(?=\r?\n[^#\r\n]*)'
re.split(rgx, s)
#=> ['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]\n']
Demo
The regular expression can be broken down as follows.
(?= # begin a positive lookahead
#.* # match '#' followed by >= 0 chars other than line terminators
) # end positive lookahead
| # or
(?= # begin a positive lookahead
\r?\n # match a line terminator
[^#\r\n]* # match >= 0 characters other than '#' and line terminators
) # end positive lookahead
Notice that matches are zero-width.
re.split expects the regular expression to match the delimiters in the string. It only returns the parts of the delimiters which are captured. In the case of your regex, that's only the part between the brackets, if present.
If you want the whole delimiter to show up in the list, put parentheses around the whole regex:
target = r"(#[sae](\[[\w{}=, ]*\])?)"
But you are probably better off not capturing the interior group. You can change it to a non-capturing group by using (?:…) instead of (…):
target = r"(#[sae](?:\[[\w{}=, ]*\])?)"
In your output, you keep the [tag=Admin] as that part is in a capture group, and using split can also return empty strings.
Another option is to be specific about the allowed data format, and instead of split capture the parts in 2 groups.
(\s*\w+:\s*)(#[sae](?:\[[\w{}=, ]*])?)
The pattern matches:
( Capture group 1
\s*\w+:\s* Match 1+ word characters and : between optional whitespace chars
) Close group
( Capture group 2
#[sae] Match # followed by either s a e
(?:\[[\w{}=, ]*])? Optionally match [...]
) Close group
Example code:
import re
sample = """Name: #s
Owner: #a[tag=Admin]"""
target = r"(\s*\w+:\s*)(#[sae](?:\[[\w{}=, ]*])?)"
listOfTuples = re.findall(target, sample)
lst = [s for tpl in listOfTuples for s in tpl]
print(lst)
Output
['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]']
See a regex demo and a Python demo.

Python Regex is not matching the first line

I have a text file and the content is,
Submitted By,Assigned,Closed
Name1,10,5
Name2,20,10
Name3,30,15
I have written a Regex Pattern, to extract the value between first , and second ,
^\w+,(\w+),.*$
My Python code is
import re
f=r'sample.txt'
rePat = re.compile('^\w+,(\w+),.*$', re.MULTILINE)
text = open(f, 'r').read()
output = re.findall(rePat, text)
print (f)
print (output)
Expected Output:
Assigned
10
20
30
But I am getting
10
20
30
Why it is missing the first line?
The problem is due to the fact that \w+ matches one or more word chars (basically, letters, digits, underscores and also some diacritics). You have a space in between the second and third commas, so I suggest matching any chars between commas with [^,\n]+ (the \n here is to make sure we stay within the same line).
You can use
rePat = re.compile(r'^[^,\n]+,([^,\n]+),.*$', re.MULTILINE)
Or, a bit simplified if you do not need to extract anything else:
rePat = re.compile(r'^[^,\n]+,([^,\n]+)', re.MULTILINE)
See this regex demo. Details:
^ - start of a line
[^,\n]+ - one or more chars other than , and LF
, - a comma
([^,\n]+) - Group 1: one or more chars other than , and LF.
See a Python demo:
import re
text = r"""Submitted By,Assigned,Closed
Name1,10,5
Name2,20,10
Name3,30,15"""
rePat = re.compile('^[^,\n]+,([^,\n]+),.*$', re.MULTILINE)
output = re.findall(rePat, text)
print (output)
# => ['Assigned', '10', '20', '30']
You could add matching optional spaces and word characters after the first \w+ to match till the first comma.
^\w+(?: \w+)*,(\w+),.*$
^ Start of string
\w+ Match 1+ word chars
(?: \w+)* Optionally repeat matching a space and 1+ word chars
,(\w+), Match a comma and capture 1+ word chars in group 1
.*$ ( You could omit this part)
Regex demo
import re
f = r'sample.txt'
rePat = re.compile('^\w+(?: \w+)*,(\w+),.*$', re.MULTILINE)
text = open(f, 'r').read()
output = re.findall(rePat, text)
print(output)
Output
['Assigned', '10', '20', '30']

Regex pattern to find n non-space characters of x length after a certain substring

I am using this regex pattern pattern = r'cig[\s:.]*(\w{10})' to extract the 10 characters after the '''cig''' contained in each line of my dataframe. With this pattern I am accounting for all cases, except for the ones where that substring contains some spaces inside it.
For example, I am trying to extract Z9F27D2198 from the string
/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031
In the previous string, it seems like Stack overflow formatted it, but there should be 17 whitespaces between F and 2, after CIG.
Could you help me to edit the regex pattern in order to account for the white spaces in that 10-characters substring? I am also using flags=re.I to ignore the case of the strings in my re.findall calls.
To give an example string for which this pattern works:
CIG7826328A2B FORNITURA ENERGIA ELETTRICA U TENZE COMUNALI CONVENZIONE CONSIP E
and it outputs what I want: 7826328A2B.
Thanks in advance.
You can use
r'(?i)cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
See the regex demo. Details:
cig - a cig string
[\s:.]* - zero or more whitespaces, : or .
(\S(?:\s*\S){9}) - Group 1: a non-whitespace char and then nine occurrences of zero or more whitespaces followed with a non-whitespace char
(?!\S) - immediately to the right, there must be a whitespace or end of string.
In Python, you can use
import re
text = "/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031"
pattern = r'cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
matches = re.finditer(pattern, text, re.I)
for match in matches:
print(re.sub(r'\s+', '', match.group(1)), ' found at ', match.span(1))
# => Z9F27D2198 found at (32, 57)
See the Python demo.
What about:
# removes all white spaces with replace()
x = 'CIG7826328A2B FORNITURA ENERGIA ELETTRICA U'.replace(' ', '')
x = x.split("CIG")[1][:10]
# x = '7826328A2B'
x = '/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031'.replace(' ', '')
x.split("CIG")[1][:10]
# x = '7826328A2B'
Works fine if there is only one "CIG" in the string

Python Regex, optional word in brackets?

I have a quick question on regex, I have a certain string to match. It is shown below:
"[someword] This Is My Name 2010"
or
"This Is My Name 2010"
or
"(someword) This Is My Name 2010"
Basically if given any of the strings above, I want to only keep "This Is My Name" and "2010".
What I have now, which I will use result = re.search and then result.group() to get the answer:
'[\]\)]? (.+) ([0-9]{4})\D'
Basically it works with the first and third case, by allowing me to optionally match the end bracket, have a space character, and then match "This Is My Name".
However, with the second case, it only matches "Is My Name". I think this is because of the space between the '?' and '(.+)'.
Is there a way to deal with this issue in pure regex?
One way I can think of is to add an "if" statement to determine if the word starts with a [ or ( before using the appropriate regex.
The pattern that you tried [\]\)]? (.+) ([0-9]{4})\D optionally matches a closing square bracket or parenthesis. Adding the \D at the end, it expects to match any character that is not a digit.
You can optionally match the (...) or [...] part before the first capturing group, as [])] only matches the optional closing one.
Then you can capture all that follows in group 1, followed by matching the last 4 digits in group 2 and add a word boundary.
(?:\([^()\n]*\) |\[[^][\n]*\] )?(.+) ([0-9]{4})\b
(?: Non capture group
([^()\n]*) Match either (...) and space
| Or
[[^][\n]*] [...] and space
)? Close group and make it optional
(.+) Capture group 1, Match 1+ times any char except a newline followed by a space
([0-9]{4})\b Capture group 2, match 4 digits
Regex demo
Note that .* will match until the end of the line and then backtracks until the last occurrence of 4 digits. If that should be the first occurrence, you could make it non greedy .*?
You can use re.sub to replace the first portion of the sentence if it starts with (square or round) brackets, with an empty string. No if statement is needed:
import re
s1 = "[someword] This Is My Name 2010"
s2 = "This Is My Name 2010"
s3 = "(someword) This Is My Name 2010"
reg = '\[.*?\] |\(.*?\) '
res1 = re.sub(reg, '', s1)
print(res1)
res2 = re.sub(reg, '', s2)
print(res2)
res3 = re.sub(reg, '', s3)
print(res3)
OUTPUT
This Is My Name 2010
This Is My Name 2010
This Is My Name 2010

Python multiline regex delimiter

Having this multiline variable:
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
, TEST.RAW_2
, TEST.RAW_3
, TEST.RAW_4
PARALLEL = 4
'''
The structure is always TAG = CONTENT, both strings are NOT fixed and CONTENT could contain new lines.
I need a regex to get:
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1\n , TEST.RAW_2\n , TEST.RAW_3\n , TEST.RAW_4\n'), ('PARALLEL', '4')]
Tried multiple combinations but I'm not able to stop the regex engine at the right point for TABLES tag as its content is a multiline string delimited by the next tag.
Some attempts from the interpreter:
>>> re.findall(r'(\w+?)\s=\s(.+?)', raw, re.DOTALL)
[('CONTENT', 'A'), ('TABLES', 'T'), ('PARALLEL', '4')]
>>> re.findall(r'^(\w+)\s=\s(.+)?', raw, re.M)
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1'), ('PARALLEL', '4')]
>>> re.findall(r'(\w+)\s=\s(.+)?', raw, re.DOTALL)
[('CONTENT', 'ALL\nTABLES = TEST.RAW_1\n , TEST.RAW_2\n , TEST.RAW_3\n , TEST.RAW_4\nPARALLEL = 4\n')]
Thanks!
You can use a positive lookahead to make sure you lazily match the value correctly:
(\w+)\s=\s(.+?)(?=$|\n[A-Z])
^^^^^^^^^^^^
To be used with a DOTALL modifier so that a . could match a newline symbol. The (?=$|\n[A-Z]) lookahead will require .+? to match up to the end of string, or up to the newline followed with an uppercase letter.
See the regex demo.
And alternative, faster regex (as it is an unrolled version of the expression above) - but DOTALL modifier should NOT be used with it:
(\w+)\s*=\s*(.*(?:\n(?![A-Z]).*)*)
See another regex demo
Explanation:
(\w+) - Group 1 capturing 1+ word chars
\s*=\s* - a = symbol wrapped with optional (0+) whitespaces
(.*(?:\n(?![A-Z]).*)*) - Group 2 capturing 0+ sequences of:
.* - any 0+ characters other than a newline
(?:\n(?![A-Z]).*)* - 0+ sequences of:
\n(?![A-Z]) - a newline symbol not followed with an uppercase ASCII letter
.* - any 0+ characters other than a newline
Python demo:
import re
p = re.compile(r'(\w+)\s=\s(.+?)(?=$|\n[A-Z])', re.DOTALL)
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
, TEST.RAW_2
, TEST.RAW_3
, TEST.RAW_4
PARALLEL = 4
'''
print(p.findall(raw))

Categories

Resources