How to select for the longest match from regex - python

I have a long string that I need to split into separate strings. I have written up a regex pattern shown below. The problem right now is that my long string is split into the smaller strings but with duplicates.
That is what my code looks like:
import re
teststring = '''#first-error-type: cjjyr901-d374-jfh73kf8k,
#second-err, #some-other-error : jksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k
#third-errortype cjjyr901-d374-jfh73kf8k, ksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k'
'''
new = re.sub('\((.+)\)', '', teststring)
#remove perenthesis
new = re.sub(':', '', new)
#removing stray colons
new = re.sub(r'([#][\w]*(-[\w]*)*[,]*)', r'\1:', new)
#adding colons
new = re.split(r'(([#][\w]*)*(-[\w]*)*[, :]([\w]*[-, ]*)*)', new)
Due to the inconsistencies of the text, I have to do preliminary cleaning in the beginning to remove stray colons and then add it in later
Currently, this is my output:
['', '#first-error-type: cjjyr901-d374-jfh73kf8k, ', '#first', '-type', '', '\n', '#second-err,', '#second', '-err', '', '', ': ', None, None, '', '', '#some-other-error: jksdf89-123r-e3-1345r , 99f7yr901-374-jfh73kf8k ', '#some', '-error', '', '\n', '#third-errortype: cjjyr901-d374-jfh73kf8k, ksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k', '#third', '-errortype', '', "'\n"]
The code I am expecting is:
['#first-error-type: cjjyr901-d374-jfh73kf8k',
'#second-err, #some-other-error: jksdf89-123r-e3-1345r (some kind of note), 99f7yr901-374-jfh73kf8k',
'#third-errortype: cjjyr901-d374-jfh73kf8k, ksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k']
It seems like I had made a few mistakes in the pattern so I am not grouping two hashtag comments together to add just one colon. Also, the output is split into duplicating segments of varing length.

You can use
import re
teststring = '''#first-error-type: cjjyr901-d374-jfh73kf8k,
#second-err, #some-other-error : jksdf89-123r-e3-1345r (some kind of note), 99f7yr901-374-jfh73kf8k
#third-errortype cjjyr901-d374-jfh73kf8k, ksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k'
'''
new = re.sub(
r'(#\w+(?:-\w+)*)(?=[:\s]+\w)[^\S\n]*(?::[^\S\n]*)?|,[^\S\n]*(?=\n)',
lambda x: f'{x.group(1)}: ' if x.group(1) else '',
teststring)
print( list(map(lambda x: x.strip(), re.findall(r'#\w+(?:-\w+)*[,: ][^"\'\r\n]*', new))) )
See the Python demo.
Output:
['#first-error-type: cjjyr901-d374-jfh73kf8k', '#second-err, #some-other-error: jksdf89-123r-e3-1345r (some kind of note), 99f7yr901-374-jfh73kf8k', '#third-errortype: cjjyr901-d374-jfh73kf8k, ksdf89-123r-e3-1345r, 99f7yr901-374-jfh73kf8k']
The (#\w+(?:-\w+)*)(?=[:\s]+\w)[^\S\n]*(?::[^\S\n]*)?|,[^\S\n]*(?=\n) regex matches
(#\w+(?:-\w+)*)(?=[:\s]+\w)[^\S\n]*(?::[^\S\n]*)? -
(#\w+(?:-\w+)*) - Group 1: a # char followed with one or more word chars followed with zero or more repetitions of a - and again one or more word chars
(?=[:\s]+\w) - there must be a word char after any one or more whitespaces or : from the location
[^\S\n]* - zero or more whitespaces but LF char
(?::[^\S\n]*)? - an optional occurrence of a : and zero or more whitespaces but LF char
| - or
,[^\S\n]*(?=\n) - a comma, zero or more whitespaces but LF char followed with an LF char
The match is removed if Group 1 is not matched, else, we return Group 1 plus a colon with a space.
The #\w+(?:-\w+)*[,: ][^"\'\r\n]* regex used in re.findall matches
#\w+(?:-\w+)* - a #, one or more word chars, and zero or more occurrences of a - and one or more word chars
[,: ] - a space, comma or colon
[^"\'\r\n]* - zero or more chars other than ", ', CR and LF.
The map(lambda x: x.strip(),...) is used to stip whitespaces mainly from the end of the matches since the [^"\'\r\n]* negated character class can match them at will.

Related

Python Split Regex not split what I need

I have this in my file
import re
sample = """Name: #s
Owner: #a[tag=Admin]"""
target = r"#[sae](\[[\w{}=, ]*\])?"
regex = re.split(target, sample)
print(regex)
I want to split all words that start with #, so like this:
["Name: ", "#s", "\nOwner: ", "#a[tag=Admin]"]
But instead it give this:
['Name: ', None, '\nOwner: ', '[tag=Admin]', '']
How to seperating it?
I would use re.findall here:
sample = """Name: #s
Owner: #a[tag=Admin]"""
parts = re.findall(r'#\w+(?:\[.*?\])?|\s*\S+\s*', sample)
print(parts) # ['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]']
The regex pattern used here says to match:
#\w+ a tag #some_tag
(?:\[.*?\])? followed by an optional [...] term
| OR
\s*\S+\s* any other non whitespace term,
including optional whitespace on both sides
If I understand the requirements correctly you could do that as follows:
import re
s = """Name: #s
Owner: #a[tag=Admin]
"""
rgx = r'(?=#.*)|(?=\r?\n[^#\r\n]*)'
re.split(rgx, s)
#=> ['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]\n']
Demo
The regular expression can be broken down as follows.
(?= # begin a positive lookahead
#.* # match '#' followed by >= 0 chars other than line terminators
) # end positive lookahead
| # or
(?= # begin a positive lookahead
\r?\n # match a line terminator
[^#\r\n]* # match >= 0 characters other than '#' and line terminators
) # end positive lookahead
Notice that matches are zero-width.
re.split expects the regular expression to match the delimiters in the string. It only returns the parts of the delimiters which are captured. In the case of your regex, that's only the part between the brackets, if present.
If you want the whole delimiter to show up in the list, put parentheses around the whole regex:
target = r"(#[sae](\[[\w{}=, ]*\])?)"
But you are probably better off not capturing the interior group. You can change it to a non-capturing group by using (?:…) instead of (…):
target = r"(#[sae](?:\[[\w{}=, ]*\])?)"
In your output, you keep the [tag=Admin] as that part is in a capture group, and using split can also return empty strings.
Another option is to be specific about the allowed data format, and instead of split capture the parts in 2 groups.
(\s*\w+:\s*)(#[sae](?:\[[\w{}=, ]*])?)
The pattern matches:
( Capture group 1
\s*\w+:\s* Match 1+ word characters and : between optional whitespace chars
) Close group
( Capture group 2
#[sae] Match # followed by either s a e
(?:\[[\w{}=, ]*])? Optionally match [...]
) Close group
Example code:
import re
sample = """Name: #s
Owner: #a[tag=Admin]"""
target = r"(\s*\w+:\s*)(#[sae](?:\[[\w{}=, ]*])?)"
listOfTuples = re.findall(target, sample)
lst = [s for tpl in listOfTuples for s in tpl]
print(lst)
Output
['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]']
See a regex demo and a Python demo.

Splitting a string by multiple possible delimiters

I want to parse a str into a list of float values, however I want to be flexible regarding my delimiters. Specifically, I would like to be able to use any of these
s = '3.14; 42.2' # delimiter is '; '
s = '3.14;42.2' # delimiter is ';'
s = '3.14, 42.2' # delimiter is ', '
s = '3.14,42.2' # delimiter is ','
s = '3.14 42.2' # delimiter is ' '
I thought about removing all spaces, but this would disable the last version; I tried the re.split()-function by doing re.split('[;, ]', s) which would work using a single character as delimiter but fails otherwise.
I can however do
s.replace('; ', ';').replace(', ', ';').replace(',', ';').replace(' ', ';')
s.split(';')
which works but seems not really like a good practice or useful - especially if I would add even more delimiters in the future. What would be a good approach to do this?
You can use re.split and split on (The [ ] is a space and the brackets are for display only)
[;,] ?|[ ]
The pattern matches
[;,] ? Match either ; or , followed by an optional space
| or
[ ] Match a single space
Regex demo | Python demo
A bit more strict pattern with lookarounds could be asserting a digit on the left using lookarounds.
(?<=\d)(?:[;,] ?| )(?=\d)
The pattern matches:
(?<=\d) Positive lookbehind, assert a digit to the left
(?: Non capture group for the alternation
[;,] ? Match either ; or , followed by an optional space
| Or
Match a space
) Close non capture group
(?=\d) Positive lookahead, assert a digit to the right
Regex demo
Example code
import re
strings = [
"3.14; 42.2",
"3.14;42.2",
"3.14, 42.2",
"3.14,42.2",
"3.14 42.2"
]
for s in strings:
print(re.split(r"[;,] ?| ", s))
Output
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
['3.14', '42.2']
I think you can account for the last space(s) like this:
re.split(r'[;,]\s*', s)
Here \s* will capture the spaces after the separator, if any.
can also just do:
res = re.split('; |;|,|, | ', data)
see https://www.geeksforgeeks.org/python-split-multiple-characters-from-string/
Assuming you would know the delimiter of the input ahead of time, you could write a function that takes your delimiter as an argument, replaces with a space, and splits it:
def split_on_delim(strng, delim):
return strng.replace(delim, ' ').split()
for example:
>>> s = '3.14; 42.2'
>>> split_on_delim(s, '; ')
['3.14', '42.2']

Regex pattern to find n non-space characters of x length after a certain substring

I am using this regex pattern pattern = r'cig[\s:.]*(\w{10})' to extract the 10 characters after the '''cig''' contained in each line of my dataframe. With this pattern I am accounting for all cases, except for the ones where that substring contains some spaces inside it.
For example, I am trying to extract Z9F27D2198 from the string
/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031
In the previous string, it seems like Stack overflow formatted it, but there should be 17 whitespaces between F and 2, after CIG.
Could you help me to edit the regex pattern in order to account for the white spaces in that 10-characters substring? I am also using flags=re.I to ignore the case of the strings in my re.findall calls.
To give an example string for which this pattern works:
CIG7826328A2B FORNITURA ENERGIA ELETTRICA U TENZE COMUNALI CONVENZIONE CONSIP E
and it outputs what I want: 7826328A2B.
Thanks in advance.
You can use
r'(?i)cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
See the regex demo. Details:
cig - a cig string
[\s:.]* - zero or more whitespaces, : or .
(\S(?:\s*\S){9}) - Group 1: a non-whitespace char and then nine occurrences of zero or more whitespaces followed with a non-whitespace char
(?!\S) - immediately to the right, there must be a whitespace or end of string.
In Python, you can use
import re
text = "/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031"
pattern = r'cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
matches = re.finditer(pattern, text, re.I)
for match in matches:
print(re.sub(r'\s+', '', match.group(1)), ' found at ', match.span(1))
# => Z9F27D2198 found at (32, 57)
See the Python demo.
What about:
# removes all white spaces with replace()
x = 'CIG7826328A2B FORNITURA ENERGIA ELETTRICA U'.replace(' ', '')
x = x.split("CIG")[1][:10]
# x = '7826328A2B'
x = '/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031'.replace(' ', '')
x.split("CIG")[1][:10]
# x = '7826328A2B'
Works fine if there is only one "CIG" in the string

remove empty quotes from string using regex

I want to remove punctuation such as " ", ' ', , , "", '' from my string using regex. The code so far I've written only removes the ones which space between them. How do I remove the empty ones such as '',
#Code
s = "hey how ' ' is the ` ` what are '' you doing `` how is everything"
s = re.sub("' '|` `|" "|""|''|``","",s)
print(s)
My expected outcome:
hey how is the what are you doing how is everything
You may use this regex to match all such quotes:
r'([\'"`])\s*\1\s*'
Code:
>>> s = "hey how ' ' is the ` ` what are '' you doing `` how is everything"
>>> print (re.sub(r'([\'"`])\s*\1\s*', '', s))
hey how is the what are you doing how is everything
RegEx Details:
([\'"`]): Match one of the given quotes and capture it in group #1
\s*: Match 0 or more whitespaces
\1: Using back-reference of group #1 make sure we match same closing quote
\s*: Match 0 or more whitespaces
RegEx Demo
In this case, why not match all word characters, and then join them?
' '.join(re.findall('\w+',s))
# 'hey how is the what are you doing how is everything'

Python regex : adding space after comma only if not followed by a number

I want to add spaces after and before comma's in a string only if the following character isn't a number (9-0). I tried the following code:
newLine = re.sub(r'([,]+[^0-9])', r' \1 ', newLine)
But it seems like the \1 is taking the 2 matching characters and not just the comma.
Example:
>>> newLine = "abc,abc"
>>> newLine = re.sub(r'([,]+[^0-9])', r' \1 ', newLine)
"abc ,a bc"
Expected Output:
"abc , abc"
How can I tell the sub to take only the 'comma' ?
Use this one:
newLine = re.sub(r'[,]+(?![0-9])', r' , ', newLine)
Here using negative lookahead (?![0-9]) it is checking that the comma(s) are not followed by a digit.
Your regex didn't work because you picked the comma and the next character(using ([,]+[^0-9])) in a group and placed space on both sides.
UPDATE: If it is not only comma and other things as well, then place them inside the character class [] and capture them in group \1 using ()
newLine = re.sub(r'([,/\\]+)(?![0-9])', r' \1 ', newLine)

Categories

Resources