how to split very long regular expression in python

how to split very long regular expression in python - python

i have a regular expression which is very long.
vpa_pattern = '(VAP) ([0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}): (.*)'
My code to match group as follows:
class ReExpr:
def __init__(self):
self.string=None
def search(self,regexp,string):
self.string=string
self.rematch = re.search(regexp, self.string)
return bool(self.rematch)
def group(self,i):
return self.rematch.group(i)
m = ReExpr()
if m.search(vpa_pattern,line):
print m.group(1)
print m.group(2)
print m.group(3)
I tried to make the regular expression pattern to multiple line in following ways,
vpa_pattern = '(VAP) \
([0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}):\
(.*)'
Or Even i tried:
vpa_pattern = re.compile(('(VAP) \
([0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}):\
(.*)'))
But above methods are not working. For each group i have a space () after open and close parenthesis. I guess it is not picking up when i split to multiple lines.

Look at re.X flag. It allows comments and ignores white spaces in regex.
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)

Python allows writing text strings in parts if enclosed in parenthesis:
>>> text = ("alfa" "beta"
... "gama")
...
>>> text
'alfabetagama'
or in your code:
text = ("alfa" "beta"
"gama" "delta"
"omega")
print text
will print
"alfabetagamadeltaomega"

Its actually quite simple. You already use the {} notation. Use it again. So instead of:
'([0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}:[0-9A-Fa-f]{2}):'
which is just a repeat of [0-9A-Fa-f]{2}: 6 times, you can use:
'([0-9A-Fa-f]{2}:){6}'
We can even simplify it further by using \d to represent digits:
'([\dA-Fa-f]{2}:){6}'
NOTE: Depending on what re function you use, you can pass in re.IGNORE_CASE and simplify that chunk down to [\da-f]{2}:
So your final regex is:
'(VAP) ([\dA-Fa-f]{2}:){6} (.*)'

Related

How to parse values appear after the same string in python?

I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)
(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)
I am trying to parse the text to store something like this:
value1="xxx" and value2="yyy".
I wrote python code as follows:
value1_start = content.find('value')
value1_end = content.find(';', value1_start)
value2_start = content.find('value')
value2_end = content.find(';', value2_start)
print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])
But it always returns:
value=xxx
value=xxx
Could anyone tell me how can I parse the text so that the output is:
value=xxx
value=yyy

Use a regex approach:
re.findall(r'\bvalue=[^;]*', s)
Or - if value can be any 1+ word (letter/digit/underscore) chars:
re.findall(r'\b\w+=[^;]*', s)
See the regex demo
Details:
\b - word boundary
value= - a literal char sequence value=
[^;]* - zero or more chars other than ;.
See the Python demo:
import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)

Use regex to filter the data you want from the "junk characters":
>>> import re
>>> _input = '#4#5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
print(match)
value=xxx
value=yyy
>>>
Summary or the regular expression:
[a-zA-Z0-9]+: One or more alphanumeric characters
=: literal equal sign
[a-zA-Z0-9]+: One or more alphanumeric characters

For this input:
content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'
use a simple regex and manually strip off the first and last two characters:
import re
values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
print(value)
Output:
value=xxx
value=yyy
Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.

You already have good answers based on the re module. That would certainly be the simplest way.
If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :
value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)

Regular Expression (find matching characters in order)

Let us say that I have the following string variables:
welcome = "StackExchange 2016"
string_to_find = "Sx2016"
Here, I want to find the string string_to_find inside welcome using regular expressions. I want to see if each character in string_to_find comes in the same order as in welcome.
For instance, this expression would evaluate to True since the 'S' comes before the 'x' in both strings, the 'x' before the '2', the '2' before the 0, and so forth.
Is there a simple way to do this using regex?

Your answer is rather trivial. The .* character combination matches 0 or more characters. For your purpose, you would put it between all characters in there. As in S.*x.*2.*0.*1.*6. If this pattern is matched, then the string obeys your condition.
For a general string you would insert the .* pattern between characters, also taking care of escaping special characters like literal dots, stars etc. that may otherwise be interpreted by regex.

This function might fit your need
import re
def check_string(text, pattern):
return re.match('.*'.join(pattern), text)
'.*'.join(pattern) create a pattern with all you characters separated by '.*'. For instance
>> ".*".join("Sx2016")
'S.*x.*2.*0.*1.*6'

Use wildcard matches with ., repeating with *:
expression = 'S.*x.*2.*0.*1.*6'
You can also assemble this expression with join():
expression = '.*'.join('Sx2016')
Or just find it without a regular expression, checking whether the location of each of string_to_find's characters within welcome proceeds in ascending order, handling the case where a character in string_to_find is not present in welcome by catching the ValueError:
>>> welcome = "StackExchange 2016"
>>> string_to_find = "Sx2016"
>>> try:
... result = [welcome.index(c) for c in string_to_find]
... except ValueError:
... result = None
...
>>> print(result and result == sorted(result))
True

Actually having a sequence of chars like Sx2016 the pattern that best serve your purpose is a more specific:
S[^x]*x[^2]*2[^0]*0[^1]*1[^6]*6
You can obtain this kind of check defining a function like this:
import re
def contains_sequence(text, seq):
pattern = seq[0] + ''.join(map(lambda c: '[^' + c + ']*' + c, list(seq[1:])))
return re.search(pattern, text)
This approach add a layer of complexity but brings a couple of advantages as well:
It's the fastest one because the regex engine walk down the string only once while the dot-star approach go till the end of the sequence and back each time a .* is used. Compare on the same string (~1k chars):
Negated class -> 12 steps
Dot star -> 4426 step
It works on multiline strings in input as well.
Example code
>>> sequence = 'Sx2016'
>>> inputs = ['StackExchange2015','StackExchange2016','Stack\nExchange\n2015','Stach\nExchange\n2016']
>>> map(lambda x: x + ': yes' if contains_sequence(x,sequence) else x + ': no', inputs)
['StackExchange2015: no', 'StackExchange2016: yes', 'Stack\nExchange\n2015: no', 'Stach\nExchange\n2016: yes']

RegEx to match a term before OR after another specific term

I'm looking for a squaremeter term in some kind of text using this RegExpression:
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
Works pretty well.
Now, this thing should only be matched if before OR after it, a string like "Wohnfläche"/"Wohnfl"/"Wfl" exists. In other words: the latter term is mandatory, however its positon is not.
Writing a RegEx for this is not the issue in general, my problem is how to write it most elegant. Currently I only see one approach:
^[.]*[Wohnfläche|Wohnfl|Wfl]([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
new search, kombined with 'or' statement (I'm using Python)
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2][.]*[Wohnfläche|Wohnfl|Wfl]$
Ugly, isn't it? ;)

You can use alternation like this:
(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)
And check which capture group matched. It is just not possible to use the restrictive strings optionally in the regex on both sides, the will just be ignored.
See the regex demo
IDEONE demo:
import re
pat = re.compile(r'(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)')
strs = ["12,56m qm Wohnfläche", "14.54 mqm Wohnfl", "Wfl 134 m qm"]
for x in strs:
m = pat.search(x)
if m:
if m.group(1): # First alternative found a match
print("{}".format(m.group(1), " - ", m.group(2)))
else: # Second alternative "won"
print("{}".format(m.group(3), " - ", m.group(4)))

Specify a logical conjunction in the controlling application, like (pseudo-code) <area-regex>.match(string) and <text-regex>.match(string).
This assumes that any pair of matches of the two regexen on the same string will never overlap ( if they did, you'd get a false positive ). Your regexen meet this requirement.
Note that your regex for the textual context contains the additional restriction that your test string either starts or ends with a match, while in your informal description you just require a match to either occur before or after the area spec. This difference is incorporated in pt vs pt_anchored in the code below.
Python fragment (untested):
import re
...
# pa: <area_regex>
# pt: <text_regex>
# pt_anchored: <text_regex>, anchored
#
pa = re.compile ( r'([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]' )
pt = re.compile ( r'[.]*[Wohnfläche|Wohnfl|Wfl]' )
pt_anchored = re.compile ( r'^[.]*[Wohnfläche|Wohnfl|Wfl]|[.]*[Wohnfläche|Wohnfl|Wfl]$' )
if pa.match(<teststring>) and pt.match(<teststring>):
print 'Match found: '
else:
print 'No match'
...

replace wildcard numbers in pattern with additional text + same numbers

I need to find all parts of a large text string in this particular pattern:
"\t\t" + number (between 1-999) + "\t\t"
and then replace each occurrence with:
TEXT+"\t\t"+same number+"\t\t"
So, the end result is:
'TEXT\t\t24\t\tblah blah blahTEXT\t\t56\t\t'... and so on...
The various numbers are between 1-999 so it needs some kind of wildcard.
Please can somebody show me how to do it? Thanks!

You'll want to use Python's re library, and in particular the re.sub function:
import re # re is Python's regex library
SAMPLE_TEXT = "\t\t45\t\tbsadfd\t\t839\t\tds532\t\t0\t\t" # Test text to run the regex on
# Run the regex using re.sub (for substitute)
# re.sub takes three arguments: the regex expression,
# a function to return the substituted text,
# and the text you're running the regex on.
# The regex looks for substrings of the form:
# Two tabs ("\t\t"), followed by one to three digits 0-9 ("[0-9]{1,3}"),
# followed by two more tabs.
# The lambda function takes in a match object x,
# and returns the full text of that object (x.group(0))
# with "TEXT" prepended.
output = re.sub("\t\t[0-9]{1,3}\t\t",
lambda x: "TEXT" + x.group(0),
SAMPLE_TEXT)
print output # Print the resulting string.

Python complex regex replace

I'm trying to do a simple VB6 to c translator to help me port an open source game to the c language.
I want to be able to get "NpcList[NpcIndex]" from "With Npclist[NpcIndex]" using ragex and to replace it everywhere it has to be replaced. ("With" is used as a macro in VB6 that adds Npclist[NpcIndex] when ever it needs to until it founds "End With")
Example:
With Npclist[NpcIndex]
.goTo(245) <-- it should be replaced with Npclist[NpcIndex].goTo(245)
End With
Is it possible to use regex to do the job?
I've tried using a function to perfom another regex replace between the "With" and the "End With" but I can't know the text the "With" is replacing (Npclist[NpcIndex]).
Thanks in advance

I personally wouldn't trust any single-regex solution to get it right on the first time nor feel like debugging it. Instead, I would parse the code line-to-line and cache any With expression to use it to replace any . directly preceded by whitespace or by any type of brackets (add use-cases as needed):
(?<=[\s[({])\. - positive lookbehind for any character from the set + escaped literal dot
(?:(?<=[\s[({])|^)\. - use this non-capturing alternatives list if to-be-replaced . can occur on the beginning of line
import re
def convert_vb_to_c(vb_code_lines):
c_code = []
current_with = ""
for line in vb_code_lines:
if re.search(r'^\s*With', line) is not None:
current_with = line[5:] + "."
continue
elif re.search(r'^\s*End With', line) is not None:
current_with = "{error_outside_with_replacement}"
continue
line = re.sub(r'(?<=[\s[({])\.', current_with, line)
c_code.append(line)
return "\n".join(c_code)
example = """
With Npclist[NpcIndex]
.goTo(245)
End With
With hatla
.matla.tatla[.matla.other] = .matla.other2
dont.mind.me(.do.mind.me)
.next()
End With
"""
# use file_object.readlines() in real life
print(convert_vb_to_c(example.split("\n")))

You can pass a function to the sub method:
# just to give the idea of the regex
regex = re.compile(r'''With (.+)
(the-regex-for-the-VB-expression)+?
End With''')
def repl(match):
beginning = match.group(1) # NpcList[NpcIndex] in your example
return ''.join(beginning + line for line in match.group(2).splitlines())
re.sub(regex, repl, the_string)
In repl you can obtain all the information about the matching from the match object, build whichever string you want and return it. The matched string will be replaced by the string you return.
Note that you must be really careful to write the regex above. In particular using (.+) as I did matches all the line up to the newline excluded, which or may not be what you want(but I don't know VB and I have no idea which regex could go there instead to catch only what you want.
The same goes for the (the-regex-forthe-VB-expression)+. I have no idea what code could be in those lines, hence I leave to you the detail of implementing it. Maybe taking all the line can be okay, but I wouldn't trust something this simple(probably expressions can span multiple lines, right?).
Also doing all in one big regular expression is, in general, error prone and slow.
I'd strongly consider regexes only to find With and End With and use something else to do the replacements.

This may do what you need in Python 2.7. I'm assuming you want to strip out the With and End With, right? You don't need those in C.
>>> import re
>>> search_text = """
... With Np1clist[Npc1Index]
... .comeFrom(543)
... End With
...
... With Npc2list[Npc2Index]
... .goTo(245)
... End With"""
>>>
>>> def f(m):
... return '{0}{1}({2})'.format(m.group(1), m.group(2), m.group(3))
...
>>> regex = r'With\s+([^\s]*)\s*(\.[^(]+)\(([^)]+)\)[^\n]*\nEnd With'
>>> print re.sub(regex, f, search_text)
Np1clist[Npc1Index].comeFrom(543)
Npc2list[Npc2Index].goTo(245)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to split very long regular expression in python - python

Look at re.X flag. It allows comments and ignores white spaces in regex. a = re.compile(r"""\d + # the integral part \. # the decimal point \d * # some fractional digits""", re.X)

Python allows writing text strings in parts if enclosed in parenthesis: >>> text = ("alfa" "beta" ... "gama") ... >>> text 'alfabetagama' or in your code: text = ("alfa" "beta" "gama" "delta" "omega") print text will print "alfabetagamadeltaomega"

Related

How to parse values appear after the same string in python?

Regular Expression (find matching characters in order)

RegEx to match a term before OR after another specific term

replace wildcard numbers in pattern with additional text + same numbers

Python complex regex replace

Categories

Resources