Regular Expression with python need to select closest section

Regular Expression with python need to select closest section - python

Need some help with RegEx over python.
I have this text:
part101_add(
name = "part101-1",
dev2_serial = "dev_l622_01",
serial_port = "/dev/tty-part101-1",
yok_serial = "YT8388"
)
yok_tar_add("YT8388", None)
part2_add(
name = "part2-1",
serial_number = "SERIALNUMBER",
serial_port = "/dev/tty-part2-1",
yok_serial = "YT03044",
yok_port_board = "N"
)
yok_tar_add("YT03044", None)
I need to select all part*_add and its content.
for example:
part101_add:
name = "part101-1",
dev2_serial = "dev_l622_01",
serial_port = "/dev/tty-part101-1",
yok_serial = "YT8388"
part2_add:
serial_number = "SERIALNUMBER",
serial_port = "/dev/tty-part2-1",
yok_serial = "YT03044",
yok_port_board = "N"
problem is that im unable to separate the results.
when using this pattern:
regex = r"(.*?_add)\([\s\S.]*\)"
Thanks for your help.

I would precise the pattern to only match at the start and end of the line, and use a lazy quantifier with [\s\S]:
r"(?m)^(part\d+_add)\([\s\S]*?\)$"
See this regex demo
Details:
(?m) - an inline re.MULTILINE modifier version to make ^ match the line start and $ to match the line end
^ - start of a line
(part\d+_add) - Group 1 capturing part, 1+ digits, _add
\( - a literal (
[\s\S]*? - any 0+ chars, as few as possible up to
\)$ - a ) at the end of the line.

Related

How can I use a regex expression to identify the falses between braces on different lines?

I'm attempting to use the Python re.sub module to replace any instance of false in the example below with "false, \n"
local Mission = {
start_done = false
game_over = false
} --End Mission
I've attempted the following, but I'm not getting a successful replacements. The idea being I start and end with the anchor strings, skip over anything that isn't a "false", and return "false + ','" when I get a match. Any help would be appreciated!
re.sub(r'(Mission = {)(.+?)(false)(.+?)(} --End Mission)', r'\1' + ',' + '\n')

You can use
re.sub(r'Mission = {.*?} --End Mission', lambda x: x.group().replace('false', 'false, \n'), text, flags=re.S)
See the regex demo.
Notes:
The Mission = {.*?} --End Mission regex matches Mission = {, then any zero or more chars as few as chars, and then } --End Mission
Then false is replaced with false, \n in the matched texts.
See the Python demo:
import re
text = 'local Mission = {\n start_done = false\n game_over = false\n\n} --End Mission'
rx = r'Mission = {.*?} --End Mission'
print(re.sub(rx, lambda x: x.group().replace('false', 'false, \n'), text, flags=re.S))

Another option without regex:
your_string = 'local Mission = {\n start_done = false\n game_over = false\n\n} --End Mission'
print(your_string.replace(' = false\n', ' = false,\n'))
Output:
local Mission = {
start_done = false,
game_over = false,
} --End Mission

Provided that every "false" string which is preceded by = and followed by \n has to be substituted then here a regex:
re.sub(r'= (false)\n', r'= \1,\n', text)
Note: you introduce 5 groups in your regex so you should have used \3 and not \1 to refer to "false", group start from 1, see doc at paragraph \number

Dealing with ZeroOrMore in pyparsing

I'm trying to parse pactl list with pyparsing: So far all parse is working correctly but I cannot make ZeroOrMore to work correctly.
I can find foo: or foo: bar and try to deal with that with ZeroOrMore but it doesn't work, I have to add special case "Argument:" to find results without value, but there're Argument: foo results (with value) so it will not work, and I expect any other property to exist without value.
With this definition, and a fixed pactl list output:
#!/usr/bin/env python
#
# parsing pactl list
#
from pyparsing import *
import os
from subprocess import check_output
import sys
data = '''
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
'''
indentStack = [1]
stmt = Forward()
identifier = Word(alphanums+"-_.")
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums)))
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1)))))
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop = (prop_name + prop_section)
stmt << ( section | prop | ("Argument:") | value | prop_val )
syntax = OneOrMore(stmt)
parseTree = syntax.parseString(data)
parseTree.pprint()
This gets:
$ ./pactl.py
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
[[['Module'], ['6']],
[['Argument:'],
[[['Name'], ['module-alsa-card']]],
[[['Usage counter'], ['0']]],
['Properties:',
[[[['module.author'], ['"Lennart Poettering"']]],
[[['module.description'], ['"ALSA Card"']]],
[[['module.version'], ['"14.0-rebootstrapped"']]]]]]]
So far so good, but removing special case for Argument: it gets into error, as ZeroOrMore doesn't behave as expected:
#!/usr/bin/env python
#
# parsing pactl list
#
from pyparsing import *
import os
from subprocess import check_output
import sys
data = '''
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
'''
indentStack = [1]
stmt = Forward()
identifier = Word(alphanums+"-_.")
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums)))
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1))))).setDebug()
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop = (prop_name + prop_section)
stmt << ( section | prop | value | prop_val )
syntax = OneOrMore(stmt)
parseTree = syntax.parseString(data)
parseTree.pprint()
This results in:
$ ./pactl.py
Module #6
Argument:
Name: module-alsa-card
Usage counter: 0
Properties:
module.author = "Lennart Poettering"
module.description = "ALSA Card"
module.version = "14.0-rebootstrapped"
Match Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) at loc 19(3,9)
Matched Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) -> [[['Argument'], ['Name']]]
Match Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) at loc 1(2,1)
Exception raised:Expected ":", found '#' (at char 8), (line:2, col:8)
Traceback (most recent call last):
File "/home/alberto/projects/node/pacmd_list_json/./pactl.py", line 55, in <module>
parseTree = syntax.parseString(partial)
File "/usr/local/lib/python3.9/site-packages/pyparsing.py", line 1955, in parseString
raise exc
File "/usr/local/lib/python3.9/site-packages/pyparsing.py", line 6336, in checkUnindent
raise ParseException(s, l, "not an unindent")
pyparsing.ParseException: Expected {{Group:({Group:(W:(ABCD...)) Suppress:("#") Group:(W:(0123...))}) indented block} | {"Properties:" indented block} | Group:({Group:(Combine:({{W:(ABCD...) | <SP>}}...)) Suppress:(":") Group:(Combine:([{W:(ABCD...) | <SP>}]...))}) | Group:({Group:(W:(ABCD...)) Suppress:("=") Group:(Combine:({{W:(ABCD...) | <SP><TAB>}}...))})}, found ':' (at char 41), (line:4, col:13)
See from setDebug value grammar ZeroOrMore is getting the tokens from next line [[['Argument'], ['Name']]]
I tried LineEnd() and other tricks but none works.
Any idea on how to deal with ZeroOrMore to stop on LineEnd() or without special cases?
NOTE: Real output can be retrieved using:
env = os.environ.copy()
env['LANG'] = 'C'
data = check_output(
['pactl', 'list'], universal_newlines=True, env=env)

indentedBlock is not the easiest pyparsing element to work with. But there are a few things that you are doing that are getting in your way.
To debug this, I broke down some of your more complex expressions, use setName() to give them names, and then added .setDebug(). Like this:
identifier = Word(alphas, alphanums+"-_.").setName("identifier").setDebug()
This will tell pyparsing to output a message whenever this expression is about to be matched, if it matched successfully, or if not, the exception that was raised.
Match identifier at loc 1(2,1)
Matched identifier -> ['Module']
Match identifier at loc 15(3,5)
Matched identifier -> ['Argument']
Match identifier at loc 15(3,5)
Matched identifier -> ['Argument']
Match identifier at loc 23(3,13)
Exception raised:Expected identifier, found ':' (at char 23), (line:3, col:13)
It looks like these expressions are messing up the indentedBlock matching, by processing whitespace that should be indentation space:
Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))
The " character in the Word and the whitespace lead me to believe you are trying to match quoted strings. I replaced this expression with:
Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString))
You also need to take care not to read past the end of the line, or you'll also mess up the indentedBlock indentation tracking. I added this expression for a newline at the top:
NL = LineEnd()
and then used it as the stopOn argument to OneOrMore and ZeroOrMore:
prop_val_value = Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString(), stopOn=NL)).setName("prop_val_value")#.setDebug()
prop_val = Group(identifier + Suppress("=") + Group(prop_val_value)).setName("prop_val")#.setDebug()
Here is the parser I ended up with:
indentStack = [1]
stmt = Forward()
NL = LineEnd()
identifier = Word(alphas, alphanums+"-_.").setName("identifier").setDebug()
sect_def = Group(Group(identifier) + Suppress("#") + Group(Word(nums))).setName("sect_def")#.setDebug()
inner_section = indentedBlock(stmt, indentStack)
section = (sect_def + inner_section)
#~ value = Group(Group(Combine(OneOrMore(identifier|White(' ')))) + Suppress(":") + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_".')|White(' ', max=1))))).setDebug()
value_label = originalTextFor(OneOrMore(identifier)).setName("value_label")#.setDebug()
value = Group(value_label
+ Suppress(":")
+ Optional(~NL + Group(Combine(ZeroOrMore(Word(alphanums+'-/=_.') | quotedString(), stopOn=NL))))).setName("value")#.setDebug()
prop_name = Literal("Properties:")
prop_section = indentedBlock(stmt, indentStack)
#~ prop_val = Group(Group(identifier) + Suppress("=") + Group(Combine(OneOrMore(Word(alphanums+'-"/.')|White(' \t')))))
prop_val_value = Combine(OneOrMore(Word(alphas, alphanums+'-/.') | quotedString(), stopOn=NL)).setName("prop_val_value")#.setDebug()
prop_val = Group(identifier + Suppress("=") + Group(prop_val_value)).setName("prop_val")#.setDebug()
prop = (prop_name + prop_section).setName("prop")#.setDebug()
stmt << ( section | prop | value | prop_val )
Which gives this:
[[['Module'], ['6']],
[[['Argument']],
[['Name', ['module-alsa-card']]],
[['Usage counter', ['0']]],
['Properties:',
[[['module.author', ['"Lennart Poettering"']]],
[['module.description', ['"ALSA Card"']]],
[['module.version', ['"14.0-rebootstrapped"']]]]]]]

Replace the all content starting from (color to ;) and starting ? to )

this is my code so far:
import re
a = ["abc", " this is in blue color","(Refer: '(color:rgb(61, 142, 185); )Set the TEST VIN value'(color:rgb(0, 0, 0); ) in document: (color:rgb(61, 142, 185); )[UserGuide_Upgrade_2020_W10_final.pdf|CB:/displayDocument/UserGuide_Upgrade_2020_W10_final.pdf?task_id=12639618&artifact_id=48569866] )"]
p = re.compile(r'(color[\w]+\;)').sub('', a[i])
print(p)
Output required:
["abc", " this is in blue color","(Refer: 'Set the TEST VIN value' in document: [UserGuide_Upgrade_2020_W10_final.pdf|CB:/displayDocument/UserGuide_Upgrade_2020_W10_final.pdf)"]

The are 3 color parts to remove in the string and the part at the end from the question mark until right before the )
You could match all the parts using an alternation |
\(color:\w+\([^()]*\); \)|\?[^?]+(?=\)$)
Regex demo | Python demo
\(color: Match (color:
\w+\([^()]*\); \) Match 1+ word chars followed by matching from ( to ) a space and another )
| Or
\?[^?]+ Match ? and 1+ times all chars except ?
(?=\)$) Assert what is on the right is ) at the end of the string
Example code
import re
regex = r"\(color:\w+\([^()]*\); \)|\?[^?]+(?=\)$)"
test_str = " this is in blue color\",\"(Refer: 'Set the TEST VIN value' in document: [UserGuide_Upgrade_2020_W10_final.pdf|CB:/displayDocument/UserGuide_Upgrade_2020_W10_final.pdf)"
result = re.sub(regex, "", test_str)
print (result)
Output
this is in blue color","(Refer: 'Set the TEST VIN value' in document: [UserGuide_Upgrade_2020_W10_final.pdf|CB:/displayDocument/UserGuide_Upgrade_2020_W10_final.pdf)

Copy value in var from structured data

I have a bulk data in 'bulk_data' var, now need to find and copy it in sub var as per below, How to do it with python
bulk_data = """F0142514RM/JRSE1420 Mod/4758
F0144758RM/JRSE935 Mod/23
F014GS4RM/JRSE10 Mod/445
"""
typeA1 = <start with RM/>"JRSE1420"<until space> in 1st line
typeA2 = <start with RM/>"JRSE935"<until space> in 2nd line
typeA3 = <start with RM/>"JRSE10"<until space> in 3rd line
typeB1 = <start with space after typeA1>"Mod/4758"<until end of the line> in 1rd line
typeB2 = <start with space after typeA2>"Mod/23"<until end of the line> in 2nd line
typeB3 = <start with space after typeA3>"Mod/445"<until end of the line> in 3rd line
Overall result would be:
typeA1 = 'JRSE1420'
typeA2 = 'JRSE935'
typeA3 = 'JRSE10'
typeB1 = 'Mod/4758'
typeB2 = 'Mod/23'
typeB3 = 'Mod/445'
And also is there any study manual to deal with such type of data manipulation ?

You can use the re module
import re
bulk_data = '''F0142514RM/JRSE1420 Mod/4758
F0144758RM/JRSE935 Mod/23
F014GS4RM/JRSE10 Mod/445
'''
ptrn1 = re.compile(r'''
^ #matches the start of the string
.* #matches 0 or more of anything
RM\/ #matches "RM" followed by "/"
(\w+) #matches one or more alphanumeric character and the undescore
\b #matches empty string
.* #matches anything
$ #matches the end of string
''', re.MULTILINE | re.VERBOSE)
ptrn2 = re.compile(r'''
^ #matches the start of the string
.* #matches 0 or more of anything
\s #matches a space character
(Mod.*) #matches "Mod" follow by 0 or more of anything
$ #matches the end of string
''', re.MULTILINE | re.VERBOSE)
typeA1, typeA2, typeA3 = ptrn1.findall(bulk_data)
typeB1, typeB2, typeB3 = ptrn2.findall(bulk_data)

Why re? Looks like everything is already properly separated by different characters.
lines = bulk_data.splitlines()
typeA1_, typeB1 = lines[0].split(' ')
typeA1 = typeA1_.split('/')[1]
...

count = 1
li = []
with open('data') as f:
for line in f:
line = line.split()
if line:
a, b = line
a = a[a.index('/')+1:]
li.append("TypeA{} = {} ".format(count, a))
li.append("TypeB{} = {} ".format(count, b))
count += 1
for el in sorted(li):
print(el)
TypeA1 = JRSE1420
TypeA2 = JRSE935
TypeA3 = JRSE10
TypeB1 = Mod/4758
TypeB2 = Mod/23
TypeB3 = Mod/445

Is it possible to use regular expressions with pdfquery?

Can we use regex to detect text within a pdf (using pdfquery or another tool)?
I know we can do this:
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:contains("Cash")')
left_corner = float(label.attr('x0'))
bottom_corner = float(label.attr('y0'))
cash = pdf.pq('LTTextLineHorizontal:in_bbox("%s, %s, %s, %s")' % \
(left_corner, bottom_corner-30, \
left_corner+150, bottom_corner)).text()
print cash
'179,000.00'
But we need something like this:
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pdf.load()
label = pdf.pq('LTTextLineHorizontal:regex("\d{1,3}(?:,\d{3})*(?:\.\d{2})?")')
cash = str(label.attr('x0'))
print cash
'179,000.00'

This is not exactly a lookup for a regex, but it works to format/filter the possible extractions:
def regex_function(pattern, match):
re_obj = re.search(pattern, match)
if re_obj != None and len(re_obj.groups()) > 0:
return re_obj.group(1)
return None
pdf = pdfquery.PDFQuery("tests/samples/IRS_1040A.pdf")
pattern = ''
pdf.extract( [
('with_parent','LTPage[pageid=1]'),
('with_formatter', 'text'),
('year', 'LTTextLineHorizontal:contains("Form 1040A (")',
lambda match: regex_function(SOME_PATTERN_HERE, match)))
])
I didn't test this next one, but it might work also:
def some_regex_function_feature():
# here you could use some regex.
return float(this.get('width',0)) * float(this.get('height',0)) > 40000
pdf.pq('LTPage[page_index="1"] *').filter(regex_function_filter_here)
[<LTTextBoxHorizontal>, <LTRect>, <LTRect>]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular Expression with python need to select closest section - python

Related

How can I use a regex expression to identify the falses between braces on different lines?

Dealing with ZeroOrMore in pyparsing

Replace the all content starting from (color to ;) and starting ? to )

Copy value in var from structured data

Is it possible to use regular expressions with pdfquery?

Categories

Resources