Parsing a string for nested patterns

Parsing a string for nested patterns - python

What would be the best way to do this.
The input string is
<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust.</135_3></133_3>
the expected output is
{'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit \
using either camera now they are just sitting and collecting dust.':[133, 135],
'The other system worked for about 1 month': [116],
'on it then it started doing the same thing as the first one':[137]
}
that seems like a recursive regexp search but I can't figure out how exactly.
I can think of a tedious recursive function as of now, but have a feeling that there should be a better way.
Related question: Can regular expressions be used to match nested patterns?

Use expat or another XML parser; it's more explicit than anything else, considering you're dealing with XML data anyway.
However, note that XML element names can't start with a number as your example has them.
Here's a parser that will do what you need, although you'll need to tweak it to combine duplicate elements into one dict key:
from xml.parsers.expat import ParserCreate
open_elements = {}
result_dict = {}
def start_element(name, attrs):
open_elements[name] = True
def end_element(name):
del open_elements[name]
def char_data(data):
for element in open_elements:
cur = result_dict.setdefault(element, '')
result_dict[element] = cur + data
if __name__ == '__main__':
p = ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
p.Parse(u'<_133_3><_135_3><_116_2>The other system worked for about 1 month</_116_2> got some good images <_137_3>on it then it started doing the same thing as the first one</_137_3> so then I quit using either camera now they are just sitting and collecting dust.</_135_3></_133_3>', 1)
print result_dict

Take an XML parser, make it generate a DOM (Document Object Model) and then build a recursive algorithm that traverses all the nodes, calls "text()" in each node (that should give you the text in the current node and all children) and puts that as a key in the dictionary.

from cStringIO import StringIO
from collections import defaultdict
####from xml.etree import cElementTree as etree
from lxml import etree
xml = "<e133_3><e135_3><e116_2>The other system worked for about 1 month</e116_2> got some good images <e137_3>on it then it started doing the same thing as the first one</e137_3> so then I quit using either camera now they are just sitting and collecting dust. </e135_3></e133_3>"
d = defaultdict(list)
for event, elem in etree.iterparse(StringIO(xml)):
d[''.join(elem.itertext())].append(int(elem.tag[1:-2]))
print(dict(d.items()))
Output:
{'on it then it started doing the same thing as the first one': [137],
'The other system worked for about 1 month': [116],
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}

I think a grammar would be the best option here. I found a link with some information:
http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html

Note that you can't actually solve this by a regular expression, since they don't have the expressive power to enforce proper nesting.
Take the following mini-language:
A certain number of "(" followed by the same number of ")", no matter what the number.
You could make a regular expression very easily to represent a super-language of this mini-language (where you don't enforce the equality of the number of starts parentheses and end parentheses). You could also make a regular expression very easilty to represent any finite sub-language (where you limit yourself to some max depth of nesting). But you can never represent this exact language in a regular expression.
So you'd have to use a grammar, yes.

Here's an unreliable inefficient recursive regexp solution:
import re
re_tag = re.compile(r'<(?P<tag>[^>]+)>(?P<content>.*?)</(?P=tag)>', re.S)
def iterparse(text, tag=None):
if tag is not None: yield tag, text
for m in re_tag.finditer(text):
for tag, text in iterparse(m.group('content'), m.group('tag')):
yield tag, text
def strip_tags(content):
nested = lambda m: re_tag.sub(nested, m.group('content'))
return re_tag.sub(nested, content)
txt = "<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust. </135_3></133_3>"
d = {}
for tag, text in iterparse(txt):
d.setdefault(strip_tags(text), []).append(int(tag[:-2]))
print(d)
Output:
{'on it then it started doing the same thing as the first one': [137],
'The other system worked for about 1 month': [116],
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}

Related

How to find JSON objects without actually parsing the JSON (regex)

I'm working on filtering a website's data and looking for keywords. The website uses a long JSON body and I only need to parse everything before a base64-encoded image. I cannot parse the JSON object regularly as the structure changes often and sometimes it's cut off.
Here is a snippet of code I'm parsing:
<script id="__APP_DATA" type="application/json">{"routeProps":{"b723":{"navDataResource":[{"catalogId":48,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/bbjy2x.png","catalogName":"New Crypto Listings","total":762,"articles":[{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"},{"id":54548,"code":"e5607624f4614c3f9fd2562c8beb8660","title":"BTG, DEXE \u0026 SHIB Enabled on Binance Isolated Margin"},{"id":54394,"code":"a176d4cfd4c74a7fb8238e63d71c062a","title":"Binance Futures Will Launch USDT-Margined ICP Perpetual Contracts with Up to 25X Leverage"},{"id":54392,"code":"4fa91d953fd0484ab9a48cca0a41c192","title":"Binance Will Open Trading for Internet Computer (ICP)"},{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}],"catalogs":[]},{"catalogId":49,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/zxgg2x.png","catalogName":"Latest News","total":1164,"articles":[{"id":54649,"code":"2291f02b964f45b195fd6d4685db80bb","title":"Update on Trading Suspension for GYEN"},{"id":54646,"code":"724346d139b041198a441dc149133c7d","title":"Binance Liquid Swap Adds RAMP/BUSD Liquidity Pool"},{"id":54643,"code":"bc9f313c04cc40d2b7e598c831fd721f","title":"Notice on Trading Suspension for GYEN"},{"id":54591,"code":"b3c6998066af43078c63a5498bfd80b1","title":"Binance P2P Supports New Payment Methods for Mongolia"},{"id":54586,"code":"d4418be0b9ea4d1b8e92cbbfe8468a17","title":"Dual Investment (42nd Phase) - Earn Up to 56% APY"}]
As you can see, I'm trying to weed out everything except for these:
{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}
As the JSON is really long and it wouldn't be smart to parse the entire thing, is there a way to find strings like these without actually parsing the JSON object? Ideally, I'd like for everything to be in an array. Will regular expressions work?
The ID is 5 numbers long, the code is 32 characters long, and there is a title.
Thanks a lot in advance

The following will use string.find() to step through the string and, if it finds both the start AND end of your target string, it will extract it as a dictionary. If it only finds the start but not the end, it assumes it's a broken or interrupted string and breaks out of the loop as there's nothing further to do.
I'm using the ast module to convert the string to dictionary. This isn't strictly needed to answer the question but I think it makes the end result more usable.
import ast
testdata = '{"routeProps":{"b723":{"navDataResource":[{"catalogId":48,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/bbjy2x.png","catalogName":"New Crypto Listings","total":762,"articles":[{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"},{"id":54548,"code":"e5607624f4614c3f9fd2562c8beb8660","title":"BTG, DEXE \u0026 SHIB Enabled on Binance Isolated Margin"},{"id":54394,"code":"a176d4cfd4c74a7fb8238e63d71c062a","title":"Binance Futures Will Launch USDT-Margined ICP Perpetual Contracts with Up to 25X Leverage"},{"id":54392,"code":"4fa91d953fd0484ab9a48cca0a41c192","title":"Binance Will Open Trading for Internet Computer (ICP)"},{"id":54382,"code":"33b6e8116ce54705ac89e898d1a05510","title":"Binance Will List Internet Computer (ICP)"}],"catalogs":[]},{"catalogId":49,"parentCatalogId":null,"icon":"https://bin.bnbstatic.com/image/20200609/zxgg2x.png","catalogName":"Latest News","total":1164,"articles":[{"id":54649,"code":"2291f02b964f45b195fd6d4685db80bb","title":"Update on Trading Suspension for GYEN"},{"id":54646,"code":"724346d139b041198a441dc149133c7d","title":"Binance Liquid Swap Adds RAMP/BUSD Liquidity Pool"},{"id":54643,"code":"bc9f313c04cc40d2b7e598c831fd721f","title":"Notice on Trading Suspension for GYEN"},{"id":54591,"code":"b3c6998066af43078c63a5498bfd80b1","title":"Binance P2P Supports New Payment Methods for Mongolia"},{"id":54586,"code":"d4418be0b9ea4d1b8e92cbbfe8468a17","title":"Dual Investment (42nd Phase) - Earn Up to 56% APY"}]'
# Create a list to hold the dictionary objects
itemlist = []
# Create variable to keep track of our position in the string
strMarker = 0
#Neverending Loooooooooooooooooooooooooooooooop
while True:
# Find occurrence of the beginning of a target string
strStart = testdata.find('{"id":',strMarker)
if not strStart == -1:
# If we've found the start, now look for the end marker of the string,
# starting from the location we identified as the beginning of that string
strEnd = testdata.find('}', strStart)
# If it does not exist, this suggests it might be an interrupted string
# so we don't do anything further with it, just allow the loop to break
if not strEnd == -1:
# Save this marker as it will be used as the starting point
# for the next search cycle.
strMarker = strEnd
# Extract the substring based on the start and end positions, +1 to capture
# the final '}'; as this string is nicely formatted as a dictionary object
# already, we are using ast.literal_eval() to turn it into an actual usable
# dictionary object
itemlist.append(ast.literal_eval(testdata[strStart:strEnd+1]))
# We're happy to keep searching so jump to the next loop
continue
# If nothing happened to trigger a jump to the next loop, break out of the
# while loop
break
# Print out the first entry in the list as a demo
print(str(itemlist[0]))
print(str(itemlist[0]["title"]))
Output from this code should be a nice formatted dict:
{"id":54572,"code":"0ef69e1d334c4d8c9ffbd088843bf2dd","title":"Binance Will List GYEN"}
Binance Will List GYEN

Regular expression should work here. Try matching with the following regular expression. It matches the desired sections, when I try it in https://regexr.com/. Also, regexr helps you understand the regular expression, in case you are new to it.
(\{"id":\d{5},"code":".{32}","title":"[^"]*"\})
Here is a small sample python script to find all of the sections.
import re
pattern='(\{"id":\d{5},"code":".{32}","title":"[^"]*"\})'
string_to_parse='...'
sections = re.findall(pattern, string_to_parse, re.DOTALL)

python3.6 How do I regex a url from a .txt?

I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?

text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]

i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .

You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.

Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])

Using ElementTree to find a node - invalid predicate

I'm very new to this area so I'm sure it's just something obvious. I'm trying to change a python script so that it finds a node in a different way but I get an "invalid predicate" error.
import xml.etree.ElementTree as ET
tree = ET.parse("/tmp/failing.xml")
doc = tree.getroot()
thingy = doc.find(".//File/Diag[#id='53']")
print(thingy.attrib)
thingy = doc.find(".//File/Diag[BaseName = 'HTTPHeaders']")
print(thingy.attrib)
That should find the same node twice but the second find gets the error. Here is an extract of the XML:
<Diag id="53">
<Formatted>xyz</Formatted>
<BaseName>HTTPHeaders</BaseName>
<Column>17</Column>
I hope I've not cut it down too much. Basically, finding it with "#id" works but I want to search on that BaseName tag instead.
Actually, I want to search on a combination of tags so I have a more complicated expression lined up but I can't get the simple one to work!

The code in the question works when using Python 3.7. If the spaces before and after the equals sign in the predicate are removed, it also works with earlier Python versions.
thingy = doc.find(".//File/Diag[BaseName='HTTPHeaders']")
See https://bugs.python.org/issue31648.

Effective ways to extract XML tagged string in Python

My task is to grab a kml file, extract 1 tagged value and send it to Mongo in a geojson.
I'm getting the file as a binary requests object.
doc = requests.get(file).content #returning a XML tree.
My question is to find the "best" approach to get the value from the tag. Consider I got multiple sources that needs to be scanned by the minute so even though one run might not take that long, it will build up (being aware that the actual file import will cost more than any extraction process).
The approaches I've tried are BeautifulSoup, slicing and regex. They all work fine, but I would love to get some input on alternatives and/or pros and cons.
def extractsubstring_soup(doc, start):
soup = BeautifulSoup(doc, 'lxml-xml')
return soup.start.string
def extractsubstring_re(doc, start, stop):
return re.search('%s(.*)%s' %(start, stop), a).group(1)
def extractsubstring_slice(doc,start, stop):
substart = doc.index(start) + len(start)
subend = doc.index(end)
return doc[substart:subend]

For performance, you can use http://lxml.de/ and use a XPath query to extract the information you want.
BeautifulSoup is a wrapper around different libraries - you can choose which one - but usually it's for parsing HTML, not XML.

Regular Expression Python Variable

I have data like this:
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
What I want to do is if I have the number >Px016979 or I can get the data bellow it.like this:
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
I am new with Python.
#coding:utf-8
import os,re
a = """
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
"""
b = '>Px016979'
matchbj = re.match( r'$b(.*?)>',a,re.M|re.I)
print matchbj.group()
My code can not work. I have two questions:
I think my data has carriage return so my code can't work.
I don't know how to use variables in Python regular expression. If I write re.match( r'>Px016797(.*?)>',a,re.M|re.I) it can work, but I need to use variables.
Thanks.

It looks like your data is a FASTA file with protein sequences. So instead of using regular expressions, you should consider installing BioPython. That is a library specifically for bioinformatics use and research.
The goal of Biopython is to make it as easy as possible to use Python for bioinformatics by creating high-quality, reusable modules and classes. Biopython features include parsers for various Bioinformatics file formats (BLAST, Clustalw, FASTA, Genbank,...), access to online services (NCBI, Expasy,...), interfaces to common and not-so-common programs (Clustalw, DSSP, MSMS...), a standard sequence class, various clustering modules, a KD tree data structure etc. and even documentation.
Using BioPython, you would extract a sequence from a FASTA file for a given identifier in the following way:
from Bio import SeqIO
input_file = r'C:\path\to\proteins.fasta'
record_id = 'Px016979'
record_dict = SeqIO.to_dict(SeqIO.parse(input_file, 'fasta'))
record = record_dict[record_id]
sequence = str(record.seq)
print sequence

The following should work for each of the entries you have:
a = """
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY
"""
for b in ['>Px016979', '>Px016980', '>Px002185', '>Px006321']:
re_search = re.search(re.escape(b) + r'(.*?)(?:>|\Z)', a, re.M|re.I|re.S)
print re_search.group()
This will display the following:
>Px016979
MSPWMKKVFLQCMPKLLMMRRTKYSLPDYDDTFVSNGYTNELEMSRDSLT
DAFGNSKEDSGDYRKSPAPEDDMVGAGAYQRPSVTESENMLPRHLSPEVA
AALQSVRFIAQHIKDADKDNEVVEDWKFMSMVLDRFFLWLFTIACFVGTF
GIIFQSPSLYDTRVPVDQQISSIPMRKNNFFYPKDIETIGIIS
>
>Px016980
MQFIKKVLLIALTLSGAMGISREKRGLIFPPTSLYGTFLAIAVPIDIPDK
NVFVSYNFESNYSTLNNITEIDEVLFPNLPVVTARHSRSITRELAYTVLE
TKFKEHGLGGRECLLRNICEAAETPLHHNGLLGHIMHIVFTPSSSAEEGL
DDEYYEAEASGRAGSCARYEELCPVGLFDLITRIVEFKHT"
>
>Px002185
MLSPSVAIKVQVLYIGKVRISQRKVPDTLIDDALVKFVHHEAEKVKANML
RRHSLLSSTGTSIYSSESAENLNEDKTKTDTSEHNIFLMMLLRAHCEAKQ
LRHVHDTAENRTEFLNQYLGGSTIFMKAKRSLSSGFDQLLKRKSSRDEGS
GLVLPVKKVT
>
>Px006321
MFPGRTIGIMITASHNLEPDNGVKLVDPDGEMLDGSWEEIATRMANVRYL
PMSLITKFLVNSYY

I would also consider installing biopython and checking out the book python for biologists which is free online (http://pythonforbiologists.com/). I worked with fastas a lot and for a quick and dirty solution you can just use this (leave the rest of the code as is):
matchbj = re.findall( '>.*', a, re.DOTALL)
for item in matchbj:
print item
It basically matches over lines because of the re.DOTALL flag, and looks for any number of any things between '>' characters.
Be advised, this will give them to you in list for, not an object. In my experience, re.match in the first thing people learn but they are often looking for the effect of re.findall.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing a string for nested patterns - python

Take an XML parser, make it generate a DOM (Document Object Model) and then build a recursive algorithm that traverses all the nodes, calls "text()" in each node (that should give you the text in the current node and all children) and puts that as a key in the dictionary.

I think a grammar would be the best option here. I found a link with some information: http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html

Related

How to find JSON objects without actually parsing the JSON (regex)

python3.6 How do I regex a url from a .txt?

Using ElementTree to find a node - invalid predicate

Effective ways to extract XML tagged string in Python

Regular Expression Python Variable

Categories

Resources