Parse multiple similar field values from XML file with Python Regular Expression - python

I am trying to parse an xml file with regular expression.
Whichever script tag has "catch" alias, I need to collect "type" and "value".
<script type="abc">
<line x="word" size="1" alias="catch" value="4" desc="description"/>
</script>
<script type="xyz">
<line x="state" size="5" alias="catch" value="8" desc="description"/>
</script>
I tried this regular expression with multiline and dotall:
>>> re.findall(r'script\s+type=\"(\w+)\".*alias=\"catch\"\s+value=\"(\d+)\"', a, re.MULTILINE|re.DOTALL)
Output which I am getting is:
[('abc', '8')]
Expected output is:
[('abc', '4'), ('xyz', '8')]
Can someone help me in figuring out what I am missing here?

I recommend using BeautifulSoup. You can parse through the tags and, with a little bit of data re-structuring, easily check for the right alias values and store the related attributes of interest. Like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, "lxml")
to_keep = []
for script in soup.find_all("script"):
t = script["type"]
attrs = {
k:v for k, v in [attr.split("=")
for attr in script.contents[0].split()
if "=" in attr]
}
if attrs["alias"] == '"catch"':
to_keep.append({"type": t, "value": attrs["value"]})
to_keep
# [{'type': 'abc', 'value': '"4"'}, {'type': 'xyz', 'value': '"8"'}]
Data:
data = """<script type="abc">
<line x="word" size="1" alias="catch" value="4" desc="description"/>
</script>
<script type="xyz">
<line x="state" size="5" alias="catch" value="8" desc="description"/>
</script>"""

Found the answer. Thanks downvoter. I don't think there was any need to downvote this question.
>>> re.findall(r'script\s+type=\"(\w+)\".*?alias=\"catch\"\s+value=\"(\d+)\".*?\<\/script\>', a, re.MULTILINE|re.DOTALL)
[('abc', '4'), ('xyz', '8')]

Related

Beautiful soup, get into app.start(text)

I have the gotten the following lxml code using beautifulSoup:
<script>
app.start({
el: $('body'),
property: {
p_A: 'A',
p_B: 'B'
});
</script>
from which I would like to get the list {p_A : 'A', p_B : 'B'}, but I do not know how to get into the app.start.

lxml etree get all text before element

How to get all text before an element in a etree separated from the text after the element?
from lxml import etree
tree = etree.fromstring('''
<a>
find
<b>
the
</b>
text
<dd></dd>
<c>
before
</c>
<dd></dd>
and after
</a>
''')
What do I want? In this example, the <dd> tags are separators and for all of them
for el in tree.findall('.//dd'):
I would like to have all text before and after them:
[
{
el : <Element dd at 0xsomedistinctadress>,
before : 'find the text',
after : 'before and after'
},
{
el : <Element dd at 0xsomeotherdistinctadress>,
before : 'find the text before',
after : 'and after'
}
]
My idea was to use some kind of placeholders in the tree with which I replace the <dd> tags and then cut the string at that placeholder, but I need the correspondence with the actual element.
There might be a simpler way, but I would use the following XPath expressions:
preceding-sibling::*/text()|preceding::text()
following-sibling::*/text()|following::text()
Sample implementation (definitely violating the DRY principle):
def get_text_before(element):
for item in element.xpath("preceding-sibling::*/text()|preceding-sibling::text()"):
item = item.strip()
if item:
yield item
def get_text_after(element):
for item in element.xpath("following-sibling::*/text()|following-sibling::text()"):
item = item.strip()
if item:
yield item
for el in tree.findall('.//dd'):
before = " ".join(get_text_before(el))
after = " ".join(get_text_after(el))
print {
"el": el,
"before": before,
"after": after
}
Prints:
{'el': <Element dd at 0x10af81488>, 'after': 'before and after', 'before': 'find the text'}
{'el': <Element dd at 0x10af81200>, 'after': 'and after', 'before': 'find the text before'}

lxml - Is there any hacky way to keep "?

I noticed the xml entities &quot will automatically force to convert to their real original characters:
>>> from lxml import etree as et
>>> parser = et.XMLParser()
>>> xml = et.fromstring("<root><elem>"hello world"</elem></root>", parser)
>>> print et.tostring(xml, pretty_print=1)
<root>
<elem>"hello world"</elem>
</root>
>>>
I found one related old(2009-02-07) thread:
s = cStringIO.StringIO(""""She&apos;s the MAN!"""")
e = etree.parse(s,etree.XMLParser(resolve_entities=False))
Note that there's also etree.fromstring().
etree.tostring(e)
'"She\'s the MAN!"'
I would have expected resolve_entities=False to have prevented the
translation of, eg, " to ".
The "resolve_entities" option is meant for entities defined in a DTD
of which you want to keep the reference instead of the resolved value.
The entities you mention are part of the XML spec, not of a DTD.
is there another way to prevent this behavior (or, if nothing else,
reverse it after the fact)?
Well, what you get is well-formed XML. May I ask why you need the
entity references in the output?
Still, the response is why you want to do that, there's no direct answer to this problem. I'm quite surprised because the etree parser force the conversion without giving an option to disable it.
The following example shown why i need this solution, this xml is for xbmc skinning parser:
>>> print open("/tmp/so.xml").read() #the original file
<window id="1234">
<defaultcontrol>101</defaultcontrol>
<controls>
<control type="button" id="101">
<onfocus>Dialog.Close(212)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="102">
<visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
<onfocus>RunScript(script.test)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="103">
<visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
<onfocus>Close</onfocus>
<onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
</control>
</controls>
</window>
>>> root = et.parse("/tmp/so.xml", parser)
>>> r = root.getroot()
>>> for c in r:
... for cc in c:
... if cc.attrib.get('id') == "103":
... cc.remove(cc[1]) #remove 1 element, it's just a demonstrate
...
>>> o = open("/tmp/so.xml", "w")
>>> o.write(et.tostring(r, pretty_print=1)) #save it back
>>> o.close()
>>> print open("/tmp/so.xml").read() #the file after implemented
<window id="1234">
<defaultcontrol>101</defaultcontrol>
<controls>
<control type="button" id="101">
<onfocus>Dialog.Close(212)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="102">
<visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
<onfocus>RunScript(script.test)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="103">
<visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
<onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
</control>
</controls>
</window>
>>>
As you can see of the onfocus element under id "103" at the end, the &quot are no longer in their original form, and it lead to bug if the "$INFO[VideoPlayer.Album]" variable contains nested quotes and become ""test"" which was invalid and error.
So is it any hacky way i can keep &quot in their original form ?
[UPDATE]:
For someone who interest, the other 3 predefined xml entities, i.e. gt, lt and amp will only get converted by using method="html" and script tag. Either lxml VS xml.etree.ElementTree or python2 VS python3 have the same mechanism and make people confuse:
>>> from lxml import etree as et
>>> r = et.fromstring("<root><script>"&apos;&><</script><p>"&apos;&><</p></root>")
>>> print et.tostring(r, pretty_print=1, method="xml")
<root>
<script>"'&><</script>
<p>"'&><</p>
</root>
>>> print et.tostring(r, pretty_print=1, method="html")
<root><script>"'&><</script><p>"'&><</p></root>
>>>
[UPDATE2]:
The following is the list of all possible html tags:
#https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py
acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area',
'article', 'aside', 'audio', 'b', 'big', 'blockquote', 'br', 'button',
'canvas', 'caption', 'center', 'cite', 'code', 'col', 'colgroup',
'command', 'datagrid', 'datalist', 'dd', 'del', 'details', 'dfn',
'dialog', 'dir', 'div', 'dl', 'dt', 'em', 'event-source', 'fieldset',
'figcaption', 'figure', 'footer', 'font', 'form', 'header', 'h1',
'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'input', 'ins',
'keygen', 'kbd', 'label', 'legend', 'li', 'm', 'map', 'menu', 'meter',
'multicol', 'nav', 'nextid', 'ol', 'output', 'optgroup', 'option',
'p', 'pre', 'progress', 'q', 's', 'samp', 'section', 'select',
'small', 'sound', 'source', 'spacer', 'span', 'strike', 'strong',
'sub', 'sup', 'table', 'tbody', 'td', 'textarea', 'time', 'tfoot',
'th', 'thead', 'tr', 'tt', 'u', 'ul', 'var', 'video']
from lxml import etree as et
for e in acceptable_elements:
r = et.fromstring(e.join(["<", ">hello&world</", ">"]))
s = et.tostring(r, pretty_print=1, method="html")
closed_tag = "</" + e + ">"
if closed_tag not in s:
print s
Run this code and you will see output as following:
<area>
<br>
<col>
<hr>
<img>
<input>
As you can see, only opening tag printed and the rest was just go into black hole. I tested all 5 xml entities and all have the same behavior. It's so confusing. This did not happen when using HTMLParser, so i guess there's buggy between fromstring(method should be default to xml) and tostring(method="html") steps. And i found it has nothing to do with entities because "< img >hello< /img >"(without entities) is truncate into < img > too(and hello just gone to nowhere, it can appear at anytime if use method="xml" to print out).
from xml.sax.saxutils import escape
from lxml import etree
def to_string(xdoc):
r = ""
for action, elem in etree.iterwalk(xdoc, events=("start", "end")):
if action == 'start':
text = escape(elem.text, {"'": "&apos;", "\"": """}) if elem.text is not None else ""
attrs = "".join([' %s="%s"' % (k, v) for k, v in elem.attrib.items()])
r += "<%s%s>%s" % (elem.tag, attrs, text)
elif action == 'end':
r += "</%s>%s" % (elem.tag, elem.tail if elem.tail else "\n")
return r
xdoc = etree.fromstring(xml_text)
s = to_string(xdoc)

python-requests - Sending via POST form with square brackets names doesn't work

I'm trying to send test[key1] = val1 and test[key2] = val42 to the server via an HTML form.
The corresponding HTML would be:
<input type="text" name="test[key1]" value="val1" />
<input type="text" name="test[key2]" value="val42" />
(By the way, I would like to know the correct name for this kind of form.)
>>> import requests, json
>>> params = { 'test' : { 'key1' : 'val1', 'key2' : 'val42' } }
>>> r = requests.post('http://httpbin.org/post', data=params)
>>> json.loads(r.text)['form']
{u'test': [u'key2', u'key1']}
The post data has been flattened, we get the keys but lost the values val1 and val42
I thought python-requests would handle automatically the params json with embedded keys, that is not the case.
You need to write params with the square brackets.
>>> params = { 'test[key1]' : 'val1', 'test[key2]' : 'val42' }
>>> r = requests.post('http://httpbin.org/post', data=params)
>>> json.loads(r.text)['form']
{u'test[key1]': u'val1', u'test[key2]': u'val42'}
Hope this will help someone.
HTML forms by default cannot be serialized as they don't support nesting. Use a library like formencode, especially the variabledecode module to serialize/deserialize the form data to json.
https://github.com/formencode/formencode/blob/master/formencode/variabledecode.py

BeautifulSoup fails to parse nested <p> elements

Dependencies:
BeautifulSoup==3.2.1
In: from BeautifulSoup import BeautifulSoup
In: BeautifulSoup('<p><p>123</p></p>')
Out: <p></p><p>123</p>
Why are the two adjacent tags not in the output?
That is just BS3's parser fixing your broken html.
The P element represents a paragraph. It cannot contain block-level
elements (including P itself).
This
<p><p>123</p></p>
is not valid HTML. ps can't be nested. BS tries to clean it up.
When BS encounters the second <p> it thinks the first p is finished, so it inserts a closing </p>. The second </p> in your input then does not match an opening <p> so it is removed.
This is because BeautifulSoup has this NESTABLE_TAGS concept/setting:
When Beautiful Soup is parsing a document, it keeps a stack of open
tags. Whenever it sees a new start tag, it tosses that tag on top of
the stack. But before it does, it might close some of the open tags
and remove them from the stack. Which tags it closes depends on the
qualities of tag it just found, and the qualities of the tags in the
stack.
So when Beautiful Soup encounters a <P> tag, it closes and pops all
the tags up to and including the previously encountered tag of the
same type. This is the default behavior, and this is how
BeautifulStoneSoup treats every tag. It's what you get when a tag is
not mentioned in either NESTABLE_TAGS or RESET_NESTING_TAGS. It's also
what you get when a tag shows up in RESET_NESTING_TAGS but has no
entry in NESTABLE_TAGS, the way the <P> tag does.
>>> pprint(BeautifulSoup.NESTABLE_TAGS)
{'bdo': [],
'blockquote': [],
'center': [],
'dd': ['dl'],
'del': [],
'div': [],
'dl': [],
'dt': ['dl'],
'fieldset': [],
'font': [],
'ins': [],
'li': ['ul', 'ol'],
'object': [],
'ol': [],
'q': [],
'span': [],
'sub': [],
'sup': [],
'table': [],
'tbody': ['table'],
'td': ['tr'],
'tfoot': ['table'],
'th': ['tr'],
'thead': ['table'],
'tr': ['table', 'tbody', 'tfoot', 'thead'],
'ul': []}
As a workaround, you can allow p tag to be inside p:
>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup.NESTABLE_TAGS['p'] = ['p']
>>> BeautifulSoup('<p><p>123</p></p>')
<p><p>123</p></p>
Also, BeautifulSoup 3rd version is no longer maintained - you should switch to BeautifulSoup4.
When using BeautifulSoup4, you can change the underlying parser to change the behavior:
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup('<p><p>123</p></p>')
<html><body><p></p><p>123</p></body></html>
>>> BeautifulSoup('<p><p>123</p></p>', 'html.parser')
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'xml')
<?xml version="1.0" encoding="utf-8"?>
<p><p>123</p></p>
>>> BeautifulSoup('<p><p>123</p></p>', 'html5lib')
<html><head></head><body><p></p><p>123</p><p></p></body></html>

Categories

Resources