I have this list of hierarchical URLs:
data = ["https://python-rq.org/","https://python-rq.org/a","https://python-rq.org/a/b","https://python-rq.org/c"]
And I want to dynamically make a nested dictionary for every URL for which there exists another URL that is a subdomain/subfolder of it.
I already tried the follwoing but it is not returning what I expect:
result = []
for key,d in enumerate(data):
form_dict = {}
r_pattern = re.search(r"(http(s)?://(.*?)/)(.*)",d)
r = r_pattern.group(4)
if r == "":
parent_url = r_pattern.group(3)
else:
parent_url = r_pattern.group(3) + "/"+r
print(parent_url)
temp_list = data.copy()
temp_list.pop(key)
form_dict["name"] = parent_url
form_dict["children"] = []
for t in temp_list:
child_dict = {}
if parent_url in t:
child_dict["name"] = t
form_dict["children"].append(child_dict.copy())
result.append(form_dict)
This is the expected output.
{
"name":"https://python-rq.org/",
"children":[
{
"name":"https://python-rq.org/a",
"children":[
{
"name":"https://python-rq.org/a/b",
"children":[
]
}
]
},
{
"name":"https://python-rq.org/c",
"children":[
]
}
]
}
Any advice?
This was a nice problem. I tried going on with your regex method but got stuck and found out that split was actually appropriate for this case. The following works:
data = ["https://python-rq.org/","https://python-rq.org/a","https://python-rq.org/a/b","https://python-rq.org/c"]
temp_list = data.copy()
# This removes the last "/" if any URL ends with one. It makes it a lot easier
# to match the URLs and is not necessary to have a correct link.
data = [x[:-1] if x[-1]=="/" else x for x in data]
print(data)
result = []
# To find a matching parent
def find_match(d, res):
for t in res:
if d == t["name"]:
return t
elif ( len(t["children"])>0 ):
temp = find_match(d, t["children"])
if (temp):
return temp
return None
while len(data) > 0:
d = data[0]
form_dict = {}
l = d.split("/")
# I removed regex as matching the last parentheses wasn't working out
# split does just what you need however
parent = "/".join(l[:-1])
data.pop(0)
form_dict["name"] = d
form_dict["children"] = []
option = find_match(parent, result)
if (option):
option["children"].append(form_dict)
else:
result.append(form_dict)
print(result)
[{'name': 'https://python-rq.org', 'children': [{'name': 'https://python-rq.org/a', 'children': [{'name': 'https://python-rq.org/a/b', 'children': []}]}, {'name': 'https://python-rq.org/c', 'children': []}]}]
I am trying to parse an xml file with regular expression.
Whichever script tag has "catch" alias, I need to collect "type" and "value".
<script type="abc">
<line x="word" size="1" alias="catch" value="4" desc="description"/>
</script>
<script type="xyz">
<line x="state" size="5" alias="catch" value="8" desc="description"/>
</script>
I tried this regular expression with multiline and dotall:
>>> re.findall(r'script\s+type=\"(\w+)\".*alias=\"catch\"\s+value=\"(\d+)\"', a, re.MULTILINE|re.DOTALL)
Output which I am getting is:
[('abc', '8')]
Expected output is:
[('abc', '4'), ('xyz', '8')]
Can someone help me in figuring out what I am missing here?
I recommend using BeautifulSoup. You can parse through the tags and, with a little bit of data re-structuring, easily check for the right alias values and store the related attributes of interest. Like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, "lxml")
to_keep = []
for script in soup.find_all("script"):
t = script["type"]
attrs = {
k:v for k, v in [attr.split("=")
for attr in script.contents[0].split()
if "=" in attr]
}
if attrs["alias"] == '"catch"':
to_keep.append({"type": t, "value": attrs["value"]})
to_keep
# [{'type': 'abc', 'value': '"4"'}, {'type': 'xyz', 'value': '"8"'}]
Data:
data = """<script type="abc">
<line x="word" size="1" alias="catch" value="4" desc="description"/>
</script>
<script type="xyz">
<line x="state" size="5" alias="catch" value="8" desc="description"/>
</script>"""
Found the answer. Thanks downvoter. I don't think there was any need to downvote this question.
>>> re.findall(r'script\s+type=\"(\w+)\".*?alias=\"catch\"\s+value=\"(\d+)\".*?\<\/script\>', a, re.MULTILINE|re.DOTALL)
[('abc', '4'), ('xyz', '8')]
I am new to python and am trying to read a file and create a dictionary from it.
The format is as follows:
.1.3.6.1.4.1.14823.1.1.27 {
TYPE = Switch
VENDOR = Aruba
MODEL = ArubaS3500-48T
CERTIFICATION = CERTIFIED
CONT = Aruba-Switch
HEALTH = ARUBA-Controller
VLAN = Dot1q INSTRUMENTATION:
Card-Fault = ArubaController:DeviceID
CPU/Memory = ArubaController:DeviceID
Environment = ArubaSysExt:DeviceID
Interface-Fault = MIB2
Interface-Performance = MIB2
Port-Fault = MIB2
Port-Performance = MIB2
}
The first line OID (.1.3.6.1.4.1.14823.1.1.27 { ) I want this to be the key and the remaining lines are the values until the }
I have tried a few combinations but am not able to get the correct regex to match these
Any help please?
I have tried something like
lines = cache.readlines()
for line in lines:
searchObj = re.search(r'(^.\d.*{)(.*)$', line)
if searchObj:
(oid, cert ) = searchObj.groups()
results[searchObj(oid)] = ", ".join(line[1:])
print("searchObj.group() : ", searchObj.group(1))
print("searchObj.group(1) : ", searchObj.group(2))
You can try this:
import re
data = open('filename.txt').read()
the_key = re.findall("^\n*[\.\d]+", data)
values = [re.split("\s+\=\s+", i) for i in re.findall("[a-zA-Z0-9]+\s*\=\s*[a-zA-Z0-9]+", data)]
final_data = {the_key[0]:dict(values)}
Output:
{'\n.1.3.6.1.4.1.14823.1.1.27': {'VENDOR': 'Aruba', 'CERTIFICATION': 'CERTIFIED', 'Fault': 'MIB2', 'VLAN': 'Dot1q', 'Environment': 'ArubaSysExt', 'HEALTH': 'ARUBA', 'Memory': 'ArubaController', 'Performance': 'MIB2', 'CONT': 'Aruba', 'MODEL': 'ArubaS3500', 'TYPE': 'Switch'}}
You could use a nested dict comprehension along with an outer and inner regex.
Your blocks can be separated by
.numbers...numbers.. {
// values here
}
In terms of regular expression this can be formulated as
^\s* # start of line + whitespaces, eventually
(?P<key>\.[\d.]+)\s* # the key
{(?P<values>[^{}]+)} # everything between { and }
As you see, we split the parts into key/value pairs.
Your "inner" structure can be formulated like
(?P<key>\b[A-Z][-/\w]+\b) # the "inner" key
\s*=\s* # whitespaces, =, whitespaces
(?P<value>.+) # the value
Now let's build the "outer" and "inner" expressions together:
rx_outer = re.compile(r'^\s*(?P<key>\.[\d.]+)\s*{(?P<values>[^{}]+)}', re.MULTILINE)
rx_inner = re.compile(r'(?P<key>\b[A-Z][-/\w]+\b)\s*=\s*(?P<value>.+)')
result = {item.group('key'):
{match.group('key'): match.group('value')
for match in rx_inner.finditer(item.group('values'))}
for item in rx_outer.finditer(string)}
print(result)
A demo can be found on ideone.com.
How to get all text before an element in a etree separated from the text after the element?
from lxml import etree
tree = etree.fromstring('''
<a>
find
<b>
the
</b>
text
<dd></dd>
<c>
before
</c>
<dd></dd>
and after
</a>
''')
What do I want? In this example, the <dd> tags are separators and for all of them
for el in tree.findall('.//dd'):
I would like to have all text before and after them:
[
{
el : <Element dd at 0xsomedistinctadress>,
before : 'find the text',
after : 'before and after'
},
{
el : <Element dd at 0xsomeotherdistinctadress>,
before : 'find the text before',
after : 'and after'
}
]
My idea was to use some kind of placeholders in the tree with which I replace the <dd> tags and then cut the string at that placeholder, but I need the correspondence with the actual element.
There might be a simpler way, but I would use the following XPath expressions:
preceding-sibling::*/text()|preceding::text()
following-sibling::*/text()|following::text()
Sample implementation (definitely violating the DRY principle):
def get_text_before(element):
for item in element.xpath("preceding-sibling::*/text()|preceding-sibling::text()"):
item = item.strip()
if item:
yield item
def get_text_after(element):
for item in element.xpath("following-sibling::*/text()|following-sibling::text()"):
item = item.strip()
if item:
yield item
for el in tree.findall('.//dd'):
before = " ".join(get_text_before(el))
after = " ".join(get_text_after(el))
print {
"el": el,
"before": before,
"after": after
}
Prints:
{'el': <Element dd at 0x10af81488>, 'after': 'before and after', 'before': 'find the text'}
{'el': <Element dd at 0x10af81200>, 'after': 'and after', 'before': 'find the text before'}
I noticed the xml entities " will automatically force to convert to their real original characters:
>>> from lxml import etree as et
>>> parser = et.XMLParser()
>>> xml = et.fromstring("<root><elem>"hello world"</elem></root>", parser)
>>> print et.tostring(xml, pretty_print=1)
<root>
<elem>"hello world"</elem>
</root>
>>>
I found one related old(2009-02-07) thread:
s = cStringIO.StringIO(""""She's the MAN!"""")
e = etree.parse(s,etree.XMLParser(resolve_entities=False))
Note that there's also etree.fromstring().
etree.tostring(e)
'"She\'s the MAN!"'
I would have expected resolve_entities=False to have prevented the
translation of, eg, " to ".
The "resolve_entities" option is meant for entities defined in a DTD
of which you want to keep the reference instead of the resolved value.
The entities you mention are part of the XML spec, not of a DTD.
is there another way to prevent this behavior (or, if nothing else,
reverse it after the fact)?
Well, what you get is well-formed XML. May I ask why you need the
entity references in the output?
Still, the response is why you want to do that, there's no direct answer to this problem. I'm quite surprised because the etree parser force the conversion without giving an option to disable it.
The following example shown why i need this solution, this xml is for xbmc skinning parser:
>>> print open("/tmp/so.xml").read() #the original file
<window id="1234">
<defaultcontrol>101</defaultcontrol>
<controls>
<control type="button" id="101">
<onfocus>Dialog.Close(212)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="102">
<visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
<onfocus>RunScript(script.test)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="103">
<visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
<onfocus>Close</onfocus>
<onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
</control>
</controls>
</window>
>>> root = et.parse("/tmp/so.xml", parser)
>>> r = root.getroot()
>>> for c in r:
... for cc in c:
... if cc.attrib.get('id') == "103":
... cc.remove(cc[1]) #remove 1 element, it's just a demonstrate
...
>>> o = open("/tmp/so.xml", "w")
>>> o.write(et.tostring(r, pretty_print=1)) #save it back
>>> o.close()
>>> print open("/tmp/so.xml").read() #the file after implemented
<window id="1234">
<defaultcontrol>101</defaultcontrol>
<controls>
<control type="button" id="101">
<onfocus>Dialog.Close(212)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="102">
<visible>StringCompare(VideoPlayer.PlotOutline,Stream.IsPlaying) + !Skin.HasSetting(Stream.IsUpdated)</visible>
<onfocus>RunScript(script.test)</onfocus>
<onfocus>SetFocus(11)</onfocus>
</control>
<control type="button" id="103">
<visible>SubString(VideoPlayer.PlotOutline,Video.IsPlaying)</visible>
<onfocus>RunScript("/.xbmc/addons/script.hello.world/default.py","$INFO[VideoPlayer.Album]","$INFO[VideoPlayer.Genre]")</onfocus>
</control>
</controls>
</window>
>>>
As you can see of the onfocus element under id "103" at the end, the " are no longer in their original form, and it lead to bug if the "$INFO[VideoPlayer.Album]" variable contains nested quotes and become ""test"" which was invalid and error.
So is it any hacky way i can keep " in their original form ?
[UPDATE]:
For someone who interest, the other 3 predefined xml entities, i.e. gt, lt and amp will only get converted by using method="html" and script tag. Either lxml VS xml.etree.ElementTree or python2 VS python3 have the same mechanism and make people confuse:
>>> from lxml import etree as et
>>> r = et.fromstring("<root><script>"'&><</script><p>"'&><</p></root>")
>>> print et.tostring(r, pretty_print=1, method="xml")
<root>
<script>"'&><</script>
<p>"'&><</p>
</root>
>>> print et.tostring(r, pretty_print=1, method="html")
<root><script>"'&><</script><p>"'&><</p></root>
>>>
[UPDATE2]:
The following is the list of all possible html tags:
#https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py
acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area',
'article', 'aside', 'audio', 'b', 'big', 'blockquote', 'br', 'button',
'canvas', 'caption', 'center', 'cite', 'code', 'col', 'colgroup',
'command', 'datagrid', 'datalist', 'dd', 'del', 'details', 'dfn',
'dialog', 'dir', 'div', 'dl', 'dt', 'em', 'event-source', 'fieldset',
'figcaption', 'figure', 'footer', 'font', 'form', 'header', 'h1',
'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 'input', 'ins',
'keygen', 'kbd', 'label', 'legend', 'li', 'm', 'map', 'menu', 'meter',
'multicol', 'nav', 'nextid', 'ol', 'output', 'optgroup', 'option',
'p', 'pre', 'progress', 'q', 's', 'samp', 'section', 'select',
'small', 'sound', 'source', 'spacer', 'span', 'strike', 'strong',
'sub', 'sup', 'table', 'tbody', 'td', 'textarea', 'time', 'tfoot',
'th', 'thead', 'tr', 'tt', 'u', 'ul', 'var', 'video']
from lxml import etree as et
for e in acceptable_elements:
r = et.fromstring(e.join(["<", ">hello&world</", ">"]))
s = et.tostring(r, pretty_print=1, method="html")
closed_tag = "</" + e + ">"
if closed_tag not in s:
print s
Run this code and you will see output as following:
<area>
<br>
<col>
<hr>
<img>
<input>
As you can see, only opening tag printed and the rest was just go into black hole. I tested all 5 xml entities and all have the same behavior. It's so confusing. This did not happen when using HTMLParser, so i guess there's buggy between fromstring(method should be default to xml) and tostring(method="html") steps. And i found it has nothing to do with entities because "< img >hello< /img >"(without entities) is truncate into < img > too(and hello just gone to nowhere, it can appear at anytime if use method="xml" to print out).
from xml.sax.saxutils import escape
from lxml import etree
def to_string(xdoc):
r = ""
for action, elem in etree.iterwalk(xdoc, events=("start", "end")):
if action == 'start':
text = escape(elem.text, {"'": "'", "\"": """}) if elem.text is not None else ""
attrs = "".join([' %s="%s"' % (k, v) for k, v in elem.attrib.items()])
r += "<%s%s>%s" % (elem.tag, attrs, text)
elif action == 'end':
r += "</%s>%s" % (elem.tag, elem.tail if elem.tail else "\n")
return r
xdoc = etree.fromstring(xml_text)
s = to_string(xdoc)