I'm trying to get the values inside a XML, but with this code I only get the label name value, I want the value itself from the label name.
XML Text:
<root><label_params><label_param><name>BranchName</name><value></value></label_param><label_param><name>CustomerCode</name><value></value></label_param><label_param><name>SealNumber</name><value>0110000000420</value></label_param><label_param><name>CustomerName</name><value>PUNTO EDUCATIVO LTDA</value></label_param><label_param><name>LpnTypeCode</name><value>LPN</value></label_param><label_param><name>OutboundNumber</name><value>1685147.1</value></label_param><label_param><name>ReferenceNumber</name><value>18072019_pall_cerr</value></label_param><label_param><name>DeliveryAddress1</name><value>Sin Direccion</value></label_param><label_param><name>NroCita</name><value></value></label_param><label_param><name>FechaEnt</name><value>19/07/2019</value></label_param><label_param><name>Porder</name><value>18072019_pall_cerr</value></label_param><label_param><name>Factura</name><value></value></label_param><label_param><name>IdLpnCode</name><value>C0000000015</value></label_param><label_param><name>TotalBultos</name><value></value></label_param><label_param><name>ANDENWMS</name><value>ANDEN15</value></label_param><label_param><name>LpnPadre</name><value>C0000000015</value></label_param><label_param><name>Cerrados</name><value>4</value></label_param><label_param><name>NoCerrados</name><value>2</value></label_param><label_param><name>TOTALPALLET</name><value></value></label_param></label_params></root>
Python Code
from xml.dom.minidom import parse
doc = parse("DataXML.xml")
my_node_list = doc.getElementsByTagName("name")
my_n_node = my_node_list[0]
my_child = my_n_node.firstChild
my_text = my_child.data
print(my_text)
Here you go:
from xml.dom.minidom import parse
doc = parse("../data/DataXML.xml")
my_node_list = doc.getElementsByTagName("label_param")
for node in my_node_list:
name_node = node.getElementsByTagName("name")
value_node = node.getElementsByTagName("value")
print("Name: "+name_node[0].firstChild.data)
if(value_node[0].firstChild != None):
print("Value: "+value_node[0].firstChild.data)
else:
print("Value: Empty")
Related
I wrote a script, based on some of the existing StackOverflow questions, but no one perfectly fit my issues.
The user use xpath to find a XML tag from a given XML file, and update the tag text based on user inputs.
Below is my script using Python 3 (The most difficult part for me is around the namespaces):
import xml.etree.ElementTree as ET
import sys
# user inputs and variables
filename = 'actors.xml'
xpath = 'actor/name'
value = 'test name'
temp_namespace = 'temp_namespace'
# get all namespaces
all_namespaces = dict([node for _, node in ET.iterparse(filename, events=['start-ns'])])
# register namespace
for key in all_namespaces.keys():
ET.register_namespace(key, all_namespaces[key])
# remove all namespace from elements first
# and temp save it to tag attribute
# The below logic is copied from other Stackoverflow answers
# From **Python 3.8**, we can add the parser to insert comments
it = ET.iterparse(filename, parser=ET.XMLParser(target=ET.TreeBuilder(insert_comments=True)))
for _, el in it:
prefix, has_namespace, postfix = el.tag.partition('}')
if has_namespace:
el.tag = postfix
el.set(temp_namespace, prefix + has_namespace)
# find and update
root = it.root
for el in root.findall(xpath):
el.text = str(value)
# get xml comments before root level
doc_comments = []
with open(filename, 'r') as f:
lines = f.readlines()
for line in lines:
if line.startswith('<?xml'):
continue
if line.startswith('<' + root.tag):
break
else:
doc_comments.append(line)
def add_tag_namespace(el):
for sub_el in el:
if temp_namespace in sub_el.attrib.keys():
sub_el.tag = sub_el.attrib[temp_namespace] + sub_el.tag
del sub_el.attrib[temp_namespace]
add_tag_namespace(sub_el)
if temp_namespace in el.attrib.keys():
el.tag = el.attrib[temp_namespace] + el.tag
del el.attrib[temp_namespace]
# add all namespace back
# and delete the temp namespace attribute
add_tag_namespace(root)
# write back to xml file
tree = ET.ElementTree(root)
tree.write(filename, encoding='unicode', xml_declaration=True)
if len(doc_comments) == 0:
sys.exit()
# write xml comments before root back
lines = []
# first read all lines
with open(filename, 'r') as f:
lines = f.readlines()
# second, insert xml comments back into memory
for i, line in enumerate(lines):
if line.startswith('<?xml'):
insert_at = i + 1
for comment in doc_comments:
lines.insert(insert_at, comment)
insert_at += 1
break
# finally, write all contents to file
with open(filename, 'w') as f:
for line in lines:
f.write(line)
actors.xml:
<?xml version="1.0"?>
<actors xmlns:fictional="http://characters.example.com"
xmlns="http://people.example.com">
<actor>
<name>John Cleese</name>
<fictional:character>Lancelot</fictional:character>
<fictional:character>Archie Leach</fictional:character>
</actor>
<actor>
<name>Eric Idle</name>
<fictional:character>Sir Robin</fictional:character>
<fictional:character>Gunther</fictional:character>
<fictional:character>Commander Clement</fictional:character>
</actor>
</actors>
I have created this code to substitute some strings in xml file with other text. I used Beautifulsoup for this excersise and as instructed in the documentation i used soup.prettify in the end in order to save changed xml. However prettified xml is not working for me - i get errors when trying to import it back to the CMS.
Is there any other way to save updated xml without changing xml structure and without re-writing the whole code. See my code for reference below. Thanks for advice!
import openpyxl
import sys
#searching for Part Numbers and descriptions in xml
from bs4 import BeautifulSoup
infile = open('name of my file.xml', "r", encoding="utf8")
contents = infile.read()
infile.close()
soup = BeautifulSoup(contents,'xml')
all_Products = soup.find_all('Product')
#gathering all Part Numbers from xml
for i in all_Products:
PN = i.find('Name')
PN_Descr = i.find_all(AttributeID="PartNumberDescription")
PN_Details = i.find_all(AttributeID="PartNumberDetails")
for y in PN_Descr:
PN_Descr_text = y.find("TranslatableText")
try:
string = PN_Descr_text.string
PN_Descr_text.find(text=string).replace_with("New string")
except AttributeError:
print("Attribute error in: PN Description for: ", PN)
continue
for z in PN_Details:
PN_Details_text = z.find("TranslatableText")
try:
string = PN_Details_text.string
PN_Details_text.find(text=string).replace_with("New string")
except AttributeError:
print("Attribute error in: PN Details for: ", PN)
continue
xml = soup.prettify("utf-8")
with open('name of my file.xml', "wb") as file:
file.write(xml)
I am using the following code to parse an article from a french news site. When getting all the paragraphs, i keep missing some text. why is that?
Here is my code: the code with the XX is the most relevant the other parts is just me putting it in my own structure for use.
def getWordList(sent,wordList):
listOfWords = list((sent).split())
for i in listOfWords:
i = i.replace("."," ")
i = i.replace(","," ")
i = i.replace('\"'," ")
valids = re.sub(r"[^A-Za-z]+", '', i)
if(len(i) > 3 and (i.lower() not in stopWords) and i.isnumeric() !=
True and valids):
wordList[valids] = {}
wordList[valids]["definition"] = ""
wordList[valids]["status"] = ""
def parse(link):
page = requests.get(link)
tree = html.fromstring(page.content)
XXword = tree.xpath('//*[#class="article__content old__article-content-single"]')
articleContent = {}
articleContent["words"] = {}
articleContent["language"] = "French";
wordList = articleContent["words"]
contentList = []
XXpTag = word[0].xpath('//*')
pText = {}
for x in range(len(pTag)):
#print(pTag[x].get("class"))
if(pTag[x].text != None):
if(pTag[x].tail != None):
print("tail")
XXtext = pTag[x].text + pTag[x].tail
else:
print("no tail")
XXtext = pTag[x].text
XXif(pTag[x].get("class") == "article__paragraph "):
print(pTag[x].get("class"))
print(text)
getWordList(text,wordList)
pText[text] = {}
pText[text]["status"] = ""
pText[text]["type"] = "p"
XXelif(pTag[x].get("class") == "article__sub-title"):
print(pTag[x].get("class"))
getWordList(text,wordList)
pText[text] = {}
pText[text]["status"] = ""
pText[text]["type"] = "h2"
here is an example article link: https://www.lemonde.fr/economie/article/2019/05/23/vivendi-chercherait-a-ceder-universal-music-group-au-chinois-tencent_5466130_3234.html
I am successfully getting all the highlighted text but the rest is missing,not the text in the middle i am successfully avoiding that. I just want the text in between which is not being included.
Thank you for your help!!
You're trying to get the content of tags containing other tags. For example, there are <em> emphasized text tags in the <p> paragraph tags.
Use the text_content() method instead of text to get the full content of your paragraphs:
text = pTag[x].text_content() + pTag[x].tail
and
text = pTag[x].text_content()
I am in the process of stripping a couple million XMLs of sensitive data. How can I add a try and except to get around this error which seems to have occurred because a couple of malformed xmls out to the bunch.
xml.parsers.expat.ExpatError: mismatched tag: line 1, column 28691
#!/usr/bin/python
import sys
from xml.dom import minidom
def getCleanString(word):
str = ""
dummy = 0
for character in word:
try:
character = character.encode('utf-8')
str = str + character
except:
dummy += 1
return str
def parsedelete(content):
dom = minidom.parseString(content)
for element in dom.getElementsByTagName('RI_RI51_ChPtIncAcctNumber'):
parentNode = element.parentNode
parentNode.removeChild(element)
return dom.toxml()
for line in sys.stdin:
if line > 1:
line = line.strip()
line = line.split(',', 2)
if len(line) > 2:
partition = line[0]
id = line[1]
xml = line[2]
xml = getCleanString(xml)
xml = parsedelete(xml)
strng = '%s\t%s\t%s' %(partition, id, xml)
sys.stdout.write(strng + '\n')
Catching exceptions is straight forward. Add import xml to your import statements and wrap the problem code in a try/except handler.
def parsedelete(content):
try:
dom = minidom.parseString(content)
except xml.parsers.expat.ExpatError, e:
# not sure how you want to handle the error... so just passing back as string
return str(e)
for element in dom.getElementsByTagName('RI_RI51_ChPtIncAcctNumber'):
parentNode = element.parentNode
parentNode.removeChild(element)
return dom.toxml()
I'm using the xml.etree.ElementTree module to create an XML document with Python 3.1 from another structured document.
What ElementTree function can I use that returns the index of an existing sub element?
The getchildren method returns a list of sub-elements of an Element object. You could then use the built-in index method of a list.
>>> import xml.etree.ElementTree as ET
>>> root = ET.Element("html")
>>> head = ET.SubElement(root, "head")
>>> body = ET.SubElement(root, "body")
>>> root.getchildren().index(body)
1
import xml.etree.ElementTree as ET
root=ET.Element('C:\Users\Administrator\Desktop\ValidationToolKit_15.9\ValidationToolKit_15.9\NE3S_VTK\webservice\history\ofas.2017-1-3.10-55-21-608.xml')
childnew=ET.SubElement(root,"354")
root.getchildren().index(childnew)
0
list(root).index(childnew)
0
def alarms_validation(self, path, alarm_no, alarm_text):
with open(path) as f:
tree = et.parse(f)
root = tree.getroot()
try:
for x in xrange(10000):
print x
for y in xrange(6):
print y
if root[x][y].text == alarm_no:
print "found"
if root[x][y+1].text != alarm_text:
print "Alarm text is not proper"
else:
print "Alarm Text is proper"
except IndexError:
pass