Parse XML with Python with title and value on different lines

Parse XML with Python with title and value on different lines - python

I have the following XML document that i would like to write to a csv file.
<items>
<item>
<attribute type="set" identifier="naadloos">
<name locale="nl_NL">Naadloos</name>
<value locale="nl_NL" identifier="nee">Nee</value>
</attribute>
<attribute type="asset" identifier="short_description">
<value locale="nl_NL">Tom beugel bh</value>
</attribute>
<attribute type="text" identifier="name">
<name locale="nl_NL">Naam</name>
<value>Marie Jo L'Aventure Tom beugel bh</value>
</attribute>
<attribute type="int" identifier="is_backorder">
<name locale="nl_NL">Backorder</name>
<value>2</value>
</attribute>
</item>
</items>
how can i retrieve the data from this format? I need the following output
naadloos, short_description, name, is_Backorder
Nee, Tom beugel bh, Marie Jo L'Adventure Tom beugel bh, 2
so i need the identifier from the attribute line, and the text from the value line.
Any ideas?
Much appreciated

This is my try it gets elements by attribute and writes them into a specified file by dictwriter!
import lxml.etree as et
import csv
#headers={}
xml= """<items>
<item>
<attribute type="set" identifier="naadloos">
<name locale="nl_NL">Naadloos</name>
<value locale="nl_NL" identifier="nee">Nee</value>
</attribute>
<attribute type="asset" identifier="short_description">
<value locale="nl_NL">Tom beugel bh</value>
</attribute>
<attribute type="text" identifier="name">
<name locale="nl_NL">Naam</name>
<value>Marie Jo L'Aventure Tom beugel bh</value>
</attribute>
<attribute type="int" identifier="is_backorder">
<name locale="nl_NL">Backorder</name>
<value>2</value>
</attribute>
</item>
</items>
"""
tree = et.fromstring(xml)
header = []
for i in tree.xpath("//attribute/#identifier"):
header.append(i)
def dicter(x):
exp = r"//attribute[#identifier='%s']/value/text()"%x
tmp = ''.join(tree.xpath(exp))
d = [x,tmp]
return d
data = dict(dicter(i) for i in header)
#Now write data into file
with open(r"C:\Users\User_Name\Desktop\output.txt",'wb') as wrt:
writer = csv.DictWriter(wrt,header)
writer.writeheader()
writer.writerow(data)
Written file content-
naadloos,short_description,name,is_backorder
Nee,Tom beugel bh,Marie Jo L'Aventure Tom beugel bh,2

Related

Delete everything in file after last appearance string

I want to make a program which look through files, finds every incomplete file (without </module> at the end), then it will print last found abnumber in file and delete everyline (including the last with abnumber) after it.
So my file looks like that:
<Module bs="Mainfile_1">
<object id="1000" name="namex" abnumber="1">
<item name="item0" value="100" />
<item name="item00" value="100" />
</object>
<object id="1001" name="namey" abnumber="2">
<item name="item1" value="100" />
<item name="item00" value="100" />
</object>
<object id="1234" name="name1" abnumber="3">
<item name="item1" value="something11:
something11" />
<item name="item2" value="233" />
<item name="item3" value="233" />
<item name="item4" value="something12:
12something" />
</object>
<object id="1238" name="name2" abnumber="4">
<item name="item8" value="something12:
<item name="item9" value="233" />
and at the end it should looks like:
<Module bs="Mainfile_1">
<object id="1000" name="namex" abnumber="1">
<item name="item0" value="100" />
<item name="item00" value="100" />
</object>
<object id="1001" name="namey" abnumber="2">
<item name="item1" value="100" />
<item name="item00" value="100" />
</object>
<object id="1234" name="name1" abnumber="3">
<item name="item1" value="something11:
something11" />
<item name="item2" value="233" />
<item name="item3" value="233" />
<item name="item4" value="something12:
12something" />
</object>
with printed: 4
I started by doing something like that but I feel like I am doing everything wrong:
import os
Mainfile = 'path'
for filename in os.listdir(Mainfile):
lines = filename.readlines()
if not "</Module>" in lines:
with open(filename, 'r+', encoding="utf-8") as file:
line_list = list(file)
line_list.reverse()
for line in line_list:
if line.find('absno') != -1:
print(line)

You can use re to get your result :
<object([\s\S]*?)<\/object> to get correct <object... </object> tag
abnumber=\"([0-9.]+) to get abnnumber for incorrect tag
<Module.*|<object(?:[\s\S]*?)<\/object> to get correct format of xml data
import re
data = """<Module bs="Mainfile_1">
<object id="1000" name="namex" abnumber="1">
<item name="item0" value="100" />
<item name="item00" value="100" />
</object>
<object id="1001" name="namey" abnumber="2">
<item name="item1" value="100" />
<item name="item00" value="100" />
</object>
<object id="1234" name="name1" abnumber="3">
<item name="item1" value="something11:
something11" />
<item name="item2" value="233" />
<item name="item3" value="233" />
<item name="item4" value="something12:
12something" />
</object>
<object id="1238" name="name2" abnumber="4">
<item name="item8" value="something12:
<item name="item9" value="233" />"""
invalid_XML_Tag = re.sub("<object([\s\S]*?)<\/object>", '', data)
abnnumber_value = re.findall("abnumber=\"([0-9.]+)", invalid_XML_Tag)
print("abnumber of invalid tag => {0}".format(abnnumber_value))
correct_xml_format = re.findall("<Module.*|<object(?:[\s\S]*?)<\/object>",data)
print("".join(correct_xml_format))
Output:
abnumber of invalid tag => ['4']
<Module bs="Mainfile_1"><object id="1000" name="namex" abnumber="1">
<item name="item0" value="100" />
<item name="item00" value="100" />
</object><object id="1001" name="namey" abnumber="2">
<item name="item1" value="100" />
<item name="item00" value="100" />
</object><object id="1234" name="name1" abnumber="3">
<item name="item1" value="something11:
something11" />
<item name="item2" value="233" />
<item name="item3" value="233" />
<item name="item4" value="something12:
12something" />
</object>

How can i get attribute number

I use BS4 to parser .xml，i want to get resattribute number，but get none
how to do it ?
source xml
`<digitizer id="1" integrated="true" csrmusttouch="falsehardprox="true"
physidcsrs="false" pnpid="49154" kind="MULTI_TOUCH" maxcsrs="10">
<monitor left="0" top="0" right="1920" bottom="1080" />`
<properties>
<property name="x" logmin="0" logmax="16383" res="621.7457275" unit="cm" hidusage="0x00010030" guid="{598A6A8F-52C0-4BA0-93AF-AF357411A561}" />
<property name="y" logmin="0" logmax="16383" res="983.9639893" unit="cm" hidusage="0x00010031" guid="{B53F9F75-04E0-4498-A7EE-C30DBB5A9011}" />
<property name="status" logmin="0" logmax="15" res="0" unit="DEFAULT" hidusage="0x000d0042, 0x000d003c, 0x000d0044" guid="{6E0E07BF-AFE7-4CF7-87D1-AF6446208418}" />
<property name="time" logmin="0" logmax="2147483647" res="1" unit="DEFAULT" guid="{436510C5-FED3-45D1-8B76-71D3EA7A829D}" />
<property name="contactid" logmin="0" logmax="31" res="1.861861944" unit="cm" hidusage="0x000d0051" guid="{02585B91-049B-4750-9615-DF8948AB3C9C}" />`
Python Code
a = data_xml.find('digitizer',id="1")
b = a.find('properties')
print(b.get('res'))
Result
None

I have taken your data as html
html="""<digitizer id="1" integrated="true" csrmusttouch="falsehardprox="true"
physidcsrs="false" pnpid="49154" kind="MULTI_TOUCH" maxcsrs="10">
<monitor left="0" top="0" right="1920" bottom="1080" />`
<properties>
<property name="x" logmin="0" logmax="16383" res="621.7457275" unit="cm" hidusage="0x00010030" guid="{598A6A8F-52C0-4BA0-93AF-AF357411A561}" />
<property name="y" logmin="0" logmax="16383" res="983.9639893" unit="cm" hidusage="0x00010031" guid="{B53F9F75-04E0-4498-A7EE-C30DBB5A9011}" />
<property name="status" logmin="0" logmax="15" res="0" unit="DEFAULT" hidusage="0x000d0042, 0x000d003c, 0x000d0044" guid="{6E0E07BF-AFE7-4CF7-87D1-AF6446208418}" />
<property name="time" logmin="0" logmax="2147483647" res="1" unit="DEFAULT" guid="{436510C5-FED3-45D1-8B76-71D3EA7A829D}" />
<property name="contactid" logmin="0" logmax="31" res="1.861861944" unit="cm" hidusage="0x000d0051" guid="{02585B91-049B-4750-9615-DF8948AB3C9C}" />"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"html.parser")
Code::
You can find all property tag and then find res value associate to it!
a = soup.find('digitizer',attrs={"id":"1"})
properties=a.find_all("property")
res_lst=[i['res'] for i in properties]
Output::
['621.7457275', '983.9639893', '0', '1', '1.861861944']

Your xml seems poorly formatted, after reformatting it:
<digitizer id="1" integrated="true" csrmusttouch="" falsehardprox="true" physidcsrs="false" pnpid="49154" kind="MULTI_TOUCH" maxcsrs="10">
<monitor left="0" top="0" right="1920" bottom="1080"/>
<properties>
<property name="x" logmin="0" logmax="16383" res="621.7457275" unit="cm" hidusage="0x00010030" guid="{598A6A8F-52C0-4BA0-93AF-AF357411A561}" />
<property name="y" logmin="0" logmax="16383" res="983.9639893" unit="cm" hidusage="0x00010031" guid="{B53F9F75-04E0-4498-A7EE-C30DBB5A9011}" />
<property name="status" logmin="0" logmax="15" res="0" unit="DEFAULT" hidusage="0x000d0042, 0x000d003c, 0x000d0044" guid="{6E0E07BF-AFE7-4CF7-87D1-AF6446208418}" />
<property name="time" logmin="0" logmax="2147483647" res="1" unit="DEFAULT" guid="{436510C5-FED3-45D1-8B76-71D3EA7A829D}" />
<property name="contactid" logmin="0" logmax="31" res="1.861861944" unit="cm" hidusage="0x000d0051" guid="{02585B91-049B-4750-9615-DF8948AB3C9C}" />
You can easily parse it like this:
from bs4 import BeautifulSoup
with open('data.xml') as raw_resuls:
results = BeautifulSoup(raw_resuls, 'lxml')
for element in results.find_all("properties"):
for property_tag in element.find_all("property"):
print(property_tag['res'])
Output:
621.7457275
983.9639893
0
1
1.861861944
You can find more info about parsing attribute values from xml in the tutorial where the code is from.
Edit: Note that I slightly modified the code to fit your question.

Create new xml attributes from other attribute

I have the following XML
<icim source="source">
<object class="class_name" name="class_name">
<attribute name="Type">
<string>Type_Name</string>
</attribute>
<attribute name="DisplayName">
<string>DisplayName</string>
</attribute>
<attribute name="Vendor">
<string>Vendor_Name</string>
</attribute>
<attribute name="Model">
<string>Model_Name</string>
</attribute>
<attribute name="Description">
<string>Description_part1, Description_part2, Description_part3, Description_part4, Description_part5</string>
</attribute>
</object>
<object class="class_name" name="class_name">
<attribute name="Type">
<string>Type_Name</string>
</attribute>
<attribute name="DisplayName">
<DisplayName</string>
</attribute>
<attribute name="Vendor">
<string>Vendor_Name</string>
</attribute>
<attribute name="Model">
<string>Model_Name</string>
</attribute>
<attribute name="Description">
<string>Description_part1, Description_part2, Description_part3, Description_part4, Description_part5</string>
</attribute>
</object>
.
.
.
</icim>
and I want to transform it using Python's Element Tree to this:
<icim source="source">
<object class="class_name" name="class_name">
<attribute name="Type">
<string>Type_Name</string>
</attribute>
<attribute name="DisplayName">
<string>DisplayName</string>
</attribute>
<attribute name="Vendor">
<string>Vendor_Name</string>
</attribute>
<attribute name="Model">
<string>Model_Name</string>
</attribute>
<attribute name="String1">
<string>Description_part1</string>
</attribute>
</attribute>
<attribute name="String2">
<string>Description_part2</string>
</attribute>
</attribute>
<attribute name="String3">
<string>Description_part3</string>
</attribute>
<attribute name="Description">
<string>Description_part1, Description_part2, Description_part3, Description_part4, Description_part5</string>
</attribute>
</object>
<object class="class_name" name="class_name">
<attribute name="Type">
<string>Type_Name</string>
</attribute>
<attribute name="DisplayName">
<DisplayName</string>
</attribute>
<attribute name="Vendor">
<string>Vendor_Name</string>
</attribute>
<attribute name="Model">
<string>Model_Name</string>
</attribute>
</attribute>
<attribute name="String1">
<string>Description_part1</string>
</attribute>
</attribute>
<attribute name="String2">
<string>Description_part2</string>
</attribute>
</attribute>
<attribute name="String3">
<string>Description_part3</string>
</attribute>
<attribute name="Description">
<string>Description_part1, Description_part2, Description_part3, Description_part4, Description_part5</string>
</attribute>
</object>
.
.
.
</icim>
That is I want to extract the first three string parts from each Description element (the Description always has commas, so you can split the parts based on those) and create a new attribute for each of the first 3 Description parts. Thoughts?

Your xml and expected xml aren't well formed (<DisplayName</string> should be <string>DisplayName</string>) but assuming it's fixed, and if I undertstand you correctly, the following should get you at least most of the way there:
from lxml import etree
display = """[your xml above, corrected]"""
doc = etree.XML(display)
objs = doc.xpath("//object")
for obj in objs:
news = obj.xpath('.//attribute[# name="Description"]/string/text()')[0].split(',')[:3]
counter=3
for new in reversed(news): #this list needs to be reversed to get the new elements into the xml in the correct order
ins = etree.fromstring(f'<attribute name="String{counter}">\n <string>{new.strip()}</string>\n</attribute>\n')
obj.insert(4,ins)
counter-=1 #same reason for counting in reverse
print(etree.tostring(doc).decode())
Output should your expected output.

Python - Read an XML using minidom

I'm new in Python and I have a question.
I'm trying to parse this xml (this XML has several information, this is the first data what I need to read):
<![CDATA[<?xml version="1.0" encoding="UTF-8"?><UDSObjectList>
<UDSObject>
<Handle>cr:908715</Handle>
<Attributes>
<Attribute DataType="2002">
<AttrName>ref_num</AttrName>
<AttrValue>497131</AttrValue>
</Attribute>
<Attribute DataType="2002">
<AttrName>support_lev.sym</AttrName>
<AttrValue/>
</Attribute>
<Attribute DataType="2004">
<AttrName>open_date</AttrName>
<AttrValue>1516290907</AttrValue>
</Attribute>
<Attribute DataType="58814636">
<AttrName>agt.id</AttrName>
<AttrValue/>
</Attribute>
<Attribute DataType="2005">
<AttrName>priority</AttrName>
<AttrValue>3</AttrValue>
</Attribute>
<Attribute DataType="2009">
<AttrName>tenant.id</AttrName>
<AttrValue>F3CA8B5A2A456742B21EF8F3B5538623</AttrValue>
</Attribute>
<Attribute DataType="2002">
<AttrName>tenant.name</AttrName>
<AttrValue>Ripley</AttrValue>
</Attribute>
<Attribute DataType="2005">
<AttrName>log_agent</AttrName>
<AttrValue>088966043F4D2944AA90067C52DA454F</AttrValue>
</Attribute>
<Attribute DataType="58826268">
<AttrName>request_by.first_name</AttrName>
<AttrValue/>
</Attribute>
<Attribute DataType="58826268">
<AttrName>request_by.first_name</AttrName>
<AttrValue/>
</Attribute>
<Attribute DataType="2002">
<AttrName>customer.first_name</AttrName>
<AttrValue>Juan Guillermo</AttrValue>
</Attribute>
<Attribute DataType="2002">
<AttrName>customer.last_name</AttrName>
<AttrValue>Mendoza Montero</AttrValue>
</Attribute>
<Attribute DataType="2009">
<AttrName>customer.id</AttrName>
<AttrValue>8C020EBAD32035419D7654CDE510D312</AttrValue>
</Attribute>
<Attribute DataType="2001">
<AttrName>category.id</AttrName>
<AttrValue>1121021012</AttrValue>
</Attribute>
<Attribute DataType="2002">
<AttrName>category.sym</AttrName>
<AttrValue>Ripley.Sistemas Financieros.Terminal Financiero.Mensaje de
Error</AttrValue>
</Attribute>
<Attribute DataType="2002">
<AttrName>status.sym</AttrName>
<AttrValue>Suspended</AttrValue>
</Attribute>
<Attribute DataType="2009">
<AttrName>group.id</AttrName>
<AttrValue>099621F7BD77C545B65FB65BFE466550</AttrValue>
</Attribute>
<Attribute DataType="2002">
<AttrName>group.last_name</AttrName>
<AttrValue>EUS_Zona V Region</AttrValue>
</Attribute>
<Attribute DataType="2001">
<AttrName>zreporting_met.id</AttrName>
<AttrValue>7300</AttrValue>
</Attribute>
<Attribute DataType="2002">
<AttrName>zreporting_met.sym</AttrName>
<AttrValue>E-Mail</AttrValue>
</Attribute>
<Attribute DataType="2002">
<AttrName>assignee.combo_name</AttrName>
<AttrValue/>
</Attribute>
<Attribute DataType="2004">
<AttrName>open_date</AttrName>
<AttrValue>1516290907</AttrValue>
</Attribute>
<Attribute DataType="2004">
<AttrName>close_date</AttrName>
<AttrValue/>
</Attribute>
<Attribute DataType="2002">
<AttrName>description</AttrName>
<AttrValue>Asunto :Valaparaiso / Terminal Financiero Error
Nombre Completo :JUAN MENDOZA MONTERO
Ubicación :CCSS VALPARAISO Plaza victoria 1646, VALPARAISO
País :Chile
Telefono :ANEXO 2541
Correo :jmendozam#ripley.cl
Descripción :Error Terminal Financiero
Descartes :N/A</AttrValue>
</Attribute>
<Attribute DataType="2002">
<AttrName>summary</AttrName>
<AttrValue>Santiago / Modificación </AttrValue>
</Attribute>
</Attributes>
</UDSObject>
but when I read the file with this method:
from zeep import Client
import xml.dom.minidom
from xml.dom.minidom import Node
def select():
resultado = []
sid = _client.service.login("User","password")
objectType = 'cr'
whereClause = "group.last_name LIKE 'EUS_ZONA%' AND open_date > 1517454000
AND open_date <
1519786800"
maxRows = -1
attributes = ["ref_num"
,"agt.id"
,"priority"
,"pcat.id"
,"tenant.id"
,"tenant.name"
,"log_agent"
,"request_by.first_name"
,"request_by.last_name"
,"customer.first_name"
,"customer.last_name"
,"customer.id"
,"category.id"
,"category.sym"
,"status.sym"
,"group.id"
,"group.last_name"
,"zreporting_met.id"
,"zreporting_met.sym"
,"assignee.combo_name"
,"open_date"
,"close_date"
,"description"
,"summary"]
minim = _client.service.doSelect(sid=sid, objectType=objectType,
whereClause=whereClause, maxRows= maxRows, attributes= attributes)
dom = xml.dom.minidom.parseString(minim)
nodeList = dom.getElementsByTagName('AttrValue')
for j in range(len(nodeList)):
resultado.append(dom.getElementsByTagName('AttrValue')[j].firstChild.wholeText)
print(resultado[j])
logout = _client.service.logout(sid)
This only print the first AttrValue (ref_num value), what I need to do is add every field of the XML file in resultado array, I need help to print every field from the XML file, someone can help me to that?

Please read and follow How to create a Minimal, Complete, and Verifiable example.
You should remove all the server stuff and reduce the size of your sample data.
This snippet follows your code and gets all attribute elements and then iterates those:
import xml.dom.minidom
from xml.dom.minidom import Node
minim = """<?xml version="1.0" encoding="UTF-8"?>
<udsobjectlist>
<udsobject>
<handle>cr:908715</handle>
<attributes>
<attribute datatype="2002">
<attrname>ref_num</attrname>
<attrvalue>497131</attrvalue>
</attribute>
<attribute datatype="2002">
<attrname>support_lev.sym</attrname>
<attrvalue/>
</attribute>
<attribute datatype="2004">
<attrname>open_date</attrname>
<attrvalue>1516290907</attrvalue>
</attribute>
</attributes>
</udsobject>
</udsobjectlist>
"""
dom = xml.dom.minidom.parseString(minim)
nodeList = dom.getElementsByTagName('attribute')
resultado = []
attributes = ["attrname", "attrvalue"]
for node in nodeList:
a = []
for attribute in attributes:
try:
a.append( node.getElementsByTagName(attribute)[0].firstChild.wholeText)
except AttributeError:
a.append("")
resultado.append(a)
print(resultado)
prints
[['ref_num', '497131'], ['support_lev.sym', ''], ['open_date', '1516290907']]
Even closer to your code:
nodeList = dom.getElementsByTagName('attrvalue')
for node in nodeList:
try:
v = node.firstChild.wholeText
resultado.append(v)
print(v)
except:
pass
print(resultado)
prints
497131
1516290907
['497131', '1516290907']
As suggested in the comments, with ET (although you probably should not access elements by index, but this might get you started):
import xml.etree.ElementTree as ET
root = ET.fromstring(minim)
for child in root[0][1]:
try:
print(child[0].text)
print(child[1].text)
except:
pass
prints
ref_num
497131
support_lev.sym
None
open_date
1516290907

How to remove only the parent element and not its child elements in Python?

A similar question to the one in JavaScript
I have a xml and want to comment just the parent tag without its children
like in the example below:
<object id="12">
<process name="Developer">
<appdef>
<attributes>
<attribute name="X">
<ProcessValue datatype="number" value="15" />
</attribute>
<attribute name="Y">
<ProcessValue datatype="number" value="59" />
</attribute>
</attributes>
</appdef>
</process>
</object>
and comment just < object > tags
<!--<object id="12">-->
<process name="Developer">
<appdef>
<attributes>
<attribute name="X">
<ProcessValue datatype="number" value="15" />
</attribute>
<attribute name="Y">
<ProcessValue datatype="number" value="59" />
</attribute>
</attributes>
</appdef>
</process>
<!--</object>-->
I have a code to comment the tag but it comment all its children also.
Thank you very much I appreciate any help
Due to confusions I am attaching the whole code:
from xml.dom import minidom
xml = """\
<bpr:release xmlns:bpr="http://www.blueprism.co.uk/product/release">
<object id="0e694daf-836e-44a9-816a-9b8127abb7b2" name="Developer 2
ex" xmlns="http://www.blueprism.co.uk/product/process">
<process name="Developer 2 ex" version="1.0" bpversion="5.0.33.0"
narrative="BO for automation the HTML page
" type="object"
runmode="Exclusive">
<appdef>
<attributes>
<attribute name="X">
<ProcessValue datatype="number" value="15" />
</attribute>
<attribute name="Y">
<ProcessValue datatype="number" value="59" />
</attribute>
</attributes>
</appdef>
</process>
</object>
</bpr:release>
"""
def comment_node(node):
comment = node.ownerDocument.createComment(node.toxml())
print(comment)
node.parentNode.replaceChild(comment, node)
return comment
doc = minidom.parseString(xml).documentElement
comment_node(doc.getElementsByTagName('object')[-1])
xml = doc.toxml()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse XML with Python with title and value on different lines - python

Related

Delete everything in file after last appearance string

How can i get attribute number

Create new xml attributes from other attribute

Python - Read an XML using minidom

How to remove only the parent element and not its child elements in Python?

Categories

Resources