I'm attempting to save data from several lists in XML format, but I cannot understand how to make the XML display properly. An example of my code right now is as follows:
from lxml import etree
#Create XML Root
articles = etree.Element('root')
#Create Lists & Data
t_list = ['title1', 'title2', 'title3', 'title4', 'title5']
c_list = ['content1', 'content2', 'content3', 'content4', 'content5']
sum_list = ['summary1', 'summary2', 'summary3', 'summary4', 'summary5']
s_list = ['source1', 'source2', 'source3', 'source4', 'source5']
i = 0
for t in t_list:
for i in range(len(t_list)):
#Create SubElements of XML Root
article = etree.SubElement(articles, 'Article')
titles = etree.SubElement(article, 'Title')
summary = etree.SubElement(article, 'Summary')
source = etree.SubElement(article, 'Source')
content = etree.SubElement(article, 'Content')
#Add List Data to SubElements
titles.text = t_list[i]
summary.text = sum_list[i]
source.text = s_list[i]
content.text = c_list[i]
print(etree.tostring(articles, pretty_print=True))
My Current Output is written in one very jumbled fashion, all on a single line as follows:
b'<root>\n <Article>\n <Title>title1</Title>\n <Summary>summary1</Summary>\n <Source>source1</Source>\n <Content>content1</Content>\n </Article>\n
It looks like the pretty_print function within lxml is adding proper indentation, as well as \n breaks as I would want, but it doesn't seem to be getting interpreted correctly during output; it write on a single line.
The output I'm trying to get is as follows:
<root>
<Article>
<Title>title1</Title>
<Summary>summary1</Summary>
<Source>source1</Source>
<Content>content1</Content>
</Article>
Ideally, I'd like for my output to be viewed as a valid XML document, and display in proper nested format.
Your "Current Output" is the representation (internal python representation) of the bytestring generated by etree.tostring(), and seems that in Python3 print(somebytestring) prints the representation instead of the actual string.
Hopefully the solution is quite simple: just pass the desired encoding to etree.tostring(), ie:
xml = etree.tostring(articles, encoding="unicode", pretty_print=True)
print(xml)
I've only used the base ET module in Python and can't find an lxml download for python 3.5 (which I'm on) in order to test it, but the b before the line indicates bytes and a quick glance at the documentation indicates that tostring() has an encoding keyword, so you should just need to set that to unicode or utf-8.
I'll also mention that you don't need to set "i" before your for-loop (python will create the "i" it needs for the for-loop), though I- personally- would zip the lists and iterate the items in the lists themselves (though that's not going to have any real impact on the code in this situation).
Related
can anyone please explain how to modify xml element in python using elementtree.
I want to keep the rego AD-4214 and change make 'Tata' into 'Nissan' and model 'Sumo' into 'Skyline'.
If rewriting the entire file is acceptable1, the easiest way would be to turn the xml file into a dictionary (see for example here: How to convert an XML string to a dictionary?), do your modifications on that dictionary, and convert this dict back to xml (like for example here: https://pypi.org/project/dicttoxml/)
1 Consider lost formatting: whitespace, number formats etc may not be preserved by this.
This should work:
import xml.etree.ElementTree as ET
tree = ET.parse('your_xml_source.xml')
root = tree.getroot()
root[1][1].text = "Nissan"
root[1][2].text = "Skyline"
getroot() gives you the root element (<motorvehicle>), [1] selects its second child, the <vehicle> with rego AD-4214. The secondary indexing, [1] and [2], gives you AD-4214's <make> and <model> respectively. Then using the text attribute, you can change their text content.
How could I efficiently pull data from the nested xml?
By efficiently, I mean for example using a for loop.
Would I need to make use a of new data structure?
Parsing function:
import xml.etree.ElementTree as ET
it = ET.iterparse('OTA_AirSeatMapRS.xml')
# This for loop removes the namespaces
for _, el in it:
_, _, el.tag = el.tag.rpartition('}')
root = it.root
# I am not able to select data with this loop
for x in element.find(Service):
print(x)
This is part of the XML file:
<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope
xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/"
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<soapenv:Body>
<ns:OTA_AirSeatMapRS Version="1"
xmlns:ns="http://www.opentravel.org/OTA/2003/05/common/">
<ns:Success/>
<ns:SeatMapResponses>
<ns:SeatMapResponse>
<ns:FlightSegmentInfo DepartureDateTime="2020-11-22T15:30:00" FlightNumber="1179">
<ns:DepartureAirport LocationCode="LAS"/>
<ns:ArrivalAirport LocationCode="IAH"/>
<ns:Equipment AirEquipType="739"/>
</ns:FlightSegmentInfo>
<ns:SeatMapDetails>
<ns:CabinClass Layout="AB EF" UpperDeckInd="false">
<ns:RowInfo CabinType="First" OperableInd="true" RowNumber="1">
<ns:SeatInfo BlockedInd="false" BulkheadInd="false" ColumnNumber="1" ExitRowInd="false" GalleyInd="false" GridNumber="1" PlaneSection="Left">
<ns:Summary AvailableInd="false" InoperativeInd="false" OccupiedInd="false" SeatNumber="1A"/>
<ns:Features>Window</ns:Features>
</ns:SeatInfo>
My eventual goal is to use the parsed data to store in a JSON.
Take a look at the following instruction in your code:
for x in element.find(Service):
The first flaw in your code sample is that:
Service is a variable (not a string literal),
probably you initialized this variable to some string, but failed
to put this instruction in your code sample.
The source of another flaw is that find finds the first element
matching the given path, so you should not use it in a loop.
Maybe you should also check whether find returned some not-None
content, but this is another detail.
The third reason why you got the empty outptut is that print(x)
prints actually only the text of the element in question.
So to have a more general example, run:
Service = 'Summary'
x = root.find(f'.//{Service}')
print(f'{x.tag}, {x.text}, {x.attrib}')
The first instruction sets the tag name.
The second instruction invokes find, but note that I added './/'
to the XPath, to look at any depth of the source XML tree.
And the last instruction prints not only text of the element found,
but also the tag name and attributes.
The result I got (for your input XML) is:
Summary, None, {'AvailableInd': 'false', 'InoperativeInd': 'false', 'OccupiedInd': 'false', 'SeatNumber': '1A'}
(text is just None, so you didn't see any result in your original
output).
I have an XML file with several thousand records in it in the form of:
<custs>
<record cust_ID="B123456#Y1996" l_name="Jungle" f_name="George" m_name="OfThe" city="Fairbanks" zip="00010" current="1" />
<record cust_ID="Q975697#Z2000" l_name="Freely" f_name="I" m_name="P" city="Yellow River" zip="03010" current="1" />
<record cust_ID="M7803#J2323" l_name="Jungle" f_name="Jim" m_name="" city="Fallen Arches" zip="07008" current="0" />
</custs>
# (I know it's not normalized. This is just sample data)
How can I convert this into a CSV or tab-delimited file? I know I can hard-code it in Python using re.compile() statements, but there has to be something easier, and more portable among diff XML file layouts.
I've found a couple threads here about attribs, (Beautifulsoup unable to extract data using attrs=class, Extracting an attribute value with beautifulsoup) and they have gotten me almost there with:
# Python 3.30
#
from bs4 import BeautifulSoup
import fileinput
Input = open("C:/Python/XML Tut/MinGrp.xml", encoding = "utf-8", errors = "backslashreplace")
OutFile = open('C:/Python/XML Tut/MinGrp_Out.ttxt', 'w', encoding = "utf-8", errors = "backslashreplace")
soup = BeautifulSoup(Input, features="xml")
results = soup.findAll('custs', attrs={})
# output = results [0]#[0]
for each_tag in results:
cust_attrb_value = results[0]
# print (cust_attrb_value)
OutFile.write(cust_attrb_value)
OutFile.close()
What's the next (last?) step?
If this data is formatted correctly -- as in, uses canonical XML -- you should consider lxml rather than BeautifulSoup. With lxml, you read the file, then you can apply DOM logic on it, including XPath queries. With your XPath queries, you can then get the lxml objects that represent each node that you're interested in, extract the data from them that you need, and rewrite them into an arbitrary format of your choosing using something like the csv module..
Specifically, in the lxml documentation, check out these tutorials:
Parsing from Strings and Files
The Element Class: Using XPath to Find Text
I (also) wouldn't use BeautifulSoup for this, and though I like lxml, that's an extra install, and if you don't want to bother, this is simple enough to do with the standard lib ElementTree module.
Something like:
import xml.etree.ElementTree as ET
import sys
tree=ET.parse( 'test.xml' )
root=tree.getroot()
rs=root.getchildren()
keys = rs[0].attrib.keys()
for a in keys: sys.stdout.write(a); sys.stdout.write('\t')
sys.stdout.write('\n')
for r in rs:
assert keys == r.attrib.keys()
for k in keys: sys.stdout.write( r.attrib[k]); sys.stdout.write('\t')
sys.stdout.write('\n')
will, from python-3, produce :
zip m_name current city cust_ID l_name f_name
00010 OfThe 1 Fairbanks B123456#Y1996 Jungle George
03010 P 1 Yellow River Q975697#Z2000 Freely I
07008 0 Fallen Arches M7803#J2323 Jungle Jim
Note that with Python-2.7, the order of the attributes will be different.
If you want them to output in a different specific order, you should sort or
order the list "keys" .
The assert is checking that all rows have the same attributes.
If you actually have missing or different attributes in the elements,
then you'll have to remove that and add some code to deal with the differences
and supply defaults for missing values. ( In your sample data, you have a
null value ( m_name="" ), rather than a missing value. You might want to check
that this case is handled OK by the consumer of this output, or else add some
more special handling for this case.
<product product_id='66656432' name='munch'><category>men</category></product>
In beautiful soup,
product=soup.find("product",attrs={})
then use attribute to access data like product["name"]
I've got a large XML file that I need to parse and look for a specific node. Once it has been found, I need to make a copy, edit a couple of values and write the file again.
So far I've managed to get the DOM element that I want. There is actually two of these elements already in the XML so after I'm finished, there will be three. Once I've made a copy of the DOM and edited the value, how do I then write this into the DOM (and thus the file)?
I'm using Python's from xml.dom import minidom at the moment.
In minidom you start with creating Document:
Document doc = Document("your_root")
then if it is a text node you want to add, you append it with:
text_node = doc.createTextNode(str(some content))
doc.appendChild(text_node)
if you had for example <some_elem key="my value">some my text</some_elem>:
do it like this:
text_node = doc.createTextNode('some my text')
elem.appendChild(text_node)
elem.setAttribute('key', 'my value')
if it is complex element create it with:
elem = doc.createElement('your_elem')
if you need to set attributes do:
elem.setAttribute("some-attribute",your_attr)
if you need to append something to it:
elem.appendChild( some_other_elem )
then append the element:
doc.appendChild( elem )
if you need a string representation do:
doc.toxml()
of
doc.toprettyxml()
From the minidom documentation:
from xml.dom.minidom import getDOMImplementation
impl = getDOMImplementation()
newdoc = impl.createDocument(None, "some_tag", None)
top_element = newdoc.documentElement
text = newdoc.createTextNode('Some textual content.')
top_element.appendChild(text)
So I guess appendChild is what you ask for?
I am new to python. I want to create a xml tree with one parent, several childs and several subchilds. I've stored child tags are in list 'TAG' and Subchild tags are in list 'SUB'
And i have came up with following code but i am not able to achieve the desired result !
def make_xml(tag,sub):
'''
Takes in two lists and Returns a XML object.
The first list has to contain all the tag objects
The Second list has to contain child data's
'''
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
top = Element("Grand Parent")
comment = Comment('This is the ccode parse tree')
top.append(comment)
i=0
try:
for ee in tag:
child = SubElement(top, 'Tag'+str(i))
child.text = str(tag[i]).encode('utf-8',errors = 'ignore')
subchild = SubElement(child, 'Content'+str(i))
subchild.text = str(sub[i]).encode('utf-8',errors = 'ignore')
i = i+1;
except UnicodeDecodeError:
print 'oops'
return top
EDIT:
I have two lists like these:
TAG = ['HAPPY','GO','LUCKY']
SUB = ['ED','EDD','EDDY']
What i want is:
<G_parent>
<parent1>
HAPPY
<child1>
ED
<\child1>
<\parent1>
<parent2>
GO
<child2>
EDD
<\child2>
<\parent2>
<parent3>
LUCKY
<child3>
EDDY
<\child3
<\parent3>
<\G_parent>
The actual list has many more contents than this. I want to achieve using a for loop or so.
EDIT:
OOP's. My bad !
The code works as expected when i pass the example list. But in my real application the list is long. The list contains text fragments extracted from a pdf file. Somewhere in that text i get UnicodeDecodeError(reason: pdf extracted text messy. Proof: 'oops' get printed once ) and the returned xml object is incomplete.
So I need to figure out a way that even on UnicodeDecodeErrors my complete list is parsed. Is that possible ! I'm using .decode('utf-8',errors='ignore') even then the parsing does not complete !