How can I create XML files like this?
<?xml version="1.0" encoding="utf-8"?>
<data>
<li class= 'playlistItem' data-type='local' data-mp3='PATH' >
<a class='playlistNonSelected' href='#'>NAME</a>
</li>
...
</data>
I'd create this dynamically and for each item I have, I'd fill in the PATH and NAME variables with the values I need.
I'm trying to use lxml. This is what I've come up with so far, but I don't think it's correct:
from lxml import etree
for item in my_list:
root = etree.Element('li', class = 'playlistItem', data-type = 'local', data-mp3 = PATH)
child = etree.Element('a', class = 'playlistNonSelected', href ='#')
child.text = NAME
Even if the above was correct, I'm lost at this point, because if I have 20 items in the list, how can I do this for each of them and then write it all to an XML file? I've tried looking at other answers but most of the replies are to generate XML like this:
<root>
<child/>
<child>some text</child>
</root>
And I can't figure out how to generate the kind I need. Sorry if I've made obvious mistakes. I appreciate any help. Thank you!
You are on the right track save for a few minor syntax and usage issues:
class is a Python keyword, you can't use it as a function parameter name (which is essentially what class = 'playlistItem' is doing
data-type is not a valid variable name in Python, it will be evaluated as data MINUS type, consider using something like dataType or data_type. There might be ways around this but, IMHO, that would make the code unnecessarily complicated without adding any value (please see Edit #1 on how to do this)
That being said, the following code snippet should give you something usable and you can move on from there. Please feel free to let me know if you need any additional help:
from lxml import etree
data_el = etree.Element('data')
# You can do this in a loop and keep adding new elements
# Note: A deepcopy will be required for subsequent items
li_el = etree.SubElement(data_el, "li", class_name = 'playlistItem', data_type = "local", data_mp3 = "PATH")
a_el = etree.SubElement(li_el, "a", class_name = 'playlistNotSelected', href='#')
print etree.tostring(data_el, encoding='utf-8', xml_declaration = True, pretty_print = True)
This will generate the following output (which you can write to a file):
<?xml version='1.0' encoding='utf-8'?>
<data>
<li class_name="playlistItem" data_mp3="PATH" data_type="local">
<a class_name="playlistNotSelected" href="#"/>
</li>
</data>
Edit #0:
Alternatively, you can also write to a file by converting it to an ElementTree first, e.g.
# Replace sys.stdout with a file object pointing to your object file:
etree.ElementTree(data_el).write(sys.stdout, encoding='utf-8', xml_declaration = True, pretty_print = True)
Edit #1:
Since element attributes are dictionaries, you can use set to specify arbitrary attributes without any restrictions, e.g.
li_el.set('class', 'playlistItem')
li_el.set('data-type', 'local')
Related
I'm working with the XML file in this link (Downloadable file of 40MB). In this file, I'm expecting data from 2 types of tags.
Those are: OpportunityForecastDetail_1_0 and OpportunitySynopsisDetail_1_0.
I wrote the following code for that:
ARTICLE_TAGS = ['OpportunitySynopsisDetail_1_0', 'OpportunityForecastDetail_1_0']
for _tag in ARTICLE_TAGS:
f = open(xml_f)
context = etree.iterparse(f, tag = _tag)
for _, e in context:
_id = e.xpath('.//OpportunityID/text()')
text = e.xpath('.//OpportunityTitle/text()')
f.close()
Then etree.iterparse(f, tag = _tag) is returning an object which is not iterable. I think this occurs when the tag is not found in the XML file.
So, I added name spaces to the iterable tag like this.
context = etree.iterparse(f, tag='{http://apply.grants.gov/system/OpportunityDetail-V1.0}'+_tag)
Now, it is creating an iterable object. But, I'm not getting any text. I tried other namespaces in that file. But, not working.
Please tell me the solution to this problem. This is a sample snippet of the XML file. OpportunityForecastDetail_1_0 and OpportunitySynopsisDetail_1_0 tags are repeated n number of times in the XML file.
<?xml version="1.0" encoding="UTF-8"?>
<Grants xsi:schemaLocation="http://apply.grants.gov/system/OpportunityDetail-V1.0 https://apply07.grants.gov/apply/system/schemas/OppotunityDetail-V1.0.xsd" xmlns="http://apply.grants.gov/system/OpportunityDetail-V1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instace">
<OpportunitySynopsisDetail_1_0>
<OpportunityID>262148</OpportunityID>
<OpportunityTitle>Establishment of the Edmund S. Muskie Graduate Internship Program</OpportunityTitle>
</OpportunitySynopsisDetail_1_0>
<OpportunityForecastDetail_1_0>
<OpportunityID>284765</OpportunityID>
<OpportunityTitle>PPHF 2015: Immunization Grants-CDC Partnership: Strengthening Public Health Laboratories-financed in part by 2015 Prevention and Public Health Funds</OpportunityTitle>
</OpportunityForecastDetail_1_0>
</Grants>
First, when you are parsing XML that contains namespaces, you must use those namespaces when looking at tag names.
Second, iterparse doesn't take an argument named tag, so I don't see how your code could have worked as posted.
Finally, the elements returned from iterparse don't have a member function called xpath, so that can't have worked either.
Here is an example of how to parse XML using iterparse:
NS='{http://apply.grants.gov/system/OpportunityDetail-V1.0}'
ARTICLE_TAGS = [NS+'OpportunitySynopsisDetail_1_0', NS+'OpportunityForecastDetail_1_0']
with open(xml_f, 'r') as f:
context = etree.iterparse(f)
for _, e in context:
if e.tag in ARTICLE_TAGS:
_id = e.find(NS+'OpportunityID')
text = e.find(NS+'OpportunityTitle')
print(_id.text, text.text)
As I said in my comment, the Python documentation is helpful, as is the Effbot page on ElementTree. There are lots of other resources available; put xml.etree.elementtree into Google and start reading!
I'm attempting to save data from several lists in XML format, but I cannot understand how to make the XML display properly. An example of my code right now is as follows:
from lxml import etree
#Create XML Root
articles = etree.Element('root')
#Create Lists & Data
t_list = ['title1', 'title2', 'title3', 'title4', 'title5']
c_list = ['content1', 'content2', 'content3', 'content4', 'content5']
sum_list = ['summary1', 'summary2', 'summary3', 'summary4', 'summary5']
s_list = ['source1', 'source2', 'source3', 'source4', 'source5']
i = 0
for t in t_list:
for i in range(len(t_list)):
#Create SubElements of XML Root
article = etree.SubElement(articles, 'Article')
titles = etree.SubElement(article, 'Title')
summary = etree.SubElement(article, 'Summary')
source = etree.SubElement(article, 'Source')
content = etree.SubElement(article, 'Content')
#Add List Data to SubElements
titles.text = t_list[i]
summary.text = sum_list[i]
source.text = s_list[i]
content.text = c_list[i]
print(etree.tostring(articles, pretty_print=True))
My Current Output is written in one very jumbled fashion, all on a single line as follows:
b'<root>\n <Article>\n <Title>title1</Title>\n <Summary>summary1</Summary>\n <Source>source1</Source>\n <Content>content1</Content>\n </Article>\n
It looks like the pretty_print function within lxml is adding proper indentation, as well as \n breaks as I would want, but it doesn't seem to be getting interpreted correctly during output; it write on a single line.
The output I'm trying to get is as follows:
<root>
<Article>
<Title>title1</Title>
<Summary>summary1</Summary>
<Source>source1</Source>
<Content>content1</Content>
</Article>
Ideally, I'd like for my output to be viewed as a valid XML document, and display in proper nested format.
Your "Current Output" is the representation (internal python representation) of the bytestring generated by etree.tostring(), and seems that in Python3 print(somebytestring) prints the representation instead of the actual string.
Hopefully the solution is quite simple: just pass the desired encoding to etree.tostring(), ie:
xml = etree.tostring(articles, encoding="unicode", pretty_print=True)
print(xml)
I've only used the base ET module in Python and can't find an lxml download for python 3.5 (which I'm on) in order to test it, but the b before the line indicates bytes and a quick glance at the documentation indicates that tostring() has an encoding keyword, so you should just need to set that to unicode or utf-8.
I'll also mention that you don't need to set "i" before your for-loop (python will create the "i" it needs for the for-loop), though I- personally- would zip the lists and iterate the items in the lists themselves (though that's not going to have any real impact on the code in this situation).
I have a nested XML that looks like this:
<data>foo <data1>hello</data1> bar</data>
I am using minidom, but no matter how I try to get the values between "data", I am only get "foo" but not "bar"
It is even worse if the XML is like this:
<data><data1>hello</data1> bar</data>
I only get a "None", which is correct according to the logic above. So I came accross this: http://levdev.wordpress.com/2011/07/29/get-xml-element-value-in-python-using-minidom and concluded that it is due to the limitation of minidom?
So I used the method in that blog and I now get
foo <data1>hello</data1> bar
and
<data1>hello</data1> bar
which is acceptable. However, if I try to create a new node (createTextNode) using the output above as node values, the XML becomes:
<data>foo <data1>hello</data1> bar</data>
and
<data><data1>hello</data1> bar</data>
Is there any way that I can create it so that it looks like the original? Thank you.
You can use element tree For xml it very efficient for both retrieval and creation of the node
have a look at the link below
element tree--
tutorials
mixed xml
someof the examples of creating node
import xml.etree.ElementTree as ET
data = ET.Element('data')
data1= ET.SubElement(data, 'data1',attr="value")
data1.text="hello"
data.text="bar"
data1.tail="some code"
ET.dump(data)
output :<data>bar<data1 attr="value">hello</data1>some code</data>
Use the following function to prettify your xml so it is a LOT easier to see...first of all..
import xml.dom.minidom as minidom
def prettify(elem):
"""Return a pretty-printed XML string for the Element. Props goes
to Maxime from stackoverflow for this code."""
rough_string = et.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent="\t")
That makes stepping through the tree visually a lot simpler.
Next I would suggest a modification in your xml that will make your life a whole lot easier i think.
Instead of :
<data>foo
<data1>hello</data1>
bar
</data>
which is not a correct XML format I would save your 'foo' and 'bar' as attributes of
it looks like this:
<data var1='foo' var2='bar'>
<data1>hello</data1>
</data>
to do this using xml.etree.ElementTree:
import xml.etree.ElementTree as ET
data = ET.Element('data', {'var1:'foo', 'var2':'bar'})
data1= ET.SubElement(data, 'data1')
data1.text='hello'
print prettify(data)
So after pointed out by #pandubear, the XML:
<data>foo <data1>hello</data1> bar</data>
Does have two text nodes, containing "foo " and " bar", so what can be done is to iterate through all the child nodes in data and get the values.
I must be doing something inherently wrong here, every example I've seen and search for on SO seems to suggest this would work.
I'm trying to use an XPath search with lxml etree library to parse a garmin tcx file:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<TrainingCenterDatabase xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2 http://www.garmin.com/xmlschemas/TrainingCenterDatabasev2.xsd">
<Workouts>
<Workout Sport="Biking">
<Name>3P2 WK16 - 3</Name>
<Step xsi:type="Step_t">
<StepId>1</StepId>
<Name>[MP19]6:28-6:38</Name>
<Duration xsi:type="Distance_t">
<Meters>13000</Meters>
</Duration>
<Intensity>Active</Intensity>
<Target xsi:type="Speed_t">
<SpeedZone xsi:type="PredefinedSpeedZone_t">
<Number>2</Number>
</SpeedZone>
</Target>
</Step>
......
</Workout>
</Workouts>
</TrainingCenterDatabase>
I'd like to return the SpeedZone Element only where the type is PredefinedSpeedZone_t. I thought I'd be able to do:
root = ET.parse(open('file.tcx'))
xsi = {'xsi': 'http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2'}
for speed_zone in root.xpath(".//xsi:SpeedZone[#xsi:type='PredefinedSpeedZone_t']", namespaces=xsi):
print speed_zone
Though this doesn't seem to be the case. I've tried lots of combinations of removing/adding namespaces and to no avail. If I remove the attribute search and leave it as ".//xsi:SpeedZone" then this does return:
<Element {http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2}SpeedZone at 0x2595188>
as I'd expect.
I guess I could do it inside the for loop but it just feels like it should be possible on one line!
I'm a bit late, but the other answers are confusing IMHO.
In the Python code in the question and in the two other answers, the xsi prefix is bound to the http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2 URI. But in the XML document with the Garmin data, xsi is bound to http://www.w3.org/2001/XMLSchema-instance.
Since there are two namespaces at play here, I think the following code gives a clearer picture of what's going on. The namespace associated with the tcd prefix is the default namespace.
from lxml import etree
NSMAP = {"tcd": "http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2",
"xsi": "http://www.w3.org/2001/XMLSchema-instance"}
root = etree.parse('file.tcx')
for speed_zone in root.xpath(".//tcd:SpeedZone[#xsi:type='PredefinedSpeedZone_t']",
namespaces=NSMAP):
print speed_zone
Output:
<Element {http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2}SpeedZone at 0x25b7e18>
One way to workaround this is to avoid specifying the attribute name and use *:
.//xsi:SpeedZone[#*='PredefinedSpeedZone_t']
Another option (not that awesome as previous one) is to actually get all the SpeedZone tags and check for the attribute value in the loop:
attribute_name = '{%s}type' % root.nsmap['xsi']
for speed_zone in root.xpath(".//xsi:SpeedZone", namespaces=xsi):
if speed_zone.attrib.get(attribute_name) == 'PredefinedSpeedZone_t':
print speed_zone
Hope that helps.
If all else fails you can still use
".//xsi:SpeedZone[#*[name() = 'xsi:type' and . = 'PredefinedSpeedZone_t']]"
Using name() is not as nice as directly addressing the namespaced attribute, but at least etree understands it.
My current code is
xml_obj = lxml.objectify.Element('root_name')
xml_obj[root_name] = str('text')
lxml.etree.tostring(xml_obj)
but this creates the following xml:
<root_name><root_name>text</root_name></root_name>
In the application I am using this for I could easily use text substitution to solve this problem, but it would be nice to know how to do it using the library.
I'm not that familiar with objectify, but i don't think that's the way it's intended to be used. The way it represents objects, is that a node at any given level is, say, a classname, and the subnodes are field names (with types) and values. And the normal way to use it would be something more like this:
xml_obj = lxml.objectify.Element('xml_obj')
xml_obj.root_path = 'text'
etree.dump(xml_obj)
<root_name xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" py:pytype="TREE">
<root_name py:pytype="str">text</root_name>
</root_name>
What you want would be way easier to do with etree:
xml_obj = lxml.etree.Element('root_path')
xml_obj.text = 'text'
etree.dump(xml_obj)
<root_path>text</root_path>
If you really need it to be in objectify, it looks like while you shouldn't mix directly, you can use tostring to generate XML, then objectify.fromstring to bring it back. But probably, if this is what you want, you should just use etree to generate it.
I don't think you can write data into the root element. You may need to create a child element like this:
xml_obj = lxml.objectify.Element('root_name')
xml_obj.child_name = str('text')
lxml.etree.tostring(xml_obj)