I am trying to pull a value (only) from some XML in Python using Beautiful Soup (but I'll gleefully dump it for anything else if recommended). Consider the following bit of code;
global humidity, temperature, weatherdescription, winddescription
query = urllib2.urlopen('http://www.google.com/ig/api?weather="Aberdeen+Scotland"')
weatherxml = query.read()
weathersoup = BeautifulSoup(weatherxml)
query.close()
print weatherxml
This prints out the weather forecast for Aberdeen, Scotland as XML (currently) thusly (much XML removed to prevent giant wall of text syndrome);
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"
tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
><forecast_information><city data="Aberdeen, Aberdeen City"/><postal_code data=""Aberdeen Scotland""/><latitude_e6
data=""/><longitude_e6 data=""/><forecast_date
data="2012-07-31"/><current_date_time data="1970-01-01 00:00:00
+0000"/><unit_system data="US"/></forecast_information><current_conditions><condition
data="Clear"/><temp_f data="55"/><temp_c data="13"/><humidity
data="Humidity: 82%"/><icon
data="/ig/images/weather/sunny.gif"/><wind_condition data="Wind: SE at
8 mph"/></current_conditions>
Now I'd like, for example, to be able to populate variables with the values of the weather in this XML, for example make temperature = 13. Parsing it is proving a nightmare.
If I use any of the find functions on weathersoup, I get the entire tag (e.g for temp_c it returns "<temp_c data="13">), various other functions return nothing, or the entire sheet, or parts of it.
How do I simply return the VALUE for any given XML tag, without a mess of "strip"s, or resorting to regex, or basically hacking it?
To access an attribute data in element temp_c:
weathersoup.temp_c['data']
Use lxml, and get friendly with XPath. Some of this example doesn't make sense with the XML you provided, since it doesn't parse correctly... but hopefully it gives you an idea of how powerful XPath can be.
from lxml import etree
# xmlstr is the string of the input XML data
root = etree.fromstring(xmlstr)
# print the text in all current_date_time elements
for elem in root.xpath('//current_date_time'):
print elem.text
# print the values for every data attribute in every temp_c element
for value in root.xpath('//temp_c#data'):
print value
# print the text for only the temp_c elements whose data element is 'Celsius'
for elem in root.xpath('//temp_c[#data="Celsius"]'):
print elem.text
# print the text for only the temp_c elements that are under the temperatures element, which is under the root.
for elem in root.xpath('/temperatures/temp_c'):
print elem.text
Related
I am trying to write a new attribute value of an XML element while it is being ET.interparse() in a for loop. Suggestions on how to do this?
I want to avoid opening the whole XML file because it is quite large, which is why I am only opening a single element at the start event at one time.
here is the code that I have:
import xml.etree.cElementTree as ET
def main_function:
osmfile = 'sample.osm'
osm_file = open(osmfile, 'r+')
for event, elem in ET.interparse(osm_file, events=('start',)):
if elem.tag == 'node':
for tag in elem.iter('tag'):
if is_addr_street_tag(tag): # Function returns boolean
cleaned_street_name = cleaning_street(tag.attrib['v']) # Function returns cleaned street name
##===================================================##
## Write cleaned_street_name to XML tag attrib value ##
##===================================================##
osm_file.close()
BLUF: Apparently it is not possible to do that without opening the whole XML file and then later rewriting the whole XML file.
1) You can not write the attribute back to the element (although you actually can but it would be difficult, time consuming, and inelegant)
2) "It is physically impossible to replace a text in a file with a shorter or longer text without rewriting the entire file. (The very only exceptions being "exactly the same length text" and "the data is at the very end".)"
Here is the comment from usr2564301 on a question related to yours about changing an attribute value of an element without opening the whole XML document.
That cannot possibly work. The XML handling is unaware that the data came from a file and so it cannot "write back" the changed value at the exact same position in the file. Even if it could: it is physically impossible to replace a text in a file with a shorter or longer text without rewriting the entire file. (The very only exceptions being "exactly the same length text" and "the data is at the very end".) – usr2564301
I am trying to parse SAP results xml file (generated in soapUI) in Python using minidom and everything goes smoothly until it comes to retrieving values.
No matter what type of node it is, value printed is None or just empty string.
Nodes have different types and only value I can get so far is tag name for element node. When it comes to it's value I get None.
For text one I get #text for nodeName, 3 for nodeType, but empty string for nodeValue.
Whats wrong with it?
The code is:
from xml.dom.minidom import parse, Node
def parseData():
try:
data = parse('data.xml')
except (IOError):
print 'No \'data.xml\' file found. Move or rename the file.'
Milestones = data.getElementsByTagName('IT_MILESTONES')
for node in Milestones:
item_list = node.getElementsByTagName('item')
print(item_list[0].childNodes[1].nodeName)
print(item_list[0].childNodes[1].nodeType)
print(item_list[0].childNodes[1].nodeValue)
while important part of XML structure looks like that:
<IT_MILESTONES>
<item>
<AUFNR>000070087734</AUFNR>
<INDEX_SEQUENCE>2300</INDEX_SEQUENCE>
<MLSTN>1</MLSTN>
<TEDAT>2012-08-01</TEDAT>
<TETIM>09:12:38</TETIM>
<LST_ACTDT>2012-08-01</LST_ACTDT>
<MOBILE>X</MOBILE>
<ONLY_SL/>
<VORNR>1292</VORNR>
<EINSA/>
<EINSE/>
<NOT_FOR_NEXT_MS>X</NOT_FOR_NEXT_MS>
</item>
</IT_MILESTONES>
You should have a look at the item_list[0].childNodes[1].childNodes. These contain probably what you are looking for. For example:
item_list[0].childNodes[11].childNodes[0].nodeValue
is the date
u'2012-08-01'
Nodes of type 1 do not have a nodeValue but childNodes. Nodes of type 3 (text nodes) have a nodeValue.
I have an XML file with several thousand records in it in the form of:
<custs>
<record cust_ID="B123456#Y1996" l_name="Jungle" f_name="George" m_name="OfThe" city="Fairbanks" zip="00010" current="1" />
<record cust_ID="Q975697#Z2000" l_name="Freely" f_name="I" m_name="P" city="Yellow River" zip="03010" current="1" />
<record cust_ID="M7803#J2323" l_name="Jungle" f_name="Jim" m_name="" city="Fallen Arches" zip="07008" current="0" />
</custs>
# (I know it's not normalized. This is just sample data)
How can I convert this into a CSV or tab-delimited file? I know I can hard-code it in Python using re.compile() statements, but there has to be something easier, and more portable among diff XML file layouts.
I've found a couple threads here about attribs, (Beautifulsoup unable to extract data using attrs=class, Extracting an attribute value with beautifulsoup) and they have gotten me almost there with:
# Python 3.30
#
from bs4 import BeautifulSoup
import fileinput
Input = open("C:/Python/XML Tut/MinGrp.xml", encoding = "utf-8", errors = "backslashreplace")
OutFile = open('C:/Python/XML Tut/MinGrp_Out.ttxt', 'w', encoding = "utf-8", errors = "backslashreplace")
soup = BeautifulSoup(Input, features="xml")
results = soup.findAll('custs', attrs={})
# output = results [0]#[0]
for each_tag in results:
cust_attrb_value = results[0]
# print (cust_attrb_value)
OutFile.write(cust_attrb_value)
OutFile.close()
What's the next (last?) step?
If this data is formatted correctly -- as in, uses canonical XML -- you should consider lxml rather than BeautifulSoup. With lxml, you read the file, then you can apply DOM logic on it, including XPath queries. With your XPath queries, you can then get the lxml objects that represent each node that you're interested in, extract the data from them that you need, and rewrite them into an arbitrary format of your choosing using something like the csv module..
Specifically, in the lxml documentation, check out these tutorials:
Parsing from Strings and Files
The Element Class: Using XPath to Find Text
I (also) wouldn't use BeautifulSoup for this, and though I like lxml, that's an extra install, and if you don't want to bother, this is simple enough to do with the standard lib ElementTree module.
Something like:
import xml.etree.ElementTree as ET
import sys
tree=ET.parse( 'test.xml' )
root=tree.getroot()
rs=root.getchildren()
keys = rs[0].attrib.keys()
for a in keys: sys.stdout.write(a); sys.stdout.write('\t')
sys.stdout.write('\n')
for r in rs:
assert keys == r.attrib.keys()
for k in keys: sys.stdout.write( r.attrib[k]); sys.stdout.write('\t')
sys.stdout.write('\n')
will, from python-3, produce :
zip m_name current city cust_ID l_name f_name
00010 OfThe 1 Fairbanks B123456#Y1996 Jungle George
03010 P 1 Yellow River Q975697#Z2000 Freely I
07008 0 Fallen Arches M7803#J2323 Jungle Jim
Note that with Python-2.7, the order of the attributes will be different.
If you want them to output in a different specific order, you should sort or
order the list "keys" .
The assert is checking that all rows have the same attributes.
If you actually have missing or different attributes in the elements,
then you'll have to remove that and add some code to deal with the differences
and supply defaults for missing values. ( In your sample data, you have a
null value ( m_name="" ), rather than a missing value. You might want to check
that this case is handled OK by the consumer of this output, or else add some
more special handling for this case.
<product product_id='66656432' name='munch'><category>men</category></product>
In beautiful soup,
product=soup.find("product",attrs={})
then use attribute to access data like product["name"]
I am a bit stuck on a project I am doing which uses Python -which I am very new to. I have been told to use ElementTree and get specified data out of an incoming XML file. It sounds simple but I am not great at programming. Below is a (very!) tiny example of an incoming file along with the code I am trying to use.
I would like any tips or places to go next with this. I have tried searching and following what other people have done but I can't seem to get the same results. My aim is to get the information contained in the "Active", "Room" and "Direction" but later on I will need to get much more information.
I have tried using XPaths but it does not work too well, especially with the namespaces the xml uses and the fact that an XPath for everything I would need would become too large. I have simplified the example so I can understand the principle to do, as after this it must be extended to gain more information from an "AssetEquipment" and multiple instances of them. Then end goal would be all information from one equipment being saved to a dictionary so I can manipulate it later, with each new equipment in its own separate dictionary.
Example XML:
<AssetData>
<Equipment>
<AssetEquipment ID="3" name="PC960">
<Active>Yes</Active>
<Location>
<RoomLocation>
<Room>23</Room>
<Area>
<X-Area>-1</X-Area>
<Y-Area>2.4</Y-Area>
</Area>
</RoomLocation>
</Location>
<Direction>Positive</Direction>
<AssetSupport>12</AssetSupport>
</AssetEquipment>
</Equipment>
Example Code:
tree = ET.parse('C:\Temp\Example.xml')
root = tree.getroot()
ns = "{http://namespace.co.uk}"
for equipment in root.findall(ns + "Equipment//"):
tagname = re.sub(r'\{.*?\}','',equipment.tag)
name = equipment.get('name')
if tagname == 'AssetEquipment':
print "\tName: " + repr(name)
for attributes in root.findall(ns + "Equipment/" + ns + "AssetEquipment//"):
attname = re.sub(r'\{.*?\}','',attributes.tag)
if tagname == 'Room': #This does not work but I need it to be found while
#in this instance of "AssetEquipment" so it does not
#call information from another asset instead.
room = equipment.text
print "\t\tRoom:", repr(room)
import xml.etree.cElementTree as ET
tree = ET.parse('test.xml')
for elem in tree.getiterator():
if elem.tag=='{http://www.namespace.co.uk}AssetEquipment':
output={}
for elem1 in list(elem):
if elem1.tag=='{http://www.namespace.co.uk}Active':
output['Active']=elem1.text
if elem1.tag=='{http://www.namespace.co.uk}Direction':
output['Direction']=elem1.text
if elem1.tag=='{http://www.namespace.co.uk}Location':
for elem2 in list(elem1):
if elem2.tag=='{http://www.namespace.co.uk}RoomLocation':
for elem3 in list(elem2):
if elem3.tag=='{http://www.namespace.co.uk}Room':
output['Room']=elem3.text
print output
I am new to python. I want to create a xml tree with one parent, several childs and several subchilds. I've stored child tags are in list 'TAG' and Subchild tags are in list 'SUB'
And i have came up with following code but i am not able to achieve the desired result !
def make_xml(tag,sub):
'''
Takes in two lists and Returns a XML object.
The first list has to contain all the tag objects
The Second list has to contain child data's
'''
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
top = Element("Grand Parent")
comment = Comment('This is the ccode parse tree')
top.append(comment)
i=0
try:
for ee in tag:
child = SubElement(top, 'Tag'+str(i))
child.text = str(tag[i]).encode('utf-8',errors = 'ignore')
subchild = SubElement(child, 'Content'+str(i))
subchild.text = str(sub[i]).encode('utf-8',errors = 'ignore')
i = i+1;
except UnicodeDecodeError:
print 'oops'
return top
EDIT:
I have two lists like these:
TAG = ['HAPPY','GO','LUCKY']
SUB = ['ED','EDD','EDDY']
What i want is:
<G_parent>
<parent1>
HAPPY
<child1>
ED
<\child1>
<\parent1>
<parent2>
GO
<child2>
EDD
<\child2>
<\parent2>
<parent3>
LUCKY
<child3>
EDDY
<\child3
<\parent3>
<\G_parent>
The actual list has many more contents than this. I want to achieve using a for loop or so.
EDIT:
OOP's. My bad !
The code works as expected when i pass the example list. But in my real application the list is long. The list contains text fragments extracted from a pdf file. Somewhere in that text i get UnicodeDecodeError(reason: pdf extracted text messy. Proof: 'oops' get printed once ) and the returned xml object is incomplete.
So I need to figure out a way that even on UnicodeDecodeErrors my complete list is parsed. Is that possible ! I'm using .decode('utf-8',errors='ignore') even then the parsing does not complete !