I'm trying to parse data from a website and cannot print the data.
import xml.etree.ElementTree as ET
from urllib import urlopen
link = urlopen('http://weather.aero/dataserver_current/httpparam?dataSource=metars&requestType=retrieve&format=xml&stationString=KSFO&hoursBeforeNow=1')
tree = ET.parse(link)
root = tree.getroot()
data = root.findall('data/metar')
for metar in data:
print metar.find('temp_c').text
It is case sensitive:
data = root.findall('data/METAR')
Related
I am learning how to parse documents using lxml. To do so, I'm trying to parse my linkedin page. It has plenty of information and I thought it would be a good training.
Enough with the context. Here what I'm doing:
going to the url: https://www.linkedin.com/in/NAME/
opening and saving the source code to as "linkedin.html"
as I'm trying to extract my current job, I'm doing the following:
from io import StringIO, BytesIO
from lxml import html, etree
# read file
filename = 'linkedin.html'
file = open(filename).read()
# building parser
parser = etree.HTMLParser()
tree = etree.parse(StringIO(file), parser)
# parse an element
title = tree.xpath('/html/body/div[6]/div[4]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2')
print(title)
The tree variable's type is
But it always return an empty list for my variable title.
I've been trying all day but still don't understand what I'm doing wrong.
I've find the answer to my problem by adding an encoding parameter within the open() function.
Here what I've done:
def parse_html_file(filename):
f = open(filename, encoding="utf8").read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO(f), parser)
return tree
tree = parse_html_file('linkedin.html')
name = tree.xpath('//li[#class="inline t-24 t-black t-normal break-words"]')
print(name[0].text.strip())
I want to parse this url to get the text of \Roman\
http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
counts = tree.findall('.//Word')
for count in counts
print count.get('Roman')
But it didn't work.
Try tree.findall('.//{urn:yahoo:jp:jlp:FuriganaService}Word') . It seems you need to specify the namespace too .
I recently ran into a similar issue to this. It was because I was using an older version of the xml.etree package and to workaround that issue I had to create a loop for each level of the XML structure. For example:
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
counts = tree.findall('.//Word')
for result in tree.findall('Result'):
for wordlist in result.findall('WordList'):
for word in wordlist.findall('Word'):
print(word.get('Roman'))
Edit:
With the suggestion from #omu_negru I was able to get this working. There was another issue, when getting the text for "Roman" you were using the "get" method which is used to get attributes of the tag. Using the "text" attribute of the element you can get the text between the opening and closing tags. Also, if there is no 'Roman' tag, you'll get a None object and won't be able to get an attribute on None.
# encoding: utf-8
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
ns = '{urn:yahoo:jp:jlp:FuriganaService}'
counts = tree.findall('.//%sWord' % ns)
for count in counts:
roman = count.find('%sRoman' % ns)
if roman is None:
print 'Not found'
else:
print roman.text
I'm trying to get .xml file from a URL http://192.168.1.80/api/current and process the content by SUBSCRIBER by SUBSCRIBER,i wrote code to get a xml file as string using python urllib2 module,i like to convert xml file to object and process,how can i proceed
import urllib2
from xml.dom import minidom
usock = urllib2.urlopen('http://192.168.1.80/api/current')
xmldoc = minidom.parse(usock)
usock.close()
data = xmldoc.toxml()
print data
xml content
<NSE COMMAND="CURR_USERS_RSP">
<SUBSCRIBER>
<SUB_MAC_ADDR>
70:16:00:C1:12:76
</SUB_MAC_ADDR>
<SUB_IP>
192.168.1.20
</SUB_IP>
<LOCATION>
0
</LOCATION>
</SUBSCRIBER>
<SUBSCRIBER>
<SUB_MAC_ADDR>
58:E6:F6:E5:7B:78
</SUB_MAC_ADDR>
<SUB_IP>
192.168.1.21
</SUB_IP>
<LOCATION>
0
</LOCATION>
</SUBSCRIBER>
</NSE>
Finally i figure it out a way to solving above problem
import xml.etree.ElementTree as ET
import urllib2
from xml.dom import minidom
url = 'http://192.168.1.80/api/current'
try:
usock = urllib2.urlopen(url)
xmldoc = minidom.parse(usock)
usock.close()
data = xmldoc.toxml()
root = ET.fromstring(data)
for ele in root.findall('SUBSCRIBER'):
print 'MAC = ' + ele.find('SUB_MAC_ADDR').text + ', IP = ' + ele.find('SUB_IP').text + ', Location = ' + ele.find('LOCATION').text
except Exception as e:
print e.getcode()
You can use the library BeautifulSoup4 to parse the xml file. Do not forget to select the appropriate xml parser. See this.
Below is my sample code where in the background I am downloading statsxml.jsp with wget and then parsing the xml. My question is now I need to parse multiple XML URL and as you can see in the below code I am using one single file. How to accomplish this?
Example URL - http://www.trion1.com:6060/stat.xml,
http://www.trion2.com:6060/stat.xml, http://www.trion3.com:6060/stat.xml
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='statsxml.jsp')
root = tree.getroot()
root.tag, root.attrib
print "root subelements: ", root.getchildren()
root.getchildren()[0][1]
root.getchildren()[0][4].getchildren()
for component in tree.iterfind('Component name'):
print component.attrib['name']
You can use urllib2 to download and parse the file in the same way. For e.g. the first few lines will be changed to:
import xml.etree.cElementTree as ET
import urllib2
for i in range(3):
tree = ET.ElementTree(file=urllib2.urlopen('http://www.trion%i.com:6060/stat.xml' % i ))
root = tree.getroot()
root.tag, root.attrib
# Rest of your code goes here....
xml =
<company>Mcd</company>
<Author>Dr.D</Author>
I want to fetch Mcd and Dr.D.
My try
import xml.etree.ElementTree as et
e = et.parse(xml)
root = e.getroot()
for node in root.getiterator("company"):
print node.tag
Hopping for a generous help.
Simply find the one tag that matches, then take the .text attribute:
company = root.find('.//company').text
author = root.find('.//Author').text
Try this.
from xml.etree import ElementTree as ET
xmlFile = ET.iterparse(open('some_file.xml','r'))
for tag, value in xmlFile:
print value.text