I want to parse this url to get the text of \Roman\
http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
counts = tree.findall('.//Word')
for count in counts
print count.get('Roman')
But it didn't work.
Try tree.findall('.//{urn:yahoo:jp:jlp:FuriganaService}Word') . It seems you need to specify the namespace too .
I recently ran into a similar issue to this. It was because I was using an older version of the xml.etree package and to workaround that issue I had to create a loop for each level of the XML structure. For example:
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
counts = tree.findall('.//Word')
for result in tree.findall('Result'):
for wordlist in result.findall('WordList'):
for word in wordlist.findall('Word'):
print(word.get('Roman'))
Edit:
With the suggestion from #omu_negru I was able to get this working. There was another issue, when getting the text for "Roman" you were using the "get" method which is used to get attributes of the tag. Using the "text" attribute of the element you can get the text between the opening and closing tags. Also, if there is no 'Roman' tag, you'll get a None object and won't be able to get an attribute on None.
# encoding: utf-8
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
ns = '{urn:yahoo:jp:jlp:FuriganaService}'
counts = tree.findall('.//%sWord' % ns)
for count in counts:
roman = count.find('%sRoman' % ns)
if roman is None:
print 'Not found'
else:
print roman.text
Related
I am learning how to parse documents using lxml. To do so, I'm trying to parse my linkedin page. It has plenty of information and I thought it would be a good training.
Enough with the context. Here what I'm doing:
going to the url: https://www.linkedin.com/in/NAME/
opening and saving the source code to as "linkedin.html"
as I'm trying to extract my current job, I'm doing the following:
from io import StringIO, BytesIO
from lxml import html, etree
# read file
filename = 'linkedin.html'
file = open(filename).read()
# building parser
parser = etree.HTMLParser()
tree = etree.parse(StringIO(file), parser)
# parse an element
title = tree.xpath('/html/body/div[6]/div[4]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2')
print(title)
The tree variable's type is
But it always return an empty list for my variable title.
I've been trying all day but still don't understand what I'm doing wrong.
I've find the answer to my problem by adding an encoding parameter within the open() function.
Here what I've done:
def parse_html_file(filename):
f = open(filename, encoding="utf8").read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO(f), parser)
return tree
tree = parse_html_file('linkedin.html')
name = tree.xpath('//li[#class="inline t-24 t-black t-normal break-words"]')
print(name[0].text.strip())
using 'bottle' library, I have to create my own API based on this website http://dblp.uni-trier.de so I have to get data for each author. For this reason I am using the following link format http://dblp.uni-trier.de/pers/xx/'first letter of the last name'/'lastnamefirstname'.xml
Could you help me get the XML format to be able to parse it and get the information I need.
thank you
import bottle
import requests
import re
r = requests.get("https://dblp.uni-trier.de/")
#the format of my request is
#http://localhost:8080/lastname firstname
#bottle.route('/info/<name>')
def info(name):
first_letter = name[:1]
#mettre au format Lastname:Firstname
...
data = requests.get("http://dblp.uni-trier.de/pers/xx/" + first_letter + "/" + family_name + ".xml")
return data
bottle.run(host='localhost', port=8080)
from xml.etree import ElementTree
import requests
url = 'some url'
response = requests.get(url)
xml_root = ElementTree.fromstring(response.content)
fromstring Parses an XML section from a string constant. This function can be used to embed “XML literals” in Python code. text is a
string containing XML data. parser is an optional parser instance. If
not given, the standard XMLParser parser is used. Returns an Element
instance.
HOW TO Load XML from a string into an ElementTree
from xml.etree import ElementTree
root = ElementTree.fromstring("<root><a>1</a></root>")
ElementTree.dump(root)
OUTPUT
<root><a>1</a></root>
The object returned from requests.get is not the raw data. You need to use text property to get the contents
Response Content Documentation
Note that:
response.text returns content as unicode
response.content returns content as bytes
Below is my sample code where in the background I am downloading statsxml.jsp with wget and then parsing the xml. My question is now I need to parse multiple XML URL and as you can see in the below code I am using one single file. How to accomplish this?
Example URL - http://www.trion1.com:6060/stat.xml,
http://www.trion2.com:6060/stat.xml, http://www.trion3.com:6060/stat.xml
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='statsxml.jsp')
root = tree.getroot()
root.tag, root.attrib
print "root subelements: ", root.getchildren()
root.getchildren()[0][1]
root.getchildren()[0][4].getchildren()
for component in tree.iterfind('Component name'):
print component.attrib['name']
You can use urllib2 to download and parse the file in the same way. For e.g. the first few lines will be changed to:
import xml.etree.cElementTree as ET
import urllib2
for i in range(3):
tree = ET.ElementTree(file=urllib2.urlopen('http://www.trion%i.com:6060/stat.xml' % i ))
root = tree.getroot()
root.tag, root.attrib
# Rest of your code goes here....
xml =
<company>Mcd</company>
<Author>Dr.D</Author>
I want to fetch Mcd and Dr.D.
My try
import xml.etree.ElementTree as et
e = et.parse(xml)
root = e.getroot()
for node in root.getiterator("company"):
print node.tag
Hopping for a generous help.
Simply find the one tag that matches, then take the .text attribute:
company = root.find('.//company').text
author = root.find('.//Author').text
Try this.
from xml.etree import ElementTree as ET
xmlFile = ET.iterparse(open('some_file.xml','r'))
for tag, value in xmlFile:
print value.text
I'm trying to parse data from a website and cannot print the data.
import xml.etree.ElementTree as ET
from urllib import urlopen
link = urlopen('http://weather.aero/dataserver_current/httpparam?dataSource=metars&requestType=retrieve&format=xml&stationString=KSFO&hoursBeforeNow=1')
tree = ET.parse(link)
root = tree.getroot()
data = root.findall('data/metar')
for metar in data:
print metar.find('temp_c').text
It is case sensitive:
data = root.findall('data/METAR')