Best way to read/parse XML url in python3 - python

I read a lot of different answers similar questions, but no one seems providing a simple solution.
Supposing to have a remote url like this https://www.emidius.eu/fdsnws/event/1/query?eventid=quakeml:eu.ahead/event/13270512_0000_000&format=xml the final aim is to get an usable python object (e.g. a dictionary or a json like object).
I did find different methods if the xml is save as a local file:
import xml.etree.ElementTree as ET
file = '/home/user/query.xml'
tree = ET.parse(file)
root = tree.getroot()
for c in root:
print(c.tag)
for i in c:
print(i.tag)
I did not find a method (with native python modules) to bump a url string and get an object.

OK I think the best solution is this one:
import xml.etree.ElementTree as ET
import urllib.request
opener = urllib.request.build_opener()
url = 'https://www.emidius.eu/fdsnws/event/1/query?eventid=quakeml:eu.ahead/event/13270512_0000_000&includeallorigins=true&includeallmagnitudes=true&format=xml'
tree = ET.parse(opener.open(url))

This works, but you don't need build_opener() for that.
You can build a custom opener for some specific case or protocol, but you use normal https. So you can just use
import urllib.request
import xml.etree.ElementTree as ET
url = 'https://www.emidius.eu/fdsnws/event/1/query?eventid=quakeml:eu.ahead/event/13270512_0000_000&includeallorigins=true&includeallmagnitudes=true&format=xml'
with urllib.request.urlopen(url) as response:
html = ET.fromstring(response.read().decode())

Related

Get xpath from html file using LXML - Python

I am learning how to parse documents using lxml. To do so, I'm trying to parse my linkedin page. It has plenty of information and I thought it would be a good training.
Enough with the context. Here what I'm doing:
going to the url: https://www.linkedin.com/in/NAME/
opening and saving the source code to as "linkedin.html"
as I'm trying to extract my current job, I'm doing the following:
from io import StringIO, BytesIO
from lxml import html, etree
# read file
filename = 'linkedin.html'
file = open(filename).read()
# building parser
parser = etree.HTMLParser()
tree = etree.parse(StringIO(file), parser)
# parse an element
title = tree.xpath('/html/body/div[6]/div[4]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2')
print(title)
The tree variable's type is
But it always return an empty list for my variable title.
I've been trying all day but still don't understand what I'm doing wrong.
I've find the answer to my problem by adding an encoding parameter within the open() function.
Here what I've done:
def parse_html_file(filename):
f = open(filename, encoding="utf8").read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO(f), parser)
return tree
tree = parse_html_file('linkedin.html')
name = tree.xpath('//li[#class="inline t-24 t-black t-normal break-words"]')
print(name[0].text.strip())

how to get an XML file from a website using python?

using 'bottle' library, I have to create my own API based on this website http://dblp.uni-trier.de so I have to get data for each author. For this reason I am using the following link format http://dblp.uni-trier.de/pers/xx/'first letter of the last name'/'lastnamefirstname'.xml
Could you help me get the XML format to be able to parse it and get the information I need.
thank you
import bottle
import requests
import re
r = requests.get("https://dblp.uni-trier.de/")
#the format of my request is
#http://localhost:8080/lastname firstname
#bottle.route('/info/<name>')
def info(name):
first_letter = name[:1]
#mettre au format Lastname:Firstname
...
data = requests.get("http://dblp.uni-trier.de/pers/xx/" + first_letter + "/" + family_name + ".xml")
return data
bottle.run(host='localhost', port=8080)
from xml.etree import ElementTree
import requests
url = 'some url'
response = requests.get(url)
xml_root = ElementTree.fromstring(response.content)
fromstring Parses an XML section from a string constant. This function can be used to embed “XML literals” in Python code. text is a
string containing XML data. parser is an optional parser instance. If
not given, the standard XMLParser parser is used. Returns an Element
instance.
HOW TO Load XML from a string into an ElementTree
from xml.etree import ElementTree
root = ElementTree.fromstring("<root><a>1</a></root>")
ElementTree.dump(root)
OUTPUT
<root><a>1</a></root>
The object returned from requests.get is not the raw data. You need to use text property to get the contents
Response Content Documentation
Note that:
response.text returns content as unicode
response.content returns content as bytes

Parsing compressed xml feed into ElementTree

I'm trying to parse the following feed into ElementTree in python: "http://smarkets.s3.amazonaws.com/oddsfeed.xml" (warning large file)
Here is what I have tried so far:
feed = urllib.urlopen("http://smarkets.s3.amazonaws.com/oddsfeed.xml")
# feed is compressed
compressed_data = feed.read()
import StringIO
compressedstream = StringIO.StringIO(compressed_data)
import gzip
gzipper = gzip.GzipFile(fileobj=compressedstream)
data = gzipper.read()
# Parse XML
tree = ET.parse(data)
but it seems to just hang on compressed_data = feed.read(), infinitely maybe?? (I know it's a big file, but seems too long compared to other non-compressed feeds I parsed, and this large is killing any bandwidth gains from the gzip compression in the first place).
Next I tried requests, with
url = "http://smarkets.s3.amazonaws.com/oddsfeed.xml"
headers = {'accept-encoding': 'gzip, deflate'}
r = requests.get(url, headers=headers, stream=True)
but now
tree=ET.parse(r.content)
or
tree=ET.parse(r.text)
but these raise exceptions.
What's the proper way to do this?
You can pass the value returned by urlopen() directly to GzipFile() and in turn you can pass it to ElementTree methods such as iterparse():
#!/usr/bin/env python3
import xml.etree.ElementTree as etree
from gzip import GzipFile
from urllib.request import urlopen, Request
with urlopen(Request("http://smarkets.s3.amazonaws.com/oddsfeed.xml",
headers={"Accept-Encoding": "gzip"})) as response, \
GzipFile(fileobj=response) as xml_file:
for elem in getelements(xml_file, 'interesting_tag'):
process(elem)
where getelements() allows to parse files that do not fit in memory.
def getelements(filename_or_file, tag):
"""Yield *tag* elements from *filename_or_file* xml incrementaly."""
context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == tag:
yield elem
root.clear() # free memory
To preserve memory, the constructed xml tree is cleared on each tag element.
The ET.parse function takes "a filename or file object containing XML data". You're giving it a string full of XML. It's going to try to open a file whose name is that big chunk of XML. There is probably no such file.
You want the fromstring function, or the XML constructor.
Or, if you prefer, you've already got a file object, gzipper; you could just pass that to parse instead of reading it into a string.
This is all covered by the short Tutorial in the docs:
We can import this data by reading from a file:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Or directly from a string:
root = ET.fromstring(country_data_as_string)

Parsing a URL XML with the ElementTree XML API

Below is my sample code where in the background I am downloading statsxml.jsp with wget and then parsing the xml. My question is now I need to parse multiple XML URL and as you can see in the below code I am using one single file. How to accomplish this?
Example URL - http://www.trion1.com:6060/stat.xml,
http://www.trion2.com:6060/stat.xml, http://www.trion3.com:6060/stat.xml
import xml.etree.cElementTree as ET
tree = ET.ElementTree(file='statsxml.jsp')
root = tree.getroot()
root.tag, root.attrib
print "root subelements: ", root.getchildren()
root.getchildren()[0][1]
root.getchildren()[0][4].getchildren()
for component in tree.iterfind('Component name'):
print component.attrib['name']
You can use urllib2 to download and parse the file in the same way. For e.g. the first few lines will be changed to:
import xml.etree.cElementTree as ET
import urllib2
for i in range(3):
tree = ET.ElementTree(file=urllib2.urlopen('http://www.trion%i.com:6060/stat.xml' % i ))
root = tree.getroot()
root.tag, root.attrib
# Rest of your code goes here....

Trying to collect data from local files using BeautifulSoup

I want to run a python script to parse html files and collect a list of all the links with a target="_blank" attribute.
I've tried the following but it's not getting anything from bs4. SoupStrainer says in the docs it'll take args in the same way as findAll etc, should this work? Am I missing some stupid error?
import os
import sys
from bs4 import BeautifulSoup, SoupStrainer
from unipath import Path
def main():
ROOT = Path(os.path.realpath(__file__)).ancestor(3)
src = ROOT.child("src")
templatedir = src.child("templates")
for (dirpath, dirs, files) in os.walk(templatedir):
for path in (Path(dirpath, f) for f in files):
if path.endswith(".html"):
for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")):
print link
if __name__ == "__main__":
sys.exit(main())
I think you need something like this
if path.endswith(".html"):
htmlfile = open(dirpath)
for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
print link
The usage BeautifulSoup is OK but you should pass in the html string, not just the path of the html file. BeautifulSoup accepts the html string as argument, not the file path. It will not open it and then read the content automatically. You should do it yourself. If you pass in a.html, the soup will be <html><body><p>a.html</p></body></html>. This is not the content of the file. Surely there is no links. You should use BeautifulSoup(open(path).read(), ...).
edit:
It also accepts the file descriptor. BeautifulSoup(open(path), ...) is enough.

Categories

Resources