I am trying to get some text out of a table from an online xml file. I can find the tables:
from lxml import etree
import requests
main_file = requests.get('https://training.gov.au/TrainingComponentFiles/CUA/CUAWRT601_R1.xml')
main_file.encoding = 'utf-8-sig'
root = etree.fromstring(main_file.content)
tables = root.xpath('//foo:table', namespaces={"foo": "http://www.authorit.com/xml/authorit"})
print(tables)
But I can't get any further than that. The text that I am looking for is:
Prepare to write scripts
Write draft scripts
Produce final scripts
When I paste the xml in here: http://xpather.com/
I can get it using the following expression:
//table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text()
but that doesn't work here and I'm out of ideas. How can I get that text?
Use the namespace prefix you declared (with namespaces={"foo": "http://www.authorit.com/xml/authorit"}) e.g. instead of //table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text() use //foo:table[1]/foo:tr/foo:td[#width="2700"]/foo:p[#id="4"][not(*)]/text().
I'm using an open source project call OpenTripPlanner which is a tool that I plan to use to simulate a lot of itineraries from one point to another at a given time. So far, I've managed to find the URL where an XML file containing all information about an itineraries is located. The XML is built upon request so the URL isn't static. The URL looks something like this :
http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK
(You need to have an OpenTripPlanner server running to open it)
Now, I want to read these XML files and do some data analysis using python 3, but I can't find a way to read the files. I've tried to use urllib.request to download the file locally, but the file that I get from this is oddly formed. It looks something like this
{"requestParameters":{"date":"2017/12/04","mode":"TRANSIT,WALK","fromPlace":"48.40915, -71.04996","toPlace":"48.41428, -71.06996","time":"8:00:00"},"plan":{"date":1512392400000,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"orig":"","vertexType":"NORMAL"},"to":{"name":"Destination","lon":-71.06996,"lat":48.41428,"orig":"","vertexType":"NORMAL"},"itineraries":[{"duration":1538,"startTime":1512392809000,"endTime":1512394347000,"walkTime":934,"transitTime":602,"waitingTime":2,"walkDistance":1189.6595112715966,"walkLimitExceeded":false,"elevationLost":0.0,"elevationGained":0.0,"transfers":0,"legs":[{"startTime":1512392809000,"endTime":1512393537000,"departureDelay":0,"arrivalDelay":0,"realTime":false,"distance":926.553,"pathway":false,"mode":"WALK","route":"","agencyTimeZoneOffset":-18000000,"interlineWithPreviousLeg":false,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"departure":1512392809000,"orig":"","vertexType":"NORMAL"},"to":{"name":"Roitelets / Martinets","stopId":"1:370","stopCode":"370","lon":-71.047688,"lat":48.401531,"arrival":1512393537000,"departure":1512393538000,"stopIndex":15,"stopSequence":16,"vertexType":"TRANSIT"},"legGeometry":{"points":"s{mfHb{spL|ExBp#sDl#V##lB|#j#FL?j#GbCk#|A]vEsA^KBA|C{#pCeACS~CuA`#Q","length":19},"rentedBike":false,"transitLeg":false,"duration":728.0,"steps":[{"distance":131.991,"relativeDirection":"DEPART","streetName":"Rue D.-V.-Morrier","absoluteDirection":"SOUTH","stayOn":false,"area":false,"bogusName":false,"lon":-71.04961760502248,"lat":48.4090671692228,"elevation":[]},{"distance":72.319,"relativeDirection":"LEFT","streetName":"Rue Lorenzo-Genest","absoluteDirection":"EAST","stayOn":false,"area":false,"bogusName":false,"lon":-71.0502299,"lat":48.4079519,"elevation":[]}
And when I try to open the file in a browser, I get an error that says
XML Parsing Error: not well-formed
Location: http://localhost:63342/XML_reader/file.xml?_ijt=e1d6h53s4mh1ak94sqortejf9v
Line Number 1, Column 1: ...
The script I'm using is very simple, it looks like this
import urllib.request
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.xml")
How can I make the outputted XML files well-formed? Is there an other way besides urllib.request that I may want to try?
Thanks a lot
To import this file as JSON data (not XML) you need the JSON library
import urllib.request
import json
from pprint import pprint
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.json")
data = json.load(open('file.json'))
pprint(data)
json.load reads the JSON data and convert into a Python object (https://docs.python.org/2/library/json.html?highlight=json%20load#json.load)
pprint is for "Pretty printing" the JSON data (https://docs.python.org/2/library/pprint.html)
when i try to edit my XML file using xml.etree.celement library and I then try to write the change to a new XML file, there a changes to my XML file.
I need to make changes to the XML file and then write an exact replica just with the changes i made to the text. I have tried to use standards such as c14n and use the correct encodeing and xml declaration but still no luck.
encoding='utf-8', xml_declaration=True
The file still has minor changes such as:
in original DOC:
<remote-name>
glossary_name </remote-name>
and in new doc I have:
<remote-name>
glossary_name </remote-name>
How can I make them EXACTLY the same with just the minor change
I use this little script:
from xml.dom import minidom
import os
import xml.etree.cElementTree as et
import tkFileDialog
from lxml import etree as tt
## Grab the Particular file to convert xml
xmldoc = tt.parse(newfile)
root = xmldoc.getroot()
##loop through to convert the images to the correct Path
##finds the intro dashboard and converts the image
for dashboard in root.iter('dashboard'):
if dashboard.get('name') == "Intro":
##Loops through the zones within a specified dashboard
for zone in list(dashboard.iter('zone')):
if zone.get('id') == ('88'):
zone.set('param', 'customerFiles/Image/Intro - Top.png')
else:
continue
xmldoc.write_c14n('output.xml')
#os.rename('output.xml',"output.twb")
With the lxml.etree python framework, is it more efficient to parse xml directly from a link to an online xml file or is it better to say, use a different framework (such as urllib2), to return a string and then parse from that? Or does it make no difference at all?
Method 1 - Parse directly from link
from lxml import etree as ET
parsed = ET.parse(url_link)
Method 2 - Parse from string
from lxml import etree as ET
import urllib2
xml_string = urllib2.urlopen(url_link).read()
parsed = ET.parse.fromstring(xml_string)
# note: I do not have access to python
# at the moment, so not sure whether
# the .fromstring() function is correct
Or is there a more efficient method than either of these, e.g. save the xml to a .xml file on desktop then parse from those?
I ran the two methods with a simple timing rapper.
Method 1 - Parse XML Directly From Link
from lxml import etree as ET
#timing
def parseXMLFromLink():
parsed = ET.parse(url_link)
print parsed.getroot()
for n in range(0,100):
parseXMLFromLink()
Average of 100 = 98.4035 ms
Method 2 - Parse XML From String Returned By Urllib2
from lxml import etree as ET
import urllib2
#timing
def parseXMLFromString():
xml_string = urllib2.urlopen(url_link).read()
parsed = ET.fromstring(xml_string)
print parsed
for n in range(0,100):
parseXMLFromString()
Average of 100 = 286.9630 ms
So anecdotally it seems that using lxml to parse directly from the link is the more immediately quick method. It's not clear whether it would be faster to download then parse large xml documents from the hard drive, but presumably unless the document is huge and the parsing task more intensive, the parseXMLFromLink() function would still remain quicker as it is urllib2 that seems to slow the second function down.
I ran this a few times and the results stayed the same.
If by 'effective' you mean 'efficient', I'm relatively certain you will see no difference between the two at all (unless ET.parse(link) is horribly implemented).
The reason is that the network time is going to be the most significant part of parsing an online XML file, a lot longer than storing the file to disk or keeping it in memory, and a lot longer than actually parsing it.
I have a very large XML file (20GB to be exact, and yes, I need all of it). When I attempt to load the file, I receive this error:
Python(23358) malloc: *** mmap(size=140736680968192) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
File "file.py", line 5, in <module>
code = xml.read()
MemoryError
This is the current code I have, to read the XML file:
from bs4 import BeautifulSoup
xml = open('pages_full.xml', 'r')
code = xml.read()
xml.close()
soup = BeautifulSoup(code)
Now, how would I go about to eliminating this error and be able to continue working on the script. I would try splitting the file into separate files, but as I don't know how that would affect BeautifulSoup as well as the XML data, I'd rather not do this.
(The XML data is a database dump from a wiki I volunteer on, using it to import data from different time-periods, using the direct information from many pages)
Do not use BeautifulSoup to try and such a large parse XML file. Use the ElementTree API instead. Specifically, use the iterparse() function to parse your file as a stream, handle information as you are notified of elements, then delete the elements again:
from xml.etree import ElementTree as ET
parser = ET.iterparse(filename)
for event, element in parser:
# element is a whole element
if element.tag == 'yourelement'
# do something with this element
# then clean up
element.clear()
By using a event-driven approach, you never need to hold the whole XML document in memory, you only extract what you need and discard the rest.
See the iterparse() tutorial and documentation.
Alternatively, you can also use the lxml library; it offers the same API in a faster and more featurefull package.