I am trying to get some text out of a table from an online xml file. I can find the tables:
from lxml import etree
import requests
main_file = requests.get('https://training.gov.au/TrainingComponentFiles/CUA/CUAWRT601_R1.xml')
main_file.encoding = 'utf-8-sig'
root = etree.fromstring(main_file.content)
tables = root.xpath('//foo:table', namespaces={"foo": "http://www.authorit.com/xml/authorit"})
print(tables)
But I can't get any further than that. The text that I am looking for is:
Prepare to write scripts
Write draft scripts
Produce final scripts
When I paste the xml in here: http://xpather.com/
I can get it using the following expression:
//table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text()
but that doesn't work here and I'm out of ideas. How can I get that text?
Use the namespace prefix you declared (with namespaces={"foo": "http://www.authorit.com/xml/authorit"}) e.g. instead of //table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text() use //foo:table[1]/foo:tr/foo:td[#width="2700"]/foo:p[#id="4"][not(*)]/text().
I'm using an open source project call OpenTripPlanner which is a tool that I plan to use to simulate a lot of itineraries from one point to another at a given time. So far, I've managed to find the URL where an XML file containing all information about an itineraries is located. The XML is built upon request so the URL isn't static. The URL looks something like this :
http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK
(You need to have an OpenTripPlanner server running to open it)
Now, I want to read these XML files and do some data analysis using python 3, but I can't find a way to read the files. I've tried to use urllib.request to download the file locally, but the file that I get from this is oddly formed. It looks something like this
{"requestParameters":{"date":"2017/12/04","mode":"TRANSIT,WALK","fromPlace":"48.40915, -71.04996","toPlace":"48.41428, -71.06996","time":"8:00:00"},"plan":{"date":1512392400000,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"orig":"","vertexType":"NORMAL"},"to":{"name":"Destination","lon":-71.06996,"lat":48.41428,"orig":"","vertexType":"NORMAL"},"itineraries":[{"duration":1538,"startTime":1512392809000,"endTime":1512394347000,"walkTime":934,"transitTime":602,"waitingTime":2,"walkDistance":1189.6595112715966,"walkLimitExceeded":false,"elevationLost":0.0,"elevationGained":0.0,"transfers":0,"legs":[{"startTime":1512392809000,"endTime":1512393537000,"departureDelay":0,"arrivalDelay":0,"realTime":false,"distance":926.553,"pathway":false,"mode":"WALK","route":"","agencyTimeZoneOffset":-18000000,"interlineWithPreviousLeg":false,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"departure":1512392809000,"orig":"","vertexType":"NORMAL"},"to":{"name":"Roitelets / Martinets","stopId":"1:370","stopCode":"370","lon":-71.047688,"lat":48.401531,"arrival":1512393537000,"departure":1512393538000,"stopIndex":15,"stopSequence":16,"vertexType":"TRANSIT"},"legGeometry":{"points":"s{mfHb{spL|ExBp#sDl#V##lB|#j#FL?j#GbCk#|A]vEsA^KBA|C{#pCeACS~CuA`#Q","length":19},"rentedBike":false,"transitLeg":false,"duration":728.0,"steps":[{"distance":131.991,"relativeDirection":"DEPART","streetName":"Rue D.-V.-Morrier","absoluteDirection":"SOUTH","stayOn":false,"area":false,"bogusName":false,"lon":-71.04961760502248,"lat":48.4090671692228,"elevation":[]},{"distance":72.319,"relativeDirection":"LEFT","streetName":"Rue Lorenzo-Genest","absoluteDirection":"EAST","stayOn":false,"area":false,"bogusName":false,"lon":-71.0502299,"lat":48.4079519,"elevation":[]}
And when I try to open the file in a browser, I get an error that says
XML Parsing Error: not well-formed
Location: http://localhost:63342/XML_reader/file.xml?_ijt=e1d6h53s4mh1ak94sqortejf9v
Line Number 1, Column 1: ...
The script I'm using is very simple, it looks like this
import urllib.request
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.xml")
How can I make the outputted XML files well-formed? Is there an other way besides urllib.request that I may want to try?
Thanks a lot
To import this file as JSON data (not XML) you need the JSON library
import urllib.request
import json
from pprint import pprint
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.json")
data = json.load(open('file.json'))
pprint(data)
json.load reads the JSON data and convert into a Python object (https://docs.python.org/2/library/json.html?highlight=json%20load#json.load)
pprint is for "Pretty printing" the JSON data (https://docs.python.org/2/library/pprint.html)
Basically what I am doing is using urllib.request to make an API call to pubmed, receive an XML file in return, and am trying to parse it with no luck.
I have tried using Element Tree and other modules with no luck. I believe there may be an issue with XML object itself.
#Imorting URL Request Modules for API Calls
#Also importing ElemenTree as it seems to be best for XML parsing
import urllib.request
import urllib.parse
import re
import xml.etree.ElementTree as ET
from urllib import request
#Now I can make the API call.
id_request = urllib.request.urlopen('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=17570568')
#id_request will be an object that I'm not sure I understand?
#id_request Returns: "<http.client.HTTPResponse object at 0x0000000003693FD0>"
#Let's now read this baby in XML format!
id_pubmed = id_request.read()
#If I look at the id_pubmed object, I not have the XML file I want to parse.
You can see what the XML file id_pubmed is calling/prints here: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=17570568
My issue is I can't get Element Tree to parse this at all. I have tried:
tree = ET.parse(id_pubmed)
root = tree.getroot()
as well as various other suggestions from https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
ET.parse() method requires either the location of the xml file (on local file system) or a file like object , but your id_pubmed seems to be a string .
In that case , you should use ET.fromstring() . Example -
root = ET.fromstring(id_pubmed)
when i try to edit my XML file using xml.etree.celement library and I then try to write the change to a new XML file, there a changes to my XML file.
I need to make changes to the XML file and then write an exact replica just with the changes i made to the text. I have tried to use standards such as c14n and use the correct encodeing and xml declaration but still no luck.
encoding='utf-8', xml_declaration=True
The file still has minor changes such as:
in original DOC:
<remote-name>
glossary_name </remote-name>
and in new doc I have:
<remote-name>
glossary_name </remote-name>
How can I make them EXACTLY the same with just the minor change
I use this little script:
from xml.dom import minidom
import os
import xml.etree.cElementTree as et
import tkFileDialog
from lxml import etree as tt
## Grab the Particular file to convert xml
xmldoc = tt.parse(newfile)
root = xmldoc.getroot()
##loop through to convert the images to the correct Path
##finds the intro dashboard and converts the image
for dashboard in root.iter('dashboard'):
if dashboard.get('name') == "Intro":
##Loops through the zones within a specified dashboard
for zone in list(dashboard.iter('zone')):
if zone.get('id') == ('88'):
zone.set('param', 'customerFiles/Image/Intro - Top.png')
else:
continue
xmldoc.write_c14n('output.xml')
#os.rename('output.xml',"output.twb")
I am using pypdf to extract text from pdf files . The problem is that the tables in the pdf files are not extracted. I have also tried using the pdfminer but i am having the same issue .
The problem is that tables in PDFs are generally made up of absolutely positioned lines and characters, and it is non-trivial to convert this into a sensible table representation.
In Python, PDFMiner is probably your best bet. It gives you a tree structure of layout objects, but you will have to do the table interpreting yourself by looking at the positions of lines (LTLine) and text boxes (LTTextBox). There's a little bit of documentation here.
Alternatively, PDFX attempts this (and often succeeds), but you have to use it as a web service (not ideal, but fine for the occasional job). To do this from Python, you could do something like the following:
import urllib2
import xml.etree.ElementTree as ET
# Make request to PDFX
pdfdata = open('example.pdf', 'rb').read()
request = urllib2.Request('http://pdfx.cs.man.ac.uk', pdfdata, headers={'Content-Type' : 'application/pdf'})
response = urllib2.urlopen(request).read()
# Parse the response
tree = ET.fromstring(response)
for tbox in tree.findall('.//region[#class="DoCO:TableBox"]'):
src = ET.tostring(tbox.find('content/table'))
info = ET.tostring(tbox.find('region[#class="TableInfo"]'))
caption = ET.tostring(tbox.find('caption'))