Read XML file from URL in Python - python

I'm using an open source project call OpenTripPlanner which is a tool that I plan to use to simulate a lot of itineraries from one point to another at a given time. So far, I've managed to find the URL where an XML file containing all information about an itineraries is located. The XML is built upon request so the URL isn't static. The URL looks something like this :
http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK
(You need to have an OpenTripPlanner server running to open it)
Now, I want to read these XML files and do some data analysis using python 3, but I can't find a way to read the files. I've tried to use urllib.request to download the file locally, but the file that I get from this is oddly formed. It looks something like this
{"requestParameters":{"date":"2017/12/04","mode":"TRANSIT,WALK","fromPlace":"48.40915, -71.04996","toPlace":"48.41428, -71.06996","time":"8:00:00"},"plan":{"date":1512392400000,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"orig":"","vertexType":"NORMAL"},"to":{"name":"Destination","lon":-71.06996,"lat":48.41428,"orig":"","vertexType":"NORMAL"},"itineraries":[{"duration":1538,"startTime":1512392809000,"endTime":1512394347000,"walkTime":934,"transitTime":602,"waitingTime":2,"walkDistance":1189.6595112715966,"walkLimitExceeded":false,"elevationLost":0.0,"elevationGained":0.0,"transfers":0,"legs":[{"startTime":1512392809000,"endTime":1512393537000,"departureDelay":0,"arrivalDelay":0,"realTime":false,"distance":926.553,"pathway":false,"mode":"WALK","route":"","agencyTimeZoneOffset":-18000000,"interlineWithPreviousLeg":false,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"departure":1512392809000,"orig":"","vertexType":"NORMAL"},"to":{"name":"Roitelets / Martinets","stopId":"1:370","stopCode":"370","lon":-71.047688,"lat":48.401531,"arrival":1512393537000,"departure":1512393538000,"stopIndex":15,"stopSequence":16,"vertexType":"TRANSIT"},"legGeometry":{"points":"s{mfHb{spL|ExBp#sDl#V##lB|#j#FL?j#GbCk#|A]vEsA^KBA|C{#pCeACS~CuA`#Q","length":19},"rentedBike":false,"transitLeg":false,"duration":728.0,"steps":[{"distance":131.991,"relativeDirection":"DEPART","streetName":"Rue D.-V.-Morrier","absoluteDirection":"SOUTH","stayOn":false,"area":false,"bogusName":false,"lon":-71.04961760502248,"lat":48.4090671692228,"elevation":[]},{"distance":72.319,"relativeDirection":"LEFT","streetName":"Rue Lorenzo-Genest","absoluteDirection":"EAST","stayOn":false,"area":false,"bogusName":false,"lon":-71.0502299,"lat":48.4079519,"elevation":[]}
And when I try to open the file in a browser, I get an error that says
XML Parsing Error: not well-formed
Location: http://localhost:63342/XML_reader/file.xml?_ijt=e1d6h53s4mh1ak94sqortejf9v
Line Number 1, Column 1: ...
The script I'm using is very simple, it looks like this
import urllib.request
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.xml")
How can I make the outputted XML files well-formed? Is there an other way besides urllib.request that I may want to try?
Thanks a lot

To import this file as JSON data (not XML) you need the JSON library
import urllib.request
import json
from pprint import pprint
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.json")
data = json.load(open('file.json'))
pprint(data)
json.load reads the JSON data and convert into a Python object (https://docs.python.org/2/library/json.html?highlight=json%20load#json.load)
pprint is for "Pretty printing" the JSON data (https://docs.python.org/2/library/pprint.html)

Related

Opening a draw.io file using python

I am trying to read the data from a draw.io drawing using python.
Apparently the format is an xml with some portions in "mxfile" encoding.
(That is, a section of the xml is deflated, then base64 encoded.)
Here's the official TFM:
https://drawio-app.com/extracting-the-xml-from-mxfiles/
And their online decoder tool:
https://jgraph.github.io/drawio-tools/tools/convert.html
So i try to decode the mxfile portion using the standard python tools:
import base64
s="7VvbcuI4FPwaHpOybG55BHKZmc1kmSGb7KvAArTIFiuLEObr58jINxTATvA4IVSlKtaxLFvq1lGrbWpOz3u+EXg+/c5dwmq25T7XnMuabSOr3YR/KrJaR9pIByaCurpSEhjQXyS6UkcX1CVBpqLknEk6zwZH3PfJSGZiWAi+zFYbc5a96xxPiBEYjDAzo4/UlVPdC7uVxL8QOplGd0bNi/UZD0eVdU+CKXb5MhVyrmpOT3Au10fec48wNXjRuDx+XT2y21nz5tuP4H/8T/ev+7uHs3Vj10UuibsgiC9f3fSv2fj6y0P9v3/n/esfS+umM/x2pi+xnjBb6PHqExFwX/dYrqJhDJbUY9iHUnfMfTnQZ2AQupjRiQ/HI3g6IiDwRISkgEBHn5B8DtHRlDL3Fq/4QvUhkHg0i0rdKRf0FzSLGZxCEIDTQmoy2c1MjYG6EsIWRAUJoE4/GhgUh25xIHWdEWcMzwM6DB9YVfGwmFC/y6XkXtQQX/gucXUpRjosSMFnMXfU9Tnh0LCp0SDPKTJqeG4I94gUK6iiz8ZM01MNReVlQlzU1LFpmrROW08YPVkmcdvx7X7C5ML+BAYhuZ+zcb96zvvZzeztMAPgfSxJVw1jkKYhHKS6moRCchYgKjKIeoc9YtAURlqmKMnIWG4lZDDHI+pPbsM6l/Uk8lP3VIU4XDtmIRmm1HWJH5JFYonXfFIMmXPqy3AoGl34gwHrWeeNWgMeqAdllJThT1UXssd94BWmIYEIkHVJFGFfoNbOabufWqssYkWRTRMpA2lR/Gwz0Uy5r8h4t/CGkDaODckdGWUqPaYPy8K7YVeMt2PgfeVhqi7ruC7k6OAE+EEBb7UrBrxuAG4gzGioH/RooBfX1j3wewCkai7C+17R4fIMGZxwTE44L+DP8JCwPg+opFy1L9Z1N3hRVdZGVj0fqjuW/zeB2jCz9kKMpjhQiRtk1wyGNzw6wvlcGqio6tzcNFAdyIWruplT9Vsn1X841Y82VL/TLFf1ow3V77Tfr+pvbWfqserGnGmnmZtm72UH0Daw7MDTK/fGtr7DUnJ0SB5UEBbGu/IdwMVJEB4c1Lwqvyw9iEy/8CskfusK0AiXWtu656rsC65aO7IZndZA9bIwbledqJHptd0QteIOiEd9LBTg93hGTJP4o+NbFqTVS/7oAXZlY+K7HfXCBUpDxpXa7kJIy3FkrYvXlEUr1x69nF3+iDsh0dQhbMiXV0mgGwbgRMSUwmo74LAtJfshg/3FhOTYzamn3QnsS0AKwrCkT9n3Tju0eV8RN9HltpXV5bblZJtYd1JflX7RU7Sh9SgYDR3Mqje9v77gYxIE3JTrpx1m+TtMZ3PHl3eH2bL2kviFDaZTz7HBbL2PDSYybcsBZlhn3E+4tsWT9+NsLJHpUhroffadRnFY8+4fS9tqmC7lp1IsEWLvWrKgjUzfeqVkcTYaslsbz1K2ZDGNxm2vKU+CpXzB0rDaGTrk/hDGRjsWme2KpdH4QB/CmD7qQApCzJc3n0WxtHLT690oFtMb7VF5fJrzoA54cZwrt8Bt0y6FpC2P77O1ioGu/OMX27RMQdmrVdy2etw9AX5gwHN/GFMe4qah2oMxkUfoHFSNtfNKMXY4rE1D0wD50xsMxXFt5JRhZTkMtun9PQBE7jEu0OWh2Kw8E5v2398LOV8oe6Gj3lXeqnlwQjQ3oheV59ti1h+fh2NdzNyLfUFUvdWnx3av0xdhudfq0zgrKqVtjbp+oDe6fvH7nJgwdraJvK5fo76noS2un9HQ2eYbp412+HgckFKMQ9s0Dq3z8wj4hK6hGZdKBHvSzlBbcus1vItHs0nI3x5nXMB5nycGpHa77fw5IZpf+ieX+rFq8c/P8ht1Z29kVETMPwaXaZ7lxyrSTx8VrMPM/uib3D8OnemZMeiFWuDxVu8zJcc3UTVVcB4HP9bou7Eu5KK/kRgGAbZxJf86cXEYpjhZFz9K0m/hChSTH1yvqyc/W3eufgM="
result=zlib.decompress(base64.b64decode(s))
Throws the exception:
zlib.error: Error -3 while decompressing data: incorrect header check
Meanwhile their tool above returns xml just fine when given the exact same data.
What am I missing?
Try this:
import zlib
import base64
import xml.etree.ElementTree as ET
from urllib.parse import unquote
tree = ET.parse(filename)
data = base64.b64decode(tree.find('diagram').text)
xml = zlib.decompress(data, wbits=-15)
xml = unquote(xml)
If you read the source of their html tool, you will see this:
data = String.fromCharCode.apply(null, new Uint8Array(pako.deflateRaw(data)));
They are using a JS library called pako and 'raw' mode. From github source you can get the required setting.

python: getting npm package data from a couchdb endpoint

I want to fetch the npm package metadata. I found this endpoint which gives me all the metadata needed.
I made a following script to get this data. My plan is to select some specific keys and add that data in some database (I can also store it in a json file, but the data is huge). I made following script to fetch the data:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
for line in r.iter_lines():
# filter out keep-alive new lines
if line:
print(line)
decoded_line = line.decode('utf-8')
print(json.loads(decoded_line))
Notice, I don't even include all-docs, but it sticks in an infinite loop. I think this is because the data is huge.
A look at the head of the output from - https://replicate.npmjs.com/_all_docs
gives me following output:
{"total_rows":1017703,"offset":0,"rows":[
{"id":"0","key":"0","value":{"rev":"1-5fbff37e48e1dd03ce6e7ffd17b98998"}},
{"id":"0-","key":"0-","value":{"rev":"1-420c8f16ec6584c7387b19ef401765a4"}},
{"id":"0----","key":"0----","value":{"rev":"1-55f4221814913f0e8f861b1aa42b02e4"}},
{"id":"0-1-project","key":"0-1-project","value":{"rev":"1-3cc19950252463c69a5e717d9f8f0f39"}},
{"id":"0-100","key":"0-100","value":{"rev":"1-c4f41a37883e1289f469d5de2a7b505a"}},
{"id":"0-24","key":"0-24","value":{"rev":"1-e595ec3444bc1039f10c062dd86912a2"}},
{"id":"0-60","key":"0-60","value":{"rev":"2-32c17752acfe363fa1be7dbd38212b0a"}},
{"id":"0-9","key":"0-9","value":{"rev":"1-898c1d89f7064e58f052ff492e94c753"}},
{"id":"0-_-0","key":"0-_-0","value":{"rev":"1-d47c142e9460c815c19c4ed3355d648d"}},
{"id":"0.","key":"0.","value":{"rev":"1-11c33605f2e3fd88b5416106fcdbb435"}},
{"id":"0.0","key":"0.0","value":{"rev":"1-5e541d4358c255cbcdba501f45a66e82"}},
{"id":"0.0.1","key":"0.0.1","value":{"rev":"1-ce856c27d0e16438a5849a97f8e9671d"}},
{"id":"0.0.168","key":"0.0.168","value":{"rev":"1-96ab3047e57ca1573405d0c89dd7f3f2"}},
{"id":"0.0.250","key":"0.0.250","value":{"rev":"1-c07ad0ffb7e2dc51bfeae2838b8d8bd6"}},
Notice, that all the documents start from the second line (i.e. all the documents are part of the "rows" key's values). Now, my question is how to get only the values of "rows" key (i.e. all the documents). I found this repository for the similar purpose, but can't use/ convert it as I am a total beginner in JavaScript.
If there is no stream=True among the arguments of get() then the whole data will be downloaded into memory before the loop over the lines even starts.
Then there is the problem that at least the lines themselves are not valid JSON. You'll need an incremental JSON parser like ijson for this. ijson in turn wants a file like object which isn't easily obtained from the requests.Response, so I will use urllib from the Python standard library here:
#!/usr/bin/env python3
from urllib.request import urlopen
import ijson
def main():
with urlopen('https://replicate.npmjs.com/_all_docs') as json_file:
for row in ijson.items(json_file, 'rows.item'):
print(row)
if __name__ == '__main__':
main()
Is there a reason why you aren't decoding the json before iterating over the lines?
Can you try this:
import requests
import json
import sys
db = 'https://replicate.npmjs.com';
r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})
decoded_r = r.decode('utf-8')
data = json.loads(decoded_r)
for row in data.rows:
print(row.key)

How to save webpages text content as a text file using python

I did python script:
from string import punctuation
from collections import Counter
import urllib
from stripogram import html2text
myurl = urllib.urlopen("https://www.google.co.in/?gfe_rd=cr&ei=v-PPV5aYHs6L8Qfwwrlg#q=samsung%20j7")
html_string = myurl.read()
text = html2text( html_string )
file = open("/home/nextremer/Final_CF/contentBased/contentCount/hi.txt", "w")
file.write(text)
file.close()
Using this script I didn't get perfect output only some HTML code.
I want save all webpage text content in a text file.
I used urllib2 or bs4 but I didn't get results.
I don't want output as a html structure.
I want all text data from webpage
What do you mean with "webpage text"?
It seems you don't want the full HTML-File. If you just want the text you see in your browser, that is not so easily solvable, as the parsing of a HTML-document can be very complex, especially with JavaScript-rich pages.
That starts with assessing if a String between "<" and ">" is a regular tag and includes analyzing the CSS-Properties changed by JavaScript-behavior.
That is why people write very big and complex rendering-Engines for Webpage-Browsers.
You dont need to write any hard algorithms to extract data from search result. Google has a API to do this.
Here is an example:https://github.com/google/google-api-python-client/blob/master/samples/customsearch/main.py
But to use it, first you must to register in google for API Key.
All information you can find here: https://developers.google.com/api-client-library/python/start/get_started
import urllib
urllib.urlretrieve("http://www.example.com/test.html", "test.txt")

Parsing XML Object Python 3.4

Basically what I am doing is using urllib.request to make an API call to pubmed, receive an XML file in return, and am trying to parse it with no luck.
I have tried using Element Tree and other modules with no luck. I believe there may be an issue with XML object itself.
#Imorting URL Request Modules for API Calls
#Also importing ElemenTree as it seems to be best for XML parsing
import urllib.request
import urllib.parse
import re
import xml.etree.ElementTree as ET
from urllib import request
#Now I can make the API call.
id_request = urllib.request.urlopen('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=17570568')
#id_request will be an object that I'm not sure I understand?
#id_request Returns: "<http.client.HTTPResponse object at 0x0000000003693FD0>"
#Let's now read this baby in XML format!
id_pubmed = id_request.read()
#If I look at the id_pubmed object, I not have the XML file I want to parse.
You can see what the XML file id_pubmed is calling/prints here: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=17570568
My issue is I can't get Element Tree to parse this at all. I have tried:
tree = ET.parse(id_pubmed)
root = tree.getroot()
as well as various other suggestions from https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
ET.parse() method requires either the location of the xml file (on local file system) or a file like object , but your id_pubmed seems to be a string .
In that case , you should use ET.fromstring() . Example -
root = ET.fromstring(id_pubmed)

How should one go about collecting data from a .CSV file using Python?

I am attempting to use the Yahoo! Finance API to gather current stock quotes using Python, a language I am particularly new to. The Yahoo! Finance API appears to offer their data in the form of a .CSV file that can be downloaded.
I am wondering what the best way to use this data is? Is it inefficient to download the file and then read it; is there a method to convert it to a JSON or XML file that I can parse using something like urllib?
The .CSV I am getting is being generated on the following page, in this case the quote for Microsoft (MSFT):
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=snl1
Many thanks in advance.
Python has a built in csv module.
http://docs.python.org/library/csv.html
To answer your downloading question:
file = r"http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=snl1"
import urllib
text = urllib.urlopen(file).read()
>>> print text
... "MSFT","Microsoft Corpora",29.51
Based on csv module and DictReader, it will be more easy to parse data after urlopen, I reuse the code above from #kreativitea
import csv
import urllib
file = r"http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=snl1"
quotefile = urllib.urlopen(file)
fieldheaders = ["abbr","name","index"]
reader = csv.DictReader(quotefile,fieldnames=fieldheaders)
for row in reader:
print row
the result is
$ quote.py
{'index': '29.51', 'abbr': 'MSFT', 'name': 'Microsoft Corpora'}
the row in for loop is the hash table, which is easy to deal with

Categories

Resources