how to get an XML file from a website using python? - python

using 'bottle' library, I have to create my own API based on this website http://dblp.uni-trier.de so I have to get data for each author. For this reason I am using the following link format http://dblp.uni-trier.de/pers/xx/'first letter of the last name'/'lastnamefirstname'.xml
Could you help me get the XML format to be able to parse it and get the information I need.
thank you
import bottle
import requests
import re
r = requests.get("https://dblp.uni-trier.de/")
#the format of my request is
#http://localhost:8080/lastname firstname
#bottle.route('/info/<name>')
def info(name):
first_letter = name[:1]
#mettre au format Lastname:Firstname
...
data = requests.get("http://dblp.uni-trier.de/pers/xx/" + first_letter + "/" + family_name + ".xml")
return data
bottle.run(host='localhost', port=8080)

from xml.etree import ElementTree
import requests
url = 'some url'
response = requests.get(url)
xml_root = ElementTree.fromstring(response.content)
fromstring Parses an XML section from a string constant. This function can be used to embed “XML literals” in Python code. text is a
string containing XML data. parser is an optional parser instance. If
not given, the standard XMLParser parser is used. Returns an Element
instance.
HOW TO Load XML from a string into an ElementTree
from xml.etree import ElementTree
root = ElementTree.fromstring("<root><a>1</a></root>")
ElementTree.dump(root)
OUTPUT
<root><a>1</a></root>

The object returned from requests.get is not the raw data. You need to use text property to get the contents
Response Content Documentation
Note that:
response.text returns content as unicode
response.content returns content as bytes

Related

Get xpath from html file using LXML - Python

I am learning how to parse documents using lxml. To do so, I'm trying to parse my linkedin page. It has plenty of information and I thought it would be a good training.
Enough with the context. Here what I'm doing:
going to the url: https://www.linkedin.com/in/NAME/
opening and saving the source code to as "linkedin.html"
as I'm trying to extract my current job, I'm doing the following:
from io import StringIO, BytesIO
from lxml import html, etree
# read file
filename = 'linkedin.html'
file = open(filename).read()
# building parser
parser = etree.HTMLParser()
tree = etree.parse(StringIO(file), parser)
# parse an element
title = tree.xpath('/html/body/div[6]/div[4]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2')
print(title)
The tree variable's type is
But it always return an empty list for my variable title.
I've been trying all day but still don't understand what I'm doing wrong.
I've find the answer to my problem by adding an encoding parameter within the open() function.
Here what I've done:
def parse_html_file(filename):
f = open(filename, encoding="utf8").read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO(f), parser)
return tree
tree = parse_html_file('linkedin.html')
name = tree.xpath('//li[#class="inline t-24 t-black t-normal break-words"]')
print(name[0].text.strip())

Load JSON data from Google GitHub repo

I am trying to load the following JSON file (from the Google Github repo) in Python as follows:
import json
import requests
url = "https://raw.githubusercontent.com/google/vsaq/master/questionnaires/webapp.json"
r = requests.get(url)
data = r.text.splitlines(True)
#remove first n lines which is not JSON (commented license)
data = ''.join(data[14:])
When I use json.loads(data) I get the following error:
JSONDecodeError: Expecting ',' delimiter: line 725 column 543 (char 54975)
As this has been saved as a json file by the GitHub repo owner (Google) I'm wondering what Im doing wrong here.
I found the obtained text from API call is like a simple text, not a valid JSON (I checked at https://jsonformatter.curiousconcept.com/).
Here is my code that I used to filter the valid JSON part from the response.
I have used re module to extract the JSON part.
import json
import requests
import re
url = "https://raw.githubusercontent.com/google/vsaq/master/questionnaires/webapp.json"
r = requests.get(url)
text = r.text.strip()
m = re.search(r'\{(.|\s)*\}', text) # It is for finding a valid JSON part from obtained text
s = m.group(0).replace('false', 'False') # Python has 'False/True' not 'false/true' (Replacement)
d = eval(s)
print(d) # {...}
print(type(d)) # <class 'dict'>
References »
https://docs.python.org/3.6/library/re.html
https://jsonformatter.curiousconcept.com/

Convert html source code to json object

I am fetching html source code of many pages from one website, I need to convert it into json object and combine with other elements in json doc. . I have seen many questions on same topic but non of them were helpful.
My code:
url = "https://totalhash.cymru.com/analysis/?1ce201cf28c6dd738fd4e65da55242822111bd9f"
htmlContent = requests.get(url, verify=False)
data = htmlContent.text
print("data",data)
jsonD = json.dumps(htmlContent.text)
jsonL = json.loads(jsonD)
ContentUrl='{ \"url\" : \"'+str(urls)+'\" ,'+"\n"+' \"uid\" : \"'+str(uniqueID)+'\" ,\n\"page_content\" : \"'+jsonL+'\" , \n\"date\" : \"'+finalDate+'\"}'
above code gives me unicode type, however, when I put that output in jsonLint it gives me invalid json error. Can somebody help me understand how can I convert the complete html into a json objet?
jsonD = json.dumps(htmlContent.text) converts the raw HTML content into a JSON string representation.
jsonL = json.loads(jsonD) parses the JSON string back into a regular string/unicode object. This results in a no-op, as any escaping done by dumps() is reverted by loads(). jsonL contains the same data as htmlContent.text.
Try to use json.dumps to generate your final JSON instead of building the JSON by hand:
ContentUrl = json.dumps({
'url': str(urls),
'uid': str(uniqueID),
'page_content': htmlContent.text,
'date': finalDate
})
The correct way to convert HTML source code to a JSON file on the local system is as follows:
import json
import codecs
# Load the JSON file by specifying the location and filename
with codecs.open(filename="json_file.json", mode="r", encoding="utf-8") as jsonf:
json_file = json.loads(jsonf.read())
# Load the HTML file by specifying the location and filename
with codecs.open(filename="html_file.html", mode='r', encoding="utf-8") as htmlf:
html_file = htmlf.read()
# Chose the key name where the HTML source code will live as a string
json_file['Key1']['Key2'] = html_file
# Dump the dictionary to JSON object and save it in a specific location
json_object = json.dumps(json_file, indent=4)
with codecs.open(filename="final_json_file.json", mode="w", encoding="utf-8") as ojsonf:
ojsonf.write(json_object)
Next, open the JSON file in your editor.
Press CTRL + H, and replace \n or \t characters by '' (nothing!).
Now you can parse your HTML file with codecs.open() function and do the operations.

How to parse XML by using python

I want to parse this url to get the text of \Roman\
http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
counts = tree.findall('.//Word')
for count in counts
print count.get('Roman')
But it didn't work.
Try tree.findall('.//{urn:yahoo:jp:jlp:FuriganaService}Word') . It seems you need to specify the namespace too .
I recently ran into a similar issue to this. It was because I was using an older version of the xml.etree package and to workaround that issue I had to create a loop for each level of the XML structure. For example:
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
counts = tree.findall('.//Word')
for result in tree.findall('Result'):
for wordlist in result.findall('WordList'):
for word in wordlist.findall('Word'):
print(word.get('Roman'))
Edit:
With the suggestion from #omu_negru I was able to get this working. There was another issue, when getting the text for "Roman" you were using the "get" method which is used to get attributes of the tag. Using the "text" attribute of the element you can get the text between the opening and closing tags. Also, if there is no 'Roman' tag, you'll get a None object and won't be able to get an attribute on None.
# encoding: utf-8
import urllib
import xml.etree.ElementTree as ET
url = 'http://jlp.yahooapis.jp/FuriganaService/V1/furigana?appid=dj0zaiZpPU5TV0Zwcm1vaFpIcCZzPWNvbnN1bWVyc2VjcmV0Jng9YTk-&grade=1&sentence=私は学生です'
uh = urllib.urlopen(url)
data = uh.read()
tree = ET.fromstring(data)
ns = '{urn:yahoo:jp:jlp:FuriganaService}'
counts = tree.findall('.//%sWord' % ns)
for count in counts:
roman = count.find('%sRoman' % ns)
if roman is None:
print 'Not found'
else:
print roman.text

Parsing compressed xml feed into ElementTree

I'm trying to parse the following feed into ElementTree in python: "http://smarkets.s3.amazonaws.com/oddsfeed.xml" (warning large file)
Here is what I have tried so far:
feed = urllib.urlopen("http://smarkets.s3.amazonaws.com/oddsfeed.xml")
# feed is compressed
compressed_data = feed.read()
import StringIO
compressedstream = StringIO.StringIO(compressed_data)
import gzip
gzipper = gzip.GzipFile(fileobj=compressedstream)
data = gzipper.read()
# Parse XML
tree = ET.parse(data)
but it seems to just hang on compressed_data = feed.read(), infinitely maybe?? (I know it's a big file, but seems too long compared to other non-compressed feeds I parsed, and this large is killing any bandwidth gains from the gzip compression in the first place).
Next I tried requests, with
url = "http://smarkets.s3.amazonaws.com/oddsfeed.xml"
headers = {'accept-encoding': 'gzip, deflate'}
r = requests.get(url, headers=headers, stream=True)
but now
tree=ET.parse(r.content)
or
tree=ET.parse(r.text)
but these raise exceptions.
What's the proper way to do this?
You can pass the value returned by urlopen() directly to GzipFile() and in turn you can pass it to ElementTree methods such as iterparse():
#!/usr/bin/env python3
import xml.etree.ElementTree as etree
from gzip import GzipFile
from urllib.request import urlopen, Request
with urlopen(Request("http://smarkets.s3.amazonaws.com/oddsfeed.xml",
headers={"Accept-Encoding": "gzip"})) as response, \
GzipFile(fileobj=response) as xml_file:
for elem in getelements(xml_file, 'interesting_tag'):
process(elem)
where getelements() allows to parse files that do not fit in memory.
def getelements(filename_or_file, tag):
"""Yield *tag* elements from *filename_or_file* xml incrementaly."""
context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == tag:
yield elem
root.clear() # free memory
To preserve memory, the constructed xml tree is cleared on each tag element.
The ET.parse function takes "a filename or file object containing XML data". You're giving it a string full of XML. It's going to try to open a file whose name is that big chunk of XML. There is probably no such file.
You want the fromstring function, or the XML constructor.
Or, if you prefer, you've already got a file object, gzipper; you could just pass that to parse instead of reading it into a string.
This is all covered by the short Tutorial in the docs:
We can import this data by reading from a file:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Or directly from a string:
root = ET.fromstring(country_data_as_string)

Categories

Resources