I need to convert a web page to XML (using Python 3.4.3). If I write the contents of the URL to a file then I can read and parse it perfectly but if I try to read directly from the web page I get the following error in my terminal:
File "./AnimeXML.py", line 22, in
xml = ElementTree.parse (xmlData)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/xml/etree/ElementTree.py", line 1187, in parse
tree.parse(source, parser)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/xml/etree/ElementTree.py", line 587, in parse
source = open(source, "rb")
OSError: [Errno 36] File name too long:
My python code:
# AnimeXML.py
#! /usr/bin/Python
# Import xml parser.
import xml.etree.ElementTree as ElementTree
# XML to parse.
sampleUrl = "http://cdn.animenewsnetwork.com/encyclopedia/api.xml?anime=16989"
# Read the xml as a file.
content = urlopen (sampleUrl)
# XML content is stored here to start working on it.
xmlData = content.readall().decode('utf-8')
# Close the file.
content.close()
# Start parsing XML.
xml = ElementTree.parse (xmlData)
# Get root of the XML file.
root = xml.getroot()
for info in root.iter("info"):
print (info.attrib)
Is there any way I can fix my code so that I can read the web page directly into python without getting this error?
As explained in the Parsing XML section of the ElementTree docs:
We can import this data by reading from a file:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Or directly from a string:
root = ET.fromstring(country_data_as_string)
You're passing the whole XML contents as a giant pathname. Your XML file is probably bigger than 2K, or whatever the maximum pathname size is for your platform, hence the error. If it weren't, you'd just get a different error about there being no directory named [everything up to the first / in your XML file].
Just use fromstring instead of parse.
Or, notice that parse can take a file object, not just a filename. And the thing returned by urlopen is a file object.
Also notice the very next line in that section:
fromstring() parses XML from a string directly into an Element, which is the root element of the parsed tree. Other parsing functions may create an ElementTree.
So, you don't want that root = tree.getroot() either.
So:
# ...
content.close()
root = ElementTree.fromstring(xmlData)
Related
I'm trying to manipulate a xml file. I use a loop and for each iteration I want the version number of the xml file to be increased. For manipulating the xml file I using ETree. Here is what I have tried so far:
def main():
import xml.etree.ElementTree as ET
import os
version = "0"
while os.path.exists(f"/Users/tt/sumoTracefcdfile_{version}.xml"):
#use parse() function to load and parse an xml file
fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml"
version=int(version)
version+=1
doc = ET.parse(fileDirect)
.....
#at the end after adding some data to xml file, I do the following to write the changes into the xml file:
save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml"
b_xml = ET.tostring(valeurs)
with open(save_path_file, "wb") as f:
f.write(b_xml)
However I get the following error for the line 'doc = ET.parse(fileDirect)':
FileNotFoundError: [Errno 2] No such file or directory:
'/Users/tt/sumoTracefcdfile_{version}.xml'
It looks like you wanted to use f-strings and forgot the "f" in 2 lines.
Changing fileDirect="/Users/tt/sumoTracefcdfile_{version}.xml" to fileDirect = f"/Users/tt/sumoTracefcdfile_{version}.xml" and save_path_file = "/Users/tt/sumoTracefcdfile_{version}.xml" to save_path_file = f"/Users/tt/sumoTracefcdfile_{version}.xml" might solve your issues.
I am learning how to parse documents using lxml. To do so, I'm trying to parse my linkedin page. It has plenty of information and I thought it would be a good training.
Enough with the context. Here what I'm doing:
going to the url: https://www.linkedin.com/in/NAME/
opening and saving the source code to as "linkedin.html"
as I'm trying to extract my current job, I'm doing the following:
from io import StringIO, BytesIO
from lxml import html, etree
# read file
filename = 'linkedin.html'
file = open(filename).read()
# building parser
parser = etree.HTMLParser()
tree = etree.parse(StringIO(file), parser)
# parse an element
title = tree.xpath('/html/body/div[6]/div[4]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2')
print(title)
The tree variable's type is
But it always return an empty list for my variable title.
I've been trying all day but still don't understand what I'm doing wrong.
I've find the answer to my problem by adding an encoding parameter within the open() function.
Here what I've done:
def parse_html_file(filename):
f = open(filename, encoding="utf8").read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO(f), parser)
return tree
tree = parse_html_file('linkedin.html')
name = tree.xpath('//li[#class="inline t-24 t-black t-normal break-words"]')
print(name[0].text.strip())
I am using xmltodict library in python (https://pypi.org/project/xmltodict/) to parse a xml file by:
import xmltodict
with open("MyXML.xml") as MyXML:
doc = xmltodict.parse(MyXML.read())
The xml file looks good but I get this error:
ExpatError: no element found: line 1, column 0
What should I do?
In my uses of xmltodict, I have always parsed a string and to get an xml string is use etree. Try this:
import xml.etree.ElementTree as ET
import xmltodict
tree = ET.parse("MyXml.xml")
root = tree.getroot()
data = xmltodict.parse(ET.toString(root))
if you have your MyXml.xml file in a different locatin than this file you will need to handle that using file and the import os.
Good Luck, Hope this helps.
I'm getting:
<error>You have an error in your XML syntax...
when I run this python script I just wrote (I'm a newbie)
import requests
xml = """xxx.xml"""
headers = {'Content-Type':'text/xml'}
r = requests.post('https://example.com/serverxml.asp', data=xml)
print (r.content);
Here is the content of the xxx.xml
<xml>
<API>4.0</API>
<action>login</action>
<password>xxxx</password>
<license_number>xxxxx</license_number>
<username>xxx#xyz.com</username>
<training>1</training>
</xml>
I know that the xml is valid because I use the same xml for a perl script and the contents are being printed back.
Any help will greatly appreciated as I am very new to python.
You want to give the XML data from a file to requests.post. But, this function will not open a file for you. It expects you to pass a file object to it, not a file name. You need to open the file before you call requests.post.
Try this:
import requests
# Set the name of the XML file.
xml_file = "xxx.xml"
headers = {'Content-Type':'text/xml'}
# Open the XML file.
with open(xml_file) as xml:
# Give the object representing the XML file to requests.post.
r = requests.post('https://example.com/serverxml.asp', data=xml, headers=headers)
print (r.content);
I am trying to parse a few thousand html files and dump the variables into a csv file (excel spreadsheet). I've come up against several roadblocks, but the first one is this: I can not get it to properly parse the file. Below is a brief explanation, the python code and the traceback info.
Using Python & Sublime to parse html files, I am getting several errors. What IS working: it runs fine up until if '.html' in file:. It does not execute that loop. It will iterate through print allFiles just fine. It also creates the csv file and creates the headers (though not in separate columns, but I can ask about that later).
It seems that the problem is in the if tree = ET.parse(HTML_PATH+"/"+file) piece. I've written this several different ways (without "/" and/or "file", for example)--so far I have yet to resolve this problem.
If I can provide more information or if anyone can direct me to other documenation, it would be greatly appreciated. So far I have yet to find anything that addresses this issue.
Many thanks for your thoughts.
//C
# Parses out data from crawled html files under "html files"
# and places the output in output.csv.
import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv
# TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])
# redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"
allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"
for file in allFiles:
#print allFiles
if '.html' in file:
print "in html loop"
tree = ET.parse(HTML_PATH+"/"+file)
print '===================='
print 'Parsing file: '+file
print '===================='
for node in tree.iter():
print "tbody"
# The tbody attribute spells it all (or does it):
name = node.attrib.get('/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font')
# Check common header stuff
if name=='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
#print ' ------------------'
#print ' Category:'
category=node.text
print "category"
f.close()
Traceback:
File "/Users/C/data/Folder_NS/data_parse.py", line 34, in
tree = ET.parse(HTML_PATH+"/"+file)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: mismatched tag: line 63, column 2
You are trying to parse HTML with an XML parser, and valid HTML is not always valid XML. You would be better off using the HTML parsing library in the lxml package.
import xml.etree.ElementTree as ET
# ...
tree = ET.parse(HTML_PATH + '/' + file)
would be changed to
import lxml.html
# ...
tree = lxml.html.parse(HTML_PATH + '/' + file)