how to make a sample of the generated xml? - python

I create xml structure using the methods lxml.etree.Element
import lxml.etree
import lxml.html
parent = lxml.etree.Element('root')
child = lxml.etree.Element('sub')
child.text = 'text'
parent.append(child)
I need to do the following query:
doc = lxml.html.document_fromstring(parent)
text = doc.xpath('sub/text()')
print(text)
but I get the following error message:
Traceback (most recent call last): File
"C:\VINT\OPENSERVER\OpenServer\domains\localhost\python\parse_html\6_first_store_names_cat_full_xml_nested\q.py",
line 9, in
doc = lxml.html.document_fromstring(parent) File "C:\Python33\lib\site-packages\lxml\html__init__.py", line 600, in
document_fromstring
value = etree.fromstring(html, parser, **kw) File "lxml.etree.pyx", line 3003, in lxml.etree.fromstring
(src\lxml\lxml.etree.c:67277) File "parser.pxi", line 1784, in
lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:101615)
ValueError: can only parse strings
>
help my please

lxml.html.document_fromstring() accepts a string, not an Element as you are passing in. Try passing in lxml.etree.tostring(parent):
s = lxml.etree.tostring(parent)
doc = lxml.html.document_fromstring(s)

Related

Error in parsing XML file using Python's ElementTree

I'm trying to import an XML using ElementTree
Traceback (most recent call last):
File "Object_Detection_Script.py", line 3, in <module>
tree = ET.parse('text.xml')
File "ElementTree.py", line 1196, in parse
tree.parse(source, parser)
File "ElementTree.py", line 597, in parse
self._root = parser._parse_whole(source)
File "<string>", line None
xml.etree.ElementTree.ParseError: mismatched tag: line 20, column 6
Here's a part of my XML:
<Stream canTime="729785232" itcMsgCounter="39506" pcTime1="729209" sourceInfo="29.00" streamNumber="5.000" streamRefIndex="22090" vehIndexUsed="0" versionInfo="41240">
<vision_failsafes blurImageFailsafe="false" blurredImageSeverityLevel="0" ddrRamCrcFailure="false" flrMisalignment="false" foggySpotsSeverityLevel="0" frameIndex="0" fullBlockageSeverityLevel="0" heavyRainFailsafe="false" imageIndex="0" invalidAhbcSensitivityParams="false" invalidSpdYawDetected="false" lowSunSeverityLevel="0" lowVisibilitySeverityLevel="0" outOfCalibrationSeverityLevel="0" outOfFocusSeverityLevel="0" partialSolidBlockageFailsafe="false" radarCommErrorCounter="0" radarMisalignSeverityLevel="0" radarVisCorrelationFailsafe="false" rccMisalignment="false" rollAngleFailsafe="false" selfGlareSeverityLevel="0" smearImageSeverityLevel="0" smearedSpotsSeverityLevel="0" splashesSeverityLevel="0" spotRaysSeverityLevel="0" sunRaySeverityLevel="0"/>
<extraInfo>
XML file

Usage of python-readability

(https://github.com/buriy/python-readability)
I am struggling using this library and I can't find any documentation for it. (Is there any?)
There are some kind of useable pieces calling help(Document) but there is still something wrong.
My code so far:
from readability.readability import Document
import requests
url = 'http://www.somepage.com'
html = requests.get(url, verify=False).content
readable_article = Document(html, negative_keywords='test_keyword').summary()
with open('test.html', 'w', encoding='utf-8') as test_file:
test_file.write(readable_article)
According to the help(Document) output, it should be possible to use a list for the input of the negative_keywords.
readable_article = Document(html, negative_keywords=['test_keyword1', 'test-keyword2').summary()
Gives me a bunch of errors I don't understand:
Traceback (most recent call last): File
"/usr/lib/python3.4/site-packages/readability/readability.py", line
163, in summary
candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line
300, in score_paragraphs
candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
360, in score_node
content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
348, in class_weight
if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object
has no attribute 'search' Traceback (most recent call last): File
"/usr/lib/python3.4/site-packages/readability/readability.py", line
163, in summary
candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line
300, in score_paragraphs
candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
360, in score_node
content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line
348, in class_weight
if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object
has no attribute 'search'
Could some one give me please a hint on the error or how to deal with it?
There's an error in the library code. If you look at compile_pattern:
def compile_pattern(elements):
if not elements:
return None
elif isinstance(elements, (list, tuple)):
return list(elements)
elif isinstance(elements, regexp_type):
return elements
else:
# assume string or string like object
elements = elements.split(',')
return re.compile(u'|'.join([re.escape(x.lower()) for x in elements]), re.U)
You can see that it only returns a regex if the elements is not None, not a list or tuple, and not a regular expression.
Later on, though, it assumes that self.negative_keywords is a regular expression. So, I suggest you input your list as a string in the form of "test_keyword1,test_keyword2". This will make sure that compile_pattern returns a regular expression which should fix the error.

Python HTMLParser Not Reading Whole File

from HTMLParser import HTMLParser
class HTMLParserDos(HTMLParser):
full_text = ""
def handle_data(self, data):
self.full_text += data
return self.full_text
h = HTMLParserDos()
file = open('emails.txt', 'r')
h.feed(file.read())
file.close()
print h.container
This code is getting an error:
Traceback (most recent call last): File "/Users/laurenstrom/Google
Drive/PYTHON/RANDO_CALRISSIAN/html_parse", line 15, in
h.feed(file.read()) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py",
line 108, in feed
self.goahead(0) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py",
line 148, in goahead
k = self.parse_starttag(i) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py",
line 229, in parse_starttag
endpos = self.check_for_whole_start_tag(i) File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py",
line 304, in check_for_whole_start_tag
self.error("malformed start tag") File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py",
line 115, in error
raise HTMLParseError(message, self.getpos()) HTMLParseError: malformed start tag, at line 7, column 18
I'm not sure what I'm missing about .feed() but I can't seem to find anything about why it won't just read the whole file.
Your are asking the HTML parser to parse a file most of which isn't HTML. It is tripping over line 7 of your file. Which is :
Return-Path: <Tom#sjnetworkconsulting.com>
I would imagine it is seeing the < and assuming that is HTML which of course it is not.

Writing Python ElementTree to file throws TypeError

I'm trying to write an XML file using Python's ElementTree package. Basically I make a root element called allDepts, and then in each iteration of my for loop I call a function that returns a deptElement containing a bunch of information about a university department. I add every deptElement to allDepts, make an ElementTree out of allDepts, and try to write it to a file.
def crawl(year, season, campus):
departments = getAllDepartments(year, season, campus)
allDepts = ET.Element('depts')
for dept in departments:
deptElement = getDeptElement(allDepts, dept, year, season, campus)
print ET.tostring(deptElement) #Prints fine here!
ET.SubElement(allDepts, deptElement)
if deptElement == None:
print "ERROR: " + dept
with open(str(year) + season + "_" + campus + "_courses.xml", 'w') as f:
tree = ET.ElementTree(allDepts)
tree.write(f)
For some reason, at the tree.write(f) line, I get this error: "TypeError: cannot concatenate 'str' and 'instance' objects". Each deptElement prints out fine in the for loop, making me think that getDeptElement() is working fine. I never get my "ERROR" message printed out. Does anyone know what I'm doing wrong?
EDIT: Here's the full stack trace:
File "./CourseInfoCrawl.py", line 210, in <module>
crawl("2013", "S", "UBC")
File "./CourseInfoCrawl.py", line 207, in crawl
tree.write(f)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 663, in write
self._write(file, self._root, encoding, {})
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 707, in _write
self._write(file, n, encoding, namespaces)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/xml/etree/ElementTree.py", line 681, in _write
file.write("<" + _encode(tag, encoding))
Seem following line is cause.
print "ERROR: " + dept
Change as follow and retry:
print "ERROR: ", dept
OR
print "ERROR: " + str(dept)
ADD
Second argument to ET.SubElement should be str. Is deptElement is str?
If deptElement is Element, use allDepts.append(deptElement).
http://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.SubElement
http://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.append
ADD 2
To reproduce error (Python 2.6):
>>> from xml.etree import ElementTree as ET
>>> allDepts = ET.Element('depts')
>>> ET.SubElement(allDepts, ET.Element('a'))
<Element <Element a at b727b96c> at b727b22c>
>>> with open('a', 'wb') as f:
... tree = ET.ElementTree(allDepts)
... tree.write(f)
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/home/falsetru/t/Python-2.6/Lib/xml/etree/ElementTree.py", line 663, in write
self._write(file, self._root, encoding, {})
File "/home/falsetru/t/Python-2.6/Lib/xml/etree/ElementTree.py", line 707, in _write
self._write(file, n, encoding, namespaces)
File "/home/falsetru/t/Python-2.6/Lib/xml/etree/ElementTree.py", line 681, in _write
file.write("<" + _encode(tag, encoding))
TypeError: cannot concatenate 'str' and 'instance' objects
To reproduce error (Python 2.7, different error message):
>>> from xml.etree import ElementTree as ET
>>> allDepts = ET.Element('depts')
>>> ET.SubElement(allDepts, ET.Element('a'))
<Element <Element 'a' at 0xb745a8ec> at 0xb74601ac>
>>> with open('a', 'wb') as f:
... tree = ET.ElementTree(allDepts)
... tree.write(f)
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 817, in write
self._root, encoding, default_namespace
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 886, in _namespaces
_raise_serialization_error(tag)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1052, in _raise_serialization_error
"cannot serialize %r (type %s)" % (text, type(text).__name__)
TypeError: cannot serialize <Element 'a' at 0xb745a8ec> (type Element)

Trouble using gdata and Unicode Cyrillic in Python

I have this code
# -*- coding: utf8 -*-
__author__ = 'user'
import gdata.youtube.service
yt_service = gdata.youtube.service.YouTubeService()
query = gdata.youtube.service.YouTubeVideoQuery()
query.vq = u"не"
feed = yt_service.YouTubeQuery(query)
for yt_item in feed.entry:
print yt_item.GetSwfUrl()
And I am getting this error:
Traceback (most recent call last):
File "cyr_search.py", line 7, in
feed = yt_service.YouTubeQuery(query)
File "/Users/user/Documents/GrabaHeroku/graba_h_ve/lib/python2.7/site-packages/gdata/youtube/service.py", line 1346, in YouTubeQuery
result = self.Query(query.ToUri())
File "/Users/user/Documents/GrabaHeroku/graba_h_ve/lib/python2.7/site-packages/gdata/service.py", line 1715, in ToUri
return atom.service.BuildUri(q_feed, self)
File "/Users/user/Documents/GrabaHeroku/graba_h_ve/lib/python2.7/site-packages/atom/service.py", line 584, in BuildUri
parameter_list = DictionaryToParamList(url_params, escape_params)
File "/Users/user/Documents/GrabaHeroku/graba_h_ve/lib/python2.7/site-packages/atom/service.py", line 551, in DictionaryToParamList
for param, value in (url_parameters or {}).items()]
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1275, in quote_plus
return quote(s, safe)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1268, in quote
return ''.join(map(quoter, s))
KeyError: u'\u043d'
How do I search for non-ASCII. Do I need to url encode the query? I thought the library will do that on its own.
Change to:
query.vq = u"не".encode('utf8')
The string needs to be encoded before being sent.

Categories

Resources