Problem Background:
I have an XML file that I'm importing into BeautifulSoup and parsing through. One node has the following:
<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>
Notice that the value has
and
within the text. I understand those are the XML representation of carriage return and line feed.
When I import into BeautifulSoup, the value gets converted into the following:
<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>
You'll notice that the
gets converted to a newline.
My use case requires that the value remains as the original. Any idea how to get that to stay? Or convert it back?
Source Code:
python: (2.7.11)
from bs4 import BeautifulSoup #version 4.4.0
s = BeautifulSoup(open('test.xml'),'lxml-xml',from_encoding="ansi")
print s.DIAttribute
#XML file looks like
'''
<?xml version="1.0" encoding="UTF-8" ?>
<DIAttribute name="ObjectDesc" value="Line1
Line2
Line3"/>
'''
Notepad++ says the encoding of the source XML file is ANSI.
Things I've Tried:
I've scoured the documentation without any success.
Variations for line 3:
print s.DIAttribute.prettify('ascii')
print s.DIAttribute.prettify('windows-1252')
print s.DIAttribute.prettify('ansi')
print s.DIAttribute.prettify('utf-8')
print s.DIAttribute['value'].replace('\r','
').replace('\n','
') #This works, but it feels like a bandaid and will likely other problems will remain.
Any ideas anyone? I appreciate any comments/suggestions.
Just for record, first the libraries that DO NOT handle properly the
entity: BeautifulSoup(data ,convertEntities=BeautifulSoup.HTML_ENTITIES), lxml.html.soupparser.unescape, xml.sax.saxutils.unescape
And this is what works (in Python 2.x):
import sys
import HTMLParser
## accept file name as argument, or read stdin if nothing passed
data = len(sys.argv) > 1 and open(sys.argv[1]).read() or sys.stdin.read()
parser = HTMLParser.HTMLParser()
print parser.unescape(data)
Related
This question appears related to this one from 2013, but it didn't help me.
I'm about to parse a large (2GB) XML file, and plan to do it with Python 3.5.2 and ElementTree. I'm new to Python, but it works well until reaching any escape character, such as:
<author>Sanjeev Saxöna</author>
returning:
test.xml
File "<string>", line unknown
ParseError: undefined entity ö: line 5, column 19enter code here
My code looks something like this:
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse('test_esc.xml'):
# do something with the node
What's the best way to deal with this? Parsing the unescaped 'ö' actually works fine:
<author>Sanjeev Saxöna</author>
Is there an easy way to programmatically unescape the whole XML file?
As suggested by the answer linked by Soulaimane Sahmi, I added an inline DTD to the XML file. It is maybe not the best solution out there, but it works for now.
I'm quite new to Python and am trying to learn as much as I can by watching videos/reading tutorials.
I was following this video on how to take data from Quandl. I know there is a specific module for python already, but I wanted to learn how to take it from the website if necessary. My issue is that when I try to emulate the code around 9:50 and print the result, python doesn't split the lines in the CSV file. I understand he's using python 2.x, while I'm using 3.4.
Here's the code I use:
import urllib
from urllib.request import urlopen
def grabQuandl(ticker):
endLink = 'sort_order=desc'#without authtoken
try:
salesRev = urllib.request.urlopen('https://www.quandl.com/api/v1/datasets/SEC/'+ticker+'_SALESREVENUENET_Q.csv?&'+endLink).read()
print (salesRev)
except Exception as e:
print ('failed the main quandl loop for reason of', str(e))
grabQuandl('AAPL')
And this is what gets printed:
b'Date,Value\n2009-06-27,8337000000.0\n2009-12-26,15683000000.0\n2010-03-27,13499000000.0\n2010-06-26,15700000000.0\n2010-09-25,20343000000.0\n2010-12-25,26741000000.0\n2011-03-26,24667000000.0\n2011-06-25,28571000000.0\n2011-09-24,28270000000.0\n2011-12-31,46333000000.0\n2012-03-31,39186000000.0\n2012-06-30,35023000000.0\n2012-09-29,35966000000.0\n2012-12-29,54512000000.0\n2013-03-30,43603000000.0\n2013-06-29,35323000000.0\n2013-09-28,37472000000.0\n2013-12-28,57594000000.0\n2014-03-29,45646000000.0\n2014-06-28,37432000000.0\n2014-09-27,42123000000.0\n2014-12-27,74599000000.0\n2015-03-28,58010000000.0\n'
I get that the \n is some sort of line splitter, but it's not working like in the video. I've googled for possible solutions, such as doing a for loop, using read().split(), but at best they simply remove the \n. I can't get the output into a table like in the video. What am I doing wrong?
.read() gives you back a byte-string , when you directly print it, you get the result you got.You can notice the b at the starting before the quote, it indicates byte-string.
You should decode the string you get, before printing (or directly while using .read() . An example -
import urllib
from urllib.request import urlopen
def grabQuandl(ticker):
endLink = 'sort_order=desc'#without authtoken
try:
salesRev = urllib.request.urlopen('https://www.quandl.com/api/v1/datasets/SEC/'+ticker+'_SALESREVENUENET_Q.csv?&'+endLink).read().decode('utf-8')
print (salesRev)
except Exception as e:
print ('failed the main quandl loop for reason of', str(e))
grabQuandl('AAPL')
The above decodes the returned data using utf-8 encoding, you can use whatever encoding you want (whatever encoding the data is).
Example to show the print behavior -
>>> s = b'asd\nbcd\n'
>>> print(s)
b'asd\nbcd\n'
>>> print(s.decode('utf-8'))
asd
bcd
>>> type(s)
<class 'bytes'>
As I saw, when we run
from xml.dom.minidom import parse
myXML = parse('anything.xml')
in a Python script, it loads the contents of "anything.xml", until you leave the script or Ctrl+D your Python session.
Is it possible to add attribute values to this loaded version of the XML in Python?
The parse method returns you an instance of xml.dom.minidom.Document, on which you can invoke the plethora of methods listed in the documentation of xml.dom. Here's a small example:
import xml.dom.minidom
d = xml.dom.minidom.parseString('<head>hello</head>')
d.getElementsByTagName('head')[0].setAttribute('joe', '2')
print d.toxml()
This adds a joe="2" attribute to the head tag:
<?xml version="1.0" ?><head joe="2">hello</head>
Ok, so I am trying to write a Python script for XCHAT that will allow me to type "/hookcommand filename" and then will print that file line by line into my irc buffer.
EDIT: Here is what I have now
__module_name__ = "scroll.py"
__module_version__ = "1.0"
__module_description__ = "script to scroll contents of txt file on irc"
import xchat, random, os, glob, string
def gg(ascii):
ascii = glob.glob("F:\irc\as\*.txt")
for textfile in ascii:
f = open(textfile, 'r')
def gg_cb(word, word_eol, userdata):
ascii = gg(word[0])
xchat.command("msg %s %s"%(xchat.get_info('channel'), ascii))
return xchat.EAT_ALL
xchat.hook_command("gg", gg_cb, help="/gg filename to use")
Well, your first problem is that you're referring to a variable ascii before you define it:
ascii = gg(ascii)
Try making that:
ascii = gg(word[0])
Next, you're opening each file returned by glob... only to do absolutely nothing with them. I'm not going to give you the code for this: please try to work out what it's doing or not doing for yourself. One tip: the xchat interface is an extra complication. Try to get it working in plain Python first, then connect it to xchat.
There may well be other problems - I don't know the xchat api.
When you say "not working", try to specify exactly how it's not working. Is there an error message? Does it do the wrong thing? What have you tried?
I'm trying to query a database, then convert the file-like object it returns to an XML document. Here's what I've been doing:
>>> import urllib, xml.dom.minidom
>>> query = "http://sbol.bhi.washington.edu/openrdf-sesame/repositories/sbol_test?query=select%20distinct%20%3Fname%20%3Ffeaturename%20where%20%7B%3Fpart%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23annotation%3E%20%3Fannotation%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23status%3E%20'Available'%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23name%3E%20%3Fname.%3Fannotation%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23feature%3E%20%3Ffeature.%3Ffeature%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23type%3E%20%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23binding%3E%3B%3Chttp%3A%2F%2Fsbol.bhi.washington.edu%2Frdf%2Fsbol.owl%23name%3E%20%3Ffeaturename%7D"
>>> raw_result = urllib.urlopen(query)
>>> xml_result = xml.dom.minidom.parse(raw_result)
That last command gives me
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 4
Almost the same thing happens if I use xml.etree.ElementTree to do the parsing. I think they both use Expat. The weird part is, if instead of loading the file in python I just paste the query into Firefox, the resulting file can be read in perfectly well using open(path_to_file, "r").
Any ideas what this could be?
UPDATE:
This is the first line of the file:
<?xml version='1.0' encoding='UTF-8'?>
However that may not be what's in raw_result... that's what you get after downloading query-result.srx and changing the extension to .txt. The file extension doesn't matter does it? Also, I'm pretty new to this whole xml thing—why is column 4 the 8th character? – Jeff 0 secs ago edit
Your server is picky about the accept header in deciding what to send back and in which format. The following should work:
In [265]: import urllib2
In [266]: req = urllib2.Request(query, headers={'Accept':'application/xml'})
In [267]: rsp = urllib2.urlopen(req)
In [268]: xml = minidom.parse(rsp)
In [268]: xml.toxml()[:64]
Out[268]: u'<?xml version="1.0" ?><sparql xmlns="http://www.w3.org/2005/spar'
Note the accept header in urllib2.Request.
Any chance you could post the XML snippet? The parser is indicating that the error is happening at the very first line. My guess is the formatting is off or reporting incorrectly, which is causing EXPAT to pitch an exception right off the bat.
My guess is that first line violates something in the "well formed XML" content anwyay. For reference, you might compare against http://en.wikipedia.org/wiki/XML
Looks like something is wrong with your XML file, right about line 1, column 4.
I tried this, and what I got doesn't look like XML to me. Here are the first eight characters, as Alex suggested:
>>> raw_result.read(8)
'BRTR\x00\x00\x00\x03'
It seems that the RDF server is delivering plain text to your urllib.urlopen call.
You should be able, with setting the right header
Accept: application/sparql-results+xml, */*;q=0.5
, to get the xml response. You have to read the RDF protocol specification of openRDF for details - there is for openRDF more than one format.