Saving tables from html document

Saving tables from html document - python

I tried to save tables from a document as a file under a directory as follows:
for table in tables:
tableString = html.tostring(table)
fileref=open('c:\\Users\\ahn_133\\Desktop\\appleTables\\Apple-' + str(count) + '.htm', 'w')
fileref.write(tableString)
fileref.close()
count+=1
But, I keep getting an error as follows:
Traceback (most recent call last):
File "<pyshell#27>", line 4, in <module>
fileref.write(tableString)
TypeError: must be str, not bytes
I am using Python 3.3 and installed lxml-3.0.1.win32-py3.3.‌exe
How can I fix this error?

lxml's tostring method returns a bytestring (bytes), because it is already encoded. This is necessary because the XML/HTML document can specify its own encoding, and that better be right!
Simply open the file in binary mode:
for table in tables:
tableString = html.tostring(table)
filename = r'c:\Users\ahn_133\Desktop\appleTables\Apple-' +str(count)+ '.htm'
with open(filename, 'wb') as fileref:
# ^
fileref.write(tableString)
count+=1

Related

Python Minidom Parsing File Objects

I wrote a code using minidom which takes an xml script, opens it as a file object and then parses that file object. Not only that, but I want the script to open multiple files that are all contained in a folder, and parse each one individually.
An example of the xml script is:
<?xml version="1.0"?>
<Data>
<data1>1</data1>
<data2>2</data2>
<data3>3</data3>
<Sub_data>
<sub_data1>0.1111111111111</sub_data1>
<sub_data2>0.2222222222222</sub_data2>
... and so on.
i.e., it's pretty standard.
Now, my code looks like this:
import os
import io
from xml.dom import minidom
#folder where xml files are located
indir = '/foo/bar/docs/'
masterlist = []
for root, dirs, filenames in os.walk(indir):
for f in filenames:
row = []
fsock = io.open(indir + f, mode = 'rt', encoding = 'cp1252')
xmldoc = minidom.parse(fsock)
...
and the error I am getting is:
Traceback (most recent call last): File "kgp_2.py", line 34, in
<module> xmldoc = minidom.parse(fsock) File
"/usr/lib/python2.7/xml/dom/minidom.py", line 1918, in parse return
expatbuilder.parse(file) File
"/usr/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file) File
"/usr/lib/python2.7/xml/dom/expatbuilder.py", line 211, in parseFile
parser.Parse("", True) xml.parsers.expat.ExpatError: no element found:
line 203, column 1381
Now, when I make the change:
fsock = io.open(indir + filenames[0], mode = 'rt', encoding = 'cp1252')
this works fine, that is, it opens the first file in the folder; but I want to parse all the files in the folder. When I do a loop like:
m = 0
... in loop:
fsock = io.open(indir + filenames[m], mode = 'rt', encoding = 'cp1252')
...
m = m+1
I get the original error.
The reason I am using the io library instead of the usual file open function is that a previous stack overflow article recommended it. Using:
fsock = open(indir + filenames[0])
like before, gets no error, but:
fsock = open(indir + f)
or
#with a loop over m, like above
fsock = open(infir + filenames[m])
get the same error as above.
A strange problem. When I print the filenames they are correct. And they are being opened, there's no error there. It's the parser that just won't parse the object files, even with filenames[m] where m = 0, surely this should be no problem?
EDIT:
Parsing document with python minidom
in this post they had a similar problem, the resolution was to use
xmldoc.seek(0)
however, for me this returns
Traceback (most recent call last):
File "kgp_2.py", line 45, in <module>
xmldoc.seek(0)
AttributeError: Document instance has no attribute 'seek'
EDIT 2: THIS HAS BEEN RESOLVED. IT WAS A CASE OF A CORRUPTED INPUT XML FILE.

Are you sure the XML data contained in all XML files is correct? Perhaps one is empty an you have to handle such Exception. Anyhow I recommend you to use xml.etree doc.

save base64 image python

I am trying to save an image with python that is Base64 encoded. Here the string is to large to post but here is the image
And when received by python the last 2 characters are == although the string is not formatted so I do this
import base64
data = "data:image/png;base64," + photo_base64.replace(" ", "+")
And then I do this
imgdata = base64.b64decode(data)
filename = 'some_image.jpg' # I assume you have a way of picking unique filenames
with open(filename, 'wb') as f:
f.write(imgdata)
But this causes this error
Traceback (most recent call last):
File "/var/www/cgi-bin/save_info.py", line 83, in <module>
imgdata = base64.b64decode(data)
File "/usr/lib64/python2.7/base64.py", line 76, in b64decode
raise TypeError(msg)
TypeError: Incorrect padding
I also printed out the length of the string once the data:image/png;base64, has been added and the spaces replace with + and it has a length of 34354, I have tried a bunch of different images but all of them when I try to open the saved file say that the file is damaged.
What is happening and why is the file corrupt?
Thanks
EDIT
Here is some base64 that also failed
iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAADBQTFRFA6b1q Ci5/f2lt/9yu3 Y8v2cMpb1/DSJbz5i9R2NLwfLrWbw m T8I8////////SvMAbAAAABB0Uk5T////////////////////AOAjXRkAAACYSURBVHjaLI8JDgMgCAQ5BVG3//9t0XYTE2Y5BPq0IGpwtxtTP4G5IFNMnmEKuCopPKUN8VTNpEylNgmCxjZa2c1kafpHSvMkX6sWe7PTkwRX1dY7gdyMRHZdZ98CF6NZT2ecMVaL9tmzTtMYcwbP y3XeTgZkF5s1OSHwRzo1fkILgWC5R0X4BHYu7t/136wO71DbvwVYADUkQegpokSjwAAAABJRU5ErkJggg==
This is what I receive in my python script from the POST Request
Note I have not replace the spaces with +'s

There is no need to add data:image/png;base64, before, I tried using the code below, it works fine.
import base64
data = 'iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAADBQTFRFA6b1q Ci5/f2lt/9yu3 Y8v2cMpb1/DSJbz5i9R2NLwfLrWbw m T8I8////////SvMAbAAAABB0Uk5T////////////////////AOAjXRkAAACYSURBVHjaLI8JDgMgCAQ5BVG3//9t0XYTE2Y5BPq0IGpwtxtTP4G5IFNMnmEKuCopPKUN8VTNpEylNgmCxjZa2c1kafpHSvMkX6sWe7PTkwRX1dY7gdyMRHZdZ98CF6NZT2ecMVaL9tmzTtMYcwbP y3XeTgZkF5s1OSHwRzo1fkILgWC5R0X4BHYu7t/136wO71DbvwVYADUkQegpokSjwAAAABJRU5ErkJggg=='.replace(' ', '+')
imgdata = base64.b64decode(data)
filename = 'some_image.jpg' # I assume you have a way of picking unique filenames
with open(filename, 'wb') as f:
f.write(imgdata)

If you append data:image/png;base64, to data, then you get error. If You have this, you must replace it.
new_data = initial_data.replace('data:image/png;base64,', '')

lxml not performing xslt transform

With this code:
from lxml import etree
with open( 'C:\\Python33\\projects\\xslt', 'r' ) as xslt, open( 'C:\\Python33\\projects\\result', 'a+' ) as result, open( 'C:\\Python33\\projects\\xml', 'r' ) as xml:
s_xml = xml.read()
s_xslt = xslt.read()
transform = etree.XSLT(etree.XML(s_xslt))
out = transform(etree.XML(s_xml))
result.write(out)
I get this error:
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
from projects.xslt_transform import trans
File ".\projects\xslt_transform.py", line 17, in <module>
transform = etree.XSLT(etree.XML(s_xslt))
File "xslt.pxi", line 409, in lxml.etree.XSLT.__init__ (src\lxml\lxml.etree.c:150256)
lxml.etree.XSLTParseError: Invalid expression
this couple xml/xslt files works with other tools.
Also I had to get rid of the encoding attribute in the top declarations for both files in order not to get:
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
can it be related ?
EDIT:
this does not work either (i get the same error):
with open( 'C:\\Python33\\projects\\xslt', 'r',encoding="utf-8" ) as xslt, open( 'C:\\Python33\\projects\\result', 'a+',encoding="utf-8" ) as result, open( 'C:\\Python33\\projects\\xml', 'r',encoding="utf-8" ) as xml:
s_xml = etree.parse(BytesIO(bytes(xml.read(),'UTF-8')))
s_xslt = etree.parse(BytesIO(bytes(xslt.read(),'UTF-8')))
transform = etree.XSLT(s_xslt)
out = transform(s_xml)
print(out.tostring())
reading lxml source code: this returns an exception:
xslt.xsltParseStylesheetDoc(c_doc)
so it seems an actual parse error. Can it be namespace related ?
EDIT SOLVED:
s_xml = etree.parse(xml.read())
s_xslt = etree.parse(xslt.read())
thanks tomalak

Parsing XML is more complicated than "open a text file, stuff the resulting string into etree".
XML files are serialized representations of a DOM tree. They are not to be handled as text even though they come in the shape of a text file. They come in multiple byte encodings and finding out which encoding a certain file uses is anything but trivial.
XML parsers have proper detection mechanisms built in and therefore they should be used to open XML files. The the basic open() + read() calls are not enough to correctly handle the file contents.
lxml.etree provides the parse() function that can accept a number of argument types:
an open file object (make sure to open it in binary mode)
a file-like object that has a .read(byte_count) method returning a byte string on each call
a filename string
an HTTP or FTP URL string
and then will correctly parse the associated document back into a DOM tree.
Your code should look more like this:
from lxml import etree
f_xsl = 'C:\\Python33\\projects\\xslt'
f_xml = 'C:\\Python33\\projects\\xml'
f_out = 'C:\\Python33\\projects\\result'
transform = etree.XSLT(etree.parse(f_xsl))
result = transform(etree.parse(f_xml))
result.write(f_out)

pypdf for lists of pdfs

I have got pypdf to work just fine for a single pdf file, but I can not seem to get it to work for a lits of files, or in a for loop for multiple pdfs, without failing because of the string not being callable. Any ideas I can use as a work around?
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
#print getPDFContent(r"Z:\GIS\MasterPermits\12300983.pdf").encode("ascii", "ignore")
#find pdfs
for root, dirs, files in os.walk(folder1):
for file in files:
if file.endswith(('.pdf')):
d=os.path.join(root, file)
print getPDFContent(d).encode("ascii", "ignore")
Traceback (most recent call last):
File "C:\Documents and Settings\dknight\Desktop\readpdf.py", line 50, in <module>
print getPDFContent(d).encode("ascii", "ignore")
File "C:\Documents and Settings\dknight\Desktop\readpdf.py", line 32, in getPDFContent
pdf = pyPdf.PdfFileReader(file(path, "rb"))
TypeError: 'str' object is not callable
I was using a list, but I got the exact same error, I didnt think this would be a big deal, but as of right now it is becoming one. I know I was able to work around similar issues in arcpy, but this is nothing close

Try not to use built-in types for variable names:
Don't do this:
for file in files:
Do this instead:
for myfile in files:

XML (.xsd) feed validation against a schema

I have a XML file and I have a XML schema. I want to validate the file against that schema and check if it adheres to that. I am using python but am open to any language for that matter if there is no such useful library in python.
What would be my best options here? I would worry about the how fast I can get this up and running.

Definitely lxml.
Define an XMLParser with a predefined schema, load the the file fromstring() and catch any XML Schema errors:
from lxml import etree
def validate(xmlparser, xmlfilename):
try:
with open(xmlfilename, 'r') as f:
etree.fromstring(f.read(), xmlparser)
return True
except etree.XMLSchemaError:
return False
schema_file = 'schema.xsd'
with open(schema_file, 'r') as f:
schema_root = etree.XML(f.read())
schema = etree.XMLSchema(schema_root)
xmlparser = etree.XMLParser(schema=schema)
filenames = ['input1.xml', 'input2.xml', 'input3.xml']
for filename in filenames:
if validate(xmlparser, filename):
print("%s validates" % filename)
else:
print("%s doesn't validate" % filename)
Note about encoding
If the schema file contains an xml tag with an encoding (e.g. <?xml version="1.0" encoding="UTF-8"?>), the code above will generate the following error:
Traceback (most recent call last):
File "<input>", line 2, in <module>
schema_root = etree.XML(f.read())
File "src/lxml/etree.pyx", line 3192, in lxml.etree.XML
File "src/lxml/parser.pxi", line 1872, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
A solution is to open the files in byte mode: open(..., 'rb')
[...]
def validate(xmlparser, xmlfilename):
try:
with open(xmlfilename, 'rb') as f:
[...]
with open(schema_file, 'rb') as f:
[...]

The python snippet is good, but an alternative is to use xmllint:
xmllint -schema sample.xsd --noout sample.xml

import xmlschema
def get_validation_errors(xml_file, xsd_file):
schema = xmlschema.XMLSchema(xsd_file)
validation_error_iterator = schema.iter_errors(xml_file)
errors = list()
for idx, validation_error in enumerate(validation_error_iterator, start=1):
err = validation_error.__str__()
errors.append(err)
print(err)
return errors

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Saving tables from html document - python

Related

Python Minidom Parsing File Objects

save base64 image python

lxml not performing xslt transform

pypdf for lists of pdfs

XML (.xsd) feed validation against a schema

Categories

Resources