How to remove xml header in beautifulsoup?

How to remove xml header in beautifulsoup? - python

I have imported and modified some xml, but when I write out my xml using test.prettify(). It changes the top line of the xml from
<?xml version="1.0"?>
to
<?xml version="1.0" encoding="utf-8"?>
I don't want this change. How can I just keep the first line unchanged? What is the easiest way to do this?
If it matters, I'm using the xml parser.
soup = BeautifulSoup(r.text,'xml')

I'm sure there's a more elegant way to do this using BeautifulSoup's built-ins, but based on your comment, I'll give you the "strip it out" version:
xml_string = '<?xml version="1.0" encoding="utf-8"?>'
print xml_string[:xml_string.find("encoding")-1] + "?>"
This is general enough to strip out any encoding from the header (not just utf-8).

You could find the xml and use replaceWith() to replace it with the value you want.

Related

ParseError: junk after document element: line 7, column 0, (Python, XML parsing)

I have a dummy xml file,
<?xml version="1.0" encoding="UTF-8"?>
<hello xmlns="abc">
<inside>
<ok>xyz</ok>
</inside>
</hello>
<?xml version="1.0" encoding="UTF-8"?>
<xyz xmlns="acxd">
</xyz>
<?xml version="1.0" encoding="UTF-8"?>
<zz xmlns="zmrt">
</zz>
]]>]]>
And Iam trying to parse this xml file, using following code.
import xml.etree.ElementTree as ET
mytree = ET.parse(temp_xml)
The error I am getting is "ParseError: junk after document element: line 7, column 0".
I did try to remove ']]>]]>' i.e. in line 7 but still I am getting same error i.e. "ParseError: junk after document element: line 8, column 0". Is there a way to deal with such error or we can skip reading such lines where there is junk data ?

XML document may only have a single root element. Yours has three and therefore is not well-formed. If you wish to parse it using XML tools, you'll have to first, manually or programmatically, separate the root elements into their own documents.
Note that an XML document also can have at most a single XML declaration (<?xml version="1.0" encoding="UTF-8"?>), and if it exists, it must be at the top of the file.
See also
Why must XML documents have a single root element?
How to parse invalid (bad / not well-formed) XML?
Are multiple XML declarations in a document well-formed XML?
Parse a xml file with multiple root element in python

Writing XML formatted text to output file in Python

I am having issues while writing the below XML to output file.
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
<root>
<document>
<sentences>
<sentence id="1">
<tokens>
<token id="1">
<word>
Pusheen
</word>
<CharacterOffsetBegin>
0
</CharacterOffsetBegin>
<CharacterOffsetEnd>
7
</CharacterOffsetEnd>
<POS>
NNP
</POS>
</token>
</tokens>
</sentence>
</sentences>
</document>
</root>
How to write this to output file in xml format? I tried using below write statement
tree.write(open('person.xml', 'w'), encoding='unicode').
But, I am getting the below error
AttributeError: 'str' object has no attribute 'write'
I don't have to build XML here as I already have the data in XML format. I just need it to write it to a XML file.

Assuming that tree is your XML, it is a string. You probably want something like:
with open("person.xml", "w", encoding="unicode") as outfile:
outfile.write(tree)
(It is good practice to use with for files; it automatically closes them after)
The error is caused by the fact that, since tree is a string, you can't write to it.

I recommend using the lxml module to check the format first and then write it to a file. I notice that you've got two elements with the same id, which caught my eye. It doesn't flag an error in XML, but it could cause trouble on an HTML page, where each id is supposed to be unique.
Here's the simple code to do what I described above:
from lxml import etree
try:
root = etree.fromstring(your_xml_data) # checks XML formatting, returns Element if good
if root is not None:
tree = etree.ElementTree(root) # convert the Element to ElementTree
tree.write('person.xml') # we needed the ElementTree for writing the file
except:
'Oops!'

How do I get Python's ElementTree to pretty print to an XML file?

Background
I am using SQLite to access a database and retrieve the desired information. I'm using ElementTree in Python version 2.6 to create an XML file with that information.
Code
import sqlite3
import xml.etree.ElementTree as ET
# NOTE: Omitted code where I acccess the database,
# pull data, and add elements to the tree
tree = ET.ElementTree(root)
# Pretty printing to Python shell for testing purposes
from xml.dom import minidom
print minidom.parseString(ET.tostring(root)).toprettyxml(indent = " ")
####### Here lies my problem #######
tree.write("New_Database.xml")
Attempts
I've tried using tree.write("New_Database.xml", "utf-8") in place of the last line of code above, but it did not edit the XML's layout at all - it's still a jumbled mess.
I also decided to fiddle around and tried doing:
tree = minidom.parseString(ET.tostring(root)).toprettyxml(indent = " ") instead of printing this to the Python shell, which gives the error AttributeError: 'unicode' object has no attribute 'write'.
Questions
When I write my tree to an XML file on the last line, is there a way to pretty print to the XML file as it does to the Python shell?
Can I use toprettyxml() here or is there a different way to do this?

Whatever your XML string is, you can write it to the file of your choice by opening a file for writing and writing the string to the file.
from xml.dom import minidom
xmlstr = minidom.parseString(ET.tostring(root)).toprettyxml(indent=" ")
with open("New_Database.xml", "w") as f:
f.write(xmlstr)
There is one possible complication, especially in Python 2, which is both less strict and less sophisticated about Unicode characters in strings. If your toprettyxml method hands back a Unicode string (u"something"), then you may want to cast it to a suitable file encoding, such as UTF-8. E.g. replace the one write line with:
f.write(xmlstr.encode('utf-8'))

I simply solved it with the indent() function:
xml.etree.ElementTree.indent(tree, space=" ", level=0) Appends
whitespace to the subtree to indent the tree visually. This can be
used to generate pretty-printed XML output. tree can be an Element or
ElementTree. space is the whitespace string that will be inserted for
each indentation level, two space characters by default. For indenting
partial subtrees inside of an already indented tree, pass the initial
indentation level as level.
tree = ET.ElementTree(root)
ET.indent(tree, space="\t", level=0)
tree.write(file_name, encoding="utf-8")
Note, the indent() function was added in Python 3.9.

I found a way using straight ElementTree, but it is rather complex.
ElementTree has functions that edit the text and tail of elements, for example, element.text="text" and element.tail="tail". You have to use these in a specific way to get things to line up, so make sure you know your escape characters.
As a basic example:
I have the following file:
<?xml version='1.0' encoding='utf-8'?>
<root>
<data version="1">
<data>76939</data>
</data>
<data version="2">
<data>266720</data>
<newdata>3569</newdata>
</data>
</root>
To place a third element in and keep it pretty, you need the following code:
addElement = ET.Element("data") # Make a new element
addElement.set("version", "3") # Set the element's attribute
addElement.tail = "\n" # Edit the element's tail
addElement.text = "\n\t\t" # Edit the element's text
newData = ET.SubElement(addElement, "data") # Make a subelement and attach it to our element
newData.tail = "\n\t" # Edit the subelement's tail
newData.text = "5431" # Edit the subelement's text
root[-1].tail = "\n\t" # Edit the previous element's tail, so that our new element is properly placed
root.append(addElement) # Add the element to the tree.
To indent the internal tags (like the internal data tag), you have to add it to the text of the parent element. If you want to indent anything after an element (usually after subelements), you put it in the tail.
This code give the following result when you write it to a file:
<?xml version='1.0' encoding='utf-8'?>
<root>
<data version="1">
<data>76939</data>
</data>
<data version="2">
<data>266720</data>
<newdata>3569</newdata>
</data> <!--root[-1].tail-->
<data version="3"> <!--addElement's text-->
<data>5431</data> <!--newData's tail-->
</data> <!--addElement's tail-->
</root>
As another note, if you wish to make the program uniformally use \t, you may want to parse the file as a string first, and replace all of the spaces for indentations with \t.
This code was made in Python3.7, but still works in Python2.7.

Riffing on Ben Anderson answer as a function.
def _pretty_print(current, parent=None, index=-1, depth=0):
for i, node in enumerate(current):
_pretty_print(node, current, i, depth + 1)
if parent is not None:
if index == 0:
parent.text = '\n' + ('\t' * depth)
else:
parent[index - 1].tail = '\n' + ('\t' * depth)
if index == len(parent) - 1:
current.tail = '\n' + ('\t' * (depth - 1))
So running the test on unpretty data:
import xml.etree.ElementTree as ET
root = ET.fromstring('''<?xml version='1.0' encoding='utf-8'?>
<root>
<data version="1"><data>76939</data>
</data><data version="2">
<data>266720</data><newdata>3569</newdata>
</data> <!--root[-1].tail-->
<data version="3"> <!--addElement's text-->
<data>5431</data> <!--newData's tail-->
</data> <!--addElement's tail-->
</root>
''')
_pretty_print(root)
tree = ET.ElementTree(root)
tree.write("pretty.xml")
with open("pretty.xml", 'r') as f:
print(f.read())
We get:
<root>
<data version="1">
<data>76939</data>
</data>
<data version="2">
<data>266720</data>
<newdata>3569</newdata>
</data>
<data version="3">
<data>5431</data>
</data>
</root>

Install bs4
pip install bs4
Use this code to pretty print:
from bs4 import BeautifulSoup
x = your xml
print(BeautifulSoup(x, "xml").prettify())

If one wants to use lxml, it could be done in the following way:
from lxml import etree
xml_object = etree.tostring(root,
pretty_print=True,
xml_declaration=True,
encoding='UTF-8')
with open("xmlfile.xml", "wb") as writter:
writter.write(xml_object)`
If you see xml namespaces e.g. py:pytype="TREE", one might want to add before the creation of xml_object
etree.cleanup_namespaces(root)
This should be sufficient for any adaptation in your code.

One liner(*) to read, parse (once) and pretty print XML from file named fname:
from xml.dom import minidom
print(minidom.parseString(open(fname).read()).toprettyxml(indent=" "))
(* not counting import)

Using pure ElementTree and Python 3.9+:
def prettyPrint(element):
encoding = 'UTF-8'
# Create a copy of the input element: Convert to string, then parse again
copy = ET.fromstring(ET.tostring(element))
# Format copy. This needs Python 3.9+
ET.indent(copy, space=" ", level=0)
# tostring() returns a binary, so we need to decode it to get a string
return ET.tostring(copy, encoding=encoding).decode(encoding)
If you need a file, replace the last line with with copy.write(...) to avoid the extra overhead.

Modify XML declaration with python

I have an XML document for which I need to add a couple of things to the XML declaration using minidom. The declaration looks like this:
<?xml version="1.0"?>
And I need it to look like this:
<?xml version="1.0" encoding="UTF-16" standalone="no"?>
I know how to change or add attributes using minidom, which will not work here.
What is the easiest way of doing this? For reference, I am running python 3.3.3.

I'm not sure if this can be done with minidom. But you could try lxml.
from lxml import etree
tree = etree.parse("test.xml")
string = etree.tostring(tree.getroot(), pretty_print = True, xml_declaration = True, standalone = False, encoding = "UTF-16")
with open("test2.xml", "wb") as f:
f.write(string)
More or less taken from here.

Parsing large combined XML document with Python

I have one large document (400 mb), which contains hundreds of XML documents, each with their own declarations. I am trying to parse each document using ElementTree in Python. I am having a lot of trouble with splitting each XML document in order to parse out the information. Here is an example of what the document looks like:
<?xml version="1.0"?>
<data>
<more>
<p></p>
</more>
</data>
<?xml version="1.0"?>
<different data>
<etc>
<p></p>
</etc>
</different data>
<?xml version="1.0"?>
<continues.....>
Ideally I would like to read through each XML declaration, parse the data, and continue on with the next XML document. Any suggestions will help.

You'll need to read in the documents separately; here is a generator function that'll yield complete XML documents from a given file object:
def xml_documents(fileobj):
document = []
for line in fileobj:
if line.strip().startswith('<?xml') and document:
yield ''.join(document)
document = []
document.append(line)
if document:
yield ''.join(document)
Then use ElementTree.fromstring() to load and parse these:
with open('file_with_multiple_xmldocuments') as fileobj:
for xml in xml_documents(fileobj):
tree = ElementTree.fromstring(xml)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove xml header in beautifulsoup? - python

You could find the xml and use replaceWith() to replace it with the value you want.

Related

ParseError: junk after document element: line 7, column 0, (Python, XML parsing)

Writing XML formatted text to output file in Python

How do I get Python's ElementTree to pretty print to an XML file?

Modify XML declaration with python

Parsing large combined XML document with Python

Categories

Resources