How to add a newline to a long multi-line string? - python

I have a string like this :
my_xml = '''
<?xml version="1.0" encoding="utf-8" ?> \n
<entries> \n
<entry>1</entry> \n
<entry>2</entry> \n
</entries>
'''
return HttpResponse(my_xml)
This is my output it was printed without new line and without xml tags:
1 2
How would we add a newline in Python and make browser interpreter xml tags and print them?

You are looking at the output in a browser; HTML browsers consider newlines whitespace, collapsing successive whitespace characters to form one space.
Your browser is interpreting the response as HTML-formatted data, use <br/> tags instead:
s = """ this is line 1<br/>
this is line 2<br/>
and other lines ..."""
If you expected to see just newlines, look at the response source code instead; the newlines are there.
If you wanted to see XML output in a browser, you should set a content type header (text/xml) so the browser knows you are sending XML instead of HTML:
return HttpResponse(s, content_type='text/xml') # Assumes you are using Django
Your browser will use a default stylesheet to display XML data (usually as a tree with collapsible sections). You can use a XML stylesheet (XSLT) to override that behaviour. Add a stylesheet header:
<?xml-stylesheet type="text/xsl" href="script.xsl" ?>
the browser will fetch the named stylesheet and apply it to your XML.

Related

Ahref html issue, html tags not showing

I don't seem to be having much luck solving this issue, i am pulling data from an xml url that looks like:
<?xml version="1.0" encoding="utf-8"?>
<task>
<tasks>
<taskId>46</taskId>
<taskUserId>4</taskUserId>
<taskName>test</taskName>
<taskMode>2</taskMode>
<taskSite>thetest.org</taskSite>
<taskUser>NULL</taskUser>
<taskPass>NULL</taskPass>
<taskEmail>NULL</taskEmail>
<taskUrl>https://www.thetest.com/</taskUrl>
<taskTitle>test</taskTitle>
<taskBody>This is a test using html tags.</taskBody>
<taskCredentials>...</taskProxy>
</tasks>
</task>
This part is where i'm having issues:
<taskBody>This is a test using html tags.</taskBody>
I pull the data using BeautifulSoup like:
# beautifulsoup setup
soup = BeautifulSoup(projects.text, 'xml')
xml_task_id = soup.find('taskId')
xml_task_user_id = soup.find('taskUserId')
xml_task_name = soup.find('taskName')
xml_mode = soup.find('taskMode')
xml_site_name = soup.find('taskSite')
xml_username = soup.find('taskUser')
xml_password = soup.find('taskPass')
xml_email = soup.find('taskEmail')
xml_url = soup.find('taskUrl')
xml_content_title = soup.find('taskTitle')
xml_content_body = soup.find('taskBody')
xml_credentials = soup.find('taskCredentials')
xml_proxy = soup.find('taskProxy')
print(xml_content_body.get_text())
When i print out this part, it prints like: This is a test using html tags.
Instead of showing the ahref tag in full like: This is a test
I literally need the full string printed as is, but it keeps executing the html code instead of printing the string.
Any help would be appreciated.
Python doesn't "execute" HTML code. I'm guessing you're viewing it in a web browser, and that's interpreting the <a> tags just like it's designed to.
Use the html.escape method to turn all tags into escape sequences (with > and < and the like), which stops the browser from interpreting them.
For Phython - Escape HTML Tags: https://wiki.python.org/moin/EscapingHtml
For PHP - Escape HTMl Tags:
https://www.w3schools.com/php/showphp.asp?filename=demo_func_string_strip_tags2

How to deal with xmlns values while parsing an XML file?

I have the following toy example of an XML file. I have thousands of these. I have difficulty parsing this file.
Look at the text in second line. All my original files contain this text. When I delete i:type="Record" xmlns="http://schemas.datacontract.org/Storage" from second line (retaining the remaining text), I am able to get accelx and accely values using the code given below.
How can I parse this file with the original text?
<?xml version="1.0" encoding="utf-8"?>
<ArrayOfRecord xmlns:i="http://www.w3.org/2001/XMLSchema-instance" i:type="Record" xmlns="http://schemas.datacontract.org/Storage">
<AvailableCharts>
<Accelerometer>true</Accelerometer>
<Velocity>false</Velocity>
</AvailableCharts>
<Trics>
<Trick>
<EndOffset>PT2M21.835S</EndOffset>
<Values>
<TrickValue>
<Acceleration>26.505801694441629</Acceleration>
<Rotation>0.023379150593228679</Rotation>
</TrickValue>
</Values>
</Trick>
</Trics>
<Values>
<SensorValue>
<accelx>-3.593643144</accelx>
<accely>7.316485176</accely>
</SensorValue>
<SensorValue>
<accelx>0.31103436</accelx>
<accely>7.70408184</accely>
</SensorValue>
</Values>
</ArrayOfRecord>
Code to parse the data:
import lxml.etree as etree
tree = etree.parse(r"C:\testdel.xml")
root = tree.getroot()
val_of_interest = root.findall('./Values/SensorValue')
for sensor_val in val_of_interest:
print sensor_val.find('accelx').text
print sensor_val.find('accely').text
I asked related question here: How to extract data from xml file that is deep down the tag
Thanks
The confusion was caused by the following default namespace (namespace declared without prefix) :
xmlns="http://schemas.datacontract.org/Storage"
Note that descendants elements without prefix inherit default namespace from ancestor, implicitly. Now, to reference element in namespace, you need to map a prefix to the namespace URI, and use that prefix in your XPath :
ns = {'d': 'http://schemas.datacontract.org/Storage' }
val_of_interest = root.findall('./d:Values/d:SensorValue', ns)
for sensor_val in val_of_interest:
print sensor_val.find('d:accelx', ns).text
print sensor_val.find('d:accely', ns).text

Adding html tags to text of XML.ElementTree Elements in Python

I am trying to use a python script to generate an HTML document with text from a data table using the XML.etree.ElementTree module. I would like to format some of the cells to include html tags, typically either <br /> or <sup></sup> tags. When I generate a string and write it to a file, I believe the XML parser is converting these tags to individual characters. The output the shows the tags as text rather than processing them as tags. Here is a trivial example:
import xml.etree.ElementTree as ET
root = ET.Element('html')
#extraneous code removed
td = ET.SubElement(tr, 'td')
td.text = 'This is the first line <br /> and the second'
tree = ET.tostring(root)
out = open('test.html', 'w+')
out.write(tree)
out.close()
When you open the resulting 'test.html' file, it displays the text string exactly as typed: 'This is the first line <br /> and the second'.
The HTML document itself shows the problem in the source. It appears that the parser substitutes the "less than" and "greater than" symbols in the tag to the HTML representations of those symbols:
<!--Extraneous code removed-->
<td>This is the first line %lt;br /> and the second</td>
Clearly, my intent is to have the document process the tag itself, not display it as text. I'm not sure if there are different parser options I can pass to get this to work, or if there is a different method I should be using. I am open to using other modules (e.g. lxml) if that will solve the problem. I am mainly using the built-in XML module for convenience.
The only thing I've figured out that works is to modify the final string with re substitutions before I write the file:
tree = ET.tostring(root)
tree = re.sub(r'<','<',tree)
tree = re.sub(r'>','>',tree)
This works, but seems like it should be avoidable by using a different setting in xml. Any suggestions?
You can use tail attribute with td and br to construct the text exactly what you want:
import xml.etree.ElementTree as ET
root = ET.Element('html')
table = ET.SubElement(root, 'table')
tr = ET.SubElement(table, 'tr')
td = ET.SubElement(tr, 'td')
td.text = "This is the first line "
# note how to end td tail
td.tail = None
br = ET.SubElement(td, 'br')
# now continue your text with br.tail
br.tail = " and the second"
tree = ET.tostring(root)
# see the string
tree
'<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>'
with open('test.html', 'w+') as f:
f.write(tree)
# and the output html file
cat test.html
<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>
As a side note, to include the <sup></sup> and append text but still within <td>, use tail will have the desire effect too:
...
td.text = "this is first line "
sup = ET.SubElement(td, 'sup')
sup.text = "this is second"
# use tail to continue your text
sup.tail = "well and the last"
print ET.tostring(root)
<html><table><tr><td>this is first line <sup>this is second</sup>well and the last</td></tr></table></html>

xml parsing with beautifulsoup4, namespaces issue

While parsing .docx file contents in form of xml (word/document.xml) with beautifulsoup4 (with lxml installed, as required) I encountered one problem. This part from xml:
...
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
...
becomes this:
...
<graphic>
<graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic>
...
Even when I just parse file and save it, without any modifications. Like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(filepath_in), 'xml')
with open(filepath_out, "w+") as fd:
fd.write(str(soup))
Or parse xml from python console.
For me it looks like namespaces, declared like this, not in root document node, get eaten by parser.
Is this a bug, or feature? And is there a way to preserve these while parsing with beautifulesoup4? Or do I need to switch to something else for that?
UPDATE 1: if with some regex and text replacement I add these namespace declarations to the root document node, then beautifulsoup parses it just fine. But I'm still interested if this can be solved without modification of xml before parsing.
UPDATE 2: after playing with beutifulsoup a bit, I figured out that namespace declarations are parsed only in first occurrence. Means that if tag declares namespace, then if its children have namespace declarations, they will not be parsed. Below is code example with output to illustrate that.
from bs4 import BeautifulSoup
xmls = []
xmls.append("""<name1:tag xmlns:name1="namespace1" xmlns:name2="namespace2">
<name2:intag>
text
</name2:intag>
</name1:tag>
""")
xmls.append("""<tag>
<name2:intag xmlns:name2="namespace2">
text
</name2:intag>
</tag>
""")
xmls.append("""<name1:tag xmlns:name1="namespace1">
<name2:intag xmlns:name2="namespace2">
text
</name2:intag>
</name1:tag>
""")
for i, xml in enumerate(xmls):
print "============== xml {} ==============".format(i)
soup = BeautifulSoup(xml, "xml")
print soup
Will produce output:
============== xml 0 ==============
<?xml version="1.0" encoding="utf-8"?>
<name1:tag xmlns:name1="namespace1" xmlns:name2="namespace2">
<name2:intag>
text
</name2:intag>
</name1:tag>
============== xml 1 ==============
<?xml version="1.0" encoding="utf-8"?>
<tag>
<name2:intag xmlns:name2="namespace2">
text
</name2:intag>
</tag>
============== xml 2 ==============
<?xml version="1.0" encoding="utf-8"?>
<name1:tag xmlns:name1="namespace1">
<intag>
text
</intag>
</name1:tag>
See, how first two xmls are parsed correctly, while second declaration in third one gets eaten.
Actually this problem does not involve docx anymore. And my question is rounded to such: is this behavior hardcoded in beautifulsoup4, and if not, then how can I change it?
From the W3C recommendation:
The Prefix provides the namespace prefix part of the qualified name, and MUST be associated with a namespace URI reference in a namespace declaration.
https://www.w3.org/TR/REC-xml-names/#ns-qualnames
So I think it is the expected behaviour: discard the namespaces not declared to gracefully allow some parsing on documents that not respect the recommendation.
change this line:
soup = BeautifulSoup(open(filepath_in), 'xml')
to
soup = BeautifulSoup(open(filepath_in), 'lxml')
or
soup = BeautifulSoup(open(filepath_in), 'html.parser')

Matching the XML Req and Res

I need advice on the below
Below are the request and response XML's. Request XML contains the words to be translated in the Foriegn language [String attribute inside Texts node] and the response XML contains the translation of these words in English [inside ].
REQUEST XML
<TranslateArrayRequest>
<AppId />
<From>ru</From>
<Options>
<Category xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></Category>
<ContentType xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2">text/plain</ContentType>
<ReservedFlags xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" />
<State xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></State>
<Uri xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></Uri>
<User xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" ></User>
</Options>
<Texts>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">вк азиза и ринат</string>
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/Arrays">скачать кайда кайдк кайрат нуртас бесплатно</string>
</Texts>
<To>en</To>
</TranslateArrayRequest>
RESPONSE XML
<ArrayOfTranslateArrayResponse xmlns="http://schemas.datacontract.org/2004/07/Microsoft.MT.Web.Service.V2" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<TranslateArrayResponse>
<From>ru</From>
<OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>16</a:int>
</OriginalTextSentenceLengths>
<State/>
<TranslatedText>BK Aziza and Rinat</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>18</a:int>
</TranslatedTextSentenceLengths>
</TranslateArrayResponse>
<TranslateArrayResponse>
<From>ru</From>
<OriginalTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>43</a:int> </OriginalTextSentenceLengths>
<State/>
<TranslatedText>Kairat kajdk Qaeda nurtas download free</TranslatedText>
<TranslatedTextSentenceLengths xmlns:a="http://schemas.microsoft.com/2003/10/Serialization/Arrays"><a:int>39</a:int></TranslatedTextSentenceLengths>
</TranslateArrayResponse
</ArrayOfTranslateArrayResponse>
So there are two ways to relate the translated text to the original text:
Length of the original text; and
Order in the XML file
Relating by length being the probably unreliable because the probability of translating 2 or more phrases with the same number of characters is relatively significant.
So it comes down to order. I think it is relatively safe to assume that the files were processed and written in the same order. So I'll show you a way to relate the phrases using the order of the XML files.
This is relatively simple. We simply iterate through the trees and grab the words in the list. Also, for the translated XML due to its structure, we need to grab the root's namespace:
import re
import xml.etree.ElementTree as ElementTree
def map_translations(origin_file, translate_file):
origin_tree = ElementTree.parse(origin_file)
origin_root = origin_tree.getroot()
origin_text = [string.text for text_elem in origin_root.iter('Texts')
for string in text_elem]
translate_tree = ElementTree.parse(translate_file)
translate_root = translate_tree.getroot()
namespace = re.match('{.*}', translate_root.tag).group()
translate_text = [text.text for text in translate_root.findall(
'.//{}TranslatedText'.format(namespace))]
return dict(zip(origin_text, translate_text))
origin_file = 'some_file_path.xml'
translate_file = 'some_other_path.xml'
mapping = map_translations(origin_file, translate_file)
print(mapping)
Update
The above code is applicable for Python 2.7+. In Python 2.6 it changes slightly:
ElementTree objects do not have an iter function. Instead they have a getiterator function.
Change the appropriate line above to this:
origin_text = [string.text for text_elem in origin_root.iter('Texts')
for string in text_elem]
XPath syntax is (most likely) not supported. In order to get down to the TranslatedText nodes we need to use the same strategy as we do above:
Change the appropriate line above to this:
translate_text = [string.text for text in translate_root.getiterator(
'{0}TranslateArrayResponse'.format(namespace))
for string in text.getiterator(
'{0}TranslatedText'.format(namespace))]

Categories

Resources