lxml adds urlencoding in xml? - python

I'll preface this by indicating I'm using Python 2.7.3 (x64) on Windows 7, with lxml 2.3.6.
I have a little, odd, problem I'm hoping somebody can help with. I haven't find a solution online, perhaps I'm not searching for the right thing.
Anyway, I have a problem where I'm programmatically building some XML with lxml, then outputting this to a text file, the problem is lxml is converting carriage returns to the text 
, almost like urlencoding - but I'm not using HTML I'm using XML.
For example, I have a simple text file created in Notepad, like this:
This
is
my
text
I then build some xml and add this text into the xml:
from lxml import etree
textstr = ""
fh = open("mytext.txt", "rb")
for line in fh:
textstr += line
root = etree.Element("root")
a = etree.SubElement(root, "some_element")
a.text = textstr
print etree.tostring(root)
The problem here is the output of the print looks like this:
<root><some_element>This
is
my
text</some_element></root>
For my purposes the line breaks are fine, but the 
 elements are not.
What I have been able to figure out is that this is happening because I'm opening the text file in binary mode "rb" (which I actually need to do as my app is indexing a large text file). If I don't open the file in binary mode "r", then the output does not contain 
 (but of course, then my indexing doesn't work).
I've also tried changing the etree.tostring to:
print etree.tostring(root, method="xml")
However there is no difference in the output.
Now, I CAN dump the xml text to a string then do a replace of the $#13; artifacts, however, I was hoping for a more elegant solution - because the text files I parse are not under my control and I'm worried that other elements of the text file might be converted to url style encoding without my knowledge.
Does anyone know a way of preventing this encoding from happening?

Windows uses \r\n to represent a line ending, Unix uses \n.
This will remove the \r at the end of the line, if there is one there (so the code will work with unix text files too.) It will remove at most one \r, so if there is an \r somewhere else in the line it will be preserved.
import re
textstr = ""
with open("mytext.txt", "rb") as fh:
for line in fh:
textstr += re.sub(r'\r$', '', line)
print(repr(textstr))

Related

Forward slash "/" in string converted to "&#47", is that platform independent behaviour?

I have a Python script that reads an html into lines, and then filters out the relevant lines before saving those lines back as html file. I had some problems till I figured out that a / in the page text was being converted to &#47 when saved as a string.
The source html that I'm parsing through has the following line:
<h3 style="text-align:left">SYDNEY/KINGSFORD SMITH (YSSY)</h3>
which when passing through the file.readlines() would come out as:
<h3 style='text-align:left'>SYDNEY&#47BANKSTOWN (YSBK)</h3>
which then trips up the beautifulsoup because that then gets confused with the "&" symbol tripping up all subsequent tags.
What I'm interested in is to know if this replacement value "&#47" is platform independent or not?
It's not hard to run a .replace prior to saving each string, avoiding the issue now that I'm coding and testing on windows, but will it still work if I deploy my script on a linux server?
Here's what I have now, which works fine when run under windows:
def getHTML(self,html_source):
with open(html_source, 'r') as file:
source_lines = file.readlines()
relevant = False
relevant_lines = []
for line in source_lines:
if "</table>" in line:
relevant = False
if self.airport in line:
relevant = True
if relevant:
line = line.replace("&#47", " ")
relevant_lines.append(line)
relevant_lines.append("</table>")
filename = f"{html_source[:-5]}_{self.airport}.html"
with open(filename, 'w') as file:
file.writelines(relevant_lines)
with open(filename, 'r') as file:
relevant_html = file.read()
return relevant_html
Can anyone tell me, without having to install a virtual machine with linux, if this will work cross-platform? I tried to look for documentation on this, but all I could find was about ways to explicitly escape a / when entering a string, nothing documenting how to deal with / or other invalid characters being read when reading a source file into strings.
It should be OK everywhere, it is a standard.
See https://www.w3schools.com/charsets/ref_html_ascii.asp

Trying to understand how to get import re to work in pycharm

I'm going through a course at work for Python. We're using Pycharm, and I'm not sure if that's what the problem is.
Basically, I have to read in a text file, scrub it, then count the frequency of specific words. The counting is not an issue. (I looped through a scrubbed list, checked the scrubbed list for the specific words, then added the specific words to a dictionary as I looped through the list. It works fine).
My issue is really about scrubbing the data. I ended up doing successive scrubs to get to a final clean list. But when I read the documentation, I should be able to use regex or re and scrub my file with one line of code. No matter what I do, importing re, or regex I get errors that stop my code.
How can I write the below code pythonically?
# Open the file in read mode
with open('chocolate.txt', 'r') as file:
input_col = file.read().replace(',', '')
text3 = input_col.replace('.', '')
text2 = text3.replace('"', '')
text = text2.split()
You could try using a regular expression which looks something like this
import re
result = re.sub(r'("|.|,)', "", text)
print(result)
Here text is the string you would read from the text file
Hope this helps!
x = re.sub(r'("|\.|,)', "", str)

Python - failing to read correctly the first line of a text file to a list

I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?
Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.
I just had similar issue: python readlines() reports invalid chars heading the first line, something like . I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4

How to update/modify an XML file in python?

I have an XML document that I would like to update after it already contains data.
I thought about opening the XML file in "a" (append) mode. The problem is that the new data will be written after the root closing tag.
How can I delete the last line of a file, then start writing data from that point, and then close the root tag?
Of course I could read the whole file and do some string manipulations, but I don't think that's the best idea..
Using ElementTree:
import xml.etree.ElementTree
# Open original file
et = xml.etree.ElementTree.parse('file.xml')
# Append new tag: <a x='1' y='abc'>body text</a>
new_tag = xml.etree.ElementTree.SubElement(et.getroot(), 'a')
new_tag.text = 'body text'
new_tag.attrib['x'] = '1' # must be str; cannot be an int
new_tag.attrib['y'] = 'abc'
# Write back to file
#et.write('file.xml')
et.write('file_new.xml')
note: output written to file_new.xml for you to experiment, writing back to file.xml will replace the old content.
IMPORTANT: the ElementTree library stores attributes in a dict, as such, the order in which these attributes are listed in the xml text will NOT be preserved. Instead, they will be output in alphabetical order.
(also, comments are removed. I'm finding this rather annoying)
ie: the xml input text <b y='xxx' x='2'>some body</b> will be output as <b x='2' y='xxx'>some body</b>(after alphabetising the order parameters are defined)
This means when committing the original, and changed files to a revision control system (such as SVN, CSV, ClearCase, etc), a diff between the 2 files may not look pretty.
Useful Python XML parsers:
Minidom - functional but limited
ElementTree - decent performance, more functionality
lxml - high-performance in most cases, high functionality including real xpath support
Any of those is better than trying to update the XML file as strings of text.
What that means to you:
Open your file with an XML parser of your choice, find the node you're interested in, replace the value, serialize the file back out.
The quick and easy way, which you definitely should not do (see below), is to read the whole file into a list of strings using readlines(). I write this in case the quick and easy solution is what you're looking for.
Just open the file using open(), then call the readlines() method. What you'll get is a list of all the strings in the file. Now, you can easily add strings before the last element (just add to the list one element before the last). Finally, you can write these back to the file using writelines().
An example might help:
my_file = open(filename, "r")
lines_of_file = my_file.readlines()
lines_of_file.insert(-1, "This line is added one before the last line")
my_file.writelines(lines_of_file)
The reason you shouldn't be doing this is because, unless you are doing something very quick n' dirty, you should be using an XML parser. This is a library that allows you to work with XML intelligently, using concepts like DOM, trees, and nodes. This is not only the proper way to work with XML, it is also the standard way, making your code both more portable, and easier for other programmers to understand.
Tim's answer mentioned checking out xml.dom.minidom for this purpose, which I think would be a great idea.
While I agree with Tim and Oben Sonne that you should use an XML library, there are ways to still manipulate it as a simple string object.
I likely would not try to use a single file pointer for what you are describing, and instead read the file into memory, edit it, then write it out.:
inFile = open('file.xml', 'r')
data = inFile.readlines()
inFile.close()
# some manipulation on `data`
outFile = open('file.xml', 'w')
outFile.writelines(data)
outFile.close()
For the modification, you could use tag.text from xml. Here is snippet:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
tree.write('output.xml')
The rank in the code is example of tag, which depending on your XML file contents.
What you really want to do is use an XML parser and append the new elements with the API provided.
Then simply overwrite the file.
The easiest to use would probably be a DOM parser like the one below:
http://docs.python.org/library/xml.dom.minidom.html
To make this process more robust, you could consider using the SAX parser (that way you don't have to hold the whole file in memory), read & write till the end of tree and then start appending.
You should read the XML file using specific XML modules. That way you can edit the XML document in memory and rewrite your changed XML document into the file.
Here is a quick start: http://docs.python.org/library/xml.dom.minidom.html
There are a lot of other XML utilities, which one is best depends on the nature of your XML file and in which way you want to edit it.
As Edan Maor explained, the quick and dirty way to do it (for [utc-16] encoded .xml files), which you should not do for the resons Edam Maor explained, can done with the following python 2.7 code in case time constraints do not allow you to learn (propper) XML parses.
Assuming you want to:
Delete the last line in the original xml file.
Add a line
substitute a line
Close the root tag.
It worked in python 2.7 modifying an .xml file named "b.xml" located in folder "a", where "a" was located in the "working folder" of python. It outputs the new modified file as "c.xml" in folder "a", without yielding encoding errors (for me) in further use outside of python 2.7.
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
line_index =0 #set line count to 0 before starting
file = io.open('a/b.xml', 'r', encoding='utf-16')
lines = file.readlines()
outFile = open('a/c.xml', 'w')
for line in lines[0:len(lines)]:
line_index =line_index +1
if line_index == len(lines):
#1. & 2. delete last line and adding another line in its place not writing it
outFile.writelines("Write extra line here" + '\n')
# 4. Close root tag:
outFile.writelines("</phonebook>") # as in:
#http://tizag.com/xmlTutorial/xmldocument.php
else:
#3. Substitue a line if it finds the following substring in a line:
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
if pattern in line:
line = subst
print line
outFile.writelines(line)#just writing/copying all the lines from the original xml except for the last.

python opens text file with a space between every character

Whenever I try to open a .csv file with the python command
fread = open('input.csv', 'r')
it always opens the file with spaces between every single character. I'm guessing it's something wrong with the text file because I can open other text files with the same command and they are loaded correctly. Does anyone know why a text file would load like this in python?
Thanks.
Update
Ok, I got it with the help of Jarret Hardie's post
this is the code that I used to convert the file to ascii
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
mytext = mytext.encode('ascii', 'ignore')
fwrite = open('input-ascii.csv', 'wb')
fwrite.write(mytext)
Thanks!
The post by recursive is probably right... the contents of the file are likely encoded with a multi-byte charset. If this is, in fact, the case you can likely read the file in python itself without having to convert it first outside of python.
Try something like:
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
The 'b' flag ensures the file is read as binary data. You'll need to know (or guess) the original encoding... in this example, I've used utf-16, but YMMV. This will convert the file to unicode. If you truly have a file with multi-byte chars, I don't recommend converting it to ascii as you may end up losing a lot of the characters in the process.
EDIT: Thanks for uploading the file. There are two bytes at the front of the file which indicates that it does, indeed, use a wide charset. If you're curious, open the file in a hex editor as some have suggested... you'll see something in the text version like 'I.D.|.' (etc). The dot is the extra byte for each char.
The code snippet above seems to work on my machine with that file.
The file is encoded in some unicode encoding, but you are reading it as ascii. Try to convert the file to ascii before using it in python.
Isn't csv a simple txt file with values separated with comma.
Just try to open it with a text editor to see if the file is correctly formed.
To read an encoded file, you can simply replace open with codecs.open.
fread = codecs.open('input.csv', 'r', 'utf-16')
It did never ocurred to me, but as truppo said, it must be something wrong with the file.
Try to open the file in Excel/BrOffice Calc and Save As the file as Csv again.
If the problem persists, try a subset of the data: fist 10/last 10/intermediate 10 lines of the file.
Ok, I got it with the help of Jarret Hardie's post
this is the code that I used to convert the file to ascii
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
mytext = mytext.encode('ascii', 'ignore')
fwrite = open('input-ascii.csv', 'wb')
fwrite.write(mytext)
Thanks!
Open the file in binary mode, 'rb'. Check it in a HEX Editor and check for null padding '00'. Open the file in something like Scintilla Text Editor to check the characters present in the file.
Here's the quick and easy way, esp if python won't parse the input correctly
sed 's/ \(.\)/\1/g'

Categories

Resources