Downsides to reading strings from Excel in python using encode('utf-8') - python

I am reading a large amount of data from an excel spreadsheet in which I read (and reformat and rewrite) from the spreadsheet using the following general structure:
book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
z = i + 1
toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
out.write(toprint)
out.write("\n")
where x and y are arbitrary cells in this case, with x being less arbitrary and containing utf-8 characters
So far I have only been using the .encode('utf-8') in cells where I know there will be errors otherwise or foresee an error without using utf-8.
My question is basically this: is there a disadvantage to using .encode('utf-8') on all of the cells even if it is unnecessary? Efficiency is not an issue. the main issue is that it works even if there is a utf-8 character in a place there shouldn't be. If no errors would occur if I just lump the ".encode('utf-8')" onto every cell read, I will probably end up doing that.

The XLRD Documentation states it clearly: "From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode.". Since you are likely reading in files newer than 97, they are containing Unicode codepoints anyway. It is therefore necessary that keep the content of these cells as Unicode within Python and do not convert them to ASCII (which you do in with the str() function). Use this code below:
book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
z = i + 1
toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
out.write(toprint.encode('UTF-8'))

This answer is really a few mild comments on the accepted answer, but they need better formatting than the SO comment facility provides.
(1) Avoiding the SO horizontal scrollbar enhances the chance that people will read your code. Try wrapping your lines, for example:
toprint = u"".join([
u"formatting of the data im writing. "
u"important stuff is to the right -> ",
unicode(sheettwo.cell(z,y).value),
u" more formatting! ",
unicode(sheettwo.cell(z,x).value),
u" and done\n"
])
out.write(toprint.encode('UTF-8'))
(2) Presumably you are using unicode() to convert floats and ints to unicode; it does nothing for values that are already unicode. Be aware that unicode(), like str(), gives you only 12 digits of precision for floats:
>>> unicode(123456.78901234567)
u'123456.789012'
If that is a bother, you might like to try something like this:
>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'
(3) xlrd builds Cell objects on the fly when demanded.
sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

Related

Ignore newline character in binary file with Python?

I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)
(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.
You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")

Zeroes appearing when reading file (where aren't any)

When reading a file (UTF-8 Unicode text, csv) with Python on Linux, either with:
csv.reader()
file()
values of some columns get a zero as their first characeter (there are no zeroues in input), other get a few zeroes, which are not seen when viewing file with Geany or any other editor. For example:
Input
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
Output
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;0378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
See 378983561 > 0378983561
Reading with:
f = file('/home/foo/data.csv', 'r')
data = f.read()
split_data = data.splitlines()
lines = list(line.split(';') for line in split_data)
print data[51220][8]
>>> '0378983561' #should have been '478983561' (reads like this in Geany etc.)
Same result with csv.reader().
Help me solve the mystery, what could be the cause of this? Could it be related to encoding/decoding?
The data you're getting is a string.
print data[51220][8]
>>> '0478983561'
If you want to use this as an integer, you should parse it.
print int(data[51220][8])
>>> 478983561
If you want this as a string, you should convert it back to a string.
print repr(int(data[51220][8]))
>>> '478983561'
csv.reader treats all columns as strings. Conversion to the appropriate type is up to you as in:
print int(data[51220][8])

Write bitarray to file in python?

I am trying to write a pit array to a file in python as in this example: python bitarray to and from file
however, I get garbage in my actual test file:
test_1 = ^#^#
test_2 = ^#^#
code:
from bitarray import bitarray
def test_function(myBitArray):
test_bitarray=bitarray(10)
test_bitarray.setall(0)
with open('test_file.inp','w') as output_file:
output_file.write('test_1 = ')
myBitArray.tofile(output_file)
output_file.write('\ntest_2 = ')
test_bitarray.tofile(output_file)
Any help with what's going wrong would be appreciated.
That's not garbage. The tofile function writes binary data to a binary file. A 10-bit-long bitarray with all 0's will be output as two bytes of 0. (The docs explain that when the length is not a multiple of 8, it's padded with 0 bits.) When you read that as text, two 0 bytes will look like ^#^#, because ^# is the way (many) programs represent a 0 byte as text.
If you want a human-readable text-friendly representation, use the to01 method, which returns a human-readable strings. For example:
with open('test_file.inp','w') as output_file:
output_file.write('test_1 = ')
output_file.write(myBitArray.to01())
output_file.write('\ntest_2 = ')
output_file(test_bitarray.to01())
Or maybe you want this instead:
output_file(str(test_bitarray))
… which will give you something like:
bitarray('0000000000')

Read Unicode from CSV [duplicate]

This question already has answers here:
General Unicode/UTF-8 support for csv files in Python 2.6
(10 answers)
Closed 9 years ago.
I have a problem reading unicode characters from a csv. The csv file originally had elements with unicode tags:
"[u'Aeron\xe1utica']"
"[u'Ni\u0161']"
"[u'K\xfcnste']"
...
from which I had to remove the u'' tags to give a csv with
Aeron\xe1utica
Ni\u0161
K\xfcnste
....
Now I want to read the csv and output it into a file with the characters i.e.
Aeronáutica
Niš
Künste
....
I tried using the UnicodeWriter in the csv docs, but it gives the same output as the second list
Here's what I did to read and write:
c = open('foo.csv','r')
r = csv.reader(c)
for row in reader:
p = p + row
#The elements in p were ['Aeron\\xe1utica', 'Ni\\u0161', 'K\\xfcnste'...]
c = open('bar.csv','w')
c.write(codecs.BOM_UTF8)
writer = UnicodeWriter(c)
for row in p:
writer.writerow([row])
I also tried codecs.open('','','UTF-8') for both reading and writing, but it didn't help
It appears you have written Python lists directly to your CSV file, resulting in the [...] literal syntax instead of normal columns. You then removed most of the information that could have been used to turn the information back to Python lists with unicode strings again.
What you have left are Python unicode literals, but without the quotes. Use the unicode_escape to decode the values to Unicode again:
with open('foo.csv','r') as b0rken
for line in b0rken:
value = line.rstrip('\r\n').decode('unicode_escape')
print value
or add back the u'..' quoting, using a triple-quoted string in an attempt to avoid needing to escape embedded quotes:
with open('foo.csv','r') as b0rken
for line in b0rken:
value = literal_eval("u'''{}'''".format(line.rstrip('\r\n')))
print value
If you still have the original file (with the [u'...'] formatted lines), use the ast.literal_eval() function to turn those back into Python lists. No point in using the CSV module here:
from ast import literal_eval
with open('foo.csv','r') as b0rken
for line in b0rken:
lis = literal_eval(line)
value = lis[0]
print value
Demo with unicode_escape:
>>> for line in b0rken:
... print line.rstrip('\r\n').decode('unicode_escape')
...
Aeronáutica
Niš
Künste
École de l'Air

Convert binary data to web-safe text and back - Python

I want to convert a binary file (such as a jpg, mp3, etc) to web-safe text and then back into binary data. I've researched a few modules and I think I'm really close but I keep getting data corruption.
After looking at the documentation for binascii I came up with this:
from binascii import *
raw_bytes = open('test.jpg','rb').read()
text = b2a_qp(raw_bytes,quotetabs=True,header=False)
bytesback = a2b_qp(text,header=False)
f = open('converted.jpg','wb')
f.write(bytesback)
f.close()
When I try to open the converted.jpg I get data corruption :-/
I also tried using b2a_base64 with 57-long blocks of binary data. I took each block, converted to a string, concatenated them all together, and then converted back in a2b_base64 and got corruption again.
Can anyone help? I'm not super knowledgeable on all the intricacies of bytes and file formats. I'm using Python on Windows if that makes a difference with the \r\n stuff
Your code looks quite complicated. Try this:
#!/usr/bin/env python
from binascii import *
raw_bytes = open('28.jpg','rb').read()
i = 0
str_one = b2a_base64(raw_bytes) # 1
str_list = b2a_base64(raw_bytes).split("\n") #2
bytesBackAll = a2b_base64(''.join(str_list)) #2
print bytesBackAll == raw_bytes #True #2
bytesBackAll = a2b_base64(str_one) #1
print bytesBackAll == raw_bytes #True #1
Lines tagged with #1 and #2 represent alternatives to each other. #1 seems most straightforward to me - just make it one string, process it and convert it back.
You should use base64 encoding instead of quoted printable. Use b2a_base64() and a2b_base64().
Quoted printable is much bigger for binary data like pictures. In this encoding each binary (non alphanumeric character) code is changed into =HEX. It can be used for texts that consist mainly of alphanumeric like email subjects.
Base64 is much better for mainly binary data. It takes 6 bites of first byte, then last 2 bits of 1st byte and 4 bites from 2nd byte. etc. It can be recognized by = padding at the end of the encoded text (sometimes other character is used).
As an example I took .jpeg of 271 700 bytes. In qp it is 627 857 b while in base64 it is 362 269 bytes. Size of qp is dependent of data type: text which is letters only do not change. Size of base64 is orig_size * 8 / 6.
Your documentation reference is for Python 3.0.1. There is no good reason using Python 3.0. You should be using 3.2 or 2.7. What exactly are you using?
Suggestion: (1) change bytes to raw_bytes to avoid confusion with the bytes built-in (2) check for raw_bytes == bytes_back in your test script (3) while your test should work with quoted-printable, it is very inefficient for binary data; use base64 instead.
Update: Base64 encoding produces 4 output bytes for every 3 input bytes. Your base64 code doesn't work with 56-byte chunks because 56 is not an integral multiple of 3; each chunk is padded out to a multiple of 3. Then you join the chunks and attempt to decode, which is guaranteed not to work.
Your chunking loop would be much better written as:
output_string = ''.join(
b2a_base64(raw_bytes[i:i+57]) for i in xrange(0, xrange(len(raw_bytes), 57)
)
In any case, chunking is rather slow and pointless; just do b2a_base64(raw_bytes)
#PMC's answer copied from the question:
Here's what works:
from binascii import *
raw_bytes = open('28.jpg','rb').read()
str_list = []
i = 0
while i < len(raw_bytes):
byteSegment = raw_bytes[i:i+57]
str_list.append(b2a_base64(byteSegment))
i += 57
bytesBackAll = a2b_base64(''.join(str_list))
print bytesBackAll == raw_bytes #True
Thanks for the help guys. I'm not sure why this would fail with [0:56] instead of [0:57] but I'll leave that as an exercise for the reader :P

Categories

Resources