So I have a code that takes a .txt file and adds it to a variable as a string.
Then, I try to use .replace() on it to change the character "ó" to "o", but it is not working! The console prints the same thing.
Code:
def normalize(filename):
#Ignores errors because I get the .txt from my WhatsApp conversations and emojis raise an error.
#File says: "Es una rubrica de evaluación." (among many emojis)
txt_raw = open(filename, "r", errors="ignore")
txt_read = txt_raw.read()
#Here, only the "o" is replaced. In the real code, I use a for loop to iterate through all chrs.
rem_accent_txt = txt_read.replace("ó", "o")
print(rem_accent_txt)
return
Expected output:
"Es una rubrica de evaluacion."
Current Output:
"Es una rubrica de evaluación."
It does not print an error or anything, it just prints it as it is.
I believe the problem lies on the fact that the string comes from a file because when I just create a string and use the code, it does work, but it does not work when I get the string from a file.
EDIT: SOLUTION!
Thanks to #juanpa.arrivillaga and #das-g I came up with this solution:
from unidecode import unidecode
def get_txt(filename):
txt_raw = open(filename, "r", encoding="utf8")
txt_read = txt_raw.read()
txt_decode = unidecode(txt_read)
print(txt_decode)
return txt_decode
Almost certainly, what is occuring is that you have a unormalized unicode strings. Essentially, there are two ways to create "ó" in unicode:
>>> combining = 'ó'
>>> composed = 'ó'
>>> len(combining), len(composed)
(2, 1)
>>> list(combining)
['o', '́']
>>> list(composed)
['ó']
>>> import unicodedata
>>> list(map(unicodedata.name, combining))
['LATIN SMALL LETTER O', 'COMBINING ACUTE ACCENT']
>>> list(map(unicodedata.name, composed))
['LATIN SMALL LETTER O WITH ACUTE']
Just normalize your strings:
>>> composed == combining
False
>>> composed == unicodedata.normalize("NFC", combining)
True
Although, taking a step back, do you really want to remove accents? Or do you just want to normalize to composed, like the above?
As an aside, you shouldn't ignore the errors when reading your text file. You should use the correct encoding. I suspect what is happening is that you are writing your text file using an incorrect encoding, because you should be able to handle emojis just fine, they aren't anything special in unicode.
>>> emoji = "😀"
>>> print(emoji)
😀
>>>
>>> unicodedata.name(emoji)
'GRINNING FACE'
I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)
(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.
You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")
So I've been working on this python code for a few days now. I'm trying to decode a zero-one code I made previously. Simply put it hides genomic code...
binary = raw_input ('Enter binary code:')
binary = binary.replace('00', 'A')
binary = binary.replace('01', 'C')
binary = binary.replace('10', 'G')
binary = binary.replace('11', 'T')
print binary
My issue is, it will accept something like 0110 = CG. But when I add any characters after that it messes up, like 011011 should be CGT instead it's C1CC1. If anyone could identify this issue, or even solve it that would be great.
Repeatedly take off two characters and decode them
s = "100101001010101010110"
decode = {'00':'A', '01':'C', '10':'G', '11':'T'}
while s:
(code, s) = (s[:2], s[2:])
print decode[code]
An alternative solution to ForceBru's, using the re module:
import re
dna = '100010101001010111000'
base_pairs = {'00': 'A', '01': 'C', '10':'G', '11': 'T'}
alpha_dna = ''.join([base_pairs[x] for x in re.findall('..?', dna)])
# alpha_dna == 'GAGGGCCCTA'
In the following code error checking is omitted for the sake of clarity
a='011011'
rep={'00': 'A','01': 'C','10':'G','11': 'T'}
res=''
for x in xrange(0,len(a),2):
res+=rep[a[x]+a[x+1]]
print res
Here you just have to split the string into some blocks of length 2 and then use each of these blocks as a key of a dictionary.
One liner! Probably not useful, but just for fun:
"".join([ {'00':'A', '01':'C', '10':'G', '11':'T'}[code] for code in [ binary[i:i+2] for i in range(0,len(binary),2) ]])
Yes, I'm addicted to list comprehensions.
I am reading a large amount of data from an excel spreadsheet in which I read (and reformat and rewrite) from the spreadsheet using the following general structure:
book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
z = i + 1
toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
out.write(toprint)
out.write("\n")
where x and y are arbitrary cells in this case, with x being less arbitrary and containing utf-8 characters
So far I have only been using the .encode('utf-8') in cells where I know there will be errors otherwise or foresee an error without using utf-8.
My question is basically this: is there a disadvantage to using .encode('utf-8') on all of the cells even if it is unnecessary? Efficiency is not an issue. the main issue is that it works even if there is a utf-8 character in a place there shouldn't be. If no errors would occur if I just lump the ".encode('utf-8')" onto every cell read, I will probably end up doing that.
The XLRD Documentation states it clearly: "From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode.". Since you are likely reading in files newer than 97, they are containing Unicode codepoints anyway. It is therefore necessary that keep the content of these cells as Unicode within Python and do not convert them to ASCII (which you do in with the str() function). Use this code below:
book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
z = i + 1
toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
out.write(toprint.encode('UTF-8'))
This answer is really a few mild comments on the accepted answer, but they need better formatting than the SO comment facility provides.
(1) Avoiding the SO horizontal scrollbar enhances the chance that people will read your code. Try wrapping your lines, for example:
toprint = u"".join([
u"formatting of the data im writing. "
u"important stuff is to the right -> ",
unicode(sheettwo.cell(z,y).value),
u" more formatting! ",
unicode(sheettwo.cell(z,x).value),
u" and done\n"
])
out.write(toprint.encode('UTF-8'))
(2) Presumably you are using unicode() to convert floats and ints to unicode; it does nothing for values that are already unicode. Be aware that unicode(), like str(), gives you only 12 digits of precision for floats:
>>> unicode(123456.78901234567)
u'123456.789012'
If that is a bother, you might like to try something like this:
>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'
(3) xlrd builds Cell objects on the fly when demanded.
sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster
I'm trying to convert characters in one list into characters in another list at the same index in Japanese (zenkaku to hangaku moji, for those interested), and I can't get the comparison to work. I am decoding into utf-8 before I compare (decoding into ascii broke the program), but the comparison doesn't ever return true. Does anyone know what I'm doing wrong? Here's the code (indents are a little wacky due to SO's editor):
#!C:\Python27\python.exe
# coding=utf-8
import os
import shutil
import sys
zk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
hk = [
'。',
'、',
'「',
'」',
'(',
')',
'!',
'?',
'・',
'/',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
def main():
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print("Please specify a file to check.")
return
try:
f = open(filename, 'r')
except IOError as e:
print("Sorry! The file doesn't exist.")
return
filecontent = f.read()
f.close()
#y = zk[29]
#print y.decode('utf-8')
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
print filename
if __name__ == "__main__":
main()
Am I missing a step?
Several.
zk = [
u'。',
u'、',
u'「',
...
...
f = codecs.open(filename, 'r', encoding='utf-8')
...
I'll let you work out the rest now that the hard work's been done.
Make sure that zk and hk lists contain Unicode strings. Either use Unicode literals e.g., u'a' or decode them at runtime:
fromutf8 = lambda s: s.decode('utf-8') if not isinstance(s, unicode) else s
zk = map(fromutf8, zk)
hk = map(fromutf8, hk)
You could use unicode.translate() to convert characters in one list into characters in another list at the same index:
import codecs
translation_table = dict(zip(map(ord,zk), hk))
with codecs.open(sys.argv[1], encoding='utf-8') as f:
for line in f:
print line.translate(translation_table),
You need to convert everything to the same form, and the form is Unicode strings. Unicode strings have no encoding in the sense .encode() or .decode(). When having a non-unicode string, it is actually a stream of bytes that expresses the value in some encoding. When converting to Unicode, you have to .decode(). When storing Unicode string to a sequence of bytes, you have to .encode() the abstraction to concrete bytes.
This way, when loading Unicode strings from an UTF-8 encoded file, or you have to read it into the old strings (non Unicode, sequences of bytes) and then .decode('utf-8'), or you can use `codecs.open(..., encoding='utf-8') -- then you get Unicode strings automatically.
The form # coding=utf-8 is not the usual, but it is OK... if the editor (I mean the tool that you use to write the text) also thinks this way. Then the old strings are displayed by the editor correctly. In the case they should be .decode('utf-8')d to get Unicode. Old strings with ASCII characters only in the same source can also be converted to Unicode using the .decode('utf-8').
To summarize: you are de coding from bytes to Unicode, and you are en coding the Unicode strings into sequence of bytes. It seems from the question that you are doing the opposite.
The following is completely wrong:
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
because the filecontent is the result of f.read(). This way it is a sequence of bytes. The f in the loop is one byte. The z.decode('utf-8') returns one Unicode character. They cannot be compared. (By the way, the f is a kind of misleading name for a byte value.)