Identify string ascii status in a if/else loop with Python - python

I might be asking this wrong, but please help if I am. I need to establish whether a string contains non-ascii characters in order to separate them from the ones that is purely ascii.
I gather a string from multiple separate files and need to remove the non-ascii containing ones so that I can place the strings in a list to be used further. Without any filtering I get the following error while extracting the strings:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 40: ordinal not in range(128)
I would like to achieve the following:
Read string
if string contains non-ascii
->add to list
else
->do not add to list.
All I need to do is determine the how to filter, I have the rest of the code in tact.

You can attempt to encode the string and use a try/except to detect those that contain non-ascii characters. Something like this might work for you:
ascii_strings = []
non_ascii_strings = []
for s in sequence_of_strings:
try:
if isinstance(s, bytes): # handle Python 3 byte strings
_ = s.decode('ascii')
else:
_ = s.encode('ascii')
ascii_strings.append(s)
except UnicodeError:
non_ascii_strings.append(s)
That's the general idea and it should work in Python 2 and 3.

Related

Encoding and decoding for chars are not treated the same for polish letters

From other source i get two names with two polish letter (ń and ó), like below:
piaseczyński
zielonogórski
Of course these names is more then two.
The 1st should be looks like piaseczyński and the 2nd looks good. But when I use some operation to fix it using:
str(entity_name).encode('1252').decode('utf-8') then 1st is fixed, but 2nd return error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte
Why polish letter are not treated the same?
How to fix it?
As you probably realise already, those strings have different encodings. The best approach is to fix it at the source, so that it always returns UTF-8 (or at least some consistent, known encoding).
If you really can't do that, you should try to decode as UTF-8 first, because it's more strict: not every string of bytes is valid UTF-8. If you get UnicodeDecodeError, try to decode it as some other encoding:
def decode_crappy_bytes(b):
try:
return b.decode('utf-8')
except UnicodeDecodeError:
return b.decode('1252')
Note that this can still fail, in two ways:
If you get a string in some non-UTF-8 encoding that happens to be decodable as UTF-8 as well.
If you get a string in a non-UTF-8 encoding that's not Windows codepage 1252. Another common one in Europe is ISO-8859-1 (Latin-1). Every bytestring that's valid in one is also valid in the other.
If you do need to deal with multiple different non-UTF-8 encodings and you know that it should be Polish, you could count the number of non-ASCII Polish letters in each possible decoding, and return the one with the highest score. Still not infallible, so really, it's best to fix it at the source.
#Thomas I added another except then now works perfectly:
try:
entity_name = entity_name.encode('1252').decode('utf-8')
except UnicodeDecodeError:
pass
except UnicodeEncodeError:
pass
Passed for żarski.

'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)

I am running into this problem where when I try to decode a string I run into one error,when I try to encode I run into another error,errors below,is there a permanent solution for this?
P.S please note that you may not be able to reproduce the encoding error with the string I provided as I couldnt copy/paste some errors
text = "sometext"
string = '\n'.join(list(set(text)))
try:
print "decode"
text = string.decode('UTF-8')
except Exception as e:
print e
text = string.encode('UTF-8')
Errors:-
error while using string.decode('UTF-8')
'ascii' codec can't encode character u'\u2602' in position 438: ordinal not in range(128)
Error while using string.encode('UTF-8')
Exception All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
The First Error
The code you have provided will work as the text is a a bytestring (as you are using Python 2). But what you're trying to do is to decode from a UTF-8 string to
an ASCII one, which is possible, but only if that Unicode string contains only characters that have an ASCII equivalent (you can see the list of ASCII characters here). In your case, it's encountering a unicode character (specifically ☂) which has no ASCII equivalent. You can get around this behaviour by using:
string.decode('UTF-8', 'ignore')
Which will just ignore (i.e. replace with nothing) the characters that cannot be encoded into ASCII.
The Second Error
This error is more interesting. It appears the text you are trying to encode into UTF-8 contains either NULL bytes or specific control characters, which are not allowed by the version of Unicode (UTF-8) that you are trying to encode into. Again, the code that you have actually provided works, but something in the text that you are trying to encode is violating the encoding. You can try the same trick as above:
string.encode('UTF-8', 'ignore')
Which will simply remove the offending characters, or you can look into what it is in your specific text input that is causing the problem.

Python 3 Decoding Strings

I understand that this is likely a repeat question, but I'm having trouble finding a solution.
In short I have a string I'd like to decode:
raw = "\x94my quote\x94"
string = decode(raw)
expected from string
'"my quote"'
Last point of note is that I'm working with Python 3 so raw is unicode, and thus is already decoded. Given that, what exactly do I need to do to "decode" the "\x94" characters?
string = "\x22my quote\x22"
print(string)
You don't need to decode, Python 3 does that for you, but you need the correct control character for the double quote "
If however you have a different character set, it appears you have Windows-1252, then you need to decode the byte string from that character set:
str(b"\x94my quote\x94", "windows-1252")
If your string isn't a byte string you have to encode it first, I found the latin-1 encoding to work:
string = "\x94my quote\x94"
str(string.encode("latin-1"), "windows-1252")
I don't know if you mean to this, but this works:
some_binary = a = b"\x94my quote\x94"
result = some_binary.decode()
And you got the result...
If you don't know which encoding to choose, you can use chardet.detect:
import chardet
chardet.detect(some_binary)
Did you try it like this? I think you need to call decode as a method of the byte class, and pass utf-8 as the argument. Add b in front of the string too.
string = b"\x94my quote\x94"
decoded_str = string.decode('utf-8', 'ignore')
print(decoded_str)

How can I encode/decode \xbe in python?

I have an excel file I am reading in python using the xlrd module. I am extracting the values from each row, adding some additional data and writing it all out to a new text file. However I am running into an issue with cells that contain text with the fraction 3/4. Python reads the value as \xbe, and each time I encounter it, I get this error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xbe' in position 317: ordinal not in range(128)
I am converting my list of values from each row into a string, have tried the following without success:
row_vals_str = [unicode(str(val), 'utf-8') for val in row_vals]
row_vals_str = [str(val).encode('utf-8') for val in row_vals]
row_vals_str = [str(val).decode() for val in row_vals]
Each time I hit the first occurrence of the 3/4 fraction I get the same error.
How can I convert this to something that can be written to text?
I came across this thread but didn't find an answer: How to convert \xXY encoded characters to UTF-8 in Python?
It is latin-1 group. you can use latin1 to decode the char or replace to different one if you do not need it.
http://www.codetable.net/hex/be
>>> '\xbe'.decode('latin1')
u'\xbe'
>>> '\xbe'.decode('cp1252')
u'\xbe'
>>> '\xbe this is a test'.replace('\xbe','3/4')
'3/4 this is a test'
What actually ending up working was to to decode the string, then encode it, then replace:
row_vals_str = [str(val).decode('latin1').encode('utf8').replace(r'\xbe', '3/4') for val in row_vals]

UnicodeDecodeError when using a Python string handling function

I'm doing this:
word.rstrip(s)
Where word and s are strings containing unicode characters.
I'm getting this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)
There's a bug report where this error happens on some Windows Django systems. However, my situation seems unrelated to that case.
What could be the problem?
EDIT: The code is like this:
def Strip(word):
for s in suffixes:
return word.rstrip(s)
The issue is that s is a bytestring, while word is a unicode string - so, Python tries to turn s into a unicode string so that the rstrip makes sense. The issue is, it assumes s is encoded in ASCII, which it clearly isn't (since it contains a character outside the ASCII range).
So, since you intitialise it as a literal, it is very easy to turn it into a unicode string by putting a u in front of it:
suffixes = [u'ি']
Will work. As you add more suffixes, you'll need the u in front of all of them individually.
I guess this happens because of implicit conversion in python2.
It's explained in this document, but I recommend you to read the whole presentation about handling unicode in python 2 and 3 (and why python3 is better ;-))
So, I think the solution to your problem would be to force the decoding of strings as utf8 before striping.
Something like :
def Strip(word):
word = word.decode("utf8")
for s in suffixes:
return word.rstrip(s.decode("utf8")
Second try :
def Strip(word):
if type(word) == str:
word = word.decode("utf8")
for s in suffixes:
if type(s) == str:
s = s.decode("utf8")
return word.rstrip(s)

Categories

Resources