Decode individual octal characters in string variable - python

A string variable sometimes includes octal characters that need to be un-octaled. Example: oct_var = "String\302\240with\302\240octals", the value of oct_var should be "String with octals" with non-breaking spaces.
Codecs doesn't support octal, and I failed to find a working solution with encode(). The strings originate upstream outside my control.
Python 3.9.8
Edited to add:
It doesn't have to scale or be ultra fast, so maybe the idea from here (#6) can work (not tested yet):
def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

You forgot to indicate that oct_var should be given as bytes:
>>> oct_var = b"String\302\240with\302\240octals"
>>> oct_var.decode()
'String\xa0with\xa0octals'
>>> print(oct_var.decode())
String with octals
Note: if your value is already as a string (beyond your control), you can try to convert it to bytes:
>>> oct_str = "String\302\240with\302\240octals" # as a string
>>> oct_var = bytes([ord(c) for c in oct_str])
# often equivalent to:
>>> oct_var = oct_str.encode('Latin1')
and then proceed as above.
Note, if the string also contains chars beyond ASCII, (e.g., with Latin1, accented chars like 'é'), the subsequent .decode() will fail, as in UTF-8 those are represented as multibyte chars (e.g. 'é'.encode() == b'\xc3\xa9', but 'é'.encode('Latin1') == b'\xe9'). If the string contains Unicode chars beyond Latin1 (e.g. '你好'), you will get a ValueError or a UnicodeEncodeError, depending on which of the two conversion methods you choose).
In short: don't fly anything expensive, heavy, or with people inside with that -- this is hacky. At the very least, surround your code with try ... except (ValueError, UnicodeEncodeError, UnicodeDecodeError) and handle these exceptions accordingly.

Putting your ideas and pointers together, and with the risks that come with the use of an undocumented function[*], i.e, codecs.escape_decode, this line works:
value = (codecs.escape_decode(bytes(oct_var, "latin-1"))[0].decode("utf-8"))
[*] "Internal function means: you can use it on your risk but the function can be changed or even removed in any Python release."
Explanations for for codecs.escape_decode:
https://stackoverflow.com/a/37059682/5309571
Examples for its use:
https://www.programcreek.com/python/example/8498/codecs.escape_decode
Other approaches that may turn out to be more future-proof than codecs.escape_decode (no warranty, I have not tried them):
https://stackoverflow.com/a/58829514/5309571
https://bytes.com/topic/python/answers/743965-converting-octal-escaped-utf-8-a

Related

Check if a character is Unicode in a way compatible with Python 2 and 3

EDIT: Clarifying the allowed character set based on comments
The allowed characters from ASCII character set are a-z, A-Z, 0-9, -, _, ., /. Any other character from ASCII set should not be allowed.
Unicode characters apart from the disallowed ASCII set defined above are also allowed.
End of Edit
I am processing some text data where the only allowed ASCII characters are a-z, A-Z, 0-9, and -,_,.,/. Apart from these Unicode characters are also allowed. I need to make sure that the incoming data contains only these set of characters.
Checking for the allowed ASCII characters is easy:
from string import ascii_letters, digits
VALID_CHARSET= set(ascii_letters + digits + "-_./")
def is_valid_string(string):
for c in string:
if c not in VALID_CHARSET:
return False
return True
But I am wondering about how to allow unicode characters apart from the above. I guess in Python-2.7 I could add a check like so:
if isinstance(c, unicode)
return True
if c not in VALID_CHARSET:
return False
But strings in Python-3 are Unicode by default and there is no separate unicode type, so this would not work there. Any cleaner way of doing this which works in both the versions of Python?
As I read the question, you want to allow any non-ASCII character, plus the whitelisted ASCII characters. Since making a set of all valid characters is impractical (it would have over a million entries), the simplest solution is to make a set of invalid characters and verify that your strings contain none of them:
VALID_CHARSET = frozenset(ascii_letters + digits + "-_./")
INVALID_CHARSET = frozenset(map(chr, range(128))) - VALID_CHARSET
Once you have that, is_valid_string becomes trivial:
def is_valid_string(string):
return INVALID_CHARSET.isdisjoint(string)
If you felt like it, you could even avoid defining the Python level function at all, saving a little call overhead (at the expense of not being able to define your own docstring) by just making an alias to the bound isdisjoint method:
is_valid_string = INVALID_CHARSET.isdisjoint
You're not going to get any faster than that; set/frozenset's isdisjoint method pushes all the work to the C layer (no bytecode processing overhead per character), short-circuits (as soon as an invalid character is seen, it returns immediately), and performs each lookup in ~O(1) (so testing a string is O(n) in the length of the string).
If you aren't concerned with checking, but rather, want to strip out invalid characters, you'd want to use str.translate/unicode.translate to bulk delete the invalid characters, but given the API differs between the types (Py3 str and Py2 unicode use one form, Py3 bytes and Py2 str another), you'd have to go to some trouble to make it work on Py2 and Py3 on the same code base.

Unicode (Cyrillic) character indexing, re-writing in python

I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:
>>>print ["ё"]
['\xd1\x91']
This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":
>>>print [u"ё"]
[u'\u0451']
But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).
So... how do I get around this? If it helps, I am using python 2.7
There are two possible situations here.
Either your str represents valid UTF-8 encoded data, or it does not.
If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.
If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...
Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.
These are actually different encodings:
>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']
What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.
But the strs are being passed around as variables, and so can't be
prefixed with u
You mean the data are strings, and need to be converted into the unicode type:
>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'
You need to coerce the two-byte strings into double-byte width unicode:
>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'
And you'll see with this transform they're perfectly fine.
To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:
>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'
The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python
Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.
Note: a single user-perceived character may span several Unicode codepoints e.g.:
>>> print(u'\u0435\u0308')
ё

How many displayable characters in a unicode string (Japanese / Chinese)

I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters.
Sample code to make the question very obvious :
# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)
12
print str
睡眠時間 <<<
note that four characters are displayed
How can i know, from the string, that 4 characters are going to be displayed ?
This string
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
Is an encoded representation of unicode code points. It contain bytes, len(str) returns you amount of bytes.
You want to know, how many unicode codes contains the string. For that, you need to know, what encoding was used to encode those unicode codes. The most popular encoding is utf8. In utf8 encoding, one unicode code point can take from 1 to 6 bytes. But you must not remember that, just decode the string:
>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'
Here you can see 4 unicode points.
Print it, to see printable version:
>>> print str.decode('utf8')
睡眠時間
And get amount of unicode codes:
>>> len(str.decode('utf8'))
4
UPDATE: Look also at abarnert answer to respect all possible cases.
If you actually want "displayable characters", you have to do two things.
First, you have to convert the string from UTF-8 to Unicode, as explained by stalk:
s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')
Next, you have to filter out all code points that don't represent displayable characters. You can use the unicodedata module for this. The category function can give you the general category of any code unit. To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata docs.
For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is. Then you'd write:
def displayable(c):
return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))
Or, if you've decided that Mn and Me are also not "displayable" but Mc is:
def displayable(c):
return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}
But even this may not be what you want. For example, does a nonspacing combining mark followed by a letter count as one character or two? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point). Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want. Until you can answer that, of course, nobody can tell you how to do it.
If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. In 3.x, you can just write:
p = ''.join(c for c in u is c.isprintable())
But of course that only works if their definition of "printable" happens to match what you mean by "displayable". And it very well may not—for example, they consider all separators except ' ' non-printable. Obviously they can't include definitions for any distinction anyone might want to make.

python: extended ASCII codes

Hi I want to know how I can append and then print extended ASCII codes in python.
I have the following.
code = chr(247)
li = []
li.append(code)
print li
The result python print out is ['\xf7'] when it should be a division symbol. If I simple print code directly "print code" then I get the division symbol but not if I append it to a list. What am I doing wrong?
Thanks.
When you print a list, it outputs the default representation of all its elements - ie by calling repr() on each of them. The repr() of a string is its escaped code, by design. If you want to output all the elements of the list properly you should convert it to a string, eg via ', '.join(li).
Note that as those in the comments have stated, there isn't really any such thing as "extended ASCII", there are just various different encodings.
You probably want the charmap encoding, which lets you turn unicode into bytes without 'magic' conversions.
s='\xf7'
b=s.encode('charmap')
with open('/dev/stdout','wb') as f:
f.write(b)
f.flush()
Will print ÷ on my system.
Note that 'extended ASCII' refers to any of a number of proprietary extensions to ASCII, none of which were ever officially adopted and all of which are incompatible with each other. As a result, the symbol output by that code will vary based on the controlling terminal's choice of how to interpret it.
There's no single defined standard named "extend ASCII Codes"> - there are however, plenty of characters, tens of thousands, as defined in the Unicode standards.
You can be limited to the charset encoding of your text terminal, which you may think of as "Extend ASCII", but which might be "latin-1", for example (if you are on a Unix system such as Linux or Mac OS X, your text terminal will likely use UTF-8 encoding, and able to display any of the tens of thousands chars available in Unicode)
So, you must read this piece in order to understand what text is, after 1992 -
If you try to do any production application believing in "extended ASCII" you are harming yourself, your users and the whole eco-system at once: http://www.joelonsoftware.com/articles/Unicode.html
That said, Python2's (and Python3's) print will call the an implicit str conversion for the objects passed in. If you use a list, this conversion does not recursively calls str for each list element, instead, it uses the element's repr, which displays non ASCII characters as their numeric representation or other unsuitable notations.
You can simply join your desired characters in a unicode string, for example, and then print them normally, using the terminal encoding:
import sys
mytext = u""
mytext += unichr(247) #check the codes for unicode chars here: http://en.wikipedia.org/wiki/List_of_Unicode_characters
print mytext.encode(sys.stdout.encoding, errors="replace")
You are doing nothing wrong.
What you do is to add a string of length 1 to a list.
This string contains a character outside the range of printable characters, and outside of ASCII (which is only 7 bit). That's why its representation looks like '\xf7'.
If you print it, it will be transformed as good as the system can.
In Python 2, the byte will be just printed. The resulting output may be the division symbol, or any other thing, according to what your system's encoding is.
In Python 3, it is a unicode character and will be processed according to how stdout is set up. Normally, this indeed should be the division symbol.
In a representation of a list, the __repr__() of the string is called, leading to what you see.

What is the fool proof way to convert some string (utf-8 or else) to a simple ASCII string in python

Inside my python scrip, I get some string back from a function which I didn't write. The encoding of it varies. I need to convert it to ascii format. Is there some fool-proof way of doing this? I don't mind replacing the non-ascii chars with blanks or something else...
If you want an ASCII string that unambiguously represents what you have got, without losing any information, the answer is simple:
Don't muck about with encode/decode, use the repr() function (Python 2.X) or the ascii() function (Python 3.x).
You say "the encoding of it varies". I guess that by "it" you mean a Python 2.x "string", which is really a sequence of bytes.
Answer part one: if you do not know the encoding of that encoded string, then no, there is no way at all to do anything meaningful with it*. If you do know the encoding, then step one is to convert your str into a unicode:
encoded_string = i_have_no_control()
the_encoding = 'utf-8' # for the sake of example
text = unicode(encoded_string, the_encoding)
Then you can re-encode your unicode object as ASCII, if you like.
ascii_garbage = text.encode('ascii', 'replace')
* There are heuristic methods for guessing encodings, but they are slow and unreliable. Here's one excellent attempt in Python.
I'd try to normalize the string then encode it. What about :
import unicodedata
s = u"éèêàùçÇ"
print unicodedata.normalize('NFKD',s).encode('ascii','ignore')
This works only if you have unicode as input. Therefor, you must know what can of encoding the function ouputs and decode it. If you don't, there are encoding detection heuristics, but on short strings, there are not reliable.
Of course, you could have luck and the function outputs rely on various unknow encodings but using ascii as a code base, therefor they would allocate the same value for the bytes from 0 to 127 (like utf-8).
In that case, you can just get rid of the unwanted chars by filtering them using OrderedSets :
import string.printable # asccii chars
print "".join(OrderedSet(string.printable) & OrderedSet(s))
Or if you want blanks instead :
print("".join(((char if char in string.printable else " ") for char in s )))
"translate" can help you to do the same.
The only way to know if your are this lucky is to try it out... Sometimes, a big fat lucky day is what any dev need :-)
What's meant by "foolproof" is that the function does not fail with even the most obscure, impossible input -- meaning, you could feed the function random binary data and IT WOULD NEVER FAIL, NO MATTER WHAT. That's what "foolproof" means.
The function should then proceed do its best to convert to the destination encoding. If it has to throw away all the trash it does not understand, then that is perfectly fine and is in fact the most desirable result. Why try to salvage all the junk? Just discard the junk. Tell the user he's not merely a moron for using Microsoft anything, but a non-standard moron for using non-standard Microsoft anything...or for attempting to send in binary data!
I have just precisely this same need (though my need is in PHP), and I also have users who are at least as moronic as I am, sometimes moreso; however, they are definitely nicer and no doubt more patient.
The best, bottom-line thing I've found so far is (in PHP 5.3):
$fixed_string = iconv( 'ISO-8859-1', 'UTF-8//IGNORE//TRANSLATE', $in_string );
This attempts to translate whatever it can and simply throws away all the junk, resulting in a legal UTF-8 string output. I've also not been able to break it or cause it to fail or reject any incoming text or data, even by feeding it gobs of binary junk data.
Finding the iconv() and getting it to work is easy; what's so maddening and wasteful is reading through all the total garbage and bend-over-backwards idiocy that so many programmers seem to espouse when dealing with this encoding fiasco. What's become of the enviable (and respectable) "Flail and Burn The Idiots" mentality of old school programming? Let's get back to basics. Use iconv() and throw away their garbage, and don't be bashful when telling them you threw away their garbage -- in short, don't fail to flail the morons who feed you garbage. And you can tell them I told you so.
If all you want to do is preserve ASCII-compatible characters and throw away the rest, then in most encodings that boils down to removing all characters that have the high bit set -- i.e., characters with value over 127. This works because nearly all character sets are extensions of 7-bit ASCII.
If it's a normal string (i.e., not unicode), you need to decode it in an arbitrary character set (such as iso-8859-1 because it accepts any byte values) and then encode in ascii, using the ignore or replace option for errors:
>>> orig = '1ä2äö3öü4ü'
>>> orig.decode('iso-8859-1').encode('ascii', 'ignore')
'1234'
>>> orig.decode('iso-8859-1').encode('ascii', 'replace')
'1??2????3????4??'
The decode step is necessary because you need a unicode string in order to use encode. If you already have a Unicode string, it's simpler:
>>> orig = u'1ä2äö3öü4ü'
>>> orig.encode('ascii', 'ignore')
'1234'
>>> orig.encode('ascii', 'replace')
'1??2????3????4??'

Categories

Resources