Get an UTF-16 string length from memory in python - python

I need to read a utf-16 encoded string that is stored in memory in a python script for LLDB. According to their documentation I'm able to use ReadMemory(address, length, error) but I need to know its length in advance.
If not python's decode function fails when it stumbles upon a character it cannot decode (even using the 'ignore' option) and the process stops:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u018e' in position 12: ordinal not in range(128)
Can anyone suggest a way of achieving this? (either using a "python" or "lldb python" implementation). I don't have the original string's length.
Thanks.

Is the string 0-terminated? If so, you could read 2 bytes at a time, until you encounter 0x0000, and then you'd know you have a complete string.
If you do this, you'd want to give yourself a constraint (e.g. "I will give up after reading - say - 1MB of data", in case you're running into corrupted memory).

Related

How to encode unicode surrogate pairs, to write to file

I'm creating an compression / encryption program that utilises (nearly) all unicode characters, and then writes the data to a file. However, to write to the file, I need to encode the characters to bytes. However, when I do so, It gives this error:
unicodeencodeerror: 'utf-8' codec can't encode character '\udd7d' in position 1323: surrogates not allowed
I've tried all of the built-in python codecs, none of them work, except for 'utf-7', however, this simply encodes the unicode to base64, which defeats the object of what I'm trying to achieve.
file = open(str(file_name.capitalize()) + ".Unicode_File","wb")
file.write(unicode_madness.encode("utf-8"))
file.close()
I expect it to write the variable 'unicode_madness' to the file, which it does, however sometimes it tries to use a surrogate unicode character.
To resolve this, I either need to be able to avoid the surrogate characters (while keeping the compression lossless), or I need to find out which unicode characters use surrogates, and I can adjust the program accordingly.
Thanks for any help!

Write unmapped characters to file?

For example: the character \x80, or 128 in decimal, has no UTF-8 character assigned to it. But if I understand text files correctly, I should still be able to create a file that contains that character, even if nothing can display it. However, when I try to print an array that contains one of these characters, it writes as '\x80', and when I try to write it directly as a chr, I get an error "UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 0: character maps to ". Am I doing something fundamentally wrong, or is there a fix I just don't know about here?
The bytes type in Python is what I should have been using for this. Though I didn't quite understand it when I was posting the question, I needed a list of single-byte variables. This is exactly what the bytes object does, and even better, it can be used exactly like a string.

python ascii codes to utf

So when i post a name or text in mod_python in my native language i get:
македонија
And i also get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
When i use:
hparser = HTMLParser.HTMLParser()
req.write(hparser.unescape(text))
How can i decode it?
It's hard to explain UnicodeErrors if you don't understand the underlying mechanism. You should really read either or both of
Pragmatic Unicode (Ned Batchelder)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (Joel Spolsky)
In a (very small) nutshell, a Unicode code point is an abstract "thingy" representing one character1. Programmers like to work with these, because we like to think of strings as coming one character at a time. Unfortunately, it was decreed a long time ago that a character must fit in one byte of memory, so there can be at most 256 different characters. Which is fine for plain English, but doesn't work for anything else. There's a global list of code points -- thousands of them -- which are meant to hold every possible character, but clearly they don't fit in a byte.
The solution: there is a difference between the ordered list of code points that make a string, and its encoding as a sequence of bytes. You have to be clear whenever you work with a string which of these forms it should be in.
To convert between the forms you can .encode() a list of code points (a Unicode string) as a list of bytes, and .decode() bytes into a list of code points. To do so, you need to know how to map code points into bytes and vice versa, which is the encoding. If you don't specify one, Python 2.x will guess that you meant ASCII. If that guess is wrong, you will get a UnicodeError.
Note that Python 3.x is much better at handling Unicode strings, because the distinction between bytes and code points is much more clear cut.
1Sort of.
EDIT: I guess I should point out how this helps. But you really should read the above links! Just throwing in .encode()s and .decode()s everywhere is a terrible way to code, and one day you'll get bitten by a worse problem.
Anyway, if you step through what you're doing in the shell you'll see
>>> from HTMLParser import HTMLParser
>>> text = "македонија"
>>> hparser = HTMLParser()
>>> text = hparser.unescape(text)
>>> text
u'\u043c\u0430\u043a\u0435\u0434\u043e\u043d\u0438\u0458\u0430'
I'm using Python 2.7 here, so that's a Unicode string i.e. a sequence of Unicode code points. We can encode them into a regular string (i.e. a list of bytes) like
>>> text.encode("utf-8")
'\xd0\xbc\xd0\xb0\xd0\xba\xd0\xb5\xd0\xb4\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x98\xd0\xb0'
But we could also pick a different encoding!
>>> text.encode("utf-16")
'\xff\xfe<\x040\x04:\x045\x044\x04>\x04=\x048\x04X\x040\x04'
You'll need to decide what encoding you want to use.
What went wrong when you did it? Well, not every encoding understands every code point. In particular, the "ascii" encoding only understands the first 256! So if you try
>>> text.encode("ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
you just get an error, because you can't encode those code points in ASCII.
When you do req.write, you are trying to write a list of code points down the request. But HTML requests don't understand code points: they just use ASCII. Python 2 will try to be helpful by automatically ASCII-encoding your Unicode strings, which is fine if they really are ASCII but not if they aren't.
So you need to do req.write(hparser.unescape(text).encode("some-encoding")).

utf-8 plus question marks

I have a site that displays user input by decoding it to unicode using utf-8. However, user input can include binary data, which is obviously not always able to be 'decoded' by utf-8.
I'm using Python, and I get an error saying:
'utf8' codec can't decode byte 0xbf in position 0: unexpected code byte. You passed in '\xbf\xcd...
Is there a standard efficient way to convert those undecodable characters into question marks?
It would be most helpful if the answer uses Python.
Try:
inputstring.decode("utf8", "replace")
See here for reference
I think what you are looking for is:
str.decode('utf8','ignore')
which should drop invalid bytes rather than raising exception

set the implicit default encoding\decoding error handling in python

I am working with external data that's encoded in latin1. So I've add sitecustomize.py and in it added
sys.setdefaultencoding('latin_1')
sure enough, now working with latin1 strings works fine.
But, in case I encounter something that is not encoded in latin1:
s=str(u'abc\u2013')
I get UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 3: ordinal not in range(256)
What I would like is that the undecodable chars would simply be ignored, i.e I would get that in the above example s=='abc?', and do that without explicitly calling decode() or encode each time, i.e not s.decode(...,'replace') on each call.
I tried doing different things with codecs.register_error but to no avail.
please help?
There is a reason scripts can't call sys.setdefaultencoding. Don't do that, some libraries (including standard libraries included with Python) expect the default to be 'ascii'.
Instead, explicitly decode strings to Unicode when read into your program (via file, stdin, socket, etc.) and explicitly encode strings when writing them out.
Explicit decoding takes a parameter specifying behavior for undecodable bytes.
You can define your own custom handler and use it instead to do as you please. See this example:
import codecs
from logging import getLogger
log = getLogger()
def custom_character_handler(exception):
log.error("%s for %s on %s from position %s to %s. Using '?' in-place of it!",
exception.reason,
exception.object[exception.start:exception.end],
exception.encoding,
exception.start,
exception.end )
return ("?", exception.end)
codecs.register_error("custom_character_handler", custom_character_handler)
print( b'F\xc3\xb8\xc3\xb6\xbbB\xc3\xa5r'.decode('utf8', 'custom_character_handler') )
print( codecs.encode(u"abc\u03c0de", "ascii", "custom_character_handler") )
Running it, you will see:
invalid start byte for b'\xbb' on utf-8 from position 5 to 6. Using '?' in-place of it!
Føö?Bår
ordinal not in range(128) for π on ascii from position 3 to 4. Using '?' in-place of it!
b'abc?de'
References:
https://docs.python.org/3/library/codecs.html#codecs.register_error
https://docs.python.org/3/library/exceptions.html#UnicodeError
How to ignore invalid lines in a file?
'str' object has no attribute 'decode'. Python 3 error?
How to replace invalid unicode characters in a string in Python?
UnicodeDecodeError in Python when reading a file, how to ignore the error and jump to the next line?

Categories

Resources