Displaying Unprintable Characers in Pygame - python

I'm trying to create a game with ASCII art in Pygame 2.7. If I go to the Idle console and simply type:
for i in range(255):
print str(i) + ' - ' + str(chr(i))
I get nearly 255 distinct characters. However, if I try a similar stunt in Pygame:
import pygame, os, string, sys
from pygame.locals import *
pygame.init()
class Prog:
def __init__(self):
self.title = 'Text'
self.screen_size = (800, 600)
self.screen = pygame.display.set_mode(self.screen_size)
self.bg_color = (255, 255, 255)
self.text_color = (0,0,0)
self.text_font = 'Times New Roman'
self.text_size = 20
self.font = pygame.font.SysFont(self.text_font, self.text_size)
def draw_text(self, text, x, y):
textobj = self.font.render(text, 1, self.text_color)
textrect = textobj.get_rect()
textrect.center = (x,y)
self.screen.blit(textobj, textrect)
def main(self):
rec = self.screen.get_rect()
done = False
while not done:
for event in pygame.event.get():
if event.type == QUIT:
done = True
self.screen.fill(self.bg_color)
x = 20
y = 100
n = 1
for i in range(1, 255):
self.draw_text(chr(i), x, y)
x += 20
n += 1
if n > 25:
n = 1
x = 20
y += 30
pygame.display.flip()
Most of it is just empty boxes. Why the discrepancy? I've tried changing fonts, even using the one that Idle uses; I've tried parsing it as unicode; nothing seems to work. It wouldn't bother me so much if not for the fact that, as I said, it prints in Idle just fine, and some of the characters I can't get are present in other ASCII games I play, so it must somehow be renderable.
Can anyone advise? I'm a self/google-taught amateur, and would honestly prefer not to have to download extra modules if at all avoidable. If nothing else, I'll settle for an explanation of my computer's apparent double-standard on this issue.
Thanks so much.

I think the main issue is that Pygame and Idle are using different default encodings to interpret your strings. Strings in Python 2 are tough to understand, so bear with me a little. Also, correct me if I'm mistaken, but I'm assuming you're using Python 2, which handles strings much differently than Python 3.
Strings in Python 2 are just series of bytes. Bytes in Python 2 can be expressed as bytes literals by creating a bytes object or prepending a string with b.
'a' == b'a' == bytes(a)
>>>True
chr(i) "Returns a string of one character whose ASCII code is the integer i." (Docs) If a string is a series of bytes, chr returns bytes. Let's look at what chr returns if we call it in the Python 2 interpreter:
chr(97)
>>>'a'
chr(120)
>>>'x'
The reason those letters show up in the interpreter output is that Python 2's default encoding is ASCII, so when it sees a byte value that corresponds to a symbol on the ASCII chart, it replaces the byte with that symbol. ASCII is limited to symbols corresponding to values 0 through 127. This chart shows what symbol each value corresponds to. 97 is 'a' and 120 is 'x'.
You'll notice that 0-31 and 127 are "control characters" that don't represent symbols, but communicate things to the machine reading the text. For example, in the C language, the end of a string is denoted by the null character, which is codepoint 0 in ASCII.
chr accepts an argument from 0 through 255. What happens if we pass an integer to chr that represents a control character in ASCII? What if we pass it an integer greater than 127?
chr(4)
>>>'\x04'
chr(163)
>>>'\xa3'
\x in a string in Python tells the interpreter that the next two characters represent a hexadecimal value. When the byte value doesn't have a corresponding symbol in ASCII, the interpreter just shows us the byte value as a hexadecimal number. Behind the scenes, 'a' is just '\x61', but the interpreter uses ASCII to show us what '\x61' represents.
There are lots of encodings besides ASCII. UTF-8 is a very common one you've probably heard of. Like ASCII, it relates numerical values to symbols. This chart shows the first 256 codepoints in UTF-8. They correspond exactly to what is being output by your pygame code:
So you're passing bytes to the drawtext function, and it's using UTF-8 to interpret those bytes as symbols. When it gets a value that corresponds to a control character, it simply outputs the square instead.
The reason Idle shows you different symbols is because it uses a different default encoding than Pygame. Without knowing what symbols it outputs, I can't say which encoding it is. ISO-8859-1 and Windows-1252 are two other very common encodings. You can check which encoding Idle is using by going to Preferences > General and looking at the "Default Source Encoding" option.
So what do you do about it? If you truly want to limit yourself to ASCII symbols, you can use the string module, which contains useful lists of characters. You can get all non-whitespace ASCII characters with the following code:
import string
ascii_characters = string.digits + string.letters + string.punctuation
print ascii_characters
If you want to use more than just those characters, Pygame will never be able to render UTF-8's control characters. Here are 2 options I see.
Pass a unicode string containing the character you want. This option requires you to declare a different default encoding on the first line of your file using a magic comment:
# -*- coding: utf-8 -*-
...
self.drawtext(u"€", x, y)
Pass a unicode string containing the unicode designation for the symbol you want. This does not require the magic comment from option 1.
self.draw_text(u'\u20ac', x, y)

Related

Why a single byte corruption while manipulating with plain text files could happen?

I am really confused here, and I even cannot state the topic of the question more clearly. While manipulating with the plain text files I encountered a weird replacement of symbols (bytes).
For example, I had a file with about 20000 strings, one of which is:
MIEPTLIRVGEAFYDITHLAPTRHTVPVLVRGNFAKVPVRISYTNHCYSRTPRAGEQVPTGHEIKDGAKLRMFCEQRHRLSSYLPQILIDLLQGETSLWQAAGGNFLQVELVDDVDGEPPTKIEYNVILRMERLKPEGDQKHIMIRVETAYPEDIEYDKPFRKKSYKVSRILAAKWEDRDHREPEPKPGKGKGKAKKK
I merge about 1000 of such files together just writing them one after another with Python (using simple open(filename) method). In the resulting file in the corresponding string I saw (while all other strings are fine):
MIEPTLIRVGEAFYDITHLAPTRHTVPVLVRGNFAKVPVRISYTNHCYSRTPRAGEQVPTGHEIKDGAKLRMFCEQRHRLSSYLPQILIDLLQGETSLWQAAGGNFLQVELVDDVDGEPPTKIEYNVILRMERLKPEGDQKHIMIRVETAYPEDIEYDKPFRKKSЩKVSRILAAKWEDRDHREPEPKPGKGKGKAKKK
Thus, a replacement of "Y" (HEX 59) to "Щ" letter (HEX D9) happened (both letters are made bold above). If I do this procedure again, no replacement occur in this place, thus it is random (?). I also noticed the same kind of replacement happening with "P" (HEX 50) and russian "Р" letter (HEX D0) in other case. What unites these cases is that in both cases letters in a pair have the same number if we count from 0 and 128 position of the ASCII table: english P has position 80, and russian Р has position 128+80=208; letter Y has position 89 and letter "Щ" has position 128+89=217. I guess this is a kind of file corruption, but how and why does it happen? Any ideas?
I should have guessed it myself before even asking: actually it looks like a single bit flip which likely could occur randomly as an error of reading/writing to disk. If the very first bit in a byte coding a letter flips, the replacement becomes visible because the letter is no longer in the first 128 symbols of ASCII table and some software becomes cranky about it.
"Y" = 01011001
"Щ" = 11011001
"P" = 01010000
"Р" = 11010000

Printing characters that are not in ASCII 0-127 (Python 3)

So, I've got an algorithm whereby I take a character, take its character code, increase that code by a variable, and then print that new character. However, I'd also like it to work for characters not in the default ASCII table. Currently it's not printing 'special' characters like € (for example). How can I make it print certain special characters?
#!/usr/bin/python3
# -*- coding: utf-8 -*-
def generateKey(name):
i = 0
result = ""
for char in name:
newOrd = ord(char) + i
newChar = chr(newOrd)
print(newChar)
result += newChar
i += 1
print("Serial key for name: ", result)
generateKey(input("Enter name: "))
Whenever I give an input that forces special characters (like |||||), it works fine for the first four characters (including DEL where it gives the transparent rectangle icon), but the fifth character (meant to be €) is also an error char, which is not what I want. How can I fix this?
Here's the output from |||||:
Enter name: |||||
|
}
~
Serial key for name: |}~
But the last char should be €, not a blank. (BTW the fourth char, DEL, becomes a transparent rectangle when I copy it into Windows)
In the default encoding (utf-8), chr(128) is not the euro symbol. It's a control character. See this Unicode table. So indeed it should be blank, not €.
You can verify the default encoding with sys.getdefaultencoding().
If you want to reinterpret chr(128) as the euro symbol, you should use the windows-1252 encoding. There, it is indeed the euro symbol. (Different encodings disagree on how to represent values beyond ASCII's 0–127.)

Bytes operations in Python

I'm working on a project in which I have to perform some byte operations using python and I'd like to understand some basic principals before I go on with it.
t1 = b"\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
t2 = "\xAC\x42\x4C\x45\x54\x43\x48\x49\x4E\x47\x4C\x45\x59"
print("Adding b character before: ",t1)
print("Using bytes(str): ",bytes(t2,"utf-8"))
print("Using str.encode: ",t2.encode())
In particular, I cannot understand why the console prints this when I run the code above:
C:\Users\Marco\PycharmProjects\codeTest\venv\Scripts\python.exe C:/Users/Marco/PycharmProjects/codeTest/msgPack/temp.py
Adding b character before: b'\xacBLETCHINGLEY'
Using bytes(str): b'\xc2\xacBLETCHINGLEY'
Using str.encode: b'\xc2\xacBLETCHINGLEY'
What I would like to understand is why, if I use bytes() or decode, I get an extra "\xc2" in front of the value. What does it mean? Is this supposed to appear? And if so, how can I get rid of it without using the first method?
Because bytes objects and str objects are two different things. The former represents a sequence of bytes, the latter represents a sequence of unicode code points. There's a huge difference between the byte 172 and the unicode code point 172.
In particular, the byte 172 doesn't encode anything in particular in unicode. On the other hand, unicode code point 172 refers to the following character:
>>> c = chr(172)
>>> print(c)
¬
And of course, they actual raw bytes this would correspond to depend on the encoding. Using utf-8 it is a two-byte encoding:
>>> c.encode()
b'\xc2\xac'
In the latin-1 encoding, it is a 1 byte:
>>> c.encode('latin')
b'\xac'
If you want raw bytes, the most precise/easy way then is to use a bytes-literal.
In a string literal, \xhh (h being a hex digit) selects the corresponding unicode character U+0000 to U+00FF, with U+00AC being the ¬ "not sign". When encoding to utf-8, all code points above 0x7F take two or more bytes. \xc2\xac is the utf-8 encoding of U+00AC.
>>> "\u00AC" == "\xAC"
True
>>> "\u00AC" == "¬"
True
>>> "\xAC" == "¬"
True
>>> "\u00AC".encode('utf-8')
b'\xc2\xac'
>>> "¬".encode("utf-8")
b'\xc2\xac'

How to extract numerical values after expression '\xb' in byte string

I am attempting to extract numerical values from a byte string transmitted from an RS-232 port. Here is an example:
b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
If I attempt to decode the byte string as 'utf-8', I receive the following output:
x = b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
x.decode('utf-8', errors='ignore')
>>> 'SS3.6\n'
What I ideally want is 23.67, which is observed after every \xb pattern. How could I extract 23.67 from this byte string?
As mentioned in https://stackoverflow.com/a/59416410/3319460, your input actually doesn't really represent the output you seek. But just to fulfil your requirements, of course, we might set semantics onto the input such that
numbers or '.' sign is allowed, others are skipped
if the byte is non-ASCII character such whether the first four bytes are 0xB. If it is the case then we will simply take the ASCII part of the byte (b & 0b01111111)
That is quite easily done in Python.
def _filter(char):
return char & 0xF0 == 0xB0 or chr(char) == "." or 48 <= char <= 58
def filter_xbchars(value: bytes) -> str:
return "".join(chr(ch & 0b01111111) for ch in value if _filter(ch))
import pytest
#pytest.mark.parametrize(
"value, expected",
[(b"S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n", "23.67")],
)
def test_simple(value, expected):
assert filter_xbchars(value) == expected
Please be aware that even though code above satisfies the requirements it is an example of a poorly described task and as a result quite nonsensical solution. The code solves the task as you asked for it but we should firstly reconsider whether it even makes sense. I advise you to check the data you will test against and the meaning of the data (protocol).
Good luck :)
If you just want to get 23.67 from that byte string try this:
a = b'S\xa0S\xa0\xa0\xa0\xa0\xa0\xa0\xb23.6\xb7\xa0\xe7\x8d\n'
b = repr(a)[2:-1]
c = b.split("\\")
d = ''
e = []
for i in c:
if "xb" in i:
e.append(i[2:])
d = "".join(e)
print(d)
Please notice that \xHH is an escape code representing hexadecimal value HH and as such your string '\xb23.6\xb7' does not contain "23.67" but rater "(0xB2)3.6(0xB7)", those value cannot be extracted using a regular expression because it's not present in the string in the first place.
'\xb23.6\xb7' is not a valid UTF-8 sequence, and in Latin-1 extended ASCII it would represent "²3.6·"; the presence of many 0xA0 values would suggest a Latin-1 encoding as it represent a non-breaking space in that encoding (a fairly common character) while in UTF-8 it does not encode a meaningful sequence.

Python store non numeric string as number

I am currently trying to find a way to convert any sort of text to a number, so that it can later be converted back to text.
So something like this:
text = "some string"
number = somefunction(text)
text = someotherfunction(number)
print(text) #output "some string"
If you're using Python 3, it's pretty easy. First, convert the str to bytes in a chosen encoding (utf-8 is usually appropriate), then use int.from_bytes to convert to an int:
number = int.from_bytes(mystring.encode('utf-8'), 'little')
Converting back is slightly trickier (and will lose trailing NUL bytes unless you've stored how long the resulting string should be somewhere else; if you switch to 'big' endianness, you lose leading NUL bytes instead of trailing):
recoveredstring = number.to_bytes((number.bit_length() + 7) // 8, 'little').decode('utf-8')
You can do something similar in Python 2, but it's less efficient/direct:
import binascii
number = int(binascii.hexlify(mystring.encode('utf-8')), 16)
hx = '%x' % number
hx = hx.zfill(len(hx) + (len(hx) & 1)) # Make even length hex nibbles
recoveredstring = binascii.unhexlify(hx).decode('utf-8')
That's equivalent to the 'big' endian approach in Python 3; reversing the intermediate bytes as you go in each direction would get the 'little' effect.
You can use the ASCII values to do this:
ASCII to int:
ord('a') # = 97
Back to a string:
str(unichr(97)) # = 'a'
From there you could iterate over the string one character at a time and store these in another string. Assuming you are using standard ASCII characters, you would need to zero pad the numbers (because some are two digits and some three) like so:
s = 'My string'
number_string = ''
for c in s:
number_string += str(ord(c)).zfill(3)
To decode this, you will read the new string three characters at a time and decode them into a new string.
This assumes a few things:
all characters can be represented by ASCII (you could use Unicode code points if not)
you are storing the numeric value as a string, not as an actual int type (not a big deal in Python—saves you from having to deal with maximum values for int on different systems)
you absolutely must have a numeric value, i.e. some kind of hexadecimal representation (which could be converted into an int) and cryptographic algorithms won't work
we're not talking about GB+ of text that needs to be converted in this manner

Categories

Resources