Comparing special characters in Python - python

I have a string whose value is 'Opérations'. In my script I will read a file and do some comparisons. While comparing strings, the string that I have copied from the same source and placed in my python script DOES not equal to the same string that I receive when reading the same file in my script. Printing both strings give me 'Opérations'. However, when I encode it to utf-8 I notice the difference.
b'Ope\xcc\x81rations'
b'Op\xc3\xa9rations'
My question is what do I do to ensure that the special character in my python script is the same as the file content's when comparing such strings.

Good to know:
You are talking about two type of strings, byte string and unicode string. Each have a method to convert it to the other type of string. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. It means:
unicode.enocde() ----> bytes
and
bytes.decode() -----> unicode
and UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point. The higher the code point value, the more bytes it needs in UTF-8.
Get to the point:
If you redefine your string to two Byte strings and unicode strings, as follwos:
a_byte = b'Ope\xcc\x81rations'
a_unicode = u'Ope\xcc\x81rations'
and
b_byte = b'Op\xc3\xa9rations'
b_unicode = u'Op\xc3\xa9rations'
you w'll see:
print 'a_byte lenght is: ', len(a_byte.decode("utf-8"))
#print 'a_unicode lenght is: ',len(a_unicode.encode("utf-8"))
print 'b_byte lenght is: ',len(b_byte.decode("utf-8"))
#print 'b_unicode lenght is: ', len(b_unicode.encode("utf-8"))
output:
a_byte lenght is: 11
b_byte lenght is: 10
So you see they are not the same.
My solution:
If You don't want to be confused, then you can use repr(), and while print a_byte, b_byte printes Opérations as output, but:
print repr(a_byte),repr(b_byte)
will return:
'Ope\xcc\x81rations','Op\xc3\xa9rations'
You can also normalize the unicode before comparison as #Daniel's answer, as follows:
from unicodedata import normalize
from functools import partial
a_byte = 'Opérations'
norm = partial(normalize, 'NFC')
your_string = norm(a_byte.decode('utf8'))

Related

subprocess.check_output is returning an enclosing b' ' string [duplicate]

Apparently, the following is the valid syntax:
b'The string'
I would like to know:
What does this b character in front of the string mean?
What are the effects of using it?
What are appropriate situations to use it?
I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don't think this applies to Python.
I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn't mention the b character anywhere in that document.
Also, just out of curiosity, are there more symbols than the b and u that do other things?
Python 3.x makes a clear distinction between the types:
str = '...' literals = a sequence of Unicode characters (Latin-1, UCS-2 or UCS-4, depending on the widest character in the string)
bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
If you're familiar with:
Java or C#, think of str as String and bytes as byte[];
SQL, think of str as NVARCHAR and bytes as BINARY or BLOB;
Windows registry, think of str as REG_SZ and bytes as REG_BINARY.
If you're familiar with C(++), then forget everything you've learned about char and strings, because a character is not a byte. That idea is long obsolete.
You use str when you want to represent text.
print('שלום עולם')
You use bytes when you want to represent low-level binary data like structs.
NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]
You can encode a str to a bytes object.
>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'
And you can decode a bytes into a str.
>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'
But you can't freely mix the two types.
>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.
>>> b'A' == b'\x41'
True
But I must emphasize, a character is not a byte.
>>> 'A' == b'A'
False
In Python 2.x
Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:
unicode = u'...' literals = sequence of Unicode characters = 3.x str
str = '...' literals = sequences of confounded bytes/characters
Usually text, encoded in some unspecified encoding.
But also used to represent binary data like struct.pack output.
In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.
So yes, b'...' literals in Python have the same purpose that they do in PHP.
Also, just out of curiosity, are there
more symbols than the b and u that do
other things?
The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.
To quote the Python 2.x documentation:
A prefix of 'b' or 'B' is ignored in
Python 2; it indicates that the
literal should become a bytes literal
in Python 3 (e.g. when code is
automatically converted with 2to3). A
'u' or 'b' prefix may be followed by
an 'r' prefix.
The Python 3 documentation states:
Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
The b denotes a byte string.
Bytes are the actual data. Strings are an abstraction.
If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.
If took 1 byte with a byte string, you'd get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.
TBH I'd use strings unless I had some specific low level reason to use bytes.
From server side, if we send any response, it will be sent in the form of byte type, so it will appear in the client as b'Response from server'
In order get rid of b'....' simply use below code:
Server file:
stri="Response from server"
c.send(stri.encode())
Client file:
print(s.recv(1024).decode())
then it will print Response from server
The answer to the question is that, it does:
data.encode()
and in order to decode it(remove the b, because sometimes you don't need it)
use:
data.decode()
Here's an example where the absence of b would throw a TypeError exception in Python 3.x
>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface
Adding a b prefix would fix the problem.
It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.
The r prefix causes backslashes to be "uninterpreted" (not ignored, and the difference does matter).
In addition to what others have said, note that a single character in unicode can consist of multiple bytes.
The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.
>>> len('Öl') # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8') # convert str to bytes
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8')) # 3 bytes encode 2 characters !
3
You can use JSON to convert it to dictionary
import json
data = b'{"key":"value"}'
print(json.loads(data))
{"key":"value"}
FLASK:
This is an example from flask. Run this on terminal line:
import requests
requests.post(url='http://localhost(example)/',json={'key':'value'})
In flask/routes.py
#app.route('/', methods=['POST'])
def api_script_add():
print(request.data) # --> b'{"hi":"Hello"}'
print(json.loads(request.data))
return json.loads(request.data)
{'key':'value'}
b"hello" is not a string (even though it looks like one), but a byte sequence. It is a sequence of 5 numbers, which, if you mapped them to a character table, would look like h e l l o. However the value itself is not a string, Python just has a convenient syntax for defining byte sequences using text characters rather than the numbers itself. This saves you some typing, and also often byte sequences are meant to be interpreted as characters. However, this is not always the case - for example, reading a JPG file will produce a sequence of nonsense letters inside b"..." because JPGs have a non-text structure.
.encode() and .decode() convert between strings and bytes.
bytes(somestring.encode()) is the solution that worked for me in python 3.
def compare_types():
output = b'sometext'
print(output)
print(type(output))
somestring = 'sometext'
encoded_string = somestring.encode()
output = bytes(encoded_string)
print(output)
print(type(output))
compare_types()

Converting escaped characters to utf in Python

Is there an elegant way to convert "test\207\128" into "testπ" in python?
My issue stems from using avahi-browse on Linux, which has a -p flag to output information in an easy to parse format. However the problem is that it outputs non alpha-numeric characters as escaped sequences. So a service published as "name#id" gets output by avahi-browse as "name\035id". This can be dealt with by splitting on the \, dropping a leading zero and using chr(35) to recover the #. This solution breaks on multi-byte utf characters such as "π" which gets output as "\207\128".
The input string you have is an encoding of a UTF-8 string, in a format that Python can't deal with natively. This means you'll need to write a simple decoder, then use Python to translate the UTF-8 string to a string object:
import re
value = r"test\207\128"
# First off turn this into a byte array, since it's not a unicode string
value = value.encode("utf-8")
# Now replace any "\###" with a byte character based off
# the decimal number captured
value = re.sub(b"\\\\([0-9]{3})", lambda m: bytes([int(m.group(1))]), value)
# And now that we have a normal UTF-8 string, decode it back to a string
value = value.decode("utf-8")
print(value)
# Outputs: testπ

Convert ascii string to base64 without the "b" and quotation marks

I wanted to convert an ascii string (well just text to be precise) towards base64.
So I know how to do that, I just use the following code:
import base64
string = base64.b64encode(bytes("string", 'utf-8'))
print (string)
Which gives me
b'c3RyaW5n'
However the problem is, I'd like it to just print
c3RyaW5n
Is it possible to print the string without the "b" and the '' quotation marks?
Thanks!
The b prefix denotes that it is a binary string. A binary string is not a string: it is a sequence of bytes (values in the 0 to 255 range). It is simply typesetted as a string to make it more compact.
In case of base64 however, all characters are valid ASCII characters, you can thus simply decode it like:
print(string.decode('ascii'))
So here we will decode each byte to its ASCII equivalent. Since base64 guarantees that every byte it produces is in the ASCII range 'A' to '/') we will always produce a valid string. Mind however that this is not guaranteed with an arbitrary binary string.
A simple .decode("utf-8") would do
import base64
string = base64.b64encode(bytes("string", 'utf-8'))
print (string.decode("utf-8"))

Python-3.x - Converting a string representation of a bytearray back to a string

The back-story here is a little verbose, but basically I want to take a string like b'\x04\x0e\x1d' and cast it back into a bytearray.
I am working on a basic implementation of a one time pad, where I take a plaintext A and shared key B to generate a ciphertext C accoring to the equation A⊕B=C. Then I reverse the process with the equation C⊕B=A.
I've already found plenty of python3 functions to encode strings as bytes and then xor the bytes, such as the following:
def xor_strings(xs, ys):
return "".join(chr(ord(x) ^ ord(y)) for x, y in zip(xs, ys)).encode()
A call to xor_strings() then returns a bytearray:
print( xor_strings("foo", "bar"))
But when I print it to the screen, what I'm shown is actually a string. So I'm assuming that python is just calling some str() function on the bytearray, and I get something that looks like the following:
b'\x04\x0e\x1d'
Herein lies the problem. I want to create a new bytearray from that string. Normally I would just call decode() on the bytearray. But if I enter `b'\x04\x0e\x1d' as input, python sees it as a string, not a bytearray!
How can I take a string like b'\x04\x0e\x1d' as user input and cast it back into a bytearray?
As discussed in the comments, use base64 to send binary data in text form.
import base64
def xor_strings(xs, ys):
return "".join(chr(ord(x) ^ ord(y)) for x, y in zip(xs, ys)).encode()
# ciphertext is bytes
ciphertext = xor_strings("foo", "bar")
# >>> b'\x04\x0e\x1d'
# ciphertext_b64 is *still* bytes, but only "safe" ones (in the printable ASCII range)
ciphertext_b64 = base64.encodebytes(ciphertext)
# >>> b'BA4d\n'
Now we can transfer the bytes:
# ...we could interpret them as ASCII and print them somewhere
safe_string = ciphertext_b64.decode('ascii')
# >>> BA4d
# ...or write them to a file (or a network socket)
with open('/tmp/output', 'wb') as f:
f.write(ciphertext_b64)
And the recipient can retrieve the original message by:
# ...reading bytes from a file (or a network socket)
with open('/tmp/output', 'rb') as f:
ciphertext_b64_2 = f.read()
# ...or by reading bytes from a string
ciphertext_b64_2 = safe_string.encode('ascii')
# >>> b'BA4d\n'
# and finally decoding them into the original nessage
ciphertext_2 = base64.decodestring(ciphertext_b64_2)
# >>> b'\x04\x0e\x1d'
Of course when it comes to writing bytes to a file or to the network, encoding them as base64 first is superfluous. You can write/read the ciphertext directly if it's the only file content. Only if the ciphertext it is part of a higher structure (JSON, XML, a config file...) encoding it as base64 becomes necessary again.
A note on the use of the words "decode" and "encode".
To encode a string means to turn it from its abstract meaning ("a list of characters") into a storable representation ("a list of bytes"). The exact result of this operation depends on the byte encoding that is being used. For example:
ASCII encoding maps one character to one byte (as a trade-off it can't map all characters that can exist in a Python string).
UTF-8 encoding maps one character to 1-5 bytes, depending on the character.
To decode a byte array means turning it from "a list of bytes" back into "a list of characters" again. This of course requires prior knowledge of what the byte encoding originally was.
ciphertext_b64 above is a list of bytes and is represented as b'BA4d\n' on the Python console.
Its string equivalent, safe_string, looks very similar 'BA4d\n' when printed to the console due to the fact that base64 is a sub-set of ASCII.
The data types however are still fundamentally different. Don't let the console output deceive you.
Responding to that final question only.
>>> type(b'\x04\x0e\x1d')
<class 'bytes'>
>>> bytearray(b'\x04\x0e\x1d')
bytearray(b'\x04\x0e\x1d')
>>> type(bytearray(b'\x04\x0e\x1d'))
<class 'bytearray'>

How does source encoding apply within string literals?

PEP-263 specifies that encoding specified in the source is applied in the following order:
read the file
decode it into Unicode assuming a fixed per-file encoding
convert it into a UTF-8 byte string
tokenize the UTF-8 content
compile it, creating Unicode objects from the given Unicode data
and creating string objects from the Unicode literal data
by first reencoding the UTF-8 data into 8-bit string data
using the given file encoding
So, if I take this code:
print 'abcdefgh'
print u'abcdefgh'
And convert it to ROT-13:
# coding: rot13
cevag 'nopqrstu'
cevag h'nopqrstu'
I would expect that it is first decoded and then becomes identical to the original, printing:
abcdefgh
abcdefgh
But instead, it prints:
nopqrstu
abcdefgh
So, the unicode literal works as expeced, but str remains unconverted. Why?
Eliminating some possibilities:
I confirmed that the problem is not in a later phase (printing to console), but immediately at parsing, becuase this code produces "ValueError: unsupported format character 'q' (0x71) at index 1":
x = '%q' % 1 # that is %d !
I guess the last point actually explains what happens quite accurately:
compile it, creating Unicode objects from the given Unicode data and
creating string objects from the Unicode literal data by first
reencoding the UTF-8 data into 8-bit string data using the given file
encoding
After the first 4 steps, the contents of the source file are a tokenized unicode version of the following string:
print 'abcdefgh'
print u'abcdefgh'
After that, in step 5, the string object 'abcdefgh' is reencoded into 8-bit string data using the given file encoding (which is rot13), so the contents become:
print 'nopqrstu'
print u'abcdefgh'

Categories

Resources