Are null bytes allowed in unicode strings in PostgreSQL via Python?

Are null bytes allowed in unicode strings in PostgreSQL via Python? - python

Are null bytes allowed in unicode strings?
I don't ask about utf8, I mean the high level object representation of a unicode string.
Background
We store unicode strings containing null bytes via Python in PostgreSQL.
The strings cut at the null byte if we read it again.

About the database side, PostgreSQL itself does not allow null byte ('\0') in a string on char/text/varchar fields, so if you try to store a string containing it you receive an error. Example:
postgres=# SELECT convert_from('foo\000bar'::bytea, 'unicode');
ERROR: 22021: invalid byte sequence for encoding "UTF8": 0x00
If you really need to store such information, then you can use bytea data type on PostgreSQL side. Make to sure to encode it correctly.

Python itself is perfectly capable of having both byte strings and Unicode strings with null characters having a value of zero. However if you call out to a library implemented in C, that library may use the C convention of stopping at the first null character.

Since a string is basically just data and a pointer, you can save null in it. However, since null represents the end of the string ("null terminator "), there is no way to read beyond the null without knowing the size ahead of reading.
Therefore, seems that you ought to store your data in binary and read it as a buffer.
Good luck!

Related

Correct way to pass python bytes to Presto query (and retrieve and decode those bytes)

I want to store numpy bytes with Presto. I have the following
import numpy as np
array = np.array([1.0,3.4,5.1])
these_bytes = array.tobytes()
and then I want to store them in presto, using a query like this
query = f"INSERT INTO some_table VALUES ({these_bytes},'2021-03-11')"
where the {these_bytes} entry is a VARBINARY column. Of course, presto gives the error 'b' not recognized as "these_bytes" is actually a bytes object and not a string so it looks like b'...'. It seems I should then be decoding this object and storing the decoding... What is the correct way to store python binary bytes with presto, and then will there be any transformations required upon retrieval? Assume my python presto client just passes the query without doing additional transformations.
The expanded fstring looks like
INSERT INTO imu_test_table_1000 VALUES (b'\x00\x00\x00\x00\x00\x00\xf0?333333\x0b#ffffff\x14#','2021-03-11')
which is not right.

I guess you need these_bytes.decode() (although, if you have unprintable characters that is probably going to be an issue and will fail). You would also need to know what encoding your client wants the characters to be in (utf8, utf16, etc.). If you don't know that, then I would be confused how your client interprets what characters it is sent, or why it wants to receive characters instead of bytes.
In general, most data transferring is done in bytes (I don't know about your particular client). This would include things like subprocessing, sockets, requests, etc. all use bytes (since this is what computers talk in). A byte string can contain any byte value and is simply data ready for transfer. A string is a collection of characters (each character could be made of several bytes in memory) which represents some kind of human writing/text. In particular, not every byte string can be encoded into characters, hence str.decode() will not work unless you specifically have bytes for the utf8 code-set.
If you want to combine arbitray bytes (not representing any particular format or characters), then you CANNOT use a string; use must keep the data in a byte array. Like
query = b"INSERT INTO some_table VALUES (" + these_bytes + b",'2021-03-11')"

how to determine the type of string

I am getting the following error when writing into a sql-alchemy varchar string element
....You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str)....
How can I determine the type of a string in python?

8-bit bytestring refers to unicoding more information on this is found in Unicode HOWTO
Before sending your code into sql make sure you decode it using data.decode('utf-8') That should remove the error you're having.

Python encoding issue

i m working with some python script, got a raw string with UTF8 encoding. first of all i decoded it to utf8 then some processing is done and at the end i encode it back to utf8 and inserted to DB(mysql) but chars in DB are not presented in real format.
str = '<term>Beiträge</term>'
str = str.decode('utf8')
...
...
...
str = str.encode('utf8')
after that string is found in txt file in its real form but in MYSQL_DB, i found it like this
<term>"BeitrÃ¤ge</term>
any idea why this happened? :-(

Assuming you are using the MySQLdb library, you need to create connections using the keyword arguments:
use_unicode
If True, text-like columns are returned as unicode objects using the
connection's character set. Otherwise,
text-like columns are returned as
strings. columns are returned as
normal strings. Unicode objects will
always be encoded to the connection's
character set regardless of this
setting.
&
charset
If supplied, the connection character set will be changed to this
character set (MySQL-4.1 and newer).
This implies use_unicode=True.
You should also check the encoding of your db tables.

To make a string a Unicode string you should use the stringprefix 'u'. See also here http://docs.python.org/reference/lexical_analysis.html#literals
Maybe your example works by just adding the prefix in the initial assignment.

Python String Comparison--Problems With Special/Unicode Characters

I'm writing a Python script to process some music data. It's supposed to merge two separate databases by comparing their entries and matching them up. It's almost working, but fails when comparing strings containing special characters (i.e. accented letters). I'm pretty sure it's a ASCII vs. Unicode encoding issue, as I get the error:
"Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal"
I realize I could use regular expressions to remove the offending characters, but I'm processing a lot of data and relying too much on regexes makes my program grindingly slow. Is there a way to have Python properly compare these strings? What is going on here--is there a way to tell whether it's storing my strings as ASCII or Unicode?
EDIT 1: I'm using Python v2.6.6. After checking the types, I've discovered that one database spits out me Unicode strings and one gives ASCII. So that's probably the problems. I'm trying to convert the ASCII strings from the second database to Unicode with a line like
line = unicode(f.readline().decode(latin_1).encode(utf_8))
but this gives an error like:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)
I'm not sure why the 'ascii' codec is complaining since I'm trying to decode from ASCII. Can anyone help?

Unicode vs Bytes
First, some terminology. There are two types of strings, encoded and decoded:
Encoded. This is what's stored on disk. To Python, it's a bunch of 0's and 1's that you might treat like ASCII, but it could be anything -- binary data, a JPEG image, whatever. In Python 2.x, this is called a "string" variable. In Python 3.x, it's more accurately called a "bytes" variable.
Decoded. This is a string of actual characters. They could be encoded to 8-bit ASCII strings, or it could be encoded to 32-bit Chinese characters. But until it's time to convert to an encoded variable, it's just a Unicode string of characters.
What this means to you
So here's the thing. You said you were getting one ASCII variable and one Unicode variable. That's actually not true.
You have one variable that's a string of bytes -- ones and zeros, presumably in sets of 8. This is the variable you assumed, incorrectly, to be ASCII.
You have another variable that's Unicode data -- numbers, letters, and symbols.
Before you compare the string of bytes to a Unicode string of characters, you have to make some assumptions. In your case, Python (and you) assumed that the string of bytes was ASCII encoded. That worked fine until you came across a character that wasn't ASCII -- a character with an accent mark.
So you need to find out what that string of bytes is encoded as. It might be latin1. If it is, you want to do this:
if unicode_variable == string_variable.decode('latin1')
Latin1 is basically ASCII plus some extended characters like Ç and Â.
If your data is in Latin1, that's all you need to do. But if your string of bytes is encoded in something else, you'll need to figure out what encoding that is and pass it to decode().
The bottom line is, there's no easy answer, unless you know (or make some assumptions) about the encoding of your input data.
What I would do
Try running var.decode('latin1') on your string of bytes. That will give you a Unicode variable. If that works, and the data looks correct (ie, characters with accent marks look like they belong), roll with it.
Oh, and if latin1 doesn't parse or doesn't look right, try utf8 -- another common encoding.

You might need to preprocess the databases and convert everything into UTF-8. My guess is that you've got Latin-1 accented characters in some entries.
As to your question, the only way to know for sure is to look. Have your script spit out those that don't compare, and look up the character codes. Or just try string.decode('latin1').encode('utf8') and see what happens.

Converting both to unicode should help:
if unicode(str1) == unicode(str2):
print "same"

To find out whether YOU (not it) are storing your strings as str objects or unicode objects, print type(your_string).
You can use print repr(your_string) to show yourself (and us) unambiguously what is in your string.
By the way, exactly what version of Python are you using, on what OS? If Python 3.x, use ascii() instead of repr().

How do I convert a string to a buffer in Python 3.1?

I am attempting to pipe something to a subprocess using the following line:
p.communicate("insert into egg values ('egg');");
TypeError: must be bytes or buffer, not str
How can I convert the string to a buffer?

The correct answer is:
p.communicate(b"insert into egg values ('egg');");
Note the leading b, telling you that it's a string of bytes, not a string of unicode characters. Also, if you are reading this from a file:
value = open('thefile', 'rt').read()
p.communicate(value);
The change that to:
value = open('thefile', 'rb').read()
p.communicate(value);
Again, note the 'b'.
Now if your value is a string you get from an API that only returns strings no matter what, then you need to encode it.
p.communicate(value.encode('latin-1');
Latin-1, because unlike ASCII it supports all 256 bytes. But that said, having binary data in unicode is asking for trouble. It's better if you can make it binary from the start.

You can convert it to bytes with encode method:
>>> "insert into egg values ('egg');".encode('ascii') # ascii is just an example
b"insert into egg values ('egg');"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.