Python convert and save unicode string to a list

Python convert and save unicode string to a list - python

I need to insert a series of names (like 'Alam\xc3\xa9') into a list, and than I have to save them into a SQLite database.
I know that I can render these names correctly by tiping:
print eval(repr(NAME)).decode("utf-8")
But I have to insert them into a list, so I can't use the print
Other way for doing this without the print?

Lots and lots of misconceptions here.
The string you quote is not Unicode. It is a byte string, encoded in UTF-8.
You can convert it to Unicode by decoding it:
unicode_name = name.decode('utf-8')
When you print the value of unicode_name to the console, you will see one of two things:
>>> unicode_name
u'Alam\xe9'
>>> print unicode_name
Alamé
Here, you can see that just typing the name and pressing enter shows a representation of the Unicode code points. This is the same as typing print repr(unicode_name). However, doing print unicode_name prints the actual characters - ie behind the scenes, it encodes it to the correct encoding for your terminal, and prints the result.
But this is all irrelevant, because Unicode strings can only be represented internally. As soon as you want to store it in a database, or a file, or anywhere, you need to encode it. And the most likely encoding to choose is UTF-8 - which is what it was in originally.
>>> name
'Alam\xc3\xa9'
>>> print name
Alamé
As you can see, using the original non-decoded version of the name, repr and print once again show the codes and the characters. So it's not that converting it to Unicode actually makes it any more "really" the correct character.
So, what to do if you want to store it in a database? Nothing. Nothing at all. Sqlite accepts UTF-8 input, and stores its data in UTF-8 format on the disk. So there is absolutely no conversion needed to store the original value of name in the database.

Are you looking for something like this?
[n.decode("utf-8") for n in ['Alam\xc3\xa9', 'Alam\xc3\xa9', 'Alam\xc3\xa9']]

Related

How can I effectively store binary data in a file that's in a text format like CSV?

I'm currently working on a password storage program in Python, though C would likely be faster. I've been trying for the past hour or so to find a way to store a bytes object in a CSV file. I'm hashing the passwords with their own salt, and then storing that, and grabbing it again to check the password. It works perfectly well when it's stored in memory.
salt = os.urandom(64)
hash = hashlib.pbkdf2_hmac(
'sha256',
password.encode('utf-8'),
salt,
1000000
)
storage = salt + hash
salt_from_store = storage[:64]
hash_from_store = storage[64:]
However, when I try storing it in a CSV file, so it doesn't have to be constantly running, I get an error,
TypeError: write() argument must be str, not bytes
So, I converted it to a string using,
str(storage)
and that wrote just fine. But then, when I get it from the file, it's still a string, and the length goes from 128 (bytes) to 300+ (chars). It's also never consistent. I don't know the encoding, so I can't change it like that, when I print the bytes, it's a bunch of characters with backslashes and X's
b'\xfd\x3a'
and occasionally some random special characters. I'm not sure if there's a way to convert that to an int, and let it be converted back. Another issue is that I've found a way to do it, by changing
b"\xf1\x96"
to
"b\xf1\x96"
which prints the encoded text, rather than the bytes it's made up of. However, I don't know if that's a good way of changing it, and if it is, if there's a way to do it without something like
bytes[0] = '"'
bytes[1] = 'b'

If you want to save bytes as a string, you should probably encode them in a format made for this like base64. This is more efficient with space than directly writing hex.
Trying to convert arbitrary bytes to an encoding like utf-8 directly will likely result in UnicodeDecodeError errors.
In your case, you could do something like:
import os, hashlib, base64
password = "top_secret"
salt = os.urandom(64)
hash = hashlib.pbkdf2_hmac(
'sha256',
password.encode('utf-8'),
salt,
1000000
)
storage = salt + hash
# convert to a base64 string:
s = base64.b64encode(storage).decode('utf-8')
print(s) # <-- string you can save this to a file
# after reading it back from a file convert back to bytes
the_bytes = base64.b64decode(s)
the_bytes == storage
# True

To write bytes, either write to something that expects to contain bytes, or write text that represents the bytes in some way. CSV is fundamentally a text-based format. If you're going to use a CSV file, then you're going to open it in text mode, and write text to it.
Fundamentally, every file on the hard drive consists of bytes. This implies that, when you open the CSV file, you will be choosing (or using a default) text encoding scheme. So your bytes object will have to be converted twice (to text, and then into the underlying bytes in the file - which you could verify for example with a hex editor) on writing, and twice again on reading. That's just the reality of dealing with mixed data. Thankfully, half that work is taken care of for you automatically (by the open call, or wrappers for that like csv.Reader).
So, I converted it to a string using str(storage)
This is not actually a conversion in the sense that you're most likely interested in. This is asking for a printable, human-readable representation of the object (There is also repr, which asks for a more technically-oriented representation. For str and bytes objects, that's where the enclosing quotation marks come from, among other adjustments. When you print something, its str is used. When you evaluate something at the REPL, you see the repr of the result - except that when the result is None, it doesn't show anything at all). Specifically for dealing with bytes and str objects, Python has a concept of encoding and decoding, which uses explicit .encode (str->bytes) and .decode (bytes->str) methods. These are topics you can easily look up in the documentation (or previous Stack Overflow questions, or on the Internet in general).
when I print the bytes, it's a bunch of characters with backslashes and X's
Yes, this is the form that Python uses to tell you what data exists inside the bytes object. What you're saying here is basically the same as "when I print the list, it's a bunch of list elements with commas surrounded by square brackets", or "when I print the integer, it's a bunch of digit symbols".
But then, when I get it from the file, it's still a string, and the length goes from 128 (bytes) to 300+ (chars).
So decode it again. Of course you do need to encode properly. Everything that you get from the file will be a string, because you are opening the file in text mode, because CSV is a text format. (Incidentally, you are using the csv standard library module for this, right?)
It's also never consistent. I don't know the encoding
So tell it which encoding to use; and if you need to use a consistent amount of text, choose an encoding that consistently maps one byte to one Unicode code point (such as latin-1, also named iso-8859-1). But I suspect you don't actually care how long the text is (if anything, you'd care about the amount of bytes used in the file).
I've found a way to do it, by changing
You can only do this with literal data. Do not think in these terms. The b is part of the language syntax. It is not data.

You could use hex. Let's get some data:
>>> import os
>>> b = os.urandom(10)
>>> b
b'\xc5\xe2{\xdf\xd2\x13\xa7\x0b\xef\x07'
As a hex string that you can write to CSV:
>>> b.hex()
'c5e27bdfd213a70bef07'
Back to bytes:
>>> bytes.fromhex(b.hex())
b'\xc5\xe2{\xdf\xd2\x13\xa7\x0b\xef\x07'

String encode/decode issue - missing character from end

I am having NVARCHAR type column in my database. I am unable to convert the content of this column to plain string in my code. (I am using pyodbc for the database connection).
# This unicode string is returned by the database
>>> my_string = u'\u4157\u4347\u6e65\u6574\u2d72\u3430\u3931\u3530\u3731\u3539\u3533\u3631\u3630\u3530\u3330\u322d\u3130\u3036\u3036\u3135\u3432\u3538\u2d37\u3134\u3039\u352d'
# prints something in chineese
>>> print my_string
䅗䍇湥整⵲㐰㤱㔰㜱㔹㔳㘱㘰㔰㌰㈭㄰〶〶ㄵ㐲㔸ⴷㄴ〹㔭
The closest I have gone is via encoding it to utf-16 as:
>>> my_string.encode('utf-16')
'\xff\xfeWAGCenter-04190517953516060503-20160605124857-4190-5'
>>> print my_string.encode('utf-16')
��WAGCenter-04190517953516060503-20160605124857-4190-5
But the actual value that I need as per the value store in database is:
WAGCenter-04190517953516060503-20160605124857-4190-51
I tried with encoding it to utf-8, utf-16, ascii, utf-32 but nothing seemed to work.
Does anyone have the idea regarding what I am missing? And how to get the desired result from the my_string.
Edit: On converting it to utf-16-le, I am able to remove unwanted characters from start, but still one character is missing from end
>>> print t.encode('utf-16-le')
WAGCenter-04190517953516060503-20160605124857-4190-5
On trying for some other columns, it is working. What might be the cause of this intermittent issue?

You have a major problem in your database definition, in the way you store values in it, or in the way you read values from it. I can only explain what you are seeing, but neither why nor how to fix it without:
the type of the database
the way you input values in it
the way you extract values to obtain your pseudo unicode string
the actual content if you use direct (native) database access
What you get is an ASCII string, where the 8 bits characters are grouped by pair to build 16 bit unicode characters in little endian order. As the expected string has an odd numbers of characters, the last character was (irremediably) lost in translation, because the original string ends with u'\352d' where 0x2d is ASCII code for '-' and 0x35 for '5'. Demo:
def cvt(ustring):
l = []
for uc in ustring:
l.append(chr(ord(uc) & 0xFF)) # low order byte
l.append(chr((ord(uc) >> 8) & 0xFF)) # high order byte
return ''.join(l)
cvt(my_string)
'WAGCenter-04190517953516060503-20160605124857-4190-5'

The issue was, I was using UTF-16 in my odbcinst.ini file where as I had to use UTF-8 format of character encoding.
Earlier I was changing it as an OPTION parameter while making connection to PyODBC. But later changing it in odbcinst.ini file fixed the issue.

python: extended ASCII codes

Hi I want to know how I can append and then print extended ASCII codes in python.
I have the following.
code = chr(247)
li = []
li.append(code)
print li
The result python print out is ['\xf7'] when it should be a division symbol. If I simple print code directly "print code" then I get the division symbol but not if I append it to a list. What am I doing wrong?
Thanks.

When you print a list, it outputs the default representation of all its elements - ie by calling repr() on each of them. The repr() of a string is its escaped code, by design. If you want to output all the elements of the list properly you should convert it to a string, eg via ', '.join(li).
Note that as those in the comments have stated, there isn't really any such thing as "extended ASCII", there are just various different encodings.

You probably want the charmap encoding, which lets you turn unicode into bytes without 'magic' conversions.
s='\xf7'
b=s.encode('charmap')
with open('/dev/stdout','wb') as f:
f.write(b)
f.flush()
Will print ÷ on my system.
Note that 'extended ASCII' refers to any of a number of proprietary extensions to ASCII, none of which were ever officially adopted and all of which are incompatible with each other. As a result, the symbol output by that code will vary based on the controlling terminal's choice of how to interpret it.

There's no single defined standard named "extend ASCII Codes"> - there are however, plenty of characters, tens of thousands, as defined in the Unicode standards.
You can be limited to the charset encoding of your text terminal, which you may think of as "Extend ASCII", but which might be "latin-1", for example (if you are on a Unix system such as Linux or Mac OS X, your text terminal will likely use UTF-8 encoding, and able to display any of the tens of thousands chars available in Unicode)
So, you must read this piece in order to understand what text is, after 1992 -
If you try to do any production application believing in "extended ASCII" you are harming yourself, your users and the whole eco-system at once: http://www.joelonsoftware.com/articles/Unicode.html
That said, Python2's (and Python3's) print will call the an implicit str conversion for the objects passed in. If you use a list, this conversion does not recursively calls str for each list element, instead, it uses the element's repr, which displays non ASCII characters as their numeric representation or other unsuitable notations.
You can simply join your desired characters in a unicode string, for example, and then print them normally, using the terminal encoding:
import sys
mytext = u""
mytext += unichr(247) #check the codes for unicode chars here: http://en.wikipedia.org/wiki/List_of_Unicode_characters
print mytext.encode(sys.stdout.encoding, errors="replace")

You are doing nothing wrong.
What you do is to add a string of length 1 to a list.
This string contains a character outside the range of printable characters, and outside of ASCII (which is only 7 bit). That's why its representation looks like '\xf7'.
If you print it, it will be transformed as good as the system can.
In Python 2, the byte will be just printed. The resulting output may be the division symbol, or any other thing, according to what your system's encoding is.
In Python 3, it is a unicode character and will be processed according to how stdout is set up. Normally, this indeed should be the division symbol.
In a representation of a list, the __repr__() of the string is called, leading to what you see.

Is there any possible way to display accented characters in Python interpreter?

I am trying to make a random wiki page generator which asks the user whether or not they want to access a random wiki page. However, some of these pages have accented characters and I would like to display them in git bash when I run the code. I am using the cmd module to allow for user input. Right now, the way I display titles is using
r_site = requests.get("http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=10&format=json")
print(json.loads(r_site.text)["query"]["random"][0]["title"].encode("utf-8"))
At times it works, but whenever an accented character appears it shows up like 25\xe2\x80\x9399.
Any workarounds or alternatives? Thanks.

import sys
change your encode to .encode(sys.stdout.encoding, errors="some string")
where "some string" can be one of the following:
'strict' (the default) - raises a UnicodeError when an unprintable character is encountered
'ignore' - don't print the unencodable characters
'replace' - replace the unencodable characters with a ?
'xmlcharrefreplace' - replace unencodable characters with xml escape sequence
'backslashreplace' - replace unencodable characters with escaped unicode code point value
So no, there is no way to get the character to show up if the locale of your terminal doesn't support it. But these options let you choose what to do instead.
Check here for more reference.

I assume this is Python 3.x, given that you're writing 3.x-style print function calls.
In Python 3.x, printing any object calls str on that object, then encodes it to sys.stdout.encoding for printing.
So, if you pass it a Unicode string, it just works (assuming your terminal can handle Unicode, and Python has correctly guessed sys.stdout.encoding):
>>> print('abcé')
abcé
But if you pass it a bytes object, like the one you got back from calling .encode('utf-8'), the str function formats it like this:
>>> print('abcé'.encode('utf-8'))
b'abc\xce\xa9'
Why? Because bytes objects isn't a string, and that's how bytes objects get printed—the b prefix, the quotes, and the backslash escapes for every non-printable-ASCII byte.
The solution is just to not call encode('utf-8').
Most likely your confusion is that you read some code for Python 2.x, where bytes and str are the same type, and the type that print actually wants, and tried to use it in Python 3.x.

Django: unicode string gets written to database as non-unicode

I have written a basic script that imports several thousand values into a Django database. Here's how it looks like: link.
Those locations are in Cyrillic letters, and are represented as unicode literals. However, as soon as I save them to the database, they are converted to what seems to be encoded simple strings, in some sort of hex encoding:
>>> Region.objects.all()[0].parent
'\xd0\xbe\xd0\xb1\xd0\xbb\xd0\xb0\xd1\x81\xd1\x82 \xd0\xa1\xd0\xbb\xd0\xb8\xd0\xb2\xd0\xb5\xd0\xbd'
Surprisingly, they appear correctly in the admin panel, but I have trouble when trying to use them. How do I store and retrieve them as unicode?
I'm running Django 1.4.0 on top of MySQL, collation set to utf8_bin.

This is a Django/MySQL "bug". See issue #16052. It's actually documented here.

It looks like the data is being returned as a UTF-8 byte string rather than a Unicode string. Try decoding it:
>>> x='\xd0\xbe\xd0\xb1\xd0\xbb\xd0\xb0\xd1\x81\xd1\x82 \xd0\xa1\xd0\xbb\xd0\xb8\xd0\xb2\xd0\xb5\xd0\xbd'
>>> x.decode('utf-8')
u'\u043e\u0431\u043b\u0430\u0441\u0442 \u0421\u043b\u0438\u0432\u0435\u043d'
>>> print x.decode('utf-8')
област Сливен

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.