Recover UTF-8 encoding in string

Recover UTF-8 encoding in string - python

I'm working on my python script to extract multiple strings from a .csv file but I can't recover the Spanish characters (like á, é, í) after I open the file and read the lines.
This is my code so far:
import csv
list_text=[]
with open(file, 'rb') as data:
reader = csv.reader(data, delimiter='\t')
for row in reader:
print row[0]
list_text.extend(row[0])
print list_text
And I get something like this:
'Vivió el sueño, ESPAÑOL...' ['Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...']
I don't know why it prints it in the correct form but when I append it to the list is not correct.
Edited:
The problem is that I need to recover the characters because, after I read the file, the list has thousands of words in it and I don't need to print it I need to use regex to get rid of the punctuation, but this also deletes the backslash and the word is incomplete.

The python 2.x csv module doesn't support unicode and you did the right thing by opening the file in binary mode and parsing the utf-8 encoded strings instead of decoded unicode strings. Python 2 is kinda strange in that the str type (as opposed to the unicode type) holds either string or binary data. You got 'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...' which is the binary utf-8 encoding of the unicode.
We can decode it to get the unicode version...
>>> encoded_text = 'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...'
>>> text = encoded_text.decode('utf-8')
>>> print repr(text)
u'Vivi\xf3 el sue\xf1o, ESPA\xd1OL...'
>>> print text
Vivió el sueño, ESPAÑOL...
...but wait a second, the encoded text prints the same
>>> print encoded_text
Vivió el sueño, ESPAÑOL...
what's up with that? That has everything to do with your display surface which is a utf-8 encoded terminal. In the first case (print text), text is a unicode string so python has to encode it before sending it to the terminal which sees the utf-8 encoded version. In the second case its just a regular string and python sent it without conversion... but it just so happens it was holding encoded text which the terminal decoded.
Finally, when a string is in a list, python prints its repr representation, not its str value, as in
>>> print repr(encoded_text)
'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...'
To make things right, convert the cells in your rows to unicode after the csv module is done with them.
import csv
list_text=[]
with open(file, 'rb') as data:
reader = csv.reader(data, delimiter='\t')
for row in reader:
row = [cell.decode('utf-8') for cell in row]
print row[0]
list_text.extend(row[0])
print list_text

When you print a list it shows all the cancelled characters, that way \n and other characters don't throw off the list display, so if you print the string it will work properly:
'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...'.decode('utf-8')

use unicodecsv instead of csv , csv doesn't support well unicode
open the file with codecs, and 'utf-8'
see code below
import unicodecsv as csv
import codecs
list_text=[]
with codecs.open(file, 'rb','utf-8') as data:
reader = csv.reader(data, delimiter='\t')
for row in reader:
print row[0]
list_text.extend(row[0])
print list_text

Related

Python adding extra text and braces while reading from CSV file

I wanted to read data from a csv file using python but after using the following code there are some extra characters and braces in the text which is not in the original data.
Please help to remove it.
import csv
with open("data.csv",encoding="utf8") as csvDataFile:
csvReader = csv.reader(csvDataFile)
for row in csvReader:
print(row)
What is displayed after reading is:- ['\ufeffwww.aslteramo.it']

This is utf-8 encoding with a Byte Order Mark (BOM) - which is used as a signature in windows.
Open the file using the utf-8-sig encoding instead of utf8

\ufeff is a UTF-8 BOM (also known as 'ZERO WIDTH NO-BREAK SPACE' character).
It's sometimes used to indicate that the file is in UTF-8 format.
You could use str.replace('\ufeff', '') in your code to get rid of it. Like this:
import csv
with open("data.csv",encoding="utf8") as csvDataFile:
csvReader = csv.reader(csvDataFile)
for row in csvReader:
print([col.replace('\ufeff', '') for col in row])
Another solution is to open the file with 'utf-8-sig' encoding instead of 'utf-8' encoding.
By the way, the braces are a added because row is a list. If your CSV file only has one column, you could select the first item from each row like this:
print(row[0].replace('\ufeff', ''))

Python 3 : keep string of hex commands intact while printing and formatting

I have a string of hex commands for a mifare reader, an example of one of the commands being
'\xE0\x03\x06\x01\x00'
This will give a response of 16 bytes, an example being:
'\x0F\x02\x02\x07\x05\x06\x07\x01\x09\x0A\x09\x0F\x03\x0D\x0E\xFF'
I need to store these values in a text document but whenever I try to make the hex commands into a string, the string always changes and doesn't keep its original format, and this string will turn into
'\x0f\x02\x02\x07\x05\x06\x07\x01\t\n\t\x0f\x03\r\x0eÿ'
I have tried to change the formatting of this, by using
d = d.encode()
print("".join("\\x{:02x}".format(c) for c in d))
# Result
'\x0f\x02\x02\x07\x05\x06\x07\x01\t\n\t\x0f\x03\r\x0eÿ'
Also by changing the encoding of the string but this also doesn't give the original string as a result after decoding. What I would like to get as a result would be
'\x0F\x02\x02\x07\x05\x06\x07\x01\x09\x0A\x09\x0F\x03\x0D\x0E\xFF'
This is so the mifare reader can use this string as data to write to a new tag if necessary. Any help would be appreciated

I think the problem you are experiencing is that python is trying to interpret your data as UTF-8 text, but it is raw data, so it is not printable. What I would do is hex encode the data to print it in the file, and hex decode it back when reading.
Something like:
import binascii # see [1]
s = b'\xE0\x03\x06\x01\x00' # tell python that it is binary data
# write in file encoded as hex text
with open('file.txt','w') as f:
# hexlify() returns a binary string containing ascii characters
# you need to convert it to regular string to avoid an exception in write()
f.write(str(binascii.hexlify(s),'ascii'))
# read back as hex text
with open('file.txt', 'r') as f:
ss=f.read()
# hex decode
x=binascii.unhexlify(ss)
And then
# test
>>> x == s
True

Python encoding unicode<>utf-8

So I am getting lost somewhere in converting unicode to utf-8. I am trying to define some JSON containing unicode characters, and writing them to file. When printing to the terminal the character is represented as '\u2606'. When having a look at the file the character is encoded to '\u2606', note the double backslash. Could someone point me into the right direction regarding these encoding issues?
# encoding=utf8
import json
data = {"summary" : u"This is a unicode character: ☆"}
print data
decoded_data = unicode(data)
print decoded_data
with open('decoded_data.json', 'w') as outfile:
json.dump(decoded_data, outfile)
I tried adding the following snippet to the head of my file, but this had no success neither.
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
sys.stderr = codecs.getwriter('utf8')(sys.stderr)

First you are printing the representation of a dictionary, and python only uses ascii characters and escapes any other character with \uxxxx.
The same is with json.dump trying to only use ascii characters. You can force json.dump to use unicode with:
json_data = json.dumps(data, ensure_ascii=False)
with open('decoded_data.json', 'w') as outfile:
outfile.write(json_data.encode('utf8'))

I think you can also refer to this link.It is also really useful
Set Default Encoding

How do I convert unicode to unicode-escaped text [duplicate]

This question already has an answer here:
How to encode Python 3 string using \u escape code?
(1 answer)
Closed 1 year ago.
I'm loading a file with a bunch of unicode characters (e.g. \xe9\x87\x8b). I want to convert these characters to their escaped-unicode form (\u91cb) in Python. I've found a couple of similar questions here on StackOverflow including this one Evaluate UTF-8 literal escape sequences in a string in Python3, which does almost exactly what I want, but I can't work out how to save the data.
For example:
Input file:
\xe9\x87\x8b
Python Script
file = open("input.txt", "r")
text = file.read()
file.close()
encoded = text.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
file = open("output.txt", "w")
file.write(encoded) # fails with a unicode exception
file.close()
Output File (That I would like):
\u91cb

You need to encode it again with unicode-escape encoding.
>>> br'\xe9\x87\x8b'.decode('unicode-escape').encode('latin1').decode('utf-8')
'釋'
>>> _.encode('unicode-escape')
b'\\u91cb'
Code modified (used binary mode to reduce unnecessary encode/decodes)
with open("input.txt", "rb") as f:
text = f.read().rstrip() # rstrip to remove trailing spaces
decoded = text.decode('unicode-escape').encode('latin1').decode('utf-8')
with open("output.txt", "wb") as f:
f.write(decoded.encode('unicode-escape'))
http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq

\xe9\x87\x8b is not a Unicode character. It looks like a representation of a bytestring that represents 釋 Unicode character encoded using utf-8 character encoding. \u91cb is a representation of 釋 character in Python source code (or in JSON format). Don't confuse the text representation and the character itself:
>>> b"\xe9\x87\x8b".decode('utf-8')
u'\u91cb' # repr()
>>> print(b"\xe9\x87\x8b".decode('utf-8'))
釋
>>> import unicodedata
>>> unicodedata.name(b"\xe9\x87\x8b".decode('utf-8'))
'CJK UNIFIED IDEOGRAPH-91CB'
To read text encoded as utf-8 from a file, specify the character encoding explicitly:
with open('input.txt', encoding='utf-8') as file:
unicode_text = file.read()
It is exactly the same for saving Unicode text to a file:
with open('output.txt', 'w', encoding='utf-8') as file:
file.write(unicode_text)
If you omit the explicit encoding parameter then locale.getpreferredencoding(False) is used that may produce mojibake if it does not correspond to the actual character encoding used to save a file.
If your input file literally contains \xe9 (4 characters) then you should fix whatever software generates it. If you need to use 'unicode-escape'; something is broken.

It looks as if your input file is UTF-8 encoded so specify UTF-8 encoding when you open the file (Python3 is assumed as per your reference):
with open("input.txt", "r", encoding='utf8') as f:
text = f.read()
text will contain the content of the file as a str (i.e. unicode string). Now you can write it in unicode escaped form directly to a file by specifying encoding='unicode-escape':
with open('output.txt', 'w', encoding='unicode-escape') as f:
f.write(text)
The content of your file will now contain unicode-escaped literals:
$ cat output.txt
\u91cb

Removing non-ascii characters in a csv file

I am currently inserting data in my django models using csv file. Below is a simple save function that am using:
def save(self):
myfile = file.csv
data = csv.reader(myfile, delimiter=',', quotechar='"')
i=0
for row in data:
if i == 0:
i = i + 1
continue #skipping the header row
b=MyModel()
b.create_from_csv_row(row) # calls a method to save in models
The function is working perfectly with ascii characters. However, if the csv file has some non-ascii characters then, an error is raised: UnicodeDecodeError
'ascii' codec can't decode byte 0x93 in position 1526: ordinal not in range(128)
My question is: How can i remove non-ascii characters before saving my csv file to avoid this error.
Thanks in advance.

If you really want to strip it, try:
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii','ignore')
* WARNING THIS WILL MODIFY YOUR DATA *
It attempts to find a close match - i.e. ć -> c
Perhaps a better answer is to use unicodecsv instead.
----- EDIT -----
Okay, if you don't care that the data is represented at all, try the following:
# If row references a unicode string
b.create_from_csv_row(row.encode('ascii', 'ignore'))
If row is a collection, not a unicode string, you will need to iterate over the collection to the string level to re-serialize it.

If you want to remove non-ascii characters from your data then iterate through your data and keep only the ascii.
for item in data:
if ord(item) <= 128: # 1 - 128 is ascii
[append,write,print,whatever]
If you want to convert unicode characters to ascii, then the response above by DivinusVox is accurate.

Pandas csv parser (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html) supports different encodings:
import pandas
data = pandas.read_csv(myfile, encoding='utf-8', quotechar='"', delimiter=',')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Recover UTF-8 encoding in string - python

When you print a list it shows all the cancelled characters, that way \n and other characters don't throw off the list display, so if you print the string it will work properly: 'Vivi\xc3\xb3 el sue\xc3\xb1o, ESPA\xc3\x91OL...'.decode('utf-8')

Related

Python adding extra text and braces while reading from CSV file

Python 3 : keep string of hex commands intact while printing and formatting

Python encoding unicode<>utf-8

How do I convert unicode to unicode-escaped text [duplicate]

Removing non-ascii characters in a csv file

Categories

Resources