How to read Unicode file as Unicode string in Python [closed]

How to read Unicode file as Unicode string in Python [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have a file that is encoded in Unicode or UTF-8 (I don't know which). When I read the file in Python 3.4, the resulting string is interpreted as an ASCII string. How do I convert it to a Unicode string like u"text"?

The term "Unicode" refers to the standard, not to a particular encoding.
Since files in computers are binary, there exist different ways of encoding Unicode data in binary files. One of them is "UTF-8".
You can consult https://docs.python.org/3/howto/unicode.html
An example taken from this document (in the section "Reading and Writing Unicode Data")
with open('unicode.txt', encoding='utf-8') as f:
for line in f:
print(repr(line))
In python 3, unlike python2, unicode string constants are not written with a "u".

Related

Encoding issue during reading excel file in Python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I use read_excel from pandas library to read excel content and convert it to JSON. I am struggling with encoding issue. Non english characters are encoded like "u652f\u63f4\u8cc7\u8a0a".
How can I resolve this issue?
I tried
wb = xlrd.open_workbook(excel_filePath, encoding_override='ISO-8859-1')
new_data = pd.read_excel(wb)
Also
with open(excel_filePath, mode="r", encoding="utf-8") as file:
new_data = pd.read_excel(excel_filePath)
I tried this code with encodings like: utf-8, utf-16, utf-16, latin1...

From the docs of the json module:
The RFC requires that JSON be represented using either UTF-8, UTF-16, or UTF-32, with UTF-8 being the recommended default for maximum interoperability.
As permitted, though not required, by the RFC, this module’s serializer sets ensure_ascii=True by default, thus escaping the output so that the resulting strings only contain ASCII characters.
Maybe surprising that in this day-and-age the module defaults to escaping non-ASCII (probably for backwards compatibility), so just override that behavior with ensure_ascii=false:
with open(json_filePath, 'w') as f:
json.dump(new_json, f, ensure_ascii=False)

How do I decode a string with utf-8? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a string that is already encoded with utf-8 (ex. "No\xf0\x9f\x92\x80"). I would like to decode it so it becomes No💀. However, when I use .decode('utf-8) it says decode is not a function of a str.
The string is from a txt file that I am reading with pandas.

If the length is 6, that doesn't quite make sense if you read the file with encoding='utf8'. It should have decoded the UTF-8 bytes correctly, but this would fix it if it is really what you have:
>>> s='No\xf0\x9f\x92\x80'
>>> len(s)
6
>>> s.encode('latin1').decode('utf8')
'No💀'
Instead, if you have literal backslashes and numbers in the string, this would work:
>>> s=r'No\xf0\x9f\x92\x80'
>>> s
'No\\xf0\\x9f\\x92\\x80'
>>> len(s)
18
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'No💀'
unicode-escape translates escape codes to Unicode code points, but only works on bytes strings. .encode('latin1') translates Unicode code points, 1:1 to their byte equivalent (only works U+0000 to U+00FF, of course).
The code above translates a str to bytes, decodes the escapes, converts to bytes again, and decodes correctly as UTF-8.

Use u'string' on string stored as variable in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
As a French user of Python 2.7, I'm trying to properly print strings containing accents such as "é", "è", "à", etc. in the Python console.
I already know the trick of using u before the explicit value of a string, such as :
print(u'Université')
which properly prints the last character.
Now, my question is: how can I do the same for a string that is stored as a variable?
Indeed, I know that I could do the following:
mystring = u'Université'
print(mystring)
but the problem is that the value of mystring is bound to be passed into a SQL query (using psycopg2), and therefore I can't afford to store the u inside the value of mystring.
so how could I do something like
"print the unicode value of mystring" ?

The u sigil is not part of the value, it's just a type indicator. To convert a string into a Unicode string, you need to know the encoding.
unicodestring = mystring.decode('utf-8') # or 'latin-1' or ... whatever
and to print it you typically (in Python 2) need to convert back to whatever the system accepts on the output filehandle:
print(unicodestring.encode('utf-8')) # or 'latin-1' or ... whatever
Python 3 clarifies (though not directly simplifies) the situation by keeping Unicode strings and (what is now called) bytes objects separate.

Python 3 multiple replacements [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I don't see why this has been placed on hold as "off topic." I am asking for programming help, not a reference (which is what the explanation of why it was closed said). Here's my original question:
I have a python 3 script to send emails in HTML to a list of about 600 people. Sendmail apparently can't send non-ascii characters above 127 (decimal) unless I jump through hoops with MIME. So I'm considering doing a bulk replace of all accented characters with their HTML &#...; equivalents.
I'd rather not use regex, since I'm not proficient at them. Is there way to do this without using a loop, or at least not a complicated one?

Googled "python encode html entities", first result: https://wiki.python.org/moin/EscapingHtml:
Builtin HTML/XML escaping via ASCII encoding
A very easy way to transform non-ASCII characters like German umlauts or letters with accents into their HTML equivalents is simply encoding them from unicode to ASCII and use the xmlcharrefreplace encoding error handling:
>>> a = u"äöüßáà"
>>> a.encode('ascii', 'xmlcharrefreplace')
'äöüßáà'

You can use str.translate() and the entities from the html package:
import html
text = "a text with ä and ö"
ent = {k: '&{};'.format(v) for k, v in html.entities.codepoint2name.items()}
print(text.translate(ent))
Output:
a text with ä and ö

Converting string to number in python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I have a file in .ktx format. I have opened the file in 'rb' mode. I want to modify particular bytes in that file. I am reading bytes using read(4) [ i want to read number which is of 4 bytes], call and convert each chunk into a number. What I want is, to increase that number by specific number and insert it back into file stream. Is there any function in python which converts a byte string to an integer? I tried with int() but it prints some binary data.
my code:
bytes=file.read(4)
for char in bytes:
print hex(ord(char))

bytes = file.read(4)
bytesAsInt = struct.unpack("l",bytes)
do_something_with_int(bytesAsInt)
I think might be what you are looking for ... its hard to tell from the question though
here is the docs on the struct module https://docs.python.org/3/library/struct.html

Try this
How can I convert a character to a integer in Python, and viceversa?
Here is a suggested workflow for what you seem to be wanting to do
Read the data
Convert the data to integer
Add X to the integer, where X is the value you want to increase by

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read Unicode file as Unicode string in Python [closed] - python

Related

Encoding issue during reading excel file in Python [closed]

How do I decode a string with utf-8? [closed]

Use u'string' on string stored as variable in Python [closed]

Python 3 multiple replacements [closed]

Converting string to number in python [closed]

Categories

Resources