Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I use read_excel from pandas library to read excel content and convert it to JSON. I am struggling with encoding issue. Non english characters are encoded like "u652f\u63f4\u8cc7\u8a0a".
How can I resolve this issue?
I tried
wb = xlrd.open_workbook(excel_filePath, encoding_override='ISO-8859-1')
new_data = pd.read_excel(wb)
Also
with open(excel_filePath, mode="r", encoding="utf-8") as file:
new_data = pd.read_excel(excel_filePath)
I tried this code with encodings like: utf-8, utf-16, utf-16, latin1...
From the docs of the json module:
The RFC requires that JSON be represented using either UTF-8, UTF-16, or UTF-32, with UTF-8 being the recommended default for maximum interoperability.
As permitted, though not required, by the RFC, this module’s serializer sets ensure_ascii=True by default, thus escaping the output so that the resulting strings only contain ASCII characters.
Maybe surprising that in this day-and-age the module defaults to escaping non-ASCII (probably for backwards compatibility), so just override that behavior with ensure_ascii=false:
with open(json_filePath, 'w') as f:
json.dump(new_json, f, ensure_ascii=False)
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have some invalid characters in my file that I'm trying to remove. But I ran into a strange problem with one of them.
When I try to use the replace function then I'm getting an error SyntaxError: EOL while scanning string literal.
I found that I was dealing with \x1d which is a group separator. I have this code to remove it:
import pandas as pd
df = pd.read_csv('C:/Users/tkp/Desktop/Holdings_Download/dws/example.csv',index_col=False, sep=';', encoding='utf-8')
print(df['col'][0])
df = df['col'][0].encode("utf-8").replace(b"\x1d", b"").decode()
df = pd.DataFrame([x.split(';') for x in df.split('\n')])
print(df[0][0])
Output:
Is there another way to do this? Because it seems to me that I couldn't do it any worse this.
Notice that you are getting a SyntaxError. This means that Python never gets as far as actually running your program, because it can't figure out what the program is!
To be honest, I'm not quite sure why this happens in this case, but using "exotic" characters in string constants is always a bit iffy, because it makes you dependent on what the character encoding of the source code is, and puts you at the mercy of all sorts of buggy editors. Therefore, I would recommend using the '\uXXXX' syntax to explicitly write the Unicode number for the character you wish to replace. (It looks like what you have here is U+2194 DOUBLE ARROW, so '\u2194' should do it.)
Having said that, I would first verify that this is actually the problem, by changing the '↔' bit to something more mundane, like 'x' and seeing whether that causes the same error. If it does, then your problem is somewhere else...
You have to specify the encoding for which this character is defined in the charset.
df = df.replace('#', '', encoding='utf-8')
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
Basically, I have a txt file which is full of numbers 1-5000, in no order. I am trying to import them into a python script to manipulate them and find info on averages, and whatnot.
I've tried many different methods of importing the list, but it always errors with "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte"
list = []
with open('numbers.txt', 'r') as f:
content = f.readlines()
for x in content:
row = x.split()
list.append(int(row[0]))
print(list)
The expected result is a list of numbers, in int format
However, I either get that error or in certain executions, I get a list filled with \x00 between every character.
You can try UTF-16 to encode and then split as per your code.
my code is below.
with open(path_to_file,'rb') as f:
contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
Hope it helps.
MV
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I'm using pymysql to query a database that has an entry like 'name':'Te\xtCorp', it's a name that I need to preserve. I'm sending it somewhere else with json.dumps() and when it hits this it fails to escape the \x.
What's the proper way to escape the \x without double escaping everything else?
Two options here:
You escape the backslash, like:
'Te\\xtCorp'
You can use a raw string:
r'Te\xtCorp'
Both generate:
>>> 'Te\\xtCorp'
'Te\\xtCorp'
>>> r'Te\xtCorp'
'Te\\xtCorp'
Or printed:
>>> print(r'Te\xtCorp')
Te\xtCorp
Note that in order to inspect the content of the string, you should use a print(..) statement, otherwise you get the repr(..)esentation of that string. For example:
>>> print(json.dumps(r'te\xt'))
"te\\xt"
>>> print(json.loads(json.dumps(r'te\xt')))
te\xt
As one can read in the documentation on String literals:
\xhh...: ASCII character with hex value hh...
So it is used to encode any ASCII character, by specifying the code as a hexadecimal value.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have a file that is encoded in Unicode or UTF-8 (I don't know which). When I read the file in Python 3.4, the resulting string is interpreted as an ASCII string. How do I convert it to a Unicode string like u"text"?
The term "Unicode" refers to the standard, not to a particular encoding.
Since files in computers are binary, there exist different ways of encoding Unicode data in binary files. One of them is "UTF-8".
You can consult https://docs.python.org/3/howto/unicode.html
An example taken from this document (in the section "Reading and Writing Unicode Data")
with open('unicode.txt', encoding='utf-8') as f:
for line in f:
print(repr(line))
In python 3, unlike python2, unicode string constants are not written with a "u".
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 8 years ago.
Improve this question
I think I'm running into encoding issues. When I change to utf-16 the error changes to the first line "import temperature"
I installed Python 3.x thinking maybe it was a version issue but same symptoms.
Other exercise scripts I've been running have worked fine. Any ideas?
Python indicates the syntax error occurs at "temp = 0"
##### modules.py file
import temperature
temp = 212
convTemp = temperature.ftoc(temp)
print("The converted temp is " + str(convTemp))
temp = 0
convTemp = temperature.ctof(temp)
print("The converted temp is " + str(convTemp))
#### temperature.py file contents
def ftoc(temp):
return (5.0/9.0) * (temp - 32.0)
def ctof(temp):
return (9.0/5.0) * temp + 32.0
Results after correcting code from original post.
hostname$ python modules.py
Traceback (most recent call last):
File "modules.py", line 1, in <module>
import temperature
File "/Users/[myusername]/Dropbox/python/temperature.py", line 1
SyntaxError: Non-ASCII character '\xfe' in file /Users/[myusername]/Dropbox/python/temperature.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
hostname$
This seems to be encoding problem, indeed. The \xFE is part of BOM(\xFE \xFF) for UTF-16 encodings.
Using UTF-16 as Python source code is not a good idea. You're not able to give Python parser a hint of the source code encoding of the source file using the encoding mark. Such as
# encoding: utf-8
See PEP-0263 for detailed explanation, and below is part of important information:
Any encoding which allows processing the first two lines in the way indicated above is allowed as source code encoding, this includes ASCII compatible encodings as well as certain multi-byte encodings such as Shift_JIS. It does not include encodings which use two or more bytes for all characters like e.g. UTF-16. The reason for this is to keep the encoding detection algorithm in the tokenizer simple.
from temperature import ftoc
from temperature import ctof
This is redundant because you imported all the temperature functions with
import temperature
You have syntax errors, missing parens at the end of your print statements.
Also you have str(temperature) when you want str(convTemp). Fix these things and I think it will work fine.