UnicodeEncodeError: 'ascii' codec can't encode character error - python

I am reading some files from google cloud storage using python
spark = SparkSession.builder.appName('aggs').getOrCreate()
df = spark.read.option("sep","\t").option("encoding", "UTF-8").csv('gs://path/', inferSchema=True, header=True,encoding='utf-8')
df.count()
df.show(10)
However, I keep getting an error that complains about the df.show(10) line:
df.show(10)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line
350, in show
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 162: ordinal not in range(128)
I googled and found this seems to be a common error and the solution should be added in the encoding of "UTF-8" to the spark.read.option, as I already did. Since this doesn't help, I am still getting this error, could experts help? Thanks in advance.

How about exporting PYTHONIOENCODING before running your Spark job:
export PYTHONIOENCODING=utf8
For Python 3.7+ the following should also do the trick:
sys.stdout.reconfigure(encoding='utf-8')
For Python 2.x you can use the following:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Related

how to solved charmap' codec can't encode character '\u300b?

I'm trying to start a script, but I have run in to a problem.
[ERROR] 'charmap' codec can't encode character '\u300b' in position 11: character maps to
Your code is printing something that is unreadable for the machine. Try changing the output encoding to "utf-8" or just use code below at the very first lines of your code:
import sys
sys.stdin.reconfigure(encoding='utf-8')
sys.stdout.reconfigure(encoding='utf-8')

Trouble loading CSV files

import xlrd
import pandas as pd
data = pd.read_csv("/Milk_Papers_Estimated_Class.csv")
path ='/Milk_Papers_Estimated_Class.csv'
I experience an error in the following code while trying to run the .csv file.:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 504: invalid continuation byte.
I do not know why I am facing this error.Can anyone help me out with this?
By default the read_csv takes utf-8 as the encoder.
data = pd.read_csv("/Milk_Papers_Estimated_Class.csv", encoding='latin-1')
Try giving the encoding as latin-1
Might work:")

UnicodeDecodeError with nltk

I am working with python2.7 and nltk on a large txt file of content scraped from various websites..however I am getting various unicode errors such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
My question is not so much how I can 'fix' this with python but instead is there anything I can do to the .txt file (as in formatting) before 'feeding' it to python, such as 'make plain text' to avoid this issue entirely?
Update:
I looked around and found a solution within python that seems to work perfectly:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
try opening the file with:
f = open(fname, encoding="ascii", errors="surrogateescape")
Change the "ascii" with the desired encoding.

Python script utf-8 issue

I try to use this python text-to-speech converter to convert Greek into mp3.
Git says utf-8 is supported but when I try to translate text like "Γεια σου" it throws an error as shown below:
What I type on cmd: gtts-cli.py "Γεια σου" -l el -o hi.mp3
What I get:
'ascii' codec can't decode byte 0xf4 in position 0: ordinal not in
range(128)
Any ideas?
Update:
I added utf-8 support as shown below. I even updated to python3. Still getting a similar error...
'utf8' codec can't decode byte 0xc3 in position 0: invalid continuation byte
What I added:
text = args.text.decode('utf-8')
Any ideas?
There is related open issue in this project, please take a look.
Looks like the fix was created by the somebody already though, but it is still not merged.

python pyinstaller UnicodeDecodeError cp949

I get unicodedecodeerror when I try to install pyinstaller.
The error message reades:
UnicodeDecodeError: 'cp949' codec can't decode byte 0xe2 in position 208687: illegal multibyte sequence
When I google this error, it looks like an error with codec to read the file.
Tried some of the solutions found online but didn't work.
How can I fix this?
I think in your code have function to print some data with the codec which the window shell does not support display. Remove them and try again(I cannot comment because not enough rep so i wrote here)

Categories

Resources