python : unicodeEncodeError: 'charpmap' codec can't encode character '\u2026' - python

I try to analyse some tweets I got from tweeter, but It seems I have a probleme of encoding, if you have any idea..
import json
#Next we will read the data in into an array that we call tweets.
tweets_data_path = 'C:/Python34/TESTS/twitter_data.txt'
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
print(len(tweets_data))#412 tweets
print(tweet)
I got the mistake :
File "C:\Python34\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0]
unicodeEncodeError: 'charpmap' codec can't encode character '\u2026' in position 1345: character maps to undefined
At work, I didn't get the error, but I have python 3.3, does it make a difference, do you think ?
-----EDIT
The comment from #MarkRamson answered my question

You should specify the encoding when opening the file:
tweets_file = open(tweets_data_path, "r", encoding="utf-8-sig")

Related

Reading CSV file. UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 336: character maps to

Trying to read two CSV files based on a function but when reading one (yelp.csv) I encounter an error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 336: character maps to
I tried the encoding but the error persists. I had identified the issue is when using .readlines(). Not sure how to fix this issue.
def readDataFromFile(fileName, seperator, encoding="utf8"):
with open(fileName, 'r') as panelf:
panelf.readline() # skip header
lines = []
data = panelf.readlines()
for line in data:
line = line.strip("\n").split(seperator)
lines.append(line)
return lines
panelData = readDataFromFile("Desktop/panel.csv", ",", encoding="utf-8")
yelpData = readDataFromFile("Desktop/yelp.csv", ",", encoding="utf-8")
The encoding variable is not used. Try:
def readDataFromFile(fileName, seperator, encoding="utf8"):
with open(fileName, 'r', encoding=encoding) as panelf:
panelf.readline() # skip header
lines = []
data = panelf.readlines()
for line in data:
line = line.strip("\n").split(seperator)
lines.append(line)
return lines
panelData = readDataFromFile("Desktop/panel.csv", ",", encoding="utf-8")
yelpData = readDataFromFile("Desktop/yelp.csv", ",", encoding="utf-8")

Cannot parse through a file with latin-1 encoding

I'm trying to parse a large file of tweets from the Stanford Sentiment Database (see here: http://help.sentiment140.com/for-students/), with the following being my code:
def init_process(fin, fout):
outfile = open(fout, 'a')
with open(fin, buffering=200000, encoding='latin-1') as f:
try:
for line in f:
line = line.replace('"', '')
initial_polarity = line.split(',')[0]
if initial_polarity == '0':
initial_polarity = [1, 0]
elif initial_polarity == '4':
initial_polarity = [0, 1]
tweet = line.split(',')[-1]
outline = str(initial_polarity) + ':::' + tweet
outfile.write(outline)
except Exception as e:
print(str(e))
outfile.close()
init_process('training.1600000.processed.noemoticon.csv','train_set.csv')
I've run into this following issue:
'ascii' codec can't encode characters in position 12-14: ordinal not in range(128)
which doesn't make sense since I'm opening the file with a latin-1 encoding. How do I stop this error and successfully parse through the file?
It's probably the outfile encoding that's still ASCII. You should open it with the proper encoding, too (doesn't have to be latin-1, probably utf-8 is more appropriate depending on your environment).
Per comment from Åsmund: the file encoding is locale-specific, you should probably consider changing your locale to something that can handle non-ASCII text.

return codecs.ascii_decode(input, self.errors)[0]

I am reading a songs file in csv format and I do not know what I am doing wrong.
import csv
import os
import random
file = open("songs.csv", "rU")
reader = csv.reader(file)
for song in reader:
print(song[0], song[1], song[2])
file.close()
This is the error:
Traceback (most recent call last):
File "/Users/kuku/Desktop/hey/mine/test.py", line 10, in <module>
for song in reader:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 414: ordinal not in range(128)
try
for song in [unicode(song, 'utf-8') for song in reader]:
print(...)
With this bit of your code:
for song in reader:
print( song[0], song[1],song[2])
you are printing elements 0, 1 and 2 of the lines in reader during each iteration of the loop. This will cause a (different) error if there are fewer than 3 elements in total.
If you don't know that there will be at least 3 elements in each line, you could include the code in a try, except block:
with open("songs.csv", "r") as f:
song_reader = csv.reader(f)
for song_line in song_reader:
lyric = song_line
try:
print(lyric[0], lyric[1], lyric[2])
except:
pass # ...or preferably do something better
It's worth noting that in most cases it is preferable to open a file within a with block, as shown above. This negates the need for file.close().
You can open the file in utf-8 encoding.
file = open("songs.csv", "rU", encoding="utf-8")

codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 318: ordinal not in range(128)

I am trying to open and readlines a .txt file that contains a large amount of text. Below is my code, i dont know how to solve this problem. Any help would be very appreciated.
file = input("Please enter a .txt file: ")
myfile = open(file)
x = myfile.readlines()
print (x)
when i enter the .txt file this is the full error message is displayed below:
line 10, in <module> x = myfile.readlines()
line 26, in decode return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 318: ordinal not in range(128)
Instead of using codecs, I solve it this way:
def test():
path = './test.log'
file = open(path, 'r+', encoding='utf-8')
while True:
lines = file.readlines()
if not lines:
break
for line in lines:
print(line)
You must give encoding param precisely.
You can also try to encode :
with open(file) as f:
for line in f:
line = line.encode('ascii','ignore').decode('UTF-8','ignore')
print(line)
#AndriiAbramamov is right, your shoud check that question, here is a way you can open your file which is also on that link
import codecs
f = codecs.open('words.txt', 'r', 'UTF-8')
for line in f:
print(line)
Another way is to use regex, so when you open the file you can remove any special character like double quotes and so on.

python encoding issue, searching tweets

I have written the following code to crawl tweets with 'utf-8' encoding:
kws=[]
f=codecs.open("keywords", encoding='utf-8')
kws = f.readlines()
f.close()
print kws
for kw in kws:
timeline_endpoint ='https://api.twitter.com/1.1/search/tweets.json?q='+kw+'&count=100&lang=fr'
print timeline_endpoint
response, data = client.request(timeline_endpoint)
tweets = json.loads(data)
for tweet in tweets['statuses']:
my_on_data(json.dumps(tweet.encode('utf-8')))
time.sleep(3)
but I am getting the following error:
response, data = client.request(timeline_endpoint)
File "build/bdist.linux-x86_64/egg/oauth2/__init__.py", line 676, in request
File "build/bdist.linux-x86_64/egg/oauth2/__init__.py", line 440, in to_url
File "/usr/lib/python2.7/urllib.py", line 1357, in urlencode
l.append(k + '=' + quote_plus(str(elt)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)
I would appreciate any help.
Okay, here is the solution using a different search approach:
auth = tweepy.OAuthHandler("k1", "k2")
auth.set_access_token("k3", "k4")
api = tweepy.API(auth)
for kw in kws:
max_tweets = 10
searched_tweets = [status for status in tweepy.Cursor(api.search, q=kw.encode('utf-8')).items(max_tweets)]
for tweet in searched_tweets:
my_on_data(json.dumps(tweet._json))
time.sleep(3)

Categories

Resources