Python script receiving a UnicodeEncodeError: 'ascii' codec can't encode character - python

I have a simple Python script that pulls posts from reddit and posts them on Twitter. Unfortunately, tonight it began having issues that I'm assuming are because of someone's title on reddit having a formatting issue. The error that I'm reciving is:
File "redditbot.py", line 82, in <module>
main()
File "redditbot.py", line 64, in main
tweeter(post_dict, post_ids)
File "redditbot.py", line 74, in tweeter
print post+" "+post_dict[post]+" #python"
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)
And here is my script:
# encoding=utf8
import praw
import json
import requests
import tweepy
import time
import urllib2
import sys
reload(sys)
sys.setdefaultencoding('utf8')
access_token = 'hidden'
access_token_secret = 'hidden'
consumer_key = 'hidden'
consumer_secret = 'hidden'
def strip_title(title):
if len(title) < 75:
return title
else:
return title[:74] + "..."
def tweet_creator(subreddit_info):
post_dict = {}
post_ids = []
print "[bot] Getting posts from Reddit"
for submission in subreddit_info.get_hot(limit=2000):
post_dict[strip_title(submission.title)] = submission.url
post_ids.append(submission.id)
print "[bot] Generating short link using goo.gl"
mini_post_dict = {}
for post in post_dict:
post_title = post
post_link = post_dict[post]
mini_post_dict[post_title] = post_link
return mini_post_dict, post_ids
def setup_connection_reddit(subreddit):
print "[bot] setting up connection with Reddit"
r = praw.Reddit('PythonReddit PyReTw'
'monitoring %s' %(subreddit))
subreddit = r.get_subreddit('python')
return subreddit
def duplicate_check(id):
found = 0
with open('posted_posts.txt', 'r') as file:
for line in file:
if id in line:
found = 1
return found
def add_id_to_file(id):
with open('posted_posts.txt', 'a') as file:
file.write(str(id) + "\n")
def main():
subreddit = setup_connection_reddit('python')
post_dict, post_ids = tweet_creator(subreddit)
tweeter(post_dict, post_ids)
def tweeter(post_dict, post_ids):
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
for post, post_id in zip(post_dict, post_ids):
found = duplicate_check(post_id)
if found == 0:
print "[bot] Posting this link on twitter"
print post+" "+post_dict[post]+" #python"
api.update_status(post+" "+post_dict[post]+" #python")
add_id_to_file(post_id)
time.sleep(3000)
else:
print "[bot] Already posted"
if __name__ == '__main__':
main()
Any help would be very much appreciated - thanks in advance!

Consider this simple program:
print(u'\u201c' + "python")
If you try printing to a terminal (with an appropriate character encoding), you get
“python
However, if you try redirecting output to a file, you get a UnicodeEncodeError.
script.py > /tmp/out
Traceback (most recent call last):
File "/home/unutbu/pybin/script.py", line 4, in <module>
print(u'\u201c' + "python")
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
When you print to a terminal, Python uses the terminal's character encoding to encode unicode. (Terminals can only print bytes, so unicode must be encoded in order to be printed.)
When you redirect output to a file, Python can not determine the character encoding since files have no declared encoding. So by default Python2 implicitly encodes all unicode using the ascii encoding before writing to the file. Since u'\u201c' can not be ascii encoded, a UnicodeEncodeError. (Only the first 127 unicode code points can be encoded with ascii).
This issue is explained in detail in the Why Print Fails wiki.
To fix the problem, first, avoid adding unicode and byte strings. This causes implicit conversion using the ascii codec in Python2, and an exception in Python3. To future-proof your code, it is better to be explicit. For example, encode post explicitly before formatting and printing the bytes:
post = post.encode('utf-8')
print('{} {} #python'.format(post, post_dict[post]))

You are trying to print a unicode string to your terminal (or possibly a file by IO redirection), but the encoding used by your terminal (or file system) is ASCII. Because of this Python attempts to convert it from the unicode representation to ASCII, but fails because codepoint u'\u201c' (“) can not be represented in ASCII. Effectively your code is doing this:
>>> print u'\u201c'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)
You could try converting to UTF-8:
print (post + " " + post_dict[post] + " #python").encode('utf8')
or convert to ASCII like this:
print (post + " " + post_dict[post] + " #python").encode('ascii', 'replace')
which will replace invalid ASCII characters with ?.
Another way, which is useful if you are printing for debugging purposes, is to print the repr of the string:
print repr(post + " " + post_dict[post] + " #python")
which would output something like this:
>>> s = 'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'
>>> print repr(s)
u'string with \u201cLEFT DOUBLE QUOTATION MARK\u201c'

The problem likely arises from mixing bytestrings and unicode strings on concatenation. As an alternative to prefixing all string literals with u, maybe
from __future__ import unicode_literals
fixes things for you. See here for a deeper explanation and to decide whether it's an option for you or not.

Related

How to get a webpage with unicode chars in python

I am trying to get and parse a webpage that contains non-ASCII characters (the URL is http://www.one.co.il). This is what I have:
url = "http://www.one.co.il"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
encoding = response.headers.getparam('charset') # windows-1255
html = response.read() # The length of this is valid - about 31000-32000,
# but printing the first characters shows garbage -
# '\x1f\x8b\x08\x00\x00\x00\x00\x00', instead of
# '<!DOCTYPE'
html_decoded = html.decode(encoding)
The last line gives me an exception:
File "C:/Users/....\WebGetter.py", line 16, in get_page
html_decoded = html.decode(encoding)
File "C:\Python27\lib\encodings\cp1255.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0xdb in position 14: character maps to <undefined>
I tried looking at other related questions such as urllib2 read to Unicode and How to handle response encoding from urllib.request.urlopen() , but didn't find anything helpful about this.
Can someone please shed some light and guide me in this subject? Thanks!
0x1f 0x8b 0x08 is the magic number for a gzipped file. You will need to decompress it before you can use the contents.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)

I'm new to Python and have to build an application to get historical data from Twitter. I can see the tweets in my console and all the information I need!
But the problem I have now is that I need to write this information to a .csv file but I encounter the following error while running the code:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)". I know that the problem is that the Tweets I'm collecting are written in Swedish so its a frequent use of the letter "ÅÄÖ".
Can anyone please help me with this or have any pointers as to where i should start looking for a solution?
#!/usr/bin/python
# -*- coding: utf-8 -*-
from TwitterSearch import *
import csv
import codecs
try:
tuo = TwitterUserOrder('Vismaspcs') # create a TwitterUserOrder
ts = TwitterSearch(
consumer_key = '',
consumer_secret = '',
access_token = '',
access_token_secret = ''
)
# start asking Twitter about the timeline
for tweet in ts.search_tweets_iterable(tuo):
print( '#%s tweeted: %s' % ( tweet['user']['screen_name'], tweet['text']) )
print (tweet['created_at'],tweet['favorite_count'],tweet ['retweet_count'])
with open('visma.csv','w') as fout:
writer=csv.writer(fout)
writer.writerows([tweet['user']['screen_name'],tweet['text'],tweet['created_at'],tweet['favorite_count'],tweet['retweet_count']])
except TwitterSearchException as e: # catch all those ugly errors
print(e)
The csv module cannot handle unicode in python2:
Note This version of the csv module doesn't support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
You can use tweet['user']['screen_name'].encode("utf-8")...
Thank's,
I use it with file.write(word['value'].encode("utf-8")) and it work too :)
But you can try with file.encode('utf8') if it's not for write something

UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 39: ordinal not in range(128)

I'm new to Python and I've been trying to fix it for two hours now.
Here's the code:
import praw
import json
import requests
import tweepy
import time
access_token = 'REDACTED'
access_token_secret = 'REDACTED'
consumer_key = 'REDACTED'
consumer_secret = 'REDACTED'
def strip_title(title):
if len(title) < 94:
return title
else:
return title[:93] + "..."
def tweet_creator(subreddit_info):
post_dict = {}
post_ids = []
print "[bot] Getting posts from Reddit"
for submission in subreddit_info.get_hot(limit=20):
post_dict[strip_title(submission.title)] = submission.url
post_ids.append(submission.id)
print "[bot] Generating short link using goo.gl"
mini_post_dict = {}
for post in post_dict:
post_title = post
post_link = post_dict[post]
short_link = shorten(post_link)
mini_post_dict[post_title] = short_link
return mini_post_dict, post_ids
def setup_connection_reddit(subreddit):
print "[bot] setting up connection with Reddit"
r = praw.Reddit('yasoob_python reddit twitter bot '
'monitoring %s' %(subreddit))
subreddit = r.get_subreddit(subreddit)
return subreddit
def shorten(url):
headers = {'content-type': 'application/json'}
payload = {"longUrl": url}
url = "https://www.googleapis.com/urlshortener/v1/url"
r = requests.post(url, data=json.dumps(payload), headers=headers)
link = json.loads(r.text)['id']
return link
def duplicate_check(id):
found = 0
with open('posted_posts.txt', 'r') as file:
for line in file:
if id in line:
found = 1
return found
def add_id_to_file(id):
with open('posted_posts.txt', 'a') as file:
file.write(str(id) + "\n")
def main():
subreddit = setup_connection_reddit(‘python’)
post_dict, post_ids = tweet_creator(subreddit)
tweeter(post_dict, post_ids)
def tweeter(post_dict, post_ids):
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
for post, post_id in zip(post_dict, post_ids):
found = duplicate_check(post_id)
if found == 0:
print "[bot] Posting this link on twitter"
print post+" "+post_dict[post]+" #python"
api.update_status(post+" "+post_dict[post]+" #python")
add_id_to_file(post_id)
time.sleep(30)
else:
print "[bot] Already posted"
if __name__ == '__main__':
main()
Traceback:
root#li732-134:~# python twitter.py
[bot] setting up connection with Reddit
[bot] Getting posts from Reddit
[bot] Generating short link using goo.gl
[bot] Already posted
[bot] Already posted
[bot] Already posted
[bot] Posting this link on twitter
Traceback (most recent call last):
File "twitter.py", line 82, in <module>
main()
File "twitter.py", line 64, in main
tweeter(post_dict, post_ids)
File "twitter.py", line 74, in tweeter
print post+" "+post_dict[post]+" #python"
UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 39:
ordinal not in range(128)`
I really have no idea what to do. Could someone point me in the right direction?
Edit: Added code and traceback.
Even if you call decode(), the bytes you're receiving have to be in an expected, properly encoded form.
If \xea is encountered in a UTF-8 string, it must be followed by two bytes, and not just any bytes, they have to be in the valid range. Otherwise, it's not valid UTF-8.
E.g. here are two Unicode code points. The first code point U+56 takes only a single byte. The next one, U+a000 requires three bytes, and the way we know that is because we encounter \xea:
http://hexutf8.com/?q=0x560xea0x800x80
Simply remove the last of the continuation bytes in the above, and this ceases to be valid UTF-8:
http://hexutf8.com/?q=0x560xea0x80
I don't see where you've posted the value you're failing on, but I'd double-check that and make sure you're actually getting valid UTF-8 data.
The error happens here:
print post+" "+post_dict[post]+" #python"
The problem seems to be that you're concatenating ASCII strings and Unicode strings in this line. That's causing a problem here. Try concatenating only Unicode strings:
print post + u" " + post_dict[post] + u" #python"
If you're still having problems, look at the output of type(post) and type(post_dict[post]) which should both be Unicode strings. If either of them isn't then you'll need to convert them to be a Unicode string using the correct encoding (most likely UTF-8). That can be done as follows:
post.decode('UTF-8')
or:
post_dict[post].decode('UTF-8')
The above would convert a string to a Unicode string in Python 2. Once you've done that you can safely concatenate the Unicode strings together. The key thing in Python 2 is to never mix regular strings with Unicode strings as that'll cause problems.

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)

I have been working on a program to retrieve questions from Stack Overflow. Till yesterday the program was working fine, but since today I'm getting the error
"Message File Name Line Position
Traceback
<module> C:\Users\DPT\Desktop\questions.py 13
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 34: ordinal not in range(128)"
Currently, the questions are being displayed, but I seem to be unable to copy the output to a new text file.
import sys
sys.path.append('.')
import stackexchange
so = stackexchange.Site(stackexchange.StackOverflow)
term= raw_input("Enter the keyword for Stack Exchange")
print 'Searching for %s...' % term,
sys.stdout.flush()
qs = so.search(intitle=term)
print '\r--- questions with "%s" in title ---' % (term)
for q in qs:
print '%8d %s' % (q.id, q.title)
with open('E:\questi.txt', 'a+') as question:
question.write(q.title)
time.sleep(10)
with open('E:\questi.txt') as intxt:
data = intxt.read()
regular = re.findall('[aA-zZ]+', data)
print(regular)
tokens = set(regular)
with open('D:\Dictionary.txt', 'r') as keywords:
keyset = set(keywords.read().split())
with open('D:\Questionmatches.txt', 'w') as matches:
for word in keyset:
if word in tokens:
matches.write(word + '\n')
q.title is a Unicode string. When writing that to a file, you need to encode it first, preferably a fully Unicode-capable encoding such as UTF-8 (if you don't, Python will default to using the ASCII codec which doesn't support any character codepoint above 127).
question.write(q.title.encode("utf-8"))
should fix the problem.
By the way, the program tripped up on character “ (U+201C).
I ran into this as well using Transifex API
response['source_string']
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 3: ordinal not in range(128)
Fixed with response['source_string'].encode("utf-8")
import requests
username = "api"
password = "PASSWORD"
AUTH = (username, password)
url = 'https://www.transifex.com/api/2/project/project-site/resource/name-of-resource/translation/en/strings/?details'
response = requests.get(url, auth=AUTH).json()
print response['key'], response['context']
print response['source_string'].encode("utf-8")

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

I'm trying to write a scraper , but I'm having issues with encoding. When I tried to copy the string I was looking for into my text file, python2.7 told me it didn't recognize the encoding, despite no special characters. Don't know if that's useful info.
My code looks like this:
from urllib import FancyURLopener
import os
class MyOpener(FancyURLopener): #spoofs a real browser on Window
version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
print "What is the webaddress?"
webaddress = raw_input("8::>")
print "Folder Name?"
foldername = raw_input("8::>")
if not os.path.exists(foldername):
os.makedirs(foldername)
def urlpuller(start, page):
while page[start]!= '"':
start += 1
close = start
while page[close]!='"':
close += 1
return page[start:close]
myopener = MyOpener()
response = myopener.open(webaddress)
site = response.read()
nexturl = ''
counter = 0
while(nexturl!=webaddress):
counter += 1
start = 0
for i in range(len(site)-35):
if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
start = i + 40
break
else:
print "Something's broken, chief. Error = 1"
next = 0
for i in range(start, 8, -1):
if site[i:i+8] == u'<a href=':
next = i
break
else:
print "Something's broken, chief. Error = 2"
nexturl = urlpuller(next, site)
myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg')
print("Retrieval of "+foldername+" completed.")
When I try to run it using the site I'm using, it returns the error:
Traceback (most recent call last):
File "yada/yadayada/Python/scraper.py", line 37, in <module>
if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data
When pointed at http://google.com, it worked just fine.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
but when I try to decode using utf-8, as you can see, it does not work.
Any suggestions?
site[i:i+35].decode('utf-8')
You cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.
Look into a tool that has this built for you. BeautifulSoup or lxml are two alternatives.
Open the csv file in sublime and "Save with Encoding" -> UTF-8.
Instead of your for-loop do something like:
start = site.decode('utf-8').find('<img id="imgSized" class="slideImg"') + 40

Categories

Resources