I am trying to scrape through Twitter bios using the Twitter API with Python.
However I get this error:
newFile.writerow(info)
UnicodeEncodeError: 'ascii' codec can't
encode characters in position 0-4: ordinal not in range(128)
I assume this occurs when someone has an emoji in their bio or screen name, however none of the following solutions seem to stop the error:
.encode('unicode_escape')
.encode('UTF8')
.encode('UTF-8')
Here is the current code
for follower in followers.items():
info=[]
name =follower.name.encode('unicode_escape')
screen_name = follower.screen_name.encode('unicode_escape')
userId = userId + 1
#add values to array
values.append(userId)
values.append(name)
values.append(screen_name)
csvFile = open('followers.csv','a')
newFile =csv.writer(csvFile) #imported csv
#add list of headers as a new row
newFile.writerow(info)
#close file
csvFile.close()
A major problem is that Python's CSV module is not Unicode safe - See the warnings in https://docs.python.org/2/library/csv.html
The work around, as you've found is to encode all values to UTF-8 first:
name = follower.name.encode('UTF-8')
screen_name = follower.screen_name.encode('UTF-8')
The problem you're hitting now is Python is still trying to encode your values to ASCII. This is due to the way you've opened the file for writing. Add b for binary writing:
csvFile = open('followers.csv','ab')
In its complete form:
for follower in followers.items():
info=[]
name = follower.name.encode('UTF-8')
screen_name = follower.screen_name.encode('UTF-8')
userId = userId + 1
#add values to array
values.append(userId)
values.append(name)
values.append(screen_name)
csvFile = open('followers.csv','ab')
newFile =csv.writer(csvFile) #imported csv
#add list of headers as a new row
newFile.writerow(info)
#close file
csvFile.close()
Related
I have experienced a code problem in Python 2.7, I already used UTF-8, but it still got the exception
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 81: ordinal not in range(128)"
My files and contains so many this kind of shit, but for some reason, I'm not allowed to delete it.
desktop,[Search] Store | Automated Titles,google / cpc,Titles > Kesäkaverit,275285048,13
I have tried the below method to avoid, but still, haven't fix it. Can anyone help me ?
1.With "#!/usr/bin/python" in my file header
2.Set setdefaultencoding
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
3.content = unicode(s3core.download_file_to_memory(S3_PROFILE, S3_RAW + file), "utf-8", "ignore")
My code below
content = unicode(s3core.download_file_to_memory(S3_PROFILE, S3_RAW + file), "utf8", "ignore")
rows = content.split('\n')[1:]
for row in rows:
if not row:
continue
try:
# fetch variables
cols = row.rstrip('\n').split(',')
transaction = cols[0]
device_category = cols[1]
campaign = cols[2]
source = cols[3].split('/')[0].strip()
medium = cols[3].split('/')[1].strip()
ad_group = cols[4]
transactions = cols[5]
data_list.append('\t'.join(
['-'.join([dt[:4], dt[4:6], dt[6:]]), country, transaction, device_category, campaign, source,
medium, ad_group, transactions]))
except:
print 'ignoring row: ' + row
I am using facebook graph API but getting error when I try to run graph.py
How should I resolve this problem of charmap. I am facing unicode decode error.
enter image description here
In graph.py :
table = json2html.convert(json = variable)
htmlfile=table.encode('utf-8')
f = open('Table.html','wb')
f.write(htmlfile)
f.close()
# replacing '>' with '>' and '<' with '<'
f = open('Table.html','r')
s=f.read()
s=s.replace(">",">")
s=s.replace("<","<")
f.close()
# writting content to html file
f = open('Table.html','w')
f.write(s)
f.close()
# output
webbrowser.open("Table.html")
else:
print("We couldn't find anything for",PageName)
I could not understand why I am facing this issue. Also getting some error with 's=f.read()'
In error message I see it tries to guess encoding used in file when you read it and finally it uses encoding cp1250 to read it (probably because Windows use cp1250 as default in system) but it is incorrect encoding becuse you saved it as 'utf-8'.
So you have to use open( ..., encoding='utf-8') and it will not have to guess encoding.
# replacing '>' with '>' and '<' with '<'
f = open('Table.html','r', encoding='utf-8')
s = f.read()
f.close()
s = s.replace(">",">")
s = s.replace("<","<")
# writting content to html file
f = open('Table.html','w', encoding='utf-8')
f.write(s)
f.close()
But you could change it before you save it. And then you don't have to open it again.
table = json2html.convert(json=variable)
table = table.replace(">",">").replace("<","<")
f = open('Table.html', 'w', encoding='utf-8')
f.write(table)
f.close()
# output
webbrowser.open("Table.html")
BTW: python has function html.unescape(text) to replace all "chars" like > (so called entity)
import html
table = json2html.convert(json=variable)
table = html.unescape(table)
f = open('Table.html', 'w', encoding='utf-8')
f.write(table)
f.close()
# output
webbrowser.open("Table.html")
This question already has answers here:
Read and Write CSV files including unicode with Python 2.7
(7 answers)
Closed 5 years ago.
I have some variables in Unicode.
title
u'\u0410\u0434\u043c\u0438\u043d\u0438\u0441\u0442\u0440\u0430\u0442\u043e\u0440 \u0438\u043d\u0442\u0435\u0440\u043d\u0435\u0442-\u043c\u0430\u0433\u0430\u0437\u0438\u043d\u0430'
type(title)
unicode
If I print this vaiable, I get:
print (title)
Администратор интернет-магазин
When I try to write this data (Cyrillic symbols) to CSV file:
with open('avito.csv','a') as f:
writer=csv.writer(f)
writer.writerow((title))
This error occurs:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0410' in position 0: ordinal not in range(128)
How can I write this variable as Cyrillic symbols to a CSV?
You have to write to the file with the correct encoding, and from your comment I guess, it is cp1251:
import io
title = u'\u0410\u0434\u043c\u0438\u043d\u0438\u0441\u0442\u0440\u0430\u0442\u043e\u0440 \u0438\u043d\u0442\u0435\u0440\u043d\u0435\u0442-\u043c\u0430\u0433\u0430\u0437\u0438\u043d\u0430'
with io.open('avito.csv', 'a', encoding='cp1251') as output:
output.write(title + '\n')
Three ways on Python 2.7. Note that to open the files in Excel that program likes a UTF-8 BOM encoded at the start of the file. I write it manually in the brute force method, but the utf-8-sig codec will handle it for you otherwise. Skip the BOM signature if you aren't dealing with lame editors (Windows Notepad) or Excel.
import csv
import codecs
import cStringIO
title = u'\u0410\u0434\u043c\u0438\u043d\u0438\u0441\u0442\u0440\u0430\u0442\u043e\u0440 \u0438\u043d\u0442\u0435\u0440\u043d\u0435\u0442-\u043c\u0430\u0433\u0430\u0437\u0438\u043d\u0430'
print(title)
# Brute force
with open('avito.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8')) # writes "byte order mark" UTF-8 signature
writer=csv.writer(f)
writer.writerow([title.encode('utf8')])
# Example from the documentation for csv module
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
with open('avito2.csv','wb') as f:
w = UnicodeWriter(f)
w.writerow([title])
# 3rd party module, install from pip
import unicodecsv
with open('avito3.csv','wb') as f:
w = unicodecsv.writer(f,encoding='utf-8-sig')
w.writerow([title])
I am having an encoding issue when I run my script below:
Here is the error code:
-UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 9: ordinal not in range(128)
Here is my script:
import logging
import urllib
import csv
import json
import io
import codecs
with open('/home/local/apple.csv',
'rb') as csvinput:
reader = csv.reader(csvinput, delimiter=',')
firstline = True
for row in reader:
if firstline:
firstline = False
continue
address1 = row[0]
print row[0]
locality = row[1]
admin_area = row[2]
query = ' '.join(str(x) for x in (address1, locality, admin_area))
normalized = query.replace(" ", "+")
BaseURL = 'http://localhost:8080/verify?country=JP&freeform='
URL = BaseURL + normalized
print URL
data = urllib.urlopen(URL)
response = data.getcode()
print response
if response == 200:
file= json.load(data)
print file
output_f=open('output.csv','wb')
csvwriter=csv.writer(output_f)
count = 0
for f in file:
if count == 0:
header= f.keys()
csvwriter.writerow(header)
count += 1
csvwriter.writerow(f.values())
output_f.close()
else:
print 'error'
can anyone help me fix this its getting really annoying. I need to encode to utf8
Looks like you are using Python 2.x, instead of python's standard open, use codecs.open where you can optionally pass an encoding to use and what to do when there are errors. Gets a little less confusing in Python 3 where the standard Python open can do this.
So in your two lines where you are opening, do:
with codecs.open('/home/local/apple.csv',
'rb', 'utf-8') as csvinput:
output_f = codecs.open('output.csv','wb', 'utf-8')
The optional error parm defaults to "strict" which raises an exception if the bytes can't be mapped to the given encoding. In some contexts you may want to use 'ignore' or 'replace'.
See the python doc for a bit more info.
I have some simple code to ingest some JSON Twitter data, and output some specific fields into separate columns of a CSV file. My problem is that I cannot for the life of me figure out the proper way to encode the output as UTF-8. Below is the closest I've been able to get, with the help of a member here, but I still it still isn't running correctly and fails because of the unique characters in the tweet text field.
import json
import sys
import csv
import codecs
def main():
writer = csv.writer(codecs.getwriter("utf-8")(sys.stdout), delimiter="\t")
for line in sys.stdin:
line = line.strip()
data = []
try:
data.append(json.loads(line))
except ValueError as detail:
continue
for tweet in data:
## deletes any rate limited data
if tweet.has_key('limit'):
pass
else:
writer.writerow([
tweet['id_str'],
tweet['user']['screen_name'],
tweet['text']
])
if __name__ == '__main__':
main()
From Docs:
https://docs.python.org/2/howto/unicode.html
a = "string"
encodedstring = a.encode('utf-8')
If that does not work:
Python DictWriter writing UTF-8 encoded CSV files
I have had the same problem. I have a large amount of data from twitter firehose so every possible complication case (and has arisen)!
I've solved it as follows using try / except:
if the dict value is a string: if isinstance(value,basestring) I try to encode it straight away. If not a string, I make it a string and then encode it.
If this fails, it's because some joker is tweeting odd symbols to mess up my script. If that is the case, firstly I decode then re-encode value.decode('utf-8').encode('utf-8') for strings and decode, make into a string and re-encode for non-strings value.decode('utf-8').encode('utf-8')
Have a go with this:
import csv
def export_to_csv(list_of_tweet_dicts,export_name="flat_twitter_output.csv"):
utf8_flat_tweets=[]
keys = []
for tweet in list_of_tweet_dicts:
tmp_tweet = tweet
for key,value in tweet.iteritems():
if key not in keys: keys.append(key)
# convert fields to utf-8 if text
try:
if isinstance(value,basestring):
tmp_tweet[key] = value.encode('utf-8')
else:
tmp_tweet[key] = str(value).encode('utf-8')
except:
if isinstance(value,basestring):
tmp_tweet[key] = value.decode('utf-8').encode('utf-8')
else:
tmp_tweet[key] = str(value.decode('utf-8')).encode('utf-8')
utf8_flat_tweets.append(tmp_tweet)
del tmp_tweet
list_of_tweet_dicts = utf8_flat_tweets
del utf8_flat_tweets
with open(export_name, 'w') as f:
dict_writer = csv.DictWriter(f, fieldnames=keys,quoting=csv.QUOTE_ALL)
dict_writer.writeheader()
dict_writer.writerows(list_of_tweet_dicts)
print "exported tweets to '"+export_name+"'"
return list_of_tweet_dicts
hope that helps you.