Encoding UTF-8 when writing to CSV - python

I have some simple code to ingest some JSON Twitter data, and output some specific fields into separate columns of a CSV file. My problem is that I cannot for the life of me figure out the proper way to encode the output as UTF-8. Below is the closest I've been able to get, with the help of a member here, but I still it still isn't running correctly and fails because of the unique characters in the tweet text field.
import json
import sys
import csv
import codecs
def main():
writer = csv.writer(codecs.getwriter("utf-8")(sys.stdout), delimiter="\t")
for line in sys.stdin:
line = line.strip()
data = []
try:
data.append(json.loads(line))
except ValueError as detail:
continue
for tweet in data:
## deletes any rate limited data
if tweet.has_key('limit'):
pass
else:
writer.writerow([
tweet['id_str'],
tweet['user']['screen_name'],
tweet['text']
])
if __name__ == '__main__':
main()

From Docs:
https://docs.python.org/2/howto/unicode.html
a = "string"
encodedstring = a.encode('utf-8')
If that does not work:
Python DictWriter writing UTF-8 encoded CSV files

I have had the same problem. I have a large amount of data from twitter firehose so every possible complication case (and has arisen)!
I've solved it as follows using try / except:
if the dict value is a string: if isinstance(value,basestring) I try to encode it straight away. If not a string, I make it a string and then encode it.
If this fails, it's because some joker is tweeting odd symbols to mess up my script. If that is the case, firstly I decode then re-encode value.decode('utf-8').encode('utf-8') for strings and decode, make into a string and re-encode for non-strings value.decode('utf-8').encode('utf-8')
Have a go with this:
import csv
def export_to_csv(list_of_tweet_dicts,export_name="flat_twitter_output.csv"):
utf8_flat_tweets=[]
keys = []
for tweet in list_of_tweet_dicts:
tmp_tweet = tweet
for key,value in tweet.iteritems():
if key not in keys: keys.append(key)
# convert fields to utf-8 if text
try:
if isinstance(value,basestring):
tmp_tweet[key] = value.encode('utf-8')
else:
tmp_tweet[key] = str(value).encode('utf-8')
except:
if isinstance(value,basestring):
tmp_tweet[key] = value.decode('utf-8').encode('utf-8')
else:
tmp_tweet[key] = str(value.decode('utf-8')).encode('utf-8')
utf8_flat_tweets.append(tmp_tweet)
del tmp_tweet
list_of_tweet_dicts = utf8_flat_tweets
del utf8_flat_tweets
with open(export_name, 'w') as f:
dict_writer = csv.DictWriter(f, fieldnames=keys,quoting=csv.QUOTE_ALL)
dict_writer.writeheader()
dict_writer.writerows(list_of_tweet_dicts)
print "exported tweets to '"+export_name+"'"
return list_of_tweet_dicts
hope that helps you.

Related

Tweepy, how to pull count integer and use it

I trust all is well with everyone here. My apologies if this has been answered before, though I am trying to do the following.
cursor = tweepy.Cursor(
api.search_tweets,
q = '"Hello"',
lang = 'en',
result_type = 'recent',
count = 2
)
I want to match the number in count, to the number of json objects I will be iterating through.
for tweet in cursor.items():
tweet_payload = json.dumps(tweet._json,indent=4, sort_keys=True)
I have tried several different ways to write the data, though it would appear that the following does not work (currently is a single fire):
with open("Tweet_Payload.json", "w") as outfile:
outfile.write(tweet_payload)
time.sleep(.25)
outfile.close()
This is what it looks like put together.
import time
import tweepy
from tweepy import cursor
import Auth_Codes
import json
twitter_auth_keys = {
"consumer_key" : Auth_Codes.consumer_key,
"consumer_secret" : Auth_Codes.consumer_secret,
"access_token" : Auth_Codes.access_token,
"access_token_secret" : Auth_Codes.access_token_secret
}
auth = tweepy.OAuthHandler(
twitter_auth_keys["consumer_key"],
twitter_auth_keys["consumer_secret"]
)
auth.set_access_token(
twitter_auth_keys["access_token"],
twitter_auth_keys["access_token_secret"]
)
api = tweepy.API(auth)
cursor = tweepy.Cursor(
api.search_tweets,
q = '"Hello"',
lang = 'en',
result_type = 'recent',
count = 2
)
for tweet in cursor.items():
tweet_payload = json.dumps(tweet._json,indent=4, sort_keys=True)
with open("Tweet_Payload.json", "w") as outfile:
outfile.write(tweet_payload)
time.sleep(.25)
outfile.close()
Edit:
Using the suggestion by Mickael
also, the current code
tweet_payload = []
for tweet in cursor.items():
tweet_payload.append(tweet._json)
print(json.dumps(tweet_payload, indent=4,sort_keys=True))
with open("Tweet_Payload.json", "w") as outfile:
outfile.write(json.dumps(tweet_payload,indent=4, sort_keys=True))
time.sleep(.25)
Just loops, I am not sure why thats the case when the count is 10. I thought it would run just 1 call for 10 results or less, then end.
Opening the file with the write mode erases its previous data so, if you want to add each new tweet to the file, you should use the append mode instead.
As an alternative, you could also store all the tweets' json in a list and write them all at once. That should be more efficient and the list at the root of your json file will make it valid.
json_tweets = []
for tweet in cursor.items():
json_tweets.append(tweet._json)
with open("Tweet_Payload.json", "w") as outfile:
outfile.write(json.dumps(json_tweets,indent=4, sort_keys=True))
On a side note, the with closes the file automatically, you don't need to do it.

JSON formatting adding \ characters when I append file, but not to string in output

I am using the following function to get json from the flickr API. The string it returns is a properly formatted chunk of JSON:
def get_photo_data(photo_id):
para = {}
para["photo_id"] = photo_id
para["method"] = "flickr.photos.getInfo"
para["format"] = "json"
para["api_key"] = FLICKR_KEY
request_data = params_unique_combination("https://api.flickr.com/services/rest/", para)
if request_data in CACHE_DICTION:
return CACHE_DICTION[request_data]
else:
response = requests.get("https://api.flickr.com/services/rest/", para)
CACHE_DICTION[request_data] = response.text[14:-1]
cache_file = open(CACHE_FNAME, 'w')
cache_file.write(json.dumps(CACHE_DICTION))
cache_file.close()
return response.text[14:-1]
The issue I am having is that when I go to write the json to my cache file it keeps adding in backslashes, like this example:
"https://api.flickr.com/services/rest/format-json_method-flickr.photos.getInfo_photo_id-34869402493": "{\"photo\":{\"id\":\"34869402493\",\"secret\":\"56fcf0342c\",\"server\":\"4057\",\"farm\":5,\"dateuploaded\":\"1499030213\",\"isfavorite\":0,\"license\":\"0\",\"safety_level\":\"0\",\"rotation\":0,\"originalsecret\":\"c4d1d316ed\",\"originalformat\":\"jpg\",\"owner\":{\"nsid\":\"150544082#N05\",\"username\":\"ankitrana_\",\"realname\":\"Ankit Rana\",\"location\":\"Cincinnati, USA\",\"iconserver\":\"4236\",\"iconfarm\":5,\"path_alias\":\"ankitrana_\"},\"title\":{\"_content\":\"7\"},\"description\":{\"_content\":\"\"},\"visibility\":{\"ispublic\":1,\"isfriend\":0,\"isfamily\":0},\"dates\":{\"posted\":\"1499030213\",\"taken\":\"2017-06-19 13:43:38\",\"takengranularity\":\"0\",\"takenunknown\":\"0\",\"lastupdate\":\"1499041020\"},\"views\":\"41\",\"editability\":{\"cancomment\":0,\"canaddmeta\":0},\"publiceditability\":{\"cancomment\":1,\"canaddmeta\":0},\"usage\":{\"candownload\":1,\"canblog\":0,\"canprint\":0,\"canshare\":1},\"comments\":{\"_content\":\"0\"},\"notes\":{\"note\":[]},\"people\":{\"haspeople\":0},\"tags\":{\"tag\":[{\"id\":\"150538742-34869402493-5630\",\"author\":\"150544082#N05\",\"authorname\":\"ankitrana_\",\"raw\":\"cincinnati\",\"_content\":\"cincinnati\",\"machine_tag\":0},{\"id\":\"150538742-34869402493-226\",\"author\":\"150544082#N05\",\"authorname\":\"ankitrana_\",\"raw\":\"ohio\",\"_content\":\"ohio\",\"machine_tag\":false},
... etc., etc.}
How can I store the JSON to the existing file without these additional \ characters, as it is represented when I print the string?
use your_string.decode('string_escape') to unescape \" to "
update:
your string escaped because json.dumps(), it convert object to string and later you can read it using json.loads(), the result are unescaped.
you can save it without slash using str()
cache_file.write(str(CACHE_DICTION))
# {'myparam' :'"162000","photo":...'
but the problem it save to file with single quotes, it not valid json and not compatible with json.loads()
my suggestion keep your code as above, except you want to store it to file CACHE_FNAME.json
cache_file = open(CACHE_FNAME, 'w')
cache_file.write(response.text)
cache_file.close()
# {"photos":{"page":1,"pages":6478,..}
You could try replacing the "\" with the str.replace function in python
Add the code after the following line
cache_file = open(CACHE_FNAME, 'w')
json_item = str(json.dumps(CACHE_DICTION))
json_item.replace("\", "")
and change this line
cache_file.write(json.dumps(CACHE_DICTION))
to
cache_file.write(json_item)
let me know if this works for you
just replace \ with a whitespace.
I did the same thing while i was working with JSON.
json_new = json.replace('\\', '')

Python Encoding Issue with JSON and CSV

I am having an encoding issue when I run my script below:
Here is the error code:
-UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 9: ordinal not in range(128)
Here is my script:
import logging
import urllib
import csv
import json
import io
import codecs
with open('/home/local/apple.csv',
'rb') as csvinput:
reader = csv.reader(csvinput, delimiter=',')
firstline = True
for row in reader:
if firstline:
firstline = False
continue
address1 = row[0]
print row[0]
locality = row[1]
admin_area = row[2]
query = ' '.join(str(x) for x in (address1, locality, admin_area))
normalized = query.replace(" ", "+")
BaseURL = 'http://localhost:8080/verify?country=JP&freeform='
URL = BaseURL + normalized
print URL
data = urllib.urlopen(URL)
response = data.getcode()
print response
if response == 200:
file= json.load(data)
print file
output_f=open('output.csv','wb')
csvwriter=csv.writer(output_f)
count = 0
for f in file:
if count == 0:
header= f.keys()
csvwriter.writerow(header)
count += 1
csvwriter.writerow(f.values())
output_f.close()
else:
print 'error'
can anyone help me fix this its getting really annoying. I need to encode to utf8
Looks like you are using Python 2.x, instead of python's standard open, use codecs.open where you can optionally pass an encoding to use and what to do when there are errors. Gets a little less confusing in Python 3 where the standard Python open can do this.
So in your two lines where you are opening, do:
with codecs.open('/home/local/apple.csv',
'rb', 'utf-8') as csvinput:
output_f = codecs.open('output.csv','wb', 'utf-8')
The optional error parm defaults to "strict" which raises an exception if the bytes can't be mapped to the given encoding. In some contexts you may want to use 'ignore' or 'replace'.
See the python doc for a bit more info.

KeyError: u'somestring' Json

I am trying to make a point system for my Twitch bot and I am encountering KeyErrors when trying to make a new entry for some odd reason. Here is my code:
import urllib2, json
def updateUsers(chan):
j = urllib2.urlopen('http://tmi.twitch.tv/group/user/' + chan + '/chatters')
j_obj = json.load(j)
with open('dat.dat', 'r') as data_file:
data = json.load(data_file)
for usr in j_obj['chatters']['viewers']:
data[usr]['Points'] = "0" # Were the KeyError: u'someguysusername' occurs
with open('dat.dat', 'w') as out_file:
json.dump(data, out_file)
updateUsers('tryhard_clan')
If you want to see the Json itself go to http://tmi.twitch.tv/group/user/tryhard_clan/chatters
I'm storing user data in a file in this format:
{"users": {"cupcake": {"Points": "0"}}}
a slightly more concise form than #Raunak suggested:
data.setdefault (usr, {}) ['Points'] = "0"
that will set data[usr] to an empty dict if it's not already there, and set the 'Points' element in any case.
It happens variable usr doesn't resolve to an existing key in data. Do this instead:
if usr not in data:
data[usr] = {}
data[usr]['Points'] = "0"

How to create an index using Whoosh

I am trying to use Whoosh for text searching for the first time. I want to search for documents containing the word "XML". But because I am new to Whoosh, I just wrote a program that search for a word from a document. Where the document is a text file (myRoko.txt)
import os, os.path
from whoosh import index
from whoosh.index import open_dir
from whoosh.fields import Schema, ID, TEXT
from whoosh.qparser import QueryParser
from whoosh.query import *
if not os.path.exists("indexdir3"):
os.mkdir("indexdir3")
schema = Schema(name=ID(stored=True), content=TEXT)
ix = index.create_in("indexdir3", schema)
writer = ix.writer()
path = "myRoko.txt"
with open(path, "r") as f:
content = f.read()
f.close()
writer.add_document(name=path, content= content)
writer.commit()
ix = open_dir("indexdir3")
query_b = QueryParser('content', ix.schema).parse('XML')
with ix.searcher() as srch:
res_b = srch.search(query_b)
print res_b[0]
The above code is supposed to print the document that contain the word "XML". However the code return the following error:
raise ValueError("%r is not unicode or sequence" % value)
ValueError: 'A large number of documents are now represented and stored
as XML document on the web. Thus ................
What could be the cause of this error?
You have a Unicode problem. You should pass unicode strings to the indexer. For that, you need to open the text file as unicode:
import codecs
with codecs.open(path, "r","utf-8") as f:
content = f.read()
and use unicode string for file name:
path = u"myRoko.txt"
After fixes I got this result:
<Hit {'name': u'myRoko.txt'}>
writer.add_document(name=unicode(path), content=unicode(content))
It has to be UNICODE

Categories

Resources