i'm trying to write my tweepy results to a CSV file in Arabic, But the output file is always in unicode example :
u0646 \u060c \u0627\u0644\u0645\u0646\u062a\u062e\u0628 \u0627\u0644\u0633\u0648\u0631\u064a \u0628\u064a\u0631\u062c\u0639 \u0639\u0628\u064a\u062a\u0648 \u0648\u0645\u0627 \u0628\u0634\u0645 \u0627\u0644\u0645\u0644\u062d\u0642 \u062d\u062a\u0649
my code:
def on_data (self , data):
try:
tweet = json.loads(data)['text']
tweet = data.split(',"text":"')[1].split('","source')[0]
savefile = str(time.time()) + "::" + tweet
save = open('twitterDB4.csv', 'a')
save.write(savefile)
save.write("\n\n")
save.close()
return (True)
except KeyError:
pass
Related
I am currently extracting trade data from a JSON file to create graphs. I have made a file to load in my JSON data but I want to create a a function that allows me to extract specific data points (Like a getter?), How would I go about this?
So the function is to store the data but Im not sure how to connect it back to my loaded in JSON file.
This is my function so far
class TradeInfo():
def __init__(self, sym, vol, pChange, gcount):
self.symbol = sym
self.volume = vol
self.pChange = pChange
self.gcount = gcount
def getSymbol(self):
return (self.symbol)
def getVolume(self):
return (self.volume)
def getPriceChange(self):
return (self.pChange)
def getCount(self):
return (self.gcount)
and below is the output I recive when I load in my Json file in a separate function
enter image description here
This is the code to load my JSON file
def loadfile(infileName,biDir= True):
try:
filename= infileName
with open(filename) as f:
fileObj = json.load(f)
fileObj = json.dumps(fileObj, indent=4)
except IOError as e:
print("Error in file processing: " + str(e))
return fileObj
Let's say your JSON looks like this:
{
"marketId": "LTC-AUD",
"bestBid": "67.62",
"bestAsk": "68.15",
"lastPrice": "67.75",
"volume24h": "190.19169781",
"volumeQte24h": "12885.48752662",
"price24h": "1.37",
"pricePct24h": "2.06",
"low24h": "65.89",
"high24h": "69.48",
"timestamp": "2020-10-10T11:14:19.270000Z"
}
So your loadfile function should look something like this:
import json
def load_file(infile_name) -> dict:
try:
with open(infile_name) as f:
return json.load(f)
except IOError as e:
print("Error in file processing: " + e)
data = load_file("sample_json.json")
print(json.dumps(data, indent=2, sort_keys=True))
print(data['timestamp'])
Output:
{
"bestAsk": "68.15",
"bestBid": "67.62",
"high24h": "69.48",
"lastPrice": "67.75",
"low24h": "65.89",
"marketId": "LTC-AUD",
"price24h": "1.37",
"pricePct24h": "2.06",
"timestamp": "2020-10-10T11:14:19.270000Z",
"volume24h": "190.19169781",
"volumeQte24h": "12885.48752662"
}
2020-10-10T11:14:19.270000Z
I've simplified your function and removed a redundant argument biDir because you're not using it anywhere.
I am trying to extract values from json ld to csv as they are in the file. There are a couple of issues I am facing.
1. The values being read for different fields are getting truncated in most of the cases. In the remaining cases the value of some other field is appearing in some other field.
2. I am also getting an error - 'Additional data' after some 4,000 lines.
The file is quite big(half a gb). I am attaching a shortened version of my code. Please tell me where am I going wrong.
The input file - I have shortened it and kept it here. There was no way of putting it here.
https://github.com/Architsi/json-ld-issue
I tried writing this script and I tried multiple online converters too
import csv, sys, math, operator, re, os, json, ijson
from pprint import pprint
filelist = []
for file in os.listdir("."):
if file.endswith(".json"):
filelist.append(file)
for input in filelist:
newCsv = []
splitlist = input.split(".")
output = splitlist[0] + '.csv'
newFile = open(output, 'w', newline='') #wb for windows, else you'll see newlines added to csv
# initialize csv writer
writer = csv.writer(newFile)
#Name of the columns
header_row = ('Format', 'Description', 'Object', 'DataProvider')
writer.writerow(header_row)
with open(input, encoding="utf8") as json_file:
data = ijson.items(json_file, 'item')
#passing all the values through try except
for s in data:
source = s['_source']
try:
source_resource = source['sourceResource']
except:
print ("Warning: No source resource in record ID: " + id)
try:
data_provider = source['dataProvider'].encode()
except:
data_provider = "N/A"
try:
_object = source['object'].encode()
except:
_object = "N/A"
try:
descriptions = source_resource['description']
string = ""
for item in descriptions:
if len(descriptions) > 1:
description = item.encode() #+ " | "
else:
description = item.encode()
string = string + description
description = string.encode()
except:
description = "N/A"
created = ""
#writing it to csv
write_tuple = ('format', description, _object, data_provider)
writer.writerow(write_tuple)
print ("File written to " + output)
newFile.close()
The error that I am getting is this- raise common.JSONError('Additional Data')
Expected result is a csv file with all the columns and correct values
so as the title says I hardly tried to figure out how to save tweets using tweepy in python 3.6. I found a solution that I can save it in English but I can't in Arabic. anyone have any ideas how?
the output I get in the CSV file for Arabic tweets is like this
1510123361.875904::\u0623\u0639\u0648\u0630 \u0628\u0643\u0644\u0645\u0627\u062a \u0627\u0644\u0644\u0647 \u0627\u0644/FMsjMi2nvF
Thank you in advance.
This is my code
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
save = open('ExampleNumber4.csv', mode='w', encoding="utf8", newline=None)
class listener(StreamListener) :
def on_data (self , data):
try:
tweet = json.loads(data)['text']
print(tweet.translate(non_bmp_map))
tweet = data.split(',"text":"')[1].split('","source')[0]
savefile = str(time.time()) + "::" + tweet
save.write(savefile)
save.write("\n\n")
return (True)
except KeyError:
pass
def on_error(self , status):
print(status)
auth = OAuthHandler (ConsumerKey , ConsumerSecret)
auth.set_access_token(AccessToken , AccessTokenSecret)
twitterStream = Stream(auth , listener())
twitterStream.filter(track=[u'سيارة'])
save.close()
Here's a working solution. Please try to make working examples that produce the error of your question next time by including some sample JSON data and skipping the twitter code that we can't run as is.
#coding:utf8
import sys
import json
import time
import csv
data = r'{"text": "\u0633\u064a\u0627\u0631\u0629\ud83d\ude00"}' # ASCII JSON
# data = '{"text": "سيارة😀"}' # non-ASCII JSON
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)
with open('ExampleNumber4.csv', mode='w', encoding="utf-8-sig", newline='') as save:
writer = csv.writer(save)
tweet = json.loads(data)['text']
print(tweet.translate(non_bmp_map))
savefile = [time.time(),tweet]
writer.writerow(savefile)
Output:
1510208283.7488384,سيارة�
I am not sure of why i am getting the error mentioned.
Here, is the code which contains the numpyload.txt
def load(name):
print("start reading file with target")
wfile = open(name, "r")
line = wfile.readline().replace("\n", "")
print line
splits = line.split(",")
print splits
datalen = len(splits)
print datalen
wfile.close()
X = np.loadtxt(open(name), delimiter=',', usecols=range(0, datalen), skiprows=0)
print("done")
return np.array(X)
Here is the sample output of the csv file. *Note not listing all as there is 501 items in the csv file.
Id,asmSize,bytesSize,asmCompressionRate,bytesCompressionRate,ab_ratio,abc_ratio,ab2abc_ratio,sp_,...
so I'm having this trouble with the decode. I found it in other threads how to do it for simple strings, with the u'string'.encode. But I can't find a way to make it work with files.
Any help would be appreciated!
Here's the code.
text = file.read()
text.replace(txt.encode('utf-8'), novo_txt.encode('utf-8'))
file.seek(0) # rewind
file.write(text.encode('utf-8'))
and here's the whole code, should it help.
#!/usr/bin/env python
# coding: utf-8
"""
Script to helps on translate some code's methods from
portuguese to english.
"""
from multiprocessing import Pool
from mock import MagicMock
from goslate import Goslate
import fnmatch
import logging
import os
import re
import urllib2
_MAX_PEERS = 1
try:
os.remove('traducoes.log')
except OSError:
pass
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
handler = logging.FileHandler('traducoes.log')
logger.addHandler(handler)
def fileWalker(ext, dirname, names):
"""
Find the files with the correct extension
"""
pat = "*" + ext[0]
for f in names:
if fnmatch.fnmatch(f, pat):
ext[1].append(os.path.join(dirname, f))
def encontre_text(file):
"""
find on the string the works wich have '_' on it
"""
text = file.read().decode('utf-8')
return re.findall(r"\w+(?<=_)\w+", text)
#return re.findall(r"\"\w+\"", text)
def traduza_palavra(txt):
"""
Translate the word/phrase to english
"""
try:
# try connect with google
response = urllib2.urlopen('http://google.com', timeout=2)
pass
except urllib2.URLError as err:
print "No network connection "
exit(-1)
if txt[0] != '_':
txt = txt.replace('_', ' ')
txt = txt.replace('media'.decode('utf-8'), 'média'.decode('utf-8'))
gs = Goslate()
#txt = gs.translate(txt, 'en', gs.detect(txt))
txt = gs.translate(txt, 'en', 'pt-br') # garantindo idioma tupiniquim
txt = txt.replace(' en ', ' br ')
return txt.replace(' ', '_') # .lower()
def subistitua(file, txt, novo_txt):
"""
should rewrite the file with the new text in the future
"""
text = file.read()
text.replace(txt.encode('utf-8'), novo_txt.encode('utf-8'))
file.seek(0) # rewind
file.write(text.encode('utf-8'))
def magica(File):
"""
Thread Pool. Every single thread should play around here with
one element from list os files
"""
global _DONE
if _MAX_PEERS == 1: # inviavel em multithread
logger.info('\n---- File %s' % File)
with open(File, "r+") as file:
list_txt = encontre_text(file)
for txt in list_txt:
novo_txt = traduza_palavra(txt)
if txt != novo_txt:
logger.info('%s -> %s [%s]' % (txt, novo_txt, File))
subistitua(file, txt, novo_txt)
file.close()
print File.ljust(70) + '[OK]'.rjust(5)
if __name__ == '__main__':
try:
response = urllib2.urlopen('http://www.google.com.br', timeout=1)
except urllib2.URLError as err:
print "No network connection "
exit(-1)
root = './app'
ex = ".py"
files = []
os.path.walk(root, fileWalker, [ex, files])
print '%d files found to be translated' % len(files)
try:
if _MAX_PEERS > 1:
_pool = Pool(processes=_MAX_PEERS)
result = _pool.map_async(magica, files)
result.wait()
else:
result = MagicMock()
result.successful.return_value = False
for f in files:
pass
magica(f)
result.successful.return_value = True
except AssertionError, e:
print e
else:
pass
finally:
if result.successful():
print 'Translated all files'
else:
print 'Some files were not translated'
Thank you all for the help!
In Python 2, reading from files produces regular (byte) string objects, not unicode objects. There is no need to call .encode() on these; in fact, that'll only trigger an automatic decode to Unicode first, which can fail.
Rule of thumb: use a unicode sandwich. Whenever you read data, you decode to unicode at that stage. Use unicode values throughout your code. Whenever you write data, encode at that point. You can use io.open() to open file objects that encode and decode automatically for you.
That also means you can use unicode literals everywhere; for your regular expressions, for your string literals. So use:
def encontre_text(file):
text = file.read() # assume `io.open()` was used
return re.findall(ur"\w+(?<=_)\w+", text) # use a unicode pattern
and
def subistitua(file, txt, novo_txt):
text = file.read() # assume `io.open()` was used
text = text.replace(txt, novo_txt)
file.seek(0) # rewind
file.write(text)
as all string values in the program are already unicode, and
txt = txt.replace(u'media', u'média')
as u'..' unicode string literals don't need decoding anymore.