I am trying to take a twitter stream, save it to a file and then analyze the contents. However I am having an issue with files generated in the program as opposed to create by CLI
Twitter Analysis program:
import json
import pandas as pd
import matplotlib.pyplot as plt
tweets_data = []
tweets_file = open(“test.txt”, “r”)
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
tweets = pd.DataFrame()
tweets[‘text’] = map(lambda tweet: tweet[‘text’], tweets_data)
However with the last line I keep getting “KeyError: ‘text’ “ which I understand to be related to the fact it can’t find the key.
When I first run the twitter search program, if I output the results to a file from the CLI, it works fine with no issues. If I save the output to a file from inside the program, it gives me the error?
Twitter Search program:
class Output(StreamListener):
def on_data(self, data):
with open(“test.txt”, “a”) as tf:
tf.write(data)
def on_error(self, status):
print status
L = Output()
auth = OAuthHandler(consKey, consSecret)
auth.set_access_token(Tok1, Tok2)
stream = Stream(auth, L)
stream.filter(track=[´cyber´])
If I run the above as is, analyzing the test.txt will give me the error. But if I remove the line and instead run the program as:
python TwitterSearch.py > test.txt
then it will work with no problem when running text.txt through the analysis program
I have tried changing the file handling from append to write which was of no help.
I also added the line:
print tweet[‘text’]
tweets[‘text’] = map(lambda tweet: tweet[‘text’], tweets_data)
This worked and showing that the program can see a value for the text key. I also compared the output file from the program and the CLI and could not see any difference. Please help me to understand and resolve the problem?
Related
I've built a script that makes two parallelized process over the same text files, each one saving results into a new text file with a proper file name. One of these process does it well, while another gives error message from question title.
I thought it was because I save results in a buffer and then I write it completely into the file, so I've change this part of code such that each new results line generated would be immediately saved on file, and nonetheless error message still appears, and so I can't get results saved on file.
I'm now testing a new version of script with processes unparalleled, but how could I solve this problem such that I could keep processes parallelized?
Here's the sample code below:
from concurrent.futures import ProcessPoolExecutor, as_completed
def process():
counter_object_results = make_data_processing()
txt_file_name = f'file_name.txt'
with open(txt_file_name, 'a') as txt_file:
for count in counter_object_results.items():
txt_file_content = f'{counter_object_results[0]}\t{counter_object_results[1]}\n'
txt_file.write(txt_file_content)
def process_2():
counter_object_results = make_data_processing()
txt_file_name = f'file_name.txt'
with open(txt_file_name, 'a') as txt_file:
for count in counter_object_results.items():
txt_file_content = f'{counter_object_results[0]}\t{counter_object_results[1]}\n'
txt_file.write(txt_file_content)
with ProcessPoolExecutor() as executor:
worker_a = executor.submit(process)
worker_b = executor.submit(process_2)
futures = [worker_a, worker_b]
for worker in as_completed(futures):
resp = worker.result()
print(f'Results saved on {resp}')
I'm building a Python program to parse some calls to a social media API into CSV and I'm running into an issue with a key that has two keys above it in the hierarchy. I get this error when I run the code with PyDev in Eclipse.
Traceback (most recent call last):
line 413, in <module>
main()
line 390, in main
postAgeDemos(monitorID)
line 171, in postAgeDemos
age0To17 = str(i["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
KeyError: 'ZERO_TO_SEVENTEEN'
Here's the section of the code I'm using for it. I have a few other functions built already that work with two layers of keys.
import urllib.request
import json
def postAgeDemos(monitorID):
print("Enter the date you'd like the data to start on")
startDate = input('The date must be in the format YYYY-MM-DD. ')
print("Enter the date you'd like the data to end on")
endDate = input('The date must be in the format YYYY-MM-DD. ')
dates = "&start="+startDate+"&end="+endDate
urlStart = getURL()
authToken = getAuthToken()
endpoint = "/monitor/demographics/age?id=";
urlData = urlStart+endpoint+monitorID+authToken+dates
webURL = urllib.request.urlopen(urlData)
fPath = getFilePath()+"AgeDemographics"+startDate+"&"+endDate+".csv"
print("Connecting...")
if (webURL.getcode() == 200):
print("Connected to "+urlData)
print("This query returns information in a CSV file.")
csvFile = open(fPath, "w+")
csvFile.write("postDate,totalPosts,totalPostsWithIdentifiableAge,0-17,18-24,25-34,35+\n")
data = webURL.read().decode('utf8')
theJSON = json.loads(data)
for i in theJSON["ageCounts"]:
postDate = i["startDate"]
totalDocs = str(i["numberOfDocuments"])
totalAged = str(i["ageCount"]["totalAgeCount"])
age0To17 = str(i["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
age18To24 = str(i["ageCount"]["sortedAgeCounts"]["EIGHTEEN_TO_TWENTYFOUR"])
age25To34 = str(i["ageCount"]["sortedAgeCounts"]["TWENTYFIVE_TO_THIRTYFOUR"])
age35Over = str(i["ageCount"]["sortedAgeCounts"]["THIRTYFIVE_AND_OVER"])
csvFile.write(postDate+","+totalDocs+","+totalAged+","+age0To17+","+age18To24+","+age25To34+","+age35Over+"\n")
print("File printed to "+fPath)
csvFile.close()
else:
print("Server Error, No Data" + str(webURL.getcode()))
Here's a sample of the JSON I'm trying to parse.
{"ageCounts":[{"startDate":"2016-01-01T00:00:00","endDate":"2016-01-02T00:00:00","numberOfDocuments":520813,"ageCount":{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3245,"EIGHTEEN_TO_TWENTYFOUR":4289,"TWENTYFIVE_TO_THIRTYFOUR":2318,"THIRTYFIVE_AND_OVER":70249},"totalAgeCount":80101}},{"startDate":"2016-01-02T00:00:00","endDate":"2016-01-03T00:00:00","numberOfDocuments":633709,"ageCount":{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3560,"EIGHTEEN_TO_TWENTYFOUR":1702,"TWENTYFIVE_TO_THIRTYFOUR":2786,"THIRTYFIVE_AND_OVER":119657},"totalAgeCount":127705}}],"status":"success"}
Here it is again with line breaks so it's a little easier to read.
{"ageCounts":[{"startDate":"2016-01-01T00:00:00","endDate":"2016-01-02T00:00:00","numberOfDocuments":520813,"ageCount":
{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3245,"EIGHTEEN_TO_TWENTYFOUR":4289,"TWENTYFIVE_TO_THIRTYFOUR":2318,"THIRTYFIVE_AND_OVER":70249},"totalAgeCount":80101}},
{"startDate":"2016-01-02T00:00:00","endDate":"2016-01-03T00:00:00","numberOfDocuments":633709,"ageCount":
{"sortedAgeCounts":{"ZERO_TO_SEVENTEEN":3560,"EIGHTEEN_TO_TWENTYFOUR":1702,"TWENTYFIVE_TO_THIRTYFOUR":2786,"THIRTYFIVE_AND_OVER":119657},"totalAgeCount":127705}}],"status":"success"}
I've tried removing the ["sortedAgeCounts"] from in the middle of
age0To17 = str(i["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
but I still get the same error. I've remove the 0-17 section to test the other age ranges and I get the same error for them as well. I tried removing all the underscores from the JSON and then using keys without the underscores.
I've also tried moving the str() to convert to string from the call to where the output is printed but the error persists.
Any ideas? Is this section not actually a JSON key, maybe a problem with the all caps or am I just doing something dumb? Any other code improvements are welcome as well but I'm stuck on this one.
Let me know if you need to see anything else. Thanks in advance for your help.
Edited(This works):
JSON=json.loads(s)
for i in JSON:
print str(JSON[i][0]["ageCount"]["sortedAgeCounts"]["ZERO_TO_SEVENTEEN"])
s is a string which contains the your JSON.
I got a function (see below) that gets data from Google analytics to my computer.
I would like to store the result in a csv file but I dont know how. please help me out.
I can print the result on screen, but can't save it
def print_top_pages(service, site, start_date, end_date, max_results=10000):
"""Print out top X pages for a given site.s"""
query = service.data().ga().get(
ids=site,
start_date=start,
end_date=end,
dimensions='ga:dimension20,ga:date',
metrics='ga:sessions,ga:sessionDuration',
# sort='-ga:pageviews',
#filters='ga:hostname!=localhost',
samplingLevel='HIGHER_PRECISION',
max_results=max_results).execute()
return print_data_table(query)
replace the return with this.
with open("output.txt", "w") as out:
out.write(print_data_table(query))
you should get the same printed output in a file named output.txt
So I have a simple reddit bot set up which I wrote using the praw framework. The code is as follows:
import praw
import time
import numpy
import pickle
r = praw.Reddit(user_agent = "Gets the Daily General Thread from subreddit.")
print("Logging in...")
r.login()
words_to_match = ['sdfghm']
cache = []
def run_bot():
print("Grabbing subreddit...")
subreddit = r.get_subreddit("test")
print("Grabbing thread titles...")
threads = subreddit.get_hot(limit=10)
for submission in threads:
thread_title = submission.title.lower()
isMatch = any(string in thread_title for string in words_to_match)
if submission.id not in cache and isMatch:
print("Match found! Thread ID is " + submission.id)
r.send_message('FlameDraBot', 'DGT has been posted!', 'You are awesome!')
print("Message sent!")
cache.append(submission.id)
print("Comment loop finished. Restarting...")
# Run the script
while True:
run_bot()
time.sleep(20)
I want to create a file (text file or xml, or something else) using which the user can change the fields for the various information being queried. For example I want a file with lines such as :
Words to Search for = sdfghm
Subreddit to Search in = text
Send message to = FlameDraBot
I want the info to be input from fields, so that it takes the value after Words to Search for = instead of the whole line. After the information has been input into the file and it has been saved. I want my script to pull the information from the file, store it in a variable, and use that variable in the appropriate functions, such as:
words_to_match = ['sdfghm']
subreddit = r.get_subreddit("test")
r.send_message('FlameDraBot'....
So basically like a config file for the script. How do I go about making it so that my script can take input from a .txt or another appropriate file and implement it into my code?
Yes, that's just a plain old Python config, which you can implement in an ASCII file, or else YAML or JSON.
Create a subdirectory ./config, put your settings in ./config/__init__.py
Then import config.
Using PEP-18 compliant names, the file ./config/__init__.py would look like:
search_string = ['sdfghm']
subreddit_to_search = 'text'
notify = ['FlameDraBot']
If you want more complicated config, just read the many other posts on that.
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
from pprint import pprint
data_file = open('twitter.json')
data = json.load(data_file)
##Json file with all the ckey, csecret, atoken, and asecret
pprint(data)
#consumer key, consumer secret, access token, access secret.
ckey = data["ckey"]
csecret = data["csecret"]
atoken = data["atoken"]
asecret = data["asecret"]
class listener(StreamListener):
def on_data(self, data):
all_data = json.loads(data)
tweet = all_data["text"]
username = all_data["user"]["screen_name"]
print((username,tweet))
return True
def on_error(self, status):
print (status)
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
The code above is all standard in accessing the twitter api. However, I need to transfer the tweets obtained from twitter to a .txt file. I tried using the code below
twitterStream = Stream(auth, listener())
fid = open("cats based tweets.txt","w")
for tweet in twitterStream.filter(track=[cats]):
fid.write(tweet)
fid.close()
I intend on finding all twitter tweets/reposts that include the keyword cats, which it does. However, it is supposed to also write a txt file that includes all the tweets but it doesn't. Can anyone tell me what I need to do it fix it.
EDIT : I used the code that you guys have written but it doesn't return all of the tweets. It prints out like 5 or 6 then the error
RuntimeError: No active exception to reraise
appears and I have no idea why. Why does this occur cause I know it shouldn't.
I've done this in a project and my method involves changing the on_data method within the StreamListener object.
My code looks like this:
class Listener(StreamListener):
def __init__(self, api=None, path=None):
#I don't remember exactly why I defined this.
self.api = api
#We'll need this later.
self.path = path
def on_data(self, data):
all_data = json.loads(data)
tweet = all_data["text"]
username = all_data["user"]["screen_name"]
print((username,tweet))
#Open, write and close your file.
savefile = open(file_path, 'ab')
savefile.write(tweet)
savefile.close()
return True
A few things in the actual code, not where you redefined Listener or on_data. In order:
Define the file where you want to save. Let's call that variable the file_path. Don't forget to add the .txt extensions here.
Call the Stream and the Listener:
twitterStream = Stream(authorization, Listener(path=file_path))
Use your filters. Mine are coordinates and I put the filter in a try, except so that my code doesn't stop. Here it is adapted for you:
try:
twitterStream.filter(track=[cats])
except Exception, e:
print 'Failed filter() with this error:', str(e)
Now the text in the tweet should be written in the file whenever a text appears in the stream. Take a look at your file size and you should see it increase. Particularly, if your filter is about cats. Internet loves cats.
I guess there is a slight indentation error in the snippet you provided, However I will try to fix your error with 2 approaches, the first one is by correcting the indentation and the second one would be to change youron_data method
Approach 1:
fid = open("cats based tweets.txt","w")
for tweet in twitterStream.filter(track=[cats]):
fid.write(tweet+"\n")
fid.close()
Or you could simply write the above code as :
with open("cats based tweets.txt","w") as fid:
for tweet in twitterStream.filter(track=[cats]):
fid.write(tweet+"\n")
Approach 2:
In the second approach we can change the on_data method so that when the program receives a new tweet it opens and file and directly writes to it , but for this we need to open the file in append mode, as opening the file in w writeable mode would overwrite the contents of the file again and again.
def on_data(self, data):
all_data = json.loads(data)
tweet = all_data["text"]
username = all_data["user"]["screen_name"]
print((username,tweet))
with open("cats based tweets.txt","a") as fid:
fid.write(tweet+"\n")
return True
See the below link then you will know about how to save the tweets to Database as well as to the our local file.
https://github.com/anandstarz/Scrapee/blob/master/tweets