I am trying to get the below code to exclude any tweets that include restricted words from a list. What is the best way to do this?
This code is also returning only the final tweet once i break out of the stream. Is there a way to print all applicable tweets to CSV?
import sys
import tweepy
import csv
#pass security information to variables
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''
#use variables to access twitter
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#create an object called 'customStreamListener'
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
print (status.author.screen_name, status.created_at, status.text)
# Writing status data
with open('OutputStreaming.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow([status.author.screen_name, status.created_at, status.text])
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
# Writing csv titles
with open('OutputStreaming.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['Author', 'Date', 'Text'])
streamingAPI = tweepy.streaming.Stream(auth, CustomStreamListener())
streamingAPI.filter(track=['Hasbro', 'Mattel', 'Lego'])
The documentation for the track parameter in the Twitter API indicates that it is not possible to exclude terms from the filter, only to include words and phrases. You'll have to implement an additional filter inside your code to discard Tweets that contain words you don't want to include in your result set.
It's not possible exclude terms from the filter function, but you can implemented a custom selection.
Basically the idea is to check if the tweet's words contain not allowed words.
You can simply tokenize the tweet's text using nltk module.
A simple example from nltk homepage:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
obliviously, in your case sentence is tweet.text.
So change your code in something similar to this:
def on_status(self, status):
print (status.author.screen_name, status.created_at, status.text)
is_allowed = True
banned_words = ['word_1', 'word2', 'another_bad_word']
words_text = nltk.word_tokenize(status.text)
# loop banned_words and search if item is in words_text
for word in banned_words:
if word in words_text:
# discard this tweet
is_allowed = False
break
if is_allowed is True:
# stuff for writing status data
# ...
This code has not been tested, but shows you a way to reach your goal.
Let me know
Related
I am trying to run a simple script that will stream live Tweets. Several attempts to filter out retweets have been unsuccessful. I still get manual retweets (with the text "RT #") in my stream.
I've tried other methods including link and link.
As I am learning, my code is very similar to the following: link
What can I do to ignore retweets?
Here is a snippet of my code:
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
if (status.retweeted) and ('RT #' not in status.text):
return
description = status.user.description
loc = status.user.location
text = status.text
coords = status.coordinates
geo = status.geo
name = status.user.screen_name
user_created = status.user.created_at
followers = status.user.followers_count
id_str = status.id_str
created = status.created_at
retweets = status.retweet_count
bg_color = status.user.profile_background_color
# Initialize TextBlob class on text of each tweet
# To get sentiment score from each class
blob = TextBlob(text)
sent = blob.sentiment
What you could do is create another function to call inside of the on_status in your StreamListener. Here is something that worked for me:
def analyze_status(text):
if 'RT' in text[0:3]:
print("This status was retweeted!")
print(text)
else:
print("This status was not retweeted!")
print(text)
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
analyze_status(status.text)
def on_error(self, status_code):
print(status_code)
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth=twitter_api.auth, listener=myStreamListener)
myStream.filter(track=['Trump'])
That yields the following:
This status was not retweeted!
#baseballcrank #seanmdav But they won't, cause Trump's name is on it. I can already hear their stupidity, "I hate D…
This status was retweeted!
RT #OvenThelllegals: I'm about to end the Trump administration with a single tweet
This status was retweeted!
RT #kylegriffin1: FLASHBACK: April 2016
SAVANNAH GUTHRIE: "Do you believe in raising taxes on the wealthy?"
TRUMP: "I do. I do. Inc…
This is not the most elegant solution, but I do believe it addresses the issue that you were facing.
I am currently working on a project where I want to extract the tweet text and the time of creation and put this data in a csv file. The files I am analysing are large text files (~800MB-1.5GB) containing JSON data. I have used the below program to get this data. I have piped the output of this into a text file.
import tweepy as tp
import sys
import pandas as pd
#Variables that contains the user credentials to access Twitter API
access_token = "..."
access_token_secret = "..."
consumer_key = "..."
consumer_secret = "..."
tweets_data = []
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(tp.StreamListener):
def on_data(self, data):
print (data)
return True
def on_error(self, status):
print (status)
if __name__ == '__main__':
#This handles Twitter authentication and the connection to Twitter Streaming API
l = StdOutListener()
auth = tp.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = tp.Stream(auth, l)
stream.filter(track=['Manchester United'])
EDIT: Here is a sample output of the above program.
{"created_at":"Mon Feb 09 07:58:51 +0000 2015","id":564694906233307137,"id_str":"564694906233307137","text":"RT #ManUtd: Take an alternative look at United's starting line-up today, courtesy of #MUTV. #mufclive\nhttps:\/\/t.co\/m1n1JkgRYq","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":306297595,"id_str":"306297595","name":"Agus Wiratama","screen_name":"KunirKm","location":"Bali","url":null,"description":"girls that are uniqe and beautiful in their own way|| #GGMU #Libra #IG : #Kunirkm","protected":false,"verified":false,"followers_count":176,"friends_count":102,"listed_count":1,"favourites_count":39,"statuses_count":4810,"created_at":"Fri May 27 16:45:02 +0000 2011","utc_offset":-32400,"time_zone":"Alaska","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"022330","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme15\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme15\/bg.png","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"A8C7F7","profile_sidebar_fill_color":"C0DFEC","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/561223265025138688\/J3SFBWV4_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/561223265025138688\/J3SFBWV4_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/306297595\/1400412027","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Sun Feb 08 15:52:42 +0000 2015","id":564451764460474369,"id_str":"564451764460474369","text":"Take an alternative look at United's starting line-up today, courtesy of #MUTV. #mufclive\nhttps:\/\/t.co\/m1n1JkgRYq","source":"\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":558797310,"id_str":"558797310","name":"Manchester United","screen_name":"ManUtd","location":"#mufc","url":"http:\/\/www.manutd.com","description":"Official Twitter of Manchester United FC","protected":false,"verified":true,"followers_count":4388116,"friends_count":84,"listed_count":12006,"favourites_count":0,"statuses_count":11840,"created_at":"Fri Apr 20 15:17:43 +0000 2012","utc_offset":0,"time_zone":"Casablanca","geo_enabled":false,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/pbs.twimg.com\/profile_background_images\/491881264232677376\/VcPcDO7o.jpeg","profile_background_image_url_https":"https:\/\/pbs.twimg.com\/profile_background_images\/491881264232677376\/VcPcDO7o.jpeg","profile_background_tile":false,"profile_link_color":"B30000","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"EFEFEF","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/563854496074194947\/p74gErkN_normal.png","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/563854496074194947\/p74gErkN_normal.png","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/558797310\/1423268331","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":1338,"favorite_count":752,"entities":{"hashtags":[{"text":"MUTV","indices":[73,78]},{"text":"mufclive","indices":[80,89]}],"trends":[],"urls":[{"url":"https:\/\/t.co\/m1n1JkgRYq","expanded_url":"https:\/\/amp.twimg.com\/v\/c79db33a-7fa9-4993-be9d-12990ee17b6b","display_url":"amp.twimg.com\/v\/c79db33a-7fa\u2026","indices":[90,113]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en"},"retweet_count":0,"favorite_count":0,"entities":
I have then tried to read this file to extract the information I need using this program.
import simplejson as json
from pandas import DataFrame as df
import time
if __name__ == "__main__":
tweets_data_path = '/input.txt' #Input file path
tweets_data = []
tweets_file = open(tweets_data_path, "r")
tweets = df()
tweets1 = df()
i=0
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
i+=1
print(i)
if i > 10000:
i=0
tweets['CreatedAt'] = [tweet["created_at"] for tweet in tweets_data]
tweets['text'] = [tweet["text"] for tweet in tweets_data]
print(tweets_data)
timestr = time.strftime("%Y%m%d-%H%M%S")
filename = 'Out' + timestr + '.csv'
print(filename)
tweets.to_csv(filename, index=True)
tweets_data.clear()
tweets.drop()
print(tweets)
#print('Here I am')
except:
continue
try:
#Creating a new data frame as the old one creates a conflict with the size of the index
tweets1['CreatedAt'] = [tweet["created_at"] for tweet in tweets_data]
tweets1['text'] = [tweet["text"] for tweet in tweets_data]
timestr1 = time.strftime("%Y%m%d-%H%M%S")
filename1 = 'Out' + timestr1 + '.csv'
tweets1.to_csv(filename, index= True)
except:
print('Excepti')
The problem is that I am not able to save more than one csv file while the code continues to run for the entire length of the JSON file. Have I made an error in the looping or managing the exceptions?
I am pretty new to Python and programming in general. Appreciate your help.
I'm new to Python and having trouble thinking about this problem Pythonically. I have a text file of SMS messages. There are multi-line statements I'd like to capture.
import fileinput
parsed = {}
for linenum, line in enumerate(fileinput.input()):
### Process the input data ###
try:
parsed[linenum] = line
except (KeyError, TypeError, ValueError):
value = None
###############################################
### Now have dict with value: "data" pairing ##
### for every text message in the archive #####
###############################################
for item in parsed:
sent_or_rcvd = parsed[item][:4]
if sent_or_rcvd != "rcvd" and sent_or_rcvd != "sent" and sent_or_rcvd != '--\n':
###########################################
### Know we have a second or third line ###
###########################################
But here's where I hit a wall. I'm not sure what's the best way to contain the strings I get here. I'd love some expert input. Using Python 2.7.3 but glad to move to 3.
Goal: have a human-readable file full of three-line quotes from these SMS.
Example text:
12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump
--
(Yes, before you ask, that's a haiku about poo. I'm trying to capture them from the last 5 years of texting my best friend.)
Ideally resulting in something like:
Haipu 3
2011-03-19
More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump
import time
data = """12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump """.splitlines()
def get_haikus(lines):
haiku = None
for line in lines:
try:
ID, timestamp, txt = line.split('|')
t = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
ID = int(ID)
if haiku and len(haiku[1]) ==3:
yield haiku
haiku = (timestamp, [txt])
except ValueError: # happens on error with split(), time or int conversion
haiku[1].append(line)
else:
yield haiku
# now get_haikus() returns tuple (timestamp, [lines])
for haiku in get_haikus(data):
timestamp, text = haiku
date = timestamp.split()[0]
text = '\n'.join(text)
print """{d}\n{txt}""".format(d=date, txt=text)
A good start might be something like the following. I'm reading data from a file named data2 but the read_messages generator will consume lines from any iterable.
#!/usr/bin/env python
def read_messages(file_input):
message = []
for line in file_input:
line = line.strip()
if line[:4].lower() in ('rcvd', 'sent', '--'):
if message:
yield message
message = []
else:
message.append(line)
if message:
yield message
with open('data2') as file_input:
for msg in read_messages(file_input):
print msg
This expects input to look something like the following:
sent
message sent away
it has multiple lines
--
rcvd
message received
rcvd
message sent away
it has multiple lines
I am doing content analysis on tweets. I'm using tweepy to return tweets that match certain terms and then writing N amount of tweets to a CSv file for analysis. Creating the files and getting data is not an issue, but I would like to reduce data collection time. Currently I am iterating through a list of terms from a file. Once the N is reached (eg 500 tweets), it moves to the next filter term.
I would like to input all my terms (less than 400) into a single variable and all the results to match. This works too. What I cannot get is a return value from twitter on what term matched in the status.
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, output_file, api=None):
super(CustomStreamListener, self).__init__()
self.num_tweets = 0
self.output_file = output_file
def on_status(self, status):
cleaned = status.text.replace('\'','').replace('&','').replace('>','').replace(',','').replace("\n",'')
self.num_tweets = self.num_tweets + 1
if self.num_tweets < 500:
self.output_file.write(topicName + ',' + status.user.location.encode("UTF-8") + ',' + cleaned.encode("UTF-8") + "\n")
print ("capturing tweet number " + str(self.num_tweets) + " for search term: " + topicName)
return True
else:
return False
sys.exit("terminating")
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True #Don't kill the stream
with open('termList.txt', 'r') as f:
topics = [line.strip() for line in f]
for topicName in topics:
stamp = datetime.datetime.now().strftime(topicName + '-%Y-%m-%d-%H%M%S')
with open(stamp + '.csv', 'w+') as topicFile:
sapi = tweepy.streaming.Stream(auth, CustomStreamListener(topicFile))
sapi.filter(track=[topicName])
Specifically my issue is this. How do I get what matched if the track variable has multiple entries? I will also state that I am relatively new to python and tweepy.
Thanks in advance for any advice and assistance!
You could check the tweet text against your matching terms. Something like:
>>> a = "hello this is a tweet"
>>> terms = [ "this "]
>>> matches = []
>>> for i, term in enumerate( terms ):
... if( term in a ):
... matches.append( i )
...
>>> matches
[0]
>>>
Which would give you all of the terms that that specific tweet, a, matched. Which in this case was just the "this" term.
I was writing a twitter program using tweepy. When I run this code, it prints the Python ... values for them, like
<tweepy.models.Status object at 0x95ff8cc>
Which is not good. How do I get the actual tweet?
import tweepy, tweepy.api
key = XXXXX
sec = XXXXX
tok = XXXXX
tsec = XXXXX
auth = tweepy.OAuthHandler(key, sec)
auth.set_access_token(tok, tsec)
api = tweepy.API(auth)
pub = api.home_timeline()
for i in pub:
print str(i)
In general, you can use the dir() builtin in Python to inspect an object.
It would seem the Tweepy documentation is very lacking here, but I would imagine the Status objects mirror the structure of Twitter's REST status format, see (for example) https://dev.twitter.com/docs/api/1/get/statuses/home_timeline
So -- try
print dir(status)
to see what lives in the status object
or just, say,
print status.text
print status.user.screen_name
Have a look at the getstate() get method which can be used to inspect the returned object
for i in pub:
print i.__getstate__()
The api.home_timeline() method returns a list of 20 tweepy.models.Status objects which correspond to the top 20 tweets. That is, each Tweet is considered as an object of Status class. Each Status object has a number of attributes like id, text, user, place, created_at, etc.
The following code would print the tweet id and the text :
tweets = api.home_timeline()
for tweet in tweets:
print tweet.id, " : ", tweet.text
from actual tweets,if u want specific tweet,u must have a tweet id,
and use
tweets = self.api.statuses_lookup(tweetIDs)
for tweet in tweets:
#tweet obtained
print(str(tweet['id'])+str(tweet['text']))
or if u want tweets in general
use twitter stream api
class StdOutListener(StreamListener):
def __init__(self, outputDatabaseName, collectionName):
try:
print("Connecting to database")
conn=pymongo.MongoClient()
outputDB = conn[outputDatabaseName]
self.collection = outputDB[collectionName]
self.counter = 0
except pymongo.errors.ConnectionFailure as e:
print ("Could not connect to MongoDB:")
def on_data(self,data):
datajson=json.loads(data)
if "lang" in datajson and datajson["lang"] == "en" and "text" in datajson:
self.collection.insert(datajson)
text=datajson["text"].encode("utf-8") #The text of the tweet
self.counter += 1
print(str(self.counter) + " " +str(text))
def on_error(self, status):
print("ERROR")
print(status)
def on_connect(self):
print("You're connected to the streaming server.
l=StdOutListener(dbname,cname)
auth=OAuthHandler(Auth.consumer_key,Auth.consumer_secret)
auth.set_access_token(Auth.access_token,Auth.access_token_secret)
stream=Stream(auth,l)
stream.filter(track=stopWords)
create a class Stdoutlistener which is inherited from StreamListener
override function on_data,and tweet is returned in json format,this function runs every time tweet is obtained
tweets are filtered accrding to stopwords
which is list of u words u wants in ur tweets
On a tweepy Status instance you can can access the _json attribute, which returns a dict representing the original Tweet contents.
For example:
type(status)
# tweepy.models.Status
type(status._json)
# dict
status._json.keys()
# dict_keys(['favorite_count', 'contributors', 'id', 'user', ...])