Ignoring Retweets When Streaming Twitter Tweets - python

I am trying to run a simple script that will stream live Tweets. Several attempts to filter out retweets have been unsuccessful. I still get manual retweets (with the text "RT #") in my stream.
I've tried other methods including link and link.
As I am learning, my code is very similar to the following: link
What can I do to ignore retweets?
Here is a snippet of my code:
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
if (status.retweeted) and ('RT #' not in status.text):
return
description = status.user.description
loc = status.user.location
text = status.text
coords = status.coordinates
geo = status.geo
name = status.user.screen_name
user_created = status.user.created_at
followers = status.user.followers_count
id_str = status.id_str
created = status.created_at
retweets = status.retweet_count
bg_color = status.user.profile_background_color
# Initialize TextBlob class on text of each tweet
# To get sentiment score from each class
blob = TextBlob(text)
sent = blob.sentiment

What you could do is create another function to call inside of the on_status in your StreamListener. Here is something that worked for me:
def analyze_status(text):
if 'RT' in text[0:3]:
print("This status was retweeted!")
print(text)
else:
print("This status was not retweeted!")
print(text)
class MyStreamListener(tweepy.StreamListener):
def on_status(self, status):
analyze_status(status.text)
def on_error(self, status_code):
print(status_code)
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth=twitter_api.auth, listener=myStreamListener)
myStream.filter(track=['Trump'])
That yields the following:
This status was not retweeted!
#baseballcrank #seanmdav But they won't, cause Trump's name is on it. I can already hear their stupidity, "I hate D…
This status was retweeted!
RT #OvenThelllegals: I'm about to end the Trump administration with a single tweet
This status was retweeted!
RT #kylegriffin1: FLASHBACK: April 2016
SAVANNAH GUTHRIE: "Do you believe in raising taxes on the wealthy?"
TRUMP: "I do. I do. Inc…
This is not the most elegant solution, but I do believe it addresses the issue that you were facing.

Related

How to get full text in Twitter search API?

I have used twitter search API and also incorporated the extended_mode and full_text attributes but I am still getting a truncated string from the API
Here is my code:
results = t.search(q='tuberculosis', count=50, lang='en', result_type='popular',tweet_mode='extended')
all_tweets = results['statuses']
for tweet in all_tweets:
tweetString = tweet["full_text"]
userMentionList = tweet["entities"]["user_mentions"]
if len(userMentionList)>0:
for eachUserMention in userMentionList:
name = eachUserMention["screen_name"]
time = tweet["created_at"]
wks.insert_rows(wks.rows, values=[tweetString, name, time], inherit=True)
if you are using TwitterSearch, following should work:
tso = TwitterSearchOrder()
tso.set_keywords('tuberculosis')
for tweet in ts.search_tweets_iterable(tso):
print(tweet['text'])
you can set your desired attributes like language and count of course

Most efficient way to Twitter Stream?

My partner and I started learning Python at the beginning of the year. I am at the point where a) my partner and I are almost finished with our code, but b) are pulling our hair out trying to get it to work.
Assignment: Pull 250 tweets based on a certain topic, geocode location of tweets, analyze based on sentiment, then display them on a web-map. We have accomplished almost all of that except the 250 tweets requirement.
And I do not know how to pull the tweets more efficiently. The code works, but it writes around seven-twelve rows of information onto a CSV before it times out.
I tried setting a tracking parameter, but received this error: TypeError: 'NoneType' object is not subscriptable'
I tried expanding the locations parameter to stream.filter(locations=[-180,-90,180,90]), but received the same problem: TypeError: 'NoneType' object has no attribute 'latitude'
I really do not know what I am missing and I was wondering if anyone has any ideas.
CODE BELOW:
from geopy import geocoders
from geopy.exc import GeocoderTimedOut
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from textblob import TextBlob
import json
import csv
def geo(location):
g = geocoders.Nominatim(user_agent='USER')
if location is not None:
loc = g.geocode(location, timeout=None)
if loc.latitude and loc.longitude is not None:
return loc.latitude, loc.longitude
def WriteCSV(user, text, sentiment, lat, long):
f = open('D:/PATHWAY/TO/tweets.csv', 'a', encoding="utf-8")
write = csv.writer(f)
write.writerow([user, text, sentiment, lat, long])
f.close()
CK = ''
CS = ''
AK = ''
AS = ''
auth = tweepy.OAuthHandler(CK, CS)
auth.set_access_token(AK, AS)
#By setting these values to true, our code will automatically wait as it hits its limits
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
#Now I'm going to set up a stream listener
#https://stackoverflow.com/questions/20863486/tweepy-streaming-stop-collecting-tweets-at-x-amount
#https://wafawaheedas.gitbooks.io/twitter-sentiment-analysis-visualization-tutorial/sentiment-analysis-using-textblob.html
class StdOutListener(tweepy.StreamListener):
def __init__(self, api=None):
super(StdOutListener, self).__init__()
self.num_tweets = 0
def on_data(self, data):
Data = json.loads(data)
Author = Data['user']['screen_name']
Text = Data['text']
Tweet = TextBlob(Data["text"])
Sentiment = Tweet.sentiment.polarity
x,y = geo(Data['place']['full_name'])
if "coronavirus" in Text:
WriteCSV(Author, Text, Sentiment, x,y)
self.num_tweets += 1
if self.num_tweets < 50:
return True
else:
return False
stream = tweepy.Stream(auth=api.auth, listener=StdOutListener())
stream.filter(locations=[-122.441, 47.255, -122.329, 47.603])
The Twitter and Geolocation API returns all kinds of data. Some of the fields may be missing.
TypeError: 'NoneType' object has no attribute 'latitude'
This error comes from here:
loc = g.geocode(location, timeout=None)
if loc.latitude and loc.longitude is not None:
return loc.latitude, loc.longitude
You provide a location and it searches for such location but it cannot find that location. So it writes into loc None.
Consequently loc.latitude won't work because loc is None.
You should check loc first before accessing any of its attributes.
x,y = geo(Data['place']['full_name'])
I know you are filtering tweets by location and consequently your Twitter Status object should have Data['place']['full_name']. But this is not always the case. You should check if the key really do exist before accessing the values.
This applies generally and should be applied to your whole code. Write robust code. You will have a bit of easier time debugging mistakes if you implement some try catch and print out the objects to see how they are built. Maybe set a breakpoint in your catch and do some live inspection.

Tweepy - Limiting certain tweets

I am trying to get the below code to exclude any tweets that include restricted words from a list. What is the best way to do this?
This code is also returning only the final tweet once i break out of the stream. Is there a way to print all applicable tweets to CSV?
import sys
import tweepy
import csv
#pass security information to variables
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''
#use variables to access twitter
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#create an object called 'customStreamListener'
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
print (status.author.screen_name, status.created_at, status.text)
# Writing status data
with open('OutputStreaming.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow([status.author.screen_name, status.created_at, status.text])
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
# Writing csv titles
with open('OutputStreaming.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['Author', 'Date', 'Text'])
streamingAPI = tweepy.streaming.Stream(auth, CustomStreamListener())
streamingAPI.filter(track=['Hasbro', 'Mattel', 'Lego'])
The documentation for the track parameter in the Twitter API indicates that it is not possible to exclude terms from the filter, only to include words and phrases. You'll have to implement an additional filter inside your code to discard Tweets that contain words you don't want to include in your result set.
It's not possible exclude terms from the filter function, but you can implemented a custom selection.
Basically the idea is to check if the tweet's words contain not allowed words.
You can simply tokenize the tweet's text using nltk module.
A simple example from nltk homepage:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
obliviously, in your case sentence is tweet.text.
So change your code in something similar to this:
def on_status(self, status):
print (status.author.screen_name, status.created_at, status.text)
is_allowed = True
banned_words = ['word_1', 'word2', 'another_bad_word']
words_text = nltk.word_tokenize(status.text)
# loop banned_words and search if item is in words_text
for word in banned_words:
if word in words_text:
# discard this tweet
is_allowed = False
break
if is_allowed is True:
# stuff for writing status data
# ...
This code has not been tested, but shows you a way to reach your goal.
Let me know

Twilio conference moderation and participant name

I would like to set as moderator the first user who joins the conference and I'm using the twilio-python doc to help me but I didn't see anything about this.
The first participant should be moderator in order to mute, kick, etc the other one but to be honest I don't know if this is really required so I'm open to a "no need a moderator this".
Also I would like to know if the name related to the token is in the participant in order to retrieve it with this one instead of the SID. (didn't see anything in the doc)
Here the server side code :
#app.route('/call', methods=['GET', 'POST'])
def call():
resp = twilio.twiml.Response()
from_value = request.values.get('From')
to = request.values.get('To')
conferenceName = request.values.get('conferenceName')
account_sid = os.environ.get("ACCOUNT_SID", ACCOUNT_SID)
auth_token = os.environ.get("AUTH_TOKEN", AUTH_TOKEN)
app_sid = os.environ.get("APP_SID", APP_SID)
clientTwilio = TwilioRestClient(account_sid, auth_token)
elif to.startswith("conference:"):
# allows to user conference call
# client -> conference
conferencesList = client.conferences.list(friendly_name=conferenceName)
#there's no conference with the conferenceName so the first person should be the moderator and join it
if len(conferencesList) == 0
#do somestuff to set a moderator [...]
resp.dial(callerId=from_value).conference(to[11:])
else:
#there's already a conference just join it
resp.dial(callerId=from_value).conference(to[11:])
and for the "name" related to the token/client I want to use to retrieve a participant :
//http://foo.herokuapp.com/token?client=someName"
self.phone = [[TCDevice alloc] initWithCapabilityToken:token delegate:self];
NSDictionary *params = #{#"To": #"conference:foo"};
self.connection = [self.phone connect:params delegate:self];
[self closeNoddersView:nil];
//the user is connected as participant in the conference, is it possible to retrieve it with the "someName" ? (server side route which take a "someName" in param)
any clue ? :/
I found a workaround to use the client:name and no need of a moderator
a conference contains a list of participant
a participant is related to a specific call
a call contains the information in the to and from_: client:name
#app.route('/conference_kick', methods=['GET', 'POST'])
def conference():
client = TwilioRestClient(account_sid, auth_token)
conferenceName = request.values.get('conferenceName')
participantName = request.values.get('participantName')
index = 0
call = ""
# A list of conference objects
conferencesList = client.conferences.list(status="in-progress",friendly_name=conferenceName)
if len(conferencesList) == 1:
if conferencesList[0].participants:
participants = conferencesList[0].participants.list()
while index < len(participants):
call = client.calls.get(participants[index].call_sid)
array = call.from_.split(':')
if participantName == array[1]:
participants[index].kick()
return json.dumps({'code' : 200, 'success':1, 'message':participantName+' kicked'})
index += 1
return json.dumps({'code' : 101, 'success':0, 'message':participantName+' not found'})
else:
return json.dumps({'code' : 102, 'success':0, 'message':'no participants'})
else:
return json.dumps({'code' : 103, 'success':0, 'message':'no conference'})

return actual tweets in tweepy?

I was writing a twitter program using tweepy. When I run this code, it prints the Python ... values for them, like
<tweepy.models.Status object at 0x95ff8cc>
Which is not good. How do I get the actual tweet?
import tweepy, tweepy.api
key = XXXXX
sec = XXXXX
tok  = XXXXX
tsec = XXXXX
auth = tweepy.OAuthHandler(key, sec)
auth.set_access_token(tok, tsec)
api = tweepy.API(auth)
pub = api.home_timeline()
for i in pub:
        print str(i)
In general, you can use the dir() builtin in Python to inspect an object.
It would seem the Tweepy documentation is very lacking here, but I would imagine the Status objects mirror the structure of Twitter's REST status format, see (for example) https://dev.twitter.com/docs/api/1/get/statuses/home_timeline
So -- try
print dir(status)
to see what lives in the status object
or just, say,
print status.text
print status.user.screen_name
Have a look at the getstate() get method which can be used to inspect the returned object
for i in pub:
print i.__getstate__()
The api.home_timeline() method returns a list of 20 tweepy.models.Status objects which correspond to the top 20 tweets. That is, each Tweet is considered as an object of Status class. Each Status object has a number of attributes like id, text, user, place, created_at, etc.
The following code would print the tweet id and the text :
tweets = api.home_timeline()
for tweet in tweets:
print tweet.id, " : ", tweet.text
from actual tweets,if u want specific tweet,u must have a tweet id,
and use
tweets = self.api.statuses_lookup(tweetIDs)
for tweet in tweets:
#tweet obtained
print(str(tweet['id'])+str(tweet['text']))
or if u want tweets in general
use twitter stream api
class StdOutListener(StreamListener):
def __init__(self, outputDatabaseName, collectionName):
try:
print("Connecting to database")
conn=pymongo.MongoClient()
outputDB = conn[outputDatabaseName]
self.collection = outputDB[collectionName]
self.counter = 0
except pymongo.errors.ConnectionFailure as e:
print ("Could not connect to MongoDB:")
def on_data(self,data):
datajson=json.loads(data)
if "lang" in datajson and datajson["lang"] == "en" and "text" in datajson:
self.collection.insert(datajson)
text=datajson["text"].encode("utf-8") #The text of the tweet
self.counter += 1
print(str(self.counter) + " " +str(text))
def on_error(self, status):
print("ERROR")
print(status)
def on_connect(self):
print("You're connected to the streaming server.
l=StdOutListener(dbname,cname)
auth=OAuthHandler(Auth.consumer_key,Auth.consumer_secret)
auth.set_access_token(Auth.access_token,Auth.access_token_secret)
stream=Stream(auth,l)
stream.filter(track=stopWords)
create a class Stdoutlistener which is inherited from StreamListener
override function on_data,and tweet is returned in json format,this function runs every time tweet is obtained
tweets are filtered accrding to stopwords
which is list of u words u wants in ur tweets
On a tweepy Status instance you can can access the _json attribute, which returns a dict representing the original Tweet contents.
For example:
type(status)
# tweepy.models.Status
type(status._json)
# dict
status._json.keys()
# dict_keys(['favorite_count', 'contributors', 'id', 'user', ...])

Categories

Resources