Logical Operators in Tweepy Filter - python

I'm hoping to track tweets that contain a certain set of words, but not others. For example, if my filter is: "taco" AND ("chicken" OR "beef").
It should return these tweets:
-I am eating a chicken taco.
-I am eating a beef taco.
It should not return these tweets:
-I am eating a taco.
-I am eating a pork taco.
Here is the code I'm currently running:
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time
import json
# authentication data- get this info from twitter after you create your application
ckey = '...' # consumer key, AKA API key
csecret = '...' # consumer secret, AKA API secret
atoken = '...' # access token
asecret = '...' # access secret
# define listener class
class listener(StreamListener):
def on_data(self, data):
try:
print data # write the whole tweet to terminal
return True
except BaseException, e:
print 'failed on data, ', str(e) # if there is an error, show what it is
time.sleep(5) # one error could be that you're rate-limited; this will cause the script to pause for 5 seconds
def on_error(self, status):
print status
# authenticate yourself
auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track=["taco"]) # track what you want to search for!
The last line of the code is the part I'm struggling with; if I use:
twitterStream.filter(track=["taco","chicken","beef"])
it will return all tweets containing any of the three words. Other things I've tried, such as:
twitterStream.filter(track=(["taco"&&("chicken","beef")])
return a syntax error.
I'm fairly new to both Python and Tweepy. Both this and this seem like similar queries, but they are related to tracking multiple terms simultaneously, rather than tracking a subset of tweets containing a term. I haven't been able to find anything in the tweepy documentation.
I know another option would be tracking all tweets containing "taco" then filtering by "chicken" or "beef" into my database, but I'm worried about running up against the 1% streaming rate limit if I do a general search and then filter it down within Python, so I'd prefer only streaming the terms I want in the first place from Twitter.
Thanks in advance-
Sam

Twitter does not allow you to be very precise in how keywords are matched. However, the track parameter documentation states that spaces within a keyword are equivelent to logicals ANDS. All of the terms you specify are OR'd together.
So, to achieve your "taco" AND ("chicken" OR "beef") example, you could try the parameters [taco chicken, taco beef]. This would match tweets containing the words taco and chicken, or taco and beef. However, this isn't a perfect solution, as a tweet containing taco, chicken, and beef would also be matched.

Related

Tweepy lookup of extended tweets for multiple tweets at a time?

I'm using tweepy to access a large number of tweets. Many tweets are truncated, so I want to get the full text of some tweets, which I have the id for.
My problem is: The tweepy api instance has one method of downloading multiple tweets at once (api.statuses_lookup), but this returns truncated tweets.
It also has a method that includes the full tweet text (api.get_status), but which afaik only takes one tweet at a time.
Is there way of getting the full text for multiple tweets at once?
import tweepy
consumer_key = "XXX"
secret = "XXX"
auth = tweepy.AppAuthHandler(consumer_key, secret)
auth.secure = True
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
ids = [1108360183586140161, 1108474125486641153]
# Finds tweets (up to 100 at a time), but doesn't contain extended text
foo = api.statuses_lookup(ids)
# Returns tweet, including extended text, but only for one at a time
bar = api.get_status(1108449077937635328, tweet_mode='extended')
As pointed out by Andy Piper, the issue was fixed in a recent update of the Tweepy library, so running
pip install tweepy --upgrade
solves this.

Twitter API: Get top tweets by query and WOEID place

Preferably via Tweepy in Python, I want to obtain from the Twitter API a list of top tweets for a given search query and WOEID place identifier (Yahoo's Where On Earth IDentifier).
In my example, I obtain trending queries for a WOEID id via Tweepy's API.trends_place(id) wrapper for the Twitter REST API's GET trends/place; I then want to print the top tweets for each trending query within this place (same WOEID).
Currently, I obtain tweets for the trending query, but
not within the given place;
not necessarily the "top" tweets (as opposed to, for example, "recent").
How can I add these two restrictions to my search?
MWE:
import tweepy
from tweepy import OAuthHandler
consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
locationid = 23424775 # WOEID for Canada
trendqueries = [trend['query'] for trend in api.trends_place(locationid)[0]['trends']]
for trendquery in trendqueries:
print(api.search(q=trendquery))
What I have tried:
I can search by longitude/latitude using Tweepy's API.search(q, geocode), but I do not see an obvious way to search by WOEID.
Partial answer
API.search(q[, lang][, locale][, rpp][, page][, since_id][, geocode][, show_user])
Returns tweets that match a specified query.
Parameters:
geocode – Returns tweets by users located within a given radius of the given latitude/longitude. The location is preferentially taking from the Geotagging API, but will fall back to their Twitter profile. The parameter value is specified by “latitide,longitude,radius”, where radius units must be specified as either “mi” (miles) or “km” (kilometers). Note that you cannot use the near operator via the API to geocode arbitrary locations; however you can use this geocode parameter to search near geocodes directly.
show_user – When true, prepends “:” to the beginning of the tweet. This is useful for readers that do not display Atom’s author field. The default is false.

Full list of twitter "friends" using python and tweepy

By friends I mean all of the twitter users who I am following.
Is it possible using tweepy with python 2.7.6 to display a full list of all friends?
I have found it possible to display a list which contains some of my friends with the following code. After handling authorization of course.
api = tweepy.API(auth)
user = api.get_user('MyTwitterHandle')
print "My Twitter Handle:" , user.screen_name
ct = 0
for friend in user.friends():
print friend.screen_name
ct = ct + 1
print "\n\nFinal Count:", ct
This code successfully prints what appears to be my 20 most recent friends on Twitter, the ct variable is equal to 20. This method excludes the rest of the users I am following on Twitter.
Is it possible to display all of the users I am following on twitter? Or at least a way to adjust a parameter to allow me to include more friends?
According to the source code, friends() is referred to the GET friends / list twitter endpoint, which allows a count parameter to be passed in:
The number of users to return per page, up to a maximum of 200. Defaults to 20.
This would allow you to get 200 friends via friends().
Or, better approach would be to use a Cursor which is a paginated way to get all of the friends:
for friend in tweepy.Cursor(api.friends).items():
# Process the friend here
process_friend(friend)
See also:
incomplete friends list
Tweepy Cursor vs Iterative for low API calls

how to get English tweets alone using python?

Here is my current code
from twitter import *
t = Twitter(auth=OAuth(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET,
ACCESS_TOKEN, ACCESS_TOKEN_SECRET))
t.statuses.home_timeline()
query=raw_input("enter the query \n")
data = t.search.tweets(q=query)
for i in range (0,1000):
print data['statuses'][i]['text']
print '\n'
Here, I fetch tweets from all the languages. Is there a way to restrict myself to fetching tweets only in English?
There are at least 4 ways... I put them in the order of simplicity.
After you collect the tweets, the json output has a key/value pair that identifies the language. So you can use something like this to take all language tweets and select only the ones that are from English accounts.
for i in range (0,1000):
if data['statuses'][i][u'lang']==u'en':
print data['statuses'][i]['text']
print '\n'
Another way to collect only tweets that are identified in English, you can use the optional 'lang' parameter to request from the API only English (self-idenfitied) tweets. See details here. If you are using the python-twitter library, you can set the 'lang' parameter in twitter.py.
Use a language recognition package like guess-language.
Or if you want to recognize English text without using the self-identified twitter data (i.e. a chinese account that is writing in English), then you have to do Natural Language Processing. One option. This method will recognize common English words and then mark the text as English.
I try this for farsi:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
res = api.search('lang','fa')
for i in res:
print( i.lang)

Tracking keywords in a live stream of tweets

I installed and tried out tweepy, I am using the following function right now:
from API Reference
API.public_timeline()
Returns the 20 most recent statuses from
non-protected users who have set a custom user icon. The public
timeline is cached for 60 seconds so requesting it more often than
that is a waste of resources.
However, I want to do extract all tweets that match a certain regular expression from the complete live stream. I could put public_timeline() inside a while True loop but that would probably run into problems with rate limiting. Either way, I don't really think it can cover all current tweets.
How could that be done? If not all tweets, then I want to extract as many tweets that match a certain keyword.
The streaming API is what you want. I use a library called tweetstream. Here's my basic listening function:
def retrieve_tweets(numtweets=10, *args):
"""
This function optionally takes one or more arguments as keywords to filter tweets.
It iterates through tweets from the stream that meet the given criteria and sends them
to the database population function on a per-instance basis, so as to avoid disaster
if the stream is disconnected.
Both SampleStream and FilterStream methods access Twitter's stream of status elements.
For status element documentation, (including proper arguments for tweet['arg'] as seen
below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
"""
filters = []
for key in args:
filters.append(str(key))
if len(filters) == 0:
stream = tweetstream.SampleStream(username, password)
else:
stream = tweetstream.FilterStream(username, password, track=filters)
try:
count = 0
while count < numtweets:
for tweet in stream:
# a check is needed on text as some "tweets" are actually just API operations
# the language selection doesn't really work but it's better than nothing(?)
if tweet.get('text') and tweet['user']['lang'] == 'en':
if tweet['retweet_count'] == 0:
# bundle up the features I want and send them to the db population function
bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
db_initpop(bundle)
break
else:
# a RT has a different structure. This bundles the original tweet. Getting the
# retweets comes later, after the stream is de-accessed.
bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
tweet['retweet_count'], tweet['retweeted_status']['text'])
db_initpop(bundle)
break
count += 1
except tweetstream.ConnectionError, e:
print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
+'. Reason: ', e.reason
I haven't looked in a while, but I'm pretty sure that this library is just accessing the sample stream (as opposed to the firehose). HTH.
Edit to add: you say you want the "complete live stream", aka the firehose. That's fiscally and technically expensive and only very large companies are allowed to have it. Look at the docs and you'll see that the sample is basically representative.
Take a look at the streaming API. You can even subscribe to a list of words that you define, and only tweets that match those words are returned.
The streaming API rate limiting works differently: you get 1 connection per IP, and a maximum number of events per second. If more events occur than that, then you only get the maximum anyways, with a notification regarding how many events you missed because of rate limiting.
My understanding is that the streaming API is most suitable for servers that will redistribute the content to your users as needed, instead of being accessed directly by your users - the standing connections are expensive and Twitter starts blacklisting IPs after too many failed connections and re-connections, and possibly your API key afterwards.

Categories

Resources