Tracking keywords in a live stream of tweets

Tracking keywords in a live stream of tweets - python

I installed and tried out tweepy, I am using the following function right now:
from API Reference
API.public_timeline()
Returns the 20 most recent statuses from
non-protected users who have set a custom user icon. The public
timeline is cached for 60 seconds so requesting it more often than
that is a waste of resources.
However, I want to do extract all tweets that match a certain regular expression from the complete live stream. I could put public_timeline() inside a while True loop but that would probably run into problems with rate limiting. Either way, I don't really think it can cover all current tweets.
How could that be done? If not all tweets, then I want to extract as many tweets that match a certain keyword.

The streaming API is what you want. I use a library called tweetstream. Here's my basic listening function:
def retrieve_tweets(numtweets=10, *args):
"""
This function optionally takes one or more arguments as keywords to filter tweets.
It iterates through tweets from the stream that meet the given criteria and sends them
to the database population function on a per-instance basis, so as to avoid disaster
if the stream is disconnected.
Both SampleStream and FilterStream methods access Twitter's stream of status elements.
For status element documentation, (including proper arguments for tweet['arg'] as seen
below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
"""
filters = []
for key in args:
filters.append(str(key))
if len(filters) == 0:
stream = tweetstream.SampleStream(username, password)
else:
stream = tweetstream.FilterStream(username, password, track=filters)
try:
count = 0
while count < numtweets:
for tweet in stream:
# a check is needed on text as some "tweets" are actually just API operations
# the language selection doesn't really work but it's better than nothing(?)
if tweet.get('text') and tweet['user']['lang'] == 'en':
if tweet['retweet_count'] == 0:
# bundle up the features I want and send them to the db population function
bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
db_initpop(bundle)
break
else:
# a RT has a different structure. This bundles the original tweet. Getting the
# retweets comes later, after the stream is de-accessed.
bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
tweet['retweet_count'], tweet['retweeted_status']['text'])
db_initpop(bundle)
break
count += 1
except tweetstream.ConnectionError, e:
print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
+'. Reason: ', e.reason
I haven't looked in a while, but I'm pretty sure that this library is just accessing the sample stream (as opposed to the firehose). HTH.
Edit to add: you say you want the "complete live stream", aka the firehose. That's fiscally and technically expensive and only very large companies are allowed to have it. Look at the docs and you'll see that the sample is basically representative.

Take a look at the streaming API. You can even subscribe to a list of words that you define, and only tweets that match those words are returned.
The streaming API rate limiting works differently: you get 1 connection per IP, and a maximum number of events per second. If more events occur than that, then you only get the maximum anyways, with a notification regarding how many events you missed because of rate limiting.
My understanding is that the streaming API is most suitable for servers that will redistribute the content to your users as needed, instead of being accessed directly by your users - the standing connections are expensive and Twitter starts blacklisting IPs after too many failed connections and re-connections, and possibly your API key afterwards.

Related

How to get the number of tweets from a hastag in python?

I am trying to get the number of tweets containing a hashtag (let's say "#kitten") in python.
I am using tweepy.
However, all the codes I have found are in this form :
query = "kitten"
for i, status in enumerate(tweepy.Cursor(api.search, q=query).items(50)):
print(i, status)
I have this error : 'API' object has no attribute 'search'
Tweepy seemed to not cointain this object anymore. Is there any way to answer my problem ?
Sorry for my bad english.

After browsing the web and twitter documentation I found the answer.
If you want the historic of all tweet counts from 2006 you need Academic authorization. This is not my case so I can only get 7 days tracking which is enough in my case. Here is the code :
import tweepy
query = "kitten -is:retweet"
client = tweepy.Client(bearer_token)
counts = client.get_recent_tweets_count(query=query, granularity='day')
for i in counts.data:
print(i["tweet_count"])
The "-is:retweet" is here to not count the retweets. You need to remove it if you want to count them.
Since we're not pulling any tweets (only the volume of them) we are not increasing our MONTHLY TWEET CAP USAGE.
Be carefull when using symbols in your query such as "$" it might give you an error. For a list of valid operators see : list of valid operators for query
As said here Twitter counts introduction, you only need "read only" authorization to perform a recent count request. (see Recent Tweet counts)

SoundCloud API - Playback Count is smaller than actual count

I am using soundcloud api through python SDK.
When I get the tracks data through 'Search',
the track attribute 'playback_count' seems to be
smaller than the actual count seen on the web.
How can I avoid this problem and get the actual playback_count??
(ex.
this track's playback_count gives me 2700,
but its actually 15k when displayed on the web
https://soundcloud.com/drumandbassarena/ltj-bukem-soundcrash-mix-march-2016
)
note: this problem does not occur for comments or likes.
following is my code
##Search##
tracks = client.get('/tracks', q=querytext, created_at={'from':startdate},duration={'from':startdur},limit=200)
outputlist = []
trackinfo = {}
resultnum = 0
for t in tracks:
trackinfo = {}
resultnum += 1
trackinfo["id"] = resultnum
trackinfo["title"] =t.title
trackinfo["username"]= t.user["username"]
trackinfo["created_at"]= t.created_at[:-5]
trackinfo["genre"] = t.genre
trackinfo["plays"] = t.playback_count
trackinfo["comments"] = t.comment_count
trackinfo["likes"] =t.likes_count
trackinfo["url"] = t.permalink_url
outputlist.append(trackinfo)

There is an issue with the playback count being incorrect when reported via the API.
I have encountered this when getting data via the /me endpoint for activity and likes to mention a couple.
The first image shows the information returned when accessing the sound returned for the currently playing track in the soundcloud widget
Information returned via the api for the me/activities endpoint

Looking at the SoundCloud website, they actually call a second version of the API to populate the track list on the user page. It's similar to the documented version, but not quite the same.
If you issue a request to https://api-v2.soundcloud.com/stream/users/[userid]?limit=20&client_id=[clientid] then you'll get back a JSON object showing the same numbers you see on the web.
Since this is an undocumented version, I'm sure it'll change the next time they update their website.

Twitter timeline in Python, but only getting 20ish results?

I'm a nub when it comes to python. I literally just started today and have little understanding of programming. I have managed to make the following code work:
from twitter import *
config = {}
execfile("config.py", config)
twitter = Twitter(
auth = OAuth(config["access_key"], config["access_secret"],
config["consumer_key"], config["consumer_secret"]))
user = "skiftetse"
results = twitter.statuses.user_timeline(screen_name = user)
for status in results:
print "(%s) %s" % (status["created_at"], status["text"].encode("ascii",
"ignore"))
The problem is that it's only printing 20 results. The twitter page i'd like to get data from has 22k posts, so something is wrong with the last line of code.
screenshot
I would really appreciate help with this! I'm doing this for a research sentiment analysis, so I need several 100's to analyze. Beyond that it'd be great if retweets and information about how many people re tweeted their posts were included. I need to get better at all this, but right now I just need to meet that deadline at the end of the month.

You need to understand how the Twitter API works. Specifically, the user_timeline documentation.
By default, a request will only return 20 Tweets. If you want more, you will need to set the count parameter to, say, 50.
e.g.
results = twitter.statuses.user_timeline(screen_name = user, count = 50)
Note, count:
Specifies the number of tweets to try and retrieve, up to a maximum of 200 per distinct request.
In addition, the API will only let you retrieve the most recent 3,200 Tweets.

Python-Twitter: Retrieve most recent mentions

I'm trying to use the Python-Twitter library (https://github.com/bear/python-twitter) to extract mentions of a twitter account using the GetMention() function. The script populates a database and runs periodically on a cron job so I don't want to extract every mention, only those since the last time the script was run.
The code below extracts the mentions fine but for some reason the 'since_id' argument doesn't seem to do anything - the function returns all the mentions every time I run it, rather than filtering for only the most recent mentions. For reference the documentation is here: https://python-twitter.googlecode.com/hg/doc/twitter.html#Api-GetMentions)
What is the correct way to implement the GetMention() function? (I've looked but I can't find any examples online). Alternatively, is there a different/more elegant way of extracting twitter mentions that I'm overlooking?
def scan_timeline():
''' Scans the timeline and populates the database with the results '''
FN_NAME = "scan_timeline"
# Establish the api connection
api = twitter.Api(
consumer_key = "consumerkey",
consumer_secret = "consumersecret",
access_token_key = "accesskey",
access_token_secret = "accesssecret"
)
# Tweet ID of most recent mention from the last time the function was run
# (In actual code this is dynamic and extracted from a database)
since_id = 498404931028938752
# Retrieve all mentions created since the last scan of the timeline
length_of_response = 20
page_number = 0
while length_of_response == 20:
# Retreive most recent mentions
results = api.GetMentions(since_id,None,page_number)
### Additional code inserts the tweets into a database ###

Your syntax seems to be consistent as per mentioned in the Python-Twitter library. What I think is happening is the following:
If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available.
Which would lead to all the tweets starting from the oldest available ID. Try working with a more recent since ID value. Equivalently, also check whether the since ID you're giving is appropriate or not.

Unable to iteratively call yahoo's term extractor api using python

I am trying to loop through some 50-odd files in a directory. Each file has some text for which i am trying to find the keywords using Yahoo Term Extractor. I am able to extract text from each file, but I am not able to iteratively call the API using the text as input. Only the keywords for the first file is displayed.
Here is my code snippet:
in 'comments' list, I have extracted and stored the text from each file.
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)

Well I don't know anything about the Yahoo Term Extractor, but I'd presume that your call request.add_data(queryData) simply tacks on another data set with each iteration of your loop. And then the call to OPENER.open(request).read() would probably only process the results of the first data set. So either your request object can only hold one query, or your OPENER object's inner workings can only process one query, it's as simple as that.
Actually a third reason comes to mind now that I read the documentation provided at your link, and this is probably the true one:
RATE LIMITS
The Term Extraction service is limited to 5,000 queries per IP address per day and to noncommercial use. See information on rate limiting.
So it would make sense that the API would limit your usage to one query at a time, and not allow you to flood a bunch of queries in a single request.
In any event, I'd assume you could fix your problem in a "naive" way by having many request variables instead of just one, or maybe just creating a new request with every iteration of your loop. If you're not worried about storing your results, and just trying to debug, you could try:
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request = urllib2.Request() # I don't know how to initialize this variable, do it yourself
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)
Again, I don't know about the Yahoo Term Extractor (nor do I really have time to research it) so there may very well be a better, more native way to do this. If you post more details of your code (i.e. what classes are the request and OPENER objects coming from) then I might be able to elaborate on this.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tracking keywords in a live stream of tweets - python

Related

How to get the number of tweets from a hastag in python?

SoundCloud API - Playback Count is smaller than actual count

Twitter timeline in Python, but only getting 20ish results?

Python-Twitter: Retrieve most recent mentions

Unable to iteratively call yahoo's term extractor api using python

Categories

Resources