Python-Twitter: Retrieve most recent mentions

Python-Twitter: Retrieve most recent mentions - python

I'm trying to use the Python-Twitter library (https://github.com/bear/python-twitter) to extract mentions of a twitter account using the GetMention() function. The script populates a database and runs periodically on a cron job so I don't want to extract every mention, only those since the last time the script was run.
The code below extracts the mentions fine but for some reason the 'since_id' argument doesn't seem to do anything - the function returns all the mentions every time I run it, rather than filtering for only the most recent mentions. For reference the documentation is here: https://python-twitter.googlecode.com/hg/doc/twitter.html#Api-GetMentions)
What is the correct way to implement the GetMention() function? (I've looked but I can't find any examples online). Alternatively, is there a different/more elegant way of extracting twitter mentions that I'm overlooking?
def scan_timeline():
''' Scans the timeline and populates the database with the results '''
FN_NAME = "scan_timeline"
# Establish the api connection
api = twitter.Api(
consumer_key = "consumerkey",
consumer_secret = "consumersecret",
access_token_key = "accesskey",
access_token_secret = "accesssecret"
)
# Tweet ID of most recent mention from the last time the function was run
# (In actual code this is dynamic and extracted from a database)
since_id = 498404931028938752
# Retrieve all mentions created since the last scan of the timeline
length_of_response = 20
page_number = 0
while length_of_response == 20:
# Retreive most recent mentions
results = api.GetMentions(since_id,None,page_number)
### Additional code inserts the tweets into a database ###

Your syntax seems to be consistent as per mentioned in the Python-Twitter library. What I think is happening is the following:
If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available.
Which would lead to all the tweets starting from the oldest available ID. Try working with a more recent since ID value. Equivalently, also check whether the since ID you're giving is appropriate or not.

Related

How to get the number of tweets from a hastag in python?

I am trying to get the number of tweets containing a hashtag (let's say "#kitten") in python.
I am using tweepy.
However, all the codes I have found are in this form :
query = "kitten"
for i, status in enumerate(tweepy.Cursor(api.search, q=query).items(50)):
print(i, status)
I have this error : 'API' object has no attribute 'search'
Tweepy seemed to not cointain this object anymore. Is there any way to answer my problem ?
Sorry for my bad english.

After browsing the web and twitter documentation I found the answer.
If you want the historic of all tweet counts from 2006 you need Academic authorization. This is not my case so I can only get 7 days tracking which is enough in my case. Here is the code :
import tweepy
query = "kitten -is:retweet"
client = tweepy.Client(bearer_token)
counts = client.get_recent_tweets_count(query=query, granularity='day')
for i in counts.data:
print(i["tweet_count"])
The "-is:retweet" is here to not count the retweets. You need to remove it if you want to count them.
Since we're not pulling any tweets (only the volume of them) we are not increasing our MONTHLY TWEET CAP USAGE.
Be carefull when using symbols in your query such as "$" it might give you an error. For a list of valid operators see : list of valid operators for query
As said here Twitter counts introduction, you only need "read only" authorization to perform a recent count request. (see Recent Tweet counts)

Tweepy not returning full tweet: tweet_mode = 'extended' not working

Hello I am trying to scrape the tweets of a certain user using tweepy.
Here is my code :
tweets = []
username = 'example'
count = 140 #nb of tweets
try:
# Pulling individual tweets from query
for tweet in api.user_timeline(id=username, count=count, include_rts = False):
# Adding to list that contains all tweets
tweets.append((tweet.text))
except BaseException as e:
print('failed on_status,',str(e))
time.sleep(3)
The problem I am having is the tweets are coming back unfinished with "..." at the end.
I think I've looked at all the other similar problems on stack overflow and elsewhere but nothing works. Most do not concern me because I am NOT dealing with retweets .
I have tried putting tweet_mode = 'extended' and/or tweet.full_text or tweet._json['extended_tweet']['full_text'] in different combinations .
I don't get an error message but nothing works, just an empty list in return.
And It looks like the documentation is out of date because it says nothing about the 'tweet_mode' nor the 'include_rts' parameter :
Has anyone managed to get the full text of each tweet?? I'm really stuck on this seemingly simple problem and am losing my hair so I would appreciate any advice :D
Thanks in advance!!!

TL;DR: You're most likely running into a Rate Limiting issue. And use the full_text attribute.
Long version:
First,
The problem I am having is the tweets are coming back unfinished with "..." at the end.
From the Tweepy documentation on Extended Tweets, this is expected:
Compatibility mode
... It will also be discernible that the text attribute of the Status object is truncated as it will be suffixed with an ellipsis character, a space, and a shortened self-permalink URL to the Tweet.
Wrt
And It looks like the documentation is out of date because it says nothing about the 'tweet_mode' nor the 'include_rts' parameter :
They haven't explicitly added it to the documentation of each method, however, they specify that tweet_mode is added as a param:
Standard API methods
Any tweepy.API method that returns a Status object accepts a new tweet_mode parameter. Valid values for this parameter are compat and extended , which give compatibility mode and extended mode, respectively. The default mode (if no parameter is provided) is compatibility mode.
So without tweet_mode added to the call, you do get the tweets with partial text? And with it, all you get is an empty list? If you remove it and immediately retry, verify that you still get an empty list. ie, once you get an empty list result, check if you keep getting an empty list even when you change the params back to the one which worked.
Based on bug #1329 - API.user_timeline sometimes returns an empty list - it appears to be a Rate Limiting issue:
Harmon758 commented on Feb 13
This API limitation would manifest itself as exactly the issue you're describing.
Even if it was working, it's in the full_text attribute, not the usual text. So the line
tweets.append((tweet.text))
should be
tweets.append(tweet.full_text)
(and you can skip the extra enclosing ())
Btw, if you're not interested in retweets, see this example for the correct way to handle them:
Given an existing tweepy.API object and id for a Tweet, the following can be used to print the full text of the Tweet, or if it’s a Retweet, the full text of the Retweeted Tweet:
status = api.get_status(id, tweet_mode="extended")
try:
print(status.retweeted_status.full_text)
except AttributeError: # Not a Retweet
print(status.full_text)
If status is a Retweet, status.full_text could be truncated.

As per the twitter API v2:
tweet_mode does not work at all. You need to add expansions=referenced_tweets.id. Then in the response, search for includes. You can find all the truncated tweets as full tweets in the includes. You will still see the truncated tweets in response but do not worry about it.

Soundcloud API python issues with linked partitioning

On the Soundcloud API guide (https://developers.soundcloud.com/docs/api/guide#pagination)
the example given for reading more than 100 piece of data is as follows:
# get first 100 tracks
tracks = client.get('/tracks', order='created_at', limit=page_size)
for track in tracks:
print track.title
# start paging through results, 100 at a time
tracks = client.get('/tracks', order='created_at', limit=page_size,
linked_partitioning=1)
for track in tracks:
print track.title
I'm pretty certain this is wrong as I found that 'tracks.collection' needs referencing rather than just 'tracks'. Based on the GitHub python soundcloud API wiki it should look more like this:
tracks = client.get('/tracks', order='created_at',limit=10,linked_partitioning=1)
while tracks.collection != None:
for track in tracks.collection:
print(track.playback_count)
tracks = tracks.GetNextPartition()
Where I have removed the indent from the last line (I think there is an error on the wiki it is within the for loop which makes no sense to me). This works for the first loop. However, this doesn't work for successive pages because the "GetNextPartition()" function is not found. I've tried the last line as:
tracks = tracks.collection.GetNextPartition()
...but no success.
Maybe I'm getting versions mixed up? But I'm trying to run this with Python 3.4 after downloading the version from here: https://github.com/soundcloud/soundcloud-python
Any help much appreciated!

For anyone that cares, I found this solution on the SoundCloud developer forum. It is slightly modified from the original case (searching for tracks) to list my own followers. The trick is to call the client.get function repeatedly, passing the previously returned "users.next_href" as the request that points to the next page of results. Hooray!
pgsize=200
c=1
me = client.get('/me')
#first call to get a page of followers
users = client.get('/users/%d/followers' % me.id, limit=pgsize, order='id',
linked_partitioning=1)
for user in users.collection:
print(c,user.username)
c=c+1
#linked_partitioning means .next_href exists
while users.next_href != None:
#pass the contents of users.next_href that contains 'cursor=' to
#locate next page of results
users = client.get(users.next_href, limit=pgsize, order='id',
linked_partitioning=1)
for user in users.collection:
print(c,user.username)
c=c+1

SoundCloud API - Playback Count is smaller than actual count

I am using soundcloud api through python SDK.
When I get the tracks data through 'Search',
the track attribute 'playback_count' seems to be
smaller than the actual count seen on the web.
How can I avoid this problem and get the actual playback_count??
(ex.
this track's playback_count gives me 2700,
but its actually 15k when displayed on the web
https://soundcloud.com/drumandbassarena/ltj-bukem-soundcrash-mix-march-2016
)
note: this problem does not occur for comments or likes.
following is my code
##Search##
tracks = client.get('/tracks', q=querytext, created_at={'from':startdate},duration={'from':startdur},limit=200)
outputlist = []
trackinfo = {}
resultnum = 0
for t in tracks:
trackinfo = {}
resultnum += 1
trackinfo["id"] = resultnum
trackinfo["title"] =t.title
trackinfo["username"]= t.user["username"]
trackinfo["created_at"]= t.created_at[:-5]
trackinfo["genre"] = t.genre
trackinfo["plays"] = t.playback_count
trackinfo["comments"] = t.comment_count
trackinfo["likes"] =t.likes_count
trackinfo["url"] = t.permalink_url
outputlist.append(trackinfo)

There is an issue with the playback count being incorrect when reported via the API.
I have encountered this when getting data via the /me endpoint for activity and likes to mention a couple.
The first image shows the information returned when accessing the sound returned for the currently playing track in the soundcloud widget
Information returned via the api for the me/activities endpoint

Looking at the SoundCloud website, they actually call a second version of the API to populate the track list on the user page. It's similar to the documented version, but not quite the same.
If you issue a request to https://api-v2.soundcloud.com/stream/users/[userid]?limit=20&client_id=[clientid] then you'll get back a JSON object showing the same numbers you see on the web.
Since this is an undocumented version, I'm sure it'll change the next time they update their website.

Tracking keywords in a live stream of tweets

I installed and tried out tweepy, I am using the following function right now:
from API Reference
API.public_timeline()
Returns the 20 most recent statuses from
non-protected users who have set a custom user icon. The public
timeline is cached for 60 seconds so requesting it more often than
that is a waste of resources.
However, I want to do extract all tweets that match a certain regular expression from the complete live stream. I could put public_timeline() inside a while True loop but that would probably run into problems with rate limiting. Either way, I don't really think it can cover all current tweets.
How could that be done? If not all tweets, then I want to extract as many tweets that match a certain keyword.

The streaming API is what you want. I use a library called tweetstream. Here's my basic listening function:
def retrieve_tweets(numtweets=10, *args):
"""
This function optionally takes one or more arguments as keywords to filter tweets.
It iterates through tweets from the stream that meet the given criteria and sends them
to the database population function on a per-instance basis, so as to avoid disaster
if the stream is disconnected.
Both SampleStream and FilterStream methods access Twitter's stream of status elements.
For status element documentation, (including proper arguments for tweet['arg'] as seen
below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
"""
filters = []
for key in args:
filters.append(str(key))
if len(filters) == 0:
stream = tweetstream.SampleStream(username, password)
else:
stream = tweetstream.FilterStream(username, password, track=filters)
try:
count = 0
while count < numtweets:
for tweet in stream:
# a check is needed on text as some "tweets" are actually just API operations
# the language selection doesn't really work but it's better than nothing(?)
if tweet.get('text') and tweet['user']['lang'] == 'en':
if tweet['retweet_count'] == 0:
# bundle up the features I want and send them to the db population function
bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
db_initpop(bundle)
break
else:
# a RT has a different structure. This bundles the original tweet. Getting the
# retweets comes later, after the stream is de-accessed.
bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
tweet['retweet_count'], tweet['retweeted_status']['text'])
db_initpop(bundle)
break
count += 1
except tweetstream.ConnectionError, e:
print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
+'. Reason: ', e.reason
I haven't looked in a while, but I'm pretty sure that this library is just accessing the sample stream (as opposed to the firehose). HTH.
Edit to add: you say you want the "complete live stream", aka the firehose. That's fiscally and technically expensive and only very large companies are allowed to have it. Look at the docs and you'll see that the sample is basically representative.

Take a look at the streaming API. You can even subscribe to a list of words that you define, and only tweets that match those words are returned.
The streaming API rate limiting works differently: you get 1 connection per IP, and a maximum number of events per second. If more events occur than that, then you only get the maximum anyways, with a notification regarding how many events you missed because of rate limiting.
My understanding is that the streaming API is most suitable for servers that will redistribute the content to your users as needed, instead of being accessed directly by your users - the standing connections are expensive and Twitter starts blacklisting IPs after too many failed connections and re-connections, and possibly your API key afterwards.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.