I'm doing a research on user social relations in Twitter, in Python.
The problem is that "what is the fastest way to crawl followers of a certain user's followers information"
I searched a lot of information and am currently using Tweepy:
c = tweepy.Cursor(api.followers_ids, id=centre, count=5000).items()
while True:
try:
followers_ids_list.append(c.next())
except tweepy.TweepError:
# hit rate limit, sleep for 15 minutes
time.sleep(15 * 60 + 15)
continue
and after that I am using the /users/lookup to find the User() object according to those ids gained before.
However, this way is quite slow...I was wondering if there any fastest than what I am doing currently.
Because I want to find the user relations, which means followers in depth 2 is not enough.
Say, I have 100 followers, and those 100 followers have their own 200 followers, then the time needed for grabbing this social network (depth=3) would be:
(1 + 100 + 100*200)/15calls * 15mins / 60mins = 335 hours = about 14 days!
1 call: request my follower ids (100ids)
100 calls: request 100 followers' followers ids (100*200ids)
100*200 calls(at least): request 100*200(followers' followers) users's ids.
What I can think about to be alternative is to crawl the twitter.com website without api (but, I figure, this way would make my IP or account banned from Twitter....)
The API Limits prevent you from going any faster.
You could set up multiple apps and distribute the problem through them - but that's likely to get noticed by Twitter if they're all running from the same IP address.
You can never do it with Twitter API because of the 15 minutes time rate.
I'm also doing some work related to one author's followers. However, I need millions of followers' names, which is even worse.
My solution is to write my own crawler and it does work faster than API. It could crawl 100*1000 per night. (I test it on my local machine) This rate is lower than my expectation so I have to think about other ways to increase its speed.
Hope this could give you some inspirations.
Related
I have an academic research API for Twitter and have been using the Twarc Python library to scrape tweets.
For actual tweet scraping it works really well. However, when scraping the followers of accounts it seems incredibly slow.
My understanding is the rate limit for queries is 15 queries every 15 minutes with 1,000 followers pulled per query. That should lead to a maximum of around 60,000 followers pulled in an hour. However, the actual speeds seem much lower.
For instance, scraping followers for an account with just under 15,000 followers took 5 hours yesterday (instead of the best scenario 15 minutes).
The beneath is the code I have been using for this.
Is there anything with my code that may be causing the slow speeds? Is there a better (faster) way to pull the followers of accounts using Python and the Twitter API?
Thanks to anyone able to help.
from twarc.client2 import Twarc2
from twarc.expansions import ensure_flattened
twarc = Twarc2(#VARIOUS API LOGIN INFO HERE#)
search = twarc.followers("TwitterAccountToScrape",max_results=50,user_fields=['id'])
followers = []
for page in search:
for follower in ensure_flattened(page):
followers.append(follower)
I am trying to run the following code to search for two pieces of text in tweets:
search_words = ['Samsung', 'Amazon']
tweets = []
for tweet in tweepy.Cursor(api.search, q=search_words).items():
tweets.append(tweet.text)
But this keeps returning:
TweepError: Twitter error response: status code = 429
Which in the documentation states that I am exceeding the rate limit:
Rate limiting of the API is primarily on a per-user basis — or more accurately described, per user access token. If a method allows for 15 requests per rate limit window, then it allows 15 requests per window per access token.
Rate limits are divided into 15 minute intervals. All endpoints require authentication, so there is no concept of unauthenticated calls and rate limits.
Is my search term too broad? Even when I wait for 15 minutes, I still get the same error when I re-run the script, and even when I try to narrow the words used (just to test).
As an aside but related question, how many tweets/how far back in time will api.search return?
EDIT: Looking into this further, I think that the Cursor loop is making me hit the limit (180 per 15 minutes) after the 180th loop. Is there a more efficient way of searching all tweets in one block rather than having to iterate through?
api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
you can try this and see if it helps it is a rate limiter
ok so im really new to python and I am trying to create to assist me in marketing my music via social media. I am trying to code it so that when I compare a users followers with my followers if I am not following one of their followers, it automatically follows them. here is what I have
import twitter
import time
now = time.time
username = raw_input("whos followers")
api = twitter.Api(...)
friendslist = api.GetFollowersPaged(screen_name=username, count=1,)
myfollowers = api.GetFollowersPaged(user_id=821151801785405441, count=1)
for u in friendslist:
if u not in myfollowers:
api.CreateFriendship(u.friendslist)
print 'you followed new people'
time.sleep(15)
I am using python 2.7 and the python-twitter api wrapper my error seems to start at the api.CreateFriendship line. also I set the count to 1 to try to avoid rate limiting but hae had them as high as 150, 200 being the max
The Twitter API has fairly subjective controls in place for Write operations. There are daily follow limits and they designed to limit exactly the sort of thing you are doing.
see https://support.twitter.com/articles/15364 and https://support.twitter.com/articles/15364
If you do reach a limit, we'll let you know with an error message
telling you which limit you've hit. For limits that are time-based
(like the direct messages, Tweets, changes to account email, and API
request limits), you'll be able to try again after the time limit has
elapsed.
I want collect data from twitter using python Tweepy library.
I surveyed the rate limits for Twitter API,which is 180 requests per 15-minute.
What I want to know how many data I can get for one specific keyword?put it in another way , when I use the Tweepy.Cursor,when it'll stops?
I not saying the maths calculation(100 count * 180 request * 4 times/hour etc.) but the real experience.I found a view as follows:
"With a specific keyword, you can typically only poll the last 5,000 tweets per keyword. You are further limited by the number of requests you can make in a certain time period. "
http://www.brightplanet.com/2013/06/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/
Is this correct(if this's correct,I only need to run the program for 5 minutes or so)? or I am needed to keep getting as many tweets as they are there(which may make the program keep running very long time)?
You will definitely not be getting as many tweets as exist. The way Twitter limits how far back you can go (and therefore how many tweets are available) is with a minimum since_id parameter passed to the GET search/tweets call to the Twitter API. In Tweepy, the API.search function interfaces with the Twitter API. Twitter's GET search/tweets documentation has a lot of good info:
There are limits to the number of Tweets which can be accessed through the API. If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available.
In practical terms, Tweepy's API.search should not take long to get all the available tweets. Note that not all tweets are available per the Twitter API, but I've never had a search take up more than 10 minutes.
I am trying to retrieve user Friend network using python-twitter API. I am using the getFriendIDs() method which retrieves the ids of all the accounts a particular twitter user is following. The following is a small snipped of my test code:
for item in IdList:
aDict[item] = api.GetFriendIDs(user_id=item,count=4999)
print "sleeping 60"
time.sleep(66)
print str(api.MaximumHitFrequency())+" The maximum hit frequency"
print api.GetRateLimitStatus()['resources']['friends']['/friends/ids']['remaining']
There are 35 ids (of twitter user accounts) in IdList and for each item I am retrieving upto 4999 Ids that the current user with id 'item' is following. I am aware of the new rate-limiting by twitter wherein the rate-limit window has been changed from 60 minutes to 15 minutes and the fact that they advice you not to make more than one request to the server per minute (api.MaximumHitFrequency()). So basically 15 requests in 15 minutes. That is exactly what I'm doing in fact I'm making a request to the server every 66 seconds and not 60 seconds but I get a rate-limit error after 6 requests. I am unable to figure out why this is happening. Please do let me know if anyone else has had this problem.
Have a look at https://github.com/bear/python-twitter/wiki/Rate-Limited-API---How-to-deal-with.
Also, it might help to use a newer version of the python-twitter code. The MaximumHitFrequency and GetRateLimitStatus methods have been modified with https://github.com/bear/python-twitter/commit/25cccb81fbeb4c630a0024981bc98f7fb41f3933.