I have an academic research API for Twitter and have been using the Twarc Python library to scrape tweets.
For actual tweet scraping it works really well. However, when scraping the followers of accounts it seems incredibly slow.
My understanding is the rate limit for queries is 15 queries every 15 minutes with 1,000 followers pulled per query. That should lead to a maximum of around 60,000 followers pulled in an hour. However, the actual speeds seem much lower.
For instance, scraping followers for an account with just under 15,000 followers took 5 hours yesterday (instead of the best scenario 15 minutes).
The beneath is the code I have been using for this.
Is there anything with my code that may be causing the slow speeds? Is there a better (faster) way to pull the followers of accounts using Python and the Twitter API?
Thanks to anyone able to help.
from twarc.client2 import Twarc2
from twarc.expansions import ensure_flattened
twarc = Twarc2(#VARIOUS API LOGIN INFO HERE#)
search = twarc.followers("TwitterAccountToScrape",max_results=50,user_fields=['id'])
followers = []
for page in search:
for follower in ensure_flattened(page):
followers.append(follower)
Related
I am working on a project for which I want to extract the timelines of around 500 different twitter users (I am using this for historical analysis, so I'll only need to retrieve them all once- no need to update with incoming tweets).
While I know the Twitter API only allows the last 3,200 tweets to be retrieved, when I use the basic UserTimeline method of the R twitteR package, I only seem to fetch about 20 every time I try (for users with significantly more, recent, tweets). Is this because of rate limiting, or because I am doing something wrong?
Does anyone have tips for doing this most efficiently? I realize it might take a lot of time because of rate limiting, is there a way of automating/iterating this process in R?
I am quite stuck, so thank you very much for any help/tips you may have!
(I have some experience using the Twitter API/twitteR package to extract tweets using a certain hashtag over a couple of days. I have basic Python skills, if it turns out to be easier/quicker to do in Python).
It looks like the twitteR documentation suggests using the maxID argument for pagination. So when you get the first batch of results, you could use the minimum ID in that set minus one as the maxID for the next request, until you get no more results back (meaning you've gotten to the beginning of a user's timeline).
I'm doing a research on user social relations in Twitter, in Python.
The problem is that "what is the fastest way to crawl followers of a certain user's followers information"
I searched a lot of information and am currently using Tweepy:
c = tweepy.Cursor(api.followers_ids, id=centre, count=5000).items()
while True:
try:
followers_ids_list.append(c.next())
except tweepy.TweepError:
# hit rate limit, sleep for 15 minutes
time.sleep(15 * 60 + 15)
continue
and after that I am using the /users/lookup to find the User() object according to those ids gained before.
However, this way is quite slow...I was wondering if there any fastest than what I am doing currently.
Because I want to find the user relations, which means followers in depth 2 is not enough.
Say, I have 100 followers, and those 100 followers have their own 200 followers, then the time needed for grabbing this social network (depth=3) would be:
(1 + 100 + 100*200)/15calls * 15mins / 60mins = 335 hours = about 14 days!
1 call: request my follower ids (100ids)
100 calls: request 100 followers' followers ids (100*200ids)
100*200 calls(at least): request 100*200(followers' followers) users's ids.
What I can think about to be alternative is to crawl the twitter.com website without api (but, I figure, this way would make my IP or account banned from Twitter....)
The API Limits prevent you from going any faster.
You could set up multiple apps and distribute the problem through them - but that's likely to get noticed by Twitter if they're all running from the same IP address.
You can never do it with Twitter API because of the 15 minutes time rate.
I'm also doing some work related to one author's followers. However, I need millions of followers' names, which is even worse.
My solution is to write my own crawler and it does work faster than API. It could crawl 100*1000 per night. (I test it on my local machine) This rate is lower than my expectation so I have to think about other ways to increase its speed.
Hope this could give you some inspirations.
I want collect data from twitter using python Tweepy library.
I surveyed the rate limits for Twitter API,which is 180 requests per 15-minute.
What I want to know how many data I can get for one specific keyword?put it in another way , when I use the Tweepy.Cursor,when it'll stops?
I not saying the maths calculation(100 count * 180 request * 4 times/hour etc.) but the real experience.I found a view as follows:
"With a specific keyword, you can typically only poll the last 5,000 tweets per keyword. You are further limited by the number of requests you can make in a certain time period. "
http://www.brightplanet.com/2013/06/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/
Is this correct(if this's correct,I only need to run the program for 5 minutes or so)? or I am needed to keep getting as many tweets as they are there(which may make the program keep running very long time)?
You will definitely not be getting as many tweets as exist. The way Twitter limits how far back you can go (and therefore how many tweets are available) is with a minimum since_id parameter passed to the GET search/tweets call to the Twitter API. In Tweepy, the API.search function interfaces with the Twitter API. Twitter's GET search/tweets documentation has a lot of good info:
There are limits to the number of Tweets which can be accessed through the API. If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available.
In practical terms, Tweepy's API.search should not take long to get all the available tweets. Note that not all tweets are available per the Twitter API, but I've never had a search take up more than 10 minutes.
I am building a project in python that needs to scrape huge and huge amounts of Twitter data. Something like 1 million users and all their tweets need to be scraped.
Previously I have used Tweepy and Twython, but hit the limit of Twitter very fast.
How do sentiment analysis companies etc. get their data? How do they get all those tweets? Do you buy this somewhere or build something that iterates through different proxies or something?
How do companies like Infochimps with for example Trst rank get all their data?
* http://www.infochimps.com/datasets/twitter-census-trst-rank
If you want the latest tweets from specific users, Twitter offers the Streaming API.
The Streaming API is the real-time sample of the Twitter Firehose. This API is for those developers with data intensive needs. If you're looking to build a data mining product or are interested in analytics research, the Streaming API is most suited for such things.
If you're trying to access old information, the REST API with its severe request limits is the only way to go.
I don't know if this will work for what you're trying to do, but the Tweets2011 dataset was recently released.
From the description:
As part of the TREC 2011 microblog track, Twitter provided identifiers
for approximately 16 million tweets sampled between January 23rd and
February 8th, 2011. The corpus is designed to be a reusable,
representative sample of the twittersphere - i.e. both important and
spam tweets are included.
I am looking to create a simple graph showing 2 numbers of time for my personal twitter. They are:
Number of followers per day
Number of mentions per day
From my research so far, the search API does not provide a date so I am not about to do a GROUP BY. The only way I can have access to dates is through the OAuth Api but that requires interaction from the end user which I am trying to avoid.
Can someone point me in the right direction in order to achieve this? Thanks.
The best way is to use a cron to record the data daily.
However, you can query the mentions using the search api with a untill tag. Which should do the trick.
We can although use the search api to fetch mentions but there is a limit in it.
At a given point of time you can only fetch 200 mentions.
Any one knows how to get total mentions count?