How to scrape a huge amounts of tweets

How to scrape a huge amounts of tweets - python

I am building a project in python that needs to scrape huge and huge amounts of Twitter data. Something like 1 million users and all their tweets need to be scraped.
Previously I have used Tweepy and Twython, but hit the limit of Twitter very fast.
How do sentiment analysis companies etc. get their data? How do they get all those tweets? Do you buy this somewhere or build something that iterates through different proxies or something?
How do companies like Infochimps with for example Trst rank get all their data?
* http://www.infochimps.com/datasets/twitter-census-trst-rank

If you want the latest tweets from specific users, Twitter offers the Streaming API.
The Streaming API is the real-time sample of the Twitter Firehose. This API is for those developers with data intensive needs. If you're looking to build a data mining product or are interested in analytics research, the Streaming API is most suited for such things.
If you're trying to access old information, the REST API with its severe request limits is the only way to go.

I don't know if this will work for what you're trying to do, but the Tweets2011 dataset was recently released.
From the description:
As part of the TREC 2011 microblog track, Twitter provided identifiers
for approximately 16 million tweets sampled between January 23rd and
February 8th, 2011. The corpus is designed to be a reusable,
representative sample of the twittersphere - i.e. both important and
spam tweets are included.

Related

Is there a way to get the geospatail data of tweets?

I am working on a project using social media data. An important part of it is to understand the spatial distribution of the social data, like tweets. I used the getoldtweet3 to scrape tweets but all of them are devioded of geo data. Just wonder if there a way to get the spatial features of tweets; or there are other tools to scrape other social media data with geospatial feature.

First of all, the library you are using is not accessing the Twitter API, you probably know this already. It’s not official and it could stop working at any moment.
Secondly and more importantly for your question, there is only about a 1% use of geo pinning in the API. There’s no special way to go around this, it is up to a user to enable geo on their Tweets.
Does this answer your question?

Fetching tweets through tweepy - how many tweets?

I am doing a project on aspect based sentiment analysis from tweets on twitter. I am using tweepy to fetch the tweets. I noticed that I get only a few tweets. What is the maximum number of tweets I can get using tweepy, when a product name is searched? Is there any way I can increase this number? This is the code I am referring to : https://pastebin.com/vqbkFhtf

The number of tweets will depend on how many people are talking about the product.
Also i would suggest you to use twitter streaming API so that you get the tweets as they appear.

How to extract all tweets from multiple users' timelines using R?

I am working on a project for which I want to extract the timelines of around 500 different twitter users (I am using this for historical analysis, so I'll only need to retrieve them all once- no need to update with incoming tweets).
While I know the Twitter API only allows the last 3,200 tweets to be retrieved, when I use the basic UserTimeline method of the R twitteR package, I only seem to fetch about 20 every time I try (for users with significantly more, recent, tweets). Is this because of rate limiting, or because I am doing something wrong?
Does anyone have tips for doing this most efficiently? I realize it might take a lot of time because of rate limiting, is there a way of automating/iterating this process in R?
I am quite stuck, so thank you very much for any help/tips you may have!
(I have some experience using the Twitter API/twitteR package to extract tweets using a certain hashtag over a couple of days. I have basic Python skills, if it turns out to be easier/quicker to do in Python).

It looks like the twitteR documentation suggests using the maxID argument for pagination. So when you get the first batch of results, you could use the minimum ID in that set minus one as the maxID for the next request, until you get no more results back (meaning you've gotten to the beginning of a user's timeline).

How much data can I get with the Twitter Search API for one specific keyword?

I want collect data from twitter using python Tweepy library.
I surveyed the rate limits for Twitter API,which is 180 requests per 15-minute.
What I want to know how many data I can get for one specific keyword?put it in another way , when I use the Tweepy.Cursor,when it'll stops?
I not saying the maths calculation(100 count * 180 request * 4 times/hour etc.) but the real experience.I found a view as follows:
"With a specific keyword, you can typically only poll the last 5,000 tweets per keyword. You are further limited by the number of requests you can make in a certain time period. "
http://www.brightplanet.com/2013/06/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/
Is this correct(if this's correct,I only need to run the program for 5 minutes or so)? or I am needed to keep getting as many tweets as they are there(which may make the program keep running very long time)?

You will definitely not be getting as many tweets as exist. The way Twitter limits how far back you can go (and therefore how many tweets are available) is with a minimum since_id parameter passed to the GET search/tweets call to the Twitter API. In Tweepy, the API.search function interfaces with the Twitter API. Twitter's GET search/tweets documentation has a lot of good info:
There are limits to the number of Tweets which can be accessed through the API. If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available.
In practical terms, Tweepy's API.search should not take long to get all the available tweets. Note that not all tweets are available per the Twitter API, but I've never had a search take up more than 10 minutes.

Integrate XBRLware to Python 3

I am at the absolutely basic level on Python 3 (everything i currently know is from TheMonkeyLords) and my main focus is to integrate Python 3 with XBRLware so that i can extract financial information from the SEC EDGAR database with accuracy and reliability.
How can i use the xbrlware frame with Python 3? I have absolutely no idea how you can use a frame with Python 3....
Any suggestions on what should I learn or code for me to study, clues etc would be great help!
Thank you

Don't do it. Based on personal experience, it is very difficult to extract useful financial data from XBRL. XBRLWare does work, but there is a lot of work to do afterwards to extract the data into something useful.
XBRL has over 100 definitions of "revenue". Each industry reports differently. Each company makes 100s of filings and you have to string together data from different reports. It's an incredibly frustrating process.
I have used XBRLWare as a Ruby Gem on Windows. (It is no longer supported.) It does "work". It downloads and formats the reports nicely, but it operates as a viewer. Most filings contain two quarters of data. (Probably not the quarters you want either.)
You can use the XBRL viewer on the SEC's website to accomplish the same thing. Or you can go to the company's 10-Qs.
Also, XBRL uses CIK codes for the companies. As far as I know, the SEC doesn't have a central database to match CIK codes to ticker symbols (if you can believe that!). So it can be frustrating to find the companies you want to download.
If you want to download all the XBRL filings, I've been told its like 6TB a month.
You can't pull historical financial data from XBRL either. You have to string two quarters at a time together. So, pull every IBM filing (XBRL is 3 yrs old) and string together all the 10-Qs. XBRL is only three years old for the large accelerated filers, so historical data is limited.
There is a reason why Wall Street still charges $25k/year for financial data. XBRL is very difficult to use and difficult to extract data from.
You could try: XBRLcloud.com or findynamics.com

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.