Tweets scraping - how to measure tweeting intensity?

Tweets scraping - how to measure tweeting intensity? - python

I am looking for a method to get information of a "trend" regarding some hashtag/key word on Twitter. Let`s say I want to measure how often the hashtag/key word "Python" is tweeted in time. For instance, today, "Python" is tweeted on average every 1 minute but yesterday it was tweeted on average every 2 minutes.
I have tried various options but I am always bouncing off the twitter API limitations, i.e. if I try to download all tweets for a hashtag during the last (for example) day, only a certain franction of the tweets is downloaded (via tweepy.cursor).
Do you have any ideas / script examples of achieving similar results? Libraries or guides to recommend? I did not find any help searching on the internet. Thank you.

You should check twint repository.
Can fetch almost all Tweets (Twitter API limits to last 3200 Tweets only);
Fast initial setup;
Can be used anonymously and without Twitter sign up;
here is a sample code:
import twint
def scrapeData(search):
c = twint.Config()
c.Search = search
c.Since = '2021-03-05 00:00:00'
c.Until = '2021-03-06 00:00:00'
c.Pandas = True
c.Store_csv = True
c.Hide_output = True
c.Output = f'{search}.csv'
c.Limit = 10 # number of tweets want to fetch
print(f"\n#### Scraping from {c.Since} to {c.Until}")
twint.run.Search(c)
print("\n#### Preview: ")
print(twint.storage.panda.Tweets_df.head())
if __name__ == "__main__":
scrapeData(search="python")

Try a library called:
GetOldTweets or GetOldTweets3
Twitter Search, and by extension its API, are not meant to be an exhaustive source of tweets. The Twitter Streaming API places a limit of just one week on how far back tweets can be extracted from that match the input parameters. So in order to extract all historical tweets relevant to a set of search parameters for analysis, the Twitter Official API needs to be bypassed and custom libraries that mimic the Twitter Search Engine need to be used.

Related

Any ways to retrive tweets based on hashtag for a time frame more than a year?

Im looking for ways to retrieve tweets from Twitter which contains certain hashtags.
I tried to use the official API and tweepy package in Python but even with academic access I was only able to retrieve tweets which are 7 days old. I want to retrieve tweets from 2019 till 2020 but Im not able to do so with tweepy.
I tried the following packages GetOldTweet3, twint but none of them seem to work due to some changes Twitter made last year.
Can someone suggest a way to get old tweets with certain hashtags. Thanks in advance for any help or suggestion provided.

If you have academic access, you are able to use the full archive search API available in the Twitter API v2. Tweepy has support for this via the tweepy.Client class. There's a full tutorial on DEV, but the code will be something like this:
import tweepy
client = tweepy.Client(bearer_token='REPLACE_ME')
# Replace with your own search query
query = 'from:andypiper -is:retweet'
tweets = client.search_all_tweets(query=query, tweet_fields=['context_annotations', 'created_at'], max_results=100)
for tweet in tweets.data:
print(tweet.text)
if len(tweet.context_annotations) > 0:
print(tweet.context_annotations)
You can use search query parameters to specify the date range.

Understanding Twitter Premium API Sandbox in python

I have already Twitter Standard API (I got the approved recently and did not use Twitter API yet ) because I need to collect historical tweets.
So I have to upgrade to Premium API but should I choose API sandbox to test my code before paid and upgrade the premium API full archive? I am afraid to lose some tweets and reduce the requests.
I am a little confusing for understanding some operators
results_per_call=100 .. max_results=100 .. what are they meaning?
Can I choose any numbers to get more tweets?
How many requests can I use per day?
I find code in python that I will use it to collect ? is it correct? I am a beginner in python
where can I find the JSON file on my computer.? and how convert this file to .cvs?
!pip install searchtweets
!pip install yaml
import yaml
config = dict(
search_tweets_api = dict(
account_type = 'premium',
endpoint = 'https://api.twitter.com/1.1/tweets/search/fullarchive/YOUR_LABEL.json',
consumer_key = 'YOUR_CONSUMER_KEY',
consumer_secret = 'YOUR_CONSUMER_SECRET'
))
with open('twitter_keys_fullarchive.yaml', 'w') as config_file:
yaml.dump(config, config_file, default_flow_style=False)
from searchtweets import load_credentials
premium_search_args = load_credentials("twitter_keys_fullarchive.yaml",
yaml_key="search_tweets_api",
env_overwrite=False)
print(premium_search_args)
from searchtweets import gen_rule_payload
query = "(#COVID19 OR # Corona_virus) (pandemic OR corona OR infected OR vaccine)" rule = gen_rule_payload(query, results_per_call=100, from_date="2020-01-01", to_date="2020-01-30")` from searchtweets import ResultStream
rs = ResultStream(rule_payload=rule,
max_results=100,
**premium_search_args) print(rs)
mport json
with open('twitter_premium_api_demo.jsonl', 'a', encoding='utf-8') as f:
n = 0
for tweet in rs.stream():
n += 1
if n % 10 == 0:
print('{0}: {1}'.format(str(n), tweet['created_at']))
json.dump(tweet, f)
f.write('\n') print('done')
Very thank you in advance.

Once I had the same task that collect twitter data using different conditions,After lot of searching and tests,I had to create completely separate python twitter client API for my task.This is what I know regarding the API (documentation is little bit confusing)
Twitter API has 3 versions for search and download data.
Standard(free version with limitations)
Premium (paid version with some extended features)
Enterprise ( paid version with customize options for large scale operations)
Standard API
Free to use with correct authentication
Only return past 7 days data
Can use Standard search operators
You can send limited number of requests within given time period(ex 180 requests in 15min window for user auth and 450 requests in 15 min window for app auth)
one request return 100 data objects (100 tweets)
Premium API
Preimum APi includes 2 versions.
30-day Endpoint - Provide tweets posted within last 30 days
Full Archive endpoints - Provides tweets from starting from 2006
these 2 versions share the same endpoints and only difference is timeframe you can search.
Premium package returns maximum 500 data objects per request,Still you can limit the return count according to your use case.
Select requests per month by subscription (example 50 requests,250 requests (per month))
Answering your questions:
results_per_call=100 means how many tweet objects return by the API by default and max_results=100 is how many objects you need.
should I choose API sandbox to test my code before paid and upgrade the premium API full archive?
yes you can test basic logic and some search queries and check return object using free service.But if you need to search date difference more than 7days, or premium operators you have to use premium API.
these are some useful links
https://developer.twitter.com/en/docs/tweets/search/overview
operators
https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators
https://developer.twitter.com/en/docs/tweets/search/guides/premium-operators
API
https://developer.twitter.com/en/docs/tweets/search/api-reference/premium-search
https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets
There are more hidden information in documentation please add more if you find anything useful.

Twitter API - Obtain user tweets and parse into a table/database

This is a small project I'd like to get started on in the near future. It's still in the planning stage so this post is more about being steered in the right direction
Essentially, I'd like to obtain tweets from a user and parse the tweets into a table/database, with the aim to be able to run this program in real-time.
My initial plan to tackle this was to use Beautiful Soup, a Python specific library, however, I believe the Twitter API is the better approach (advice on this subject would be appreciated)
There are still 3 unknowns:
Where do I store the tweets once obtained?
How to parse the tweets?
Where to store the parsed data?
To answer (3), I suppose it depends on what I want to do with the data. I still haven't decided how I'll use the parsed data but I know that I'd like it put into categories so my thinking is probably a database/table/excel??
A few questions still to answer and I'd like you guys to steer me in the right direction. My programming language knowledge is limited to just C for now, but as this project means a great deal to me, I'm willing to put the effort in and learn the necessary languages/APIs.
What languages/APIs will I need to gain an understanding of to accomplish this project? From where I stand, it seems to be Twitter API and Python.
EDIT: So I have a basic script going which obtains a user tweets. It works better than expected. However, I'd like to take it another step. I'd like to only obtain the users' tweets if it contains a hashtag inside of the tweet. All other tweets should be ignored. How best to do this?
Here is a snippet of the basic code I have going:
import tweepy
import twitter_credentials
auth = tweepy.OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)
stuff = api.user_timeline(screen_name = 'XXXXXXXXXX', count = 10, include_rts = False)
for status in stuff:
print(status.text)

Scraping Twitter (or any other social network) with for example Beautiful soup, as you said, is not a good idea for 2 reasons :
if the source pages changes (name attributes, div ids...), you have to keep your code up to date
your script can be banned because scraping is not "allowed".
To answer your questions :
1) you can store the tweets wherever you want : csv, mysql, sqlite, redis, neo4j...
2) With official API, you get JSON. Here is a Tweet Object : https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html . With tweepy, for example status.text will give you the text of the tweet.
3) Same as #1. If you don't know actually what you will do with the data, store the full JSONs. You will be able later to parse them.
I suggest tweepy/python (http://www.tweepy.org/) or twit/nodejs (https://www.npmjs.com/package/twit). And read official docs : https://developer.twitter.com/en/docs/api-reference-index

Using python, how to use collect tweets (using tweepy) between two dates?

How can i use python and tweepy in order to collect tweets from twitter that are between two given dates?
is there a way to pass from...until... values to the search api?
Note:
I need to be able to search back but WITHOUT limitation to a specific user
i am using python and I know that the code should be something like this but i need help to make it work.
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth)
collection = []
for tweet in tweepy.Cursor(api.search, ???????).items():
collection[tweet.id] = tweet._json

After long hours of investigations and stabilization i can gladly share my findings.
search by geocode: pass the geocode parameter in the 'q' parameter in this format: geocode:"37.781157,-122.398720,500mi" , the double quotes are important. notice that the parameter near is not supported anymore by this api. The geocode gives more flexibility
search by timeline: use the parameters "since" and "until" in the following format: "since:2016-08-01 until:2016-08-02"
there is one more important note... twitter don't allow queries with too old dates. I am not sure but i think they give only 10-14 days back. So you cannot query this way for tweets of last month.
===================================
for status in tweepy.Cursor(api.search,
q='geocode:"37.781157,-122.398720,1mi" since:2016-08-01 until:2016-08-02 include:retweets',
result_type='recent',
include_entities=True,
monitor_rate_limit=False,
wait_on_rate_limit=False).items(300):
tweet_id = status.id
tweet_json = status._json

As of now, Tweepy is not the best solution. The best solution is using the python library SnScrape, which scrapes twitter, and can therefore get tweets after the 2-week cap twitter sets. The below code only scrapes for 100 English tweets between dates and only gets the tweet ID, but it can be easily extended for more specific searches, more or fewer tweets, or to get more information about the tweet.
import snscrape.modules.twitter as sntwitter
tweetslist = []
params="'"+"lang:en "+"since:2020-11-1"+" until:2021-03-13"+"'"
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(params).get_items()):
if i>100:
break
tweetslist.append([tweet.id])
print(tweetslist)

You have to use max_id parameters as described in twitter documentation
tweepy is a wrapper around twitter API so you should be able to use this parameter.
As per geolocation, take look at The Search API: Tweets by Place. It uses same search API, with customized keys.

How much data can I get with the Twitter Search API for one specific keyword?

I want collect data from twitter using python Tweepy library.
I surveyed the rate limits for Twitter API,which is 180 requests per 15-minute.
What I want to know how many data I can get for one specific keyword?put it in another way , when I use the Tweepy.Cursor,when it'll stops?
I not saying the maths calculation(100 count * 180 request * 4 times/hour etc.) but the real experience.I found a view as follows:
"With a specific keyword, you can typically only poll the last 5,000 tweets per keyword. You are further limited by the number of requests you can make in a certain time period. "
http://www.brightplanet.com/2013/06/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/
Is this correct(if this's correct,I only need to run the program for 5 minutes or so)? or I am needed to keep getting as many tweets as they are there(which may make the program keep running very long time)?

You will definitely not be getting as many tweets as exist. The way Twitter limits how far back you can go (and therefore how many tweets are available) is with a minimum since_id parameter passed to the GET search/tweets call to the Twitter API. In Tweepy, the API.search function interfaces with the Twitter API. Twitter's GET search/tweets documentation has a lot of good info:
There are limits to the number of Tweets which can be accessed through the API. If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available.
In practical terms, Tweepy's API.search should not take long to get all the available tweets. Note that not all tweets are available per the Twitter API, but I've never had a search take up more than 10 minutes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.