I want scheduled to run my python script every hour and save the data in elasticsearch index. So that I used a function I wrote, set_interval which uses the tweepy library. But it doesn't work as I need it to work. It runs every minute and save the data in index. Even after the set that seconds equal to 3600 it runs in every minute. But I want to configure this to run on an hourly basis.
How can I fix this? Heres my python script:
def call_at_interval(time, callback, args):
while True:
timer = Timer(time, callback, args=args)
timer.start()
timer.join()
def set_interval(time, callback, *args):
Thread(target=call_at_interval, args=(time, callback, args)).start()
def get_all_tweets(screen_name):
# authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
screen_name = ""
# initialize a list to hold all the tweepy Tweets
alltweets = []
# make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name=screen_name, count=200)
# save most recent tweets
alltweets.extend(new_tweets)
# save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
# keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
#print
#"getting tweets before %s" % (oldest)
# all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name=screen_name, count=200, max_id=oldest)
# save most recent tweets
alltweets.extend(new_tweets)
# update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#print
#"...%s tweets downloaded so far" % (len(alltweets))
outtweets = [{'ID': tweet.id_str, 'Text': tweet.text, 'Date': tweet.created_at, 'author': tweet.user.screen_name} for tweet in alltweets]
def save_es(outtweets, es): # Peps8 convention
data = [ # Please without s in data
{
"_index": "index name",
"_type": "type name",
"_id": index,
"_source": ID
}
for index, ID in enumerate(outtweets)
]
helpers.bulk(es, data)
save_es(outtweets, es)
print('Run at:')
print(datetime.now())
print("\n")
set_interval(3600, get_all_tweets(screen_name))
Why do you need so much complexity to do some task every hour? You can run script every one hour this way below, note that it is runned 1 hour + time to do work:
import time
def do_some_work():
print("Do some work")
time.sleep(1)
print("Some work is done!")
if __name__ == "__main__":
time.sleep(60) # imagine you would like to start work in 1 minute first time
while True:
do_some_work()
time.sleep(3600) # do work every one hour
If you want to run script exactly every one hour, do the following code below:
import time
import threading
def do_some_work():
print("Do some work")
time.sleep(4)
print("Some work is done!")
if __name__ == "__main__":
time.sleep(60) # imagine you would like to start work in 1 minute first time
while True:
thr = threading.Thread(target=do_some_work)
thr.start()
time.sleep(3600) # do work every one hour
In this case thr is supposed to finish it's work faster than 3600 seconds, though it does not, you'll still get results, but results will be from another attempt, see the example below:
import time
import threading
class AttemptCount:
def __init__(self, attempt_number):
self.attempt_number = attempt_number
def do_some_work(_attempt_number):
print(f"Do some work {_attempt_number.attempt_number}")
time.sleep(4)
print(f"Some work is done! {_attempt_number.attempt_number}")
_attempt_number.attempt_number += 1
if __name__ == "__main__":
attempt_number = AttemptCount(1)
time.sleep(1) # imagine you would like to start work in 1 minute first time
while True:
thr = threading.Thread(target=do_some_work, args=(attempt_number, ),)
thr.start()
time.sleep(1) # do work every one hour
The result you'll gey in the case is:
Do some work 1
Do some work 1
Do some work 1
Do some work 1
Some work is done! 1
Do some work 2
Some work is done! 2
Do some work 3
Some work is done! 3
Do some work 4
Some work is done! 4
Do some work 5
Some work is done! 5
Do some work 6
Some work is done! 6
Do some work 7
Some work is done! 7
Do some work 8
Some work is done! 8
Do some work 9
I like using subprocess.Popen for such tasks, if the child subprocess did not finish it's work within one hour due to any reason, you just terminate it and start a new one.
You also can use CRON to schedule some process to run every one hour.
Get rid of all timer code just write the logic and
cron will do the job for you add this to the end of the file after crontab -e
0 * * * * /path/to/python /path/to/script.py
0 * * * * means run at every zero minute you can find more explanation here
And also I noticed you are recursively calling get_all_tweets(screen_name) I think you might have to call it from outside
Just keep your script this much
def get_all_tweets(screen_name):
# authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
screen_name = ""
# initialize a list to hold all the tweepy Tweets
alltweets = []
# make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name=screen_name, count=200)
# save most recent tweets
alltweets.extend(new_tweets)
# save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
# keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
#print
#"getting tweets before %s" % (oldest)
# all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name=screen_name, count=200, max_id=oldest)
# save most recent tweets
alltweets.extend(new_tweets)
# update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#print
#"...%s tweets downloaded so far" % (len(alltweets))
outtweets = [{'ID': tweet.id_str, 'Text': tweet.text, 'Date': tweet.created_at, 'author': tweet.user.screen_name} for tweet in alltweets]
def save_es(outtweets, es): # Peps8 convention
data = [ # Please without s in data
{
"_index": "index name",
"_type": "type name",
"_id": index,
"_source": ID
}
for index, ID in enumerate(outtweets)
]
helpers.bulk(es, data)
save_es(outtweets, es)
get_all_tweets("") #your screen name here
Related
I want to return all hashtags that match my search, but it is currently taking a very long time to return all the data. In a perfect world I would like to return data where hashtag matches my search query. Get the count of how many times it was mentioned, and then see who tweeted it. Currently just to count the hashtags within a day takes a long time. Here is my current code.
def main():
consumer_key= 'key'
consumer_secret= 'key'
access_token= 'key'
access_token_secret= 'key'
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)
search_words = "#search"
date_since = "2021-04-26"
tweets = tw.Cursor(api.search,
q=search_words,
lang="en",
fromDate=date_since).items()
count = 0
for tweet in tweets:
count = count + 1
#print(tweet.text)
print(count)
if __name__ == "__main__":
main()
EDIT: I found out it is sleeping on the Wait Rate limit. Is their anyay around the wait limit?
in current version of tweepy counts are done like this:
import tweepy
client = tweepy.Client(bearer_token="TWITTER_API_BEARER")
# Replace with your own search query
query = 'search -is:retweet'
counts = client.get_recent_tweets_count(query=query, granularity='day')
for count in counts.data:
print(count)
I am trying to stream live tweets with a given hashtag using tweepy library. I am using the following code taken from https://galeascience.wordpress.com/2016/03/18/collecting-twitter-data-with-python/
I am new to python coding and APIs
import tweepy
from tweepy import OAuthHandler
import json
import datetime as dt
import time
import os
import sys
def load_api():
''' Function that loads the twitter API after authorizing the user. '''
consumer_key = 'xxx'
consumer_secret = 'xxx'
access_token = 'yyy'
access_secret = 'yyy'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
# load the twitter API via tweepy
return tweepy.API(auth)
def tweet_search(api, query, max_tweets, max_id, since_id, geocode):
''' Function that takes in a search string 'query', the maximum
number of tweets 'max_tweets', and the minimum (i.e., starting)
tweet id. It returns a list of tweepy.models.Status objects. '''
searched_tweets = []
while len(searched_tweets) < max_tweets:
remaining_tweets = max_tweets - len(searched_tweets)
try:
new_tweets = api.search(q=query, count=remaining_tweets,
since_id=str(since_id),
max_id=str(max_id-1))
# geocode=geocode)
print('found',len(new_tweets),'tweets')
if not new_tweets:
print('no tweets found')
break
searched_tweets.extend(new_tweets)
max_id = new_tweets[-1].id
except tweepy.TweepError:
print('exception raised, waiting 15 minutes')
print('(until:', dt.datetime.now()+dt.timedelta(minutes=15), ')')
time.sleep(15*60)
break # stop the loop
return searched_tweets, max_id
def get_tweet_id(api, date='', days_ago=7, query='a'):
''' Function that gets the ID of a tweet. This ID can then be
used as a 'starting point' from which to search. The query is
required and has been set to a commonly used word by default.
The variable 'days_ago' has been initialized to the maximum
amount we are able to search back in time (9).'''
if date:
# return an ID from the start of the given day
td = date + dt.timedelta(days=1)
tweet_date = '{0}-{1:0>2}-{2:0>2}'.format(td.year, td.month, td.day)
tweet = api.search(q=query, count=1, until=tweet_date)
else:
# return an ID from __ days ago
td = dt.datetime.now() - dt.timedelta(days=days_ago)
tweet_date = '{0}-{1:0>2}-{2:0>2}'.format(td.year, td.month, td.day)
# get list of up to 10 tweets
tweet = api.search(q=query, count=10, until=tweet_date)
print('search limit (start/stop):',tweet[0].created_at)
# return the id of the first tweet in the list
return tweet[0].id
def write_tweets(tweets, filename):
''' Function that appends tweets to a file. '''
with open(filename, 'a') as f:
for tweet in tweets:
json.dump(tweet._json, f)
f.write('\n')
def main():
''' This is a script that continuously searches for tweets
that were created over a given number of days. The search
dates and search phrase can be changed below. '''
''' search variables: '''
search_phrases = ['#Messi']
time_limit = 1.5 # runtime limit in hours
max_tweets = 200 # number of tweets per search (will be
# iterated over) - maximum is 100
min_days_old, max_days_old = 1, 5 # search limits e.g., from 7 to 8
# gives current weekday from last week,
# min_days_old=0 will search from right now
# loop over search items,
# creating a new file for each
for search_phrase in search_phrases:
print('Search phrase =', search_phrase)
''' other variables '''
name = search_phrase.split()[0]
json_file_root = name + '/' + name
os.makedirs(os.path.dirname(json_file_root), exist_ok=True)
read_IDs = False
# open a file in which to store the tweets
if max_days_old - min_days_old == 1:
d = dt.datetime.now() - dt.timedelta(days=min_days_old)
day = '{0}-{1:0>2}-{2:0>2}'.format(d.year, d.month, d.day)
else:
d1 = dt.datetime.now() - dt.timedelta(days=max_days_old-1)
d2 = dt.datetime.now() - dt.timedelta(days=min_days_old)
day = '{0}-{1:0>2}-{2:0>2}_to_{3}-{4:0>2}-{5:0>2}'.format(
d1.year, d1.month, d1.day, d2.year, d2.month, d2.day)
json_file = json_file_root + '_' + day + '.json'
if os.path.isfile(json_file):
print('Appending tweets to file named: ',json_file)
read_IDs = True
# authorize and load the twitter API
api = load_api()
# set the 'starting point' ID for tweet collection
if read_IDs:
# open the json file and get the latest tweet ID
with open(json_file, 'r') as f:
lines = f.readlines()
max_id = json.loads(lines[-1])['id']
print('Searching from the bottom ID in file')
else:
# get the ID of a tweet that is min_days_old
if min_days_old == 0:
max_id = -1
else:
max_id = get_tweet_id(api, days_ago=(min_days_old-1))
# set the smallest ID to search for
since_id = get_tweet_id(api, days_ago=(max_days_old-1))
print('max id (starting point) =', max_id)
print('since id (ending point) =', since_id)
''' tweet gathering loop '''
start = dt.datetime.now()
end = start + dt.timedelta(hours=time_limit)
count, exitcount = 0, 0
while dt.datetime.now() < end:
count += 1
print('count =',count)
# collect tweets and update max_id
tweets, max_id = tweet_search(api, search_phrase, max_tweets,
max_id=max_id, since_id=since_id,
geocode=USA)
# write tweets to file in JSON format
if tweets:
write_tweets(tweets, json_file)
exitcount = 0
else:
exitcount += 1
if exitcount == 3:
if search_phrase == search_phrases[-1]:
sys.exit('Maximum number of empty tweet strings reached - exiting')
else:
print('Maximum number of empty tweet strings reached - breaking')
break
if __name__ == "__main__":
main()
It throws the following error:
Traceback (most recent call last):
File "search.py", line 189, in <module>
main()
File "search.py", line 157, in main
since_id = get_tweet_id(api, days_ago=(max_days_old-1))
File "search.py", line 80, in get_tweet_id
tweet = api.search(q=query, count=10, until=tweet_date)
File "/usr/local/lib/python3.5/dist-packages/tweepy/binder.py", line 245, in _call
return method.execute()
File "/usr/local/lib/python3.5/dist-packages/tweepy/binder.py", line 229, in execute
raise TweepError(error_msg, resp, api_code=api_error_code)
tweepy.error.TweepError: [{'code': 215, 'message': 'Bad Authentication data.'}]
I entered the relevant tokens but still it doesn't work. Any help will be appreciated.
It's rare, but it happens sometimes that the application keys need to be regenerated because of something (?) on the back end. I don't know if that's your issue, but it's worth trying.
Also, you are not actually streaming tweets. There is another request for that. You are using Twitter's REST API for searching for tweets that have already occurred.
I'm scraping Twitter now. (Python)
It is a crawl that extracts tweets in real time with famous keywords.
If you do not have a tweet with keywords, it will be terminated.
(It has been successful so far.)
I wanted to create code that automatically executes after I enter the code, not when I quit by hand.
(Cumulative re-execution on the current result)
My first thought was to create a repeatable crawl code every 12 hours.
But I can not run it ... I thought I had to set the code that I created to end when it was 12 hours.
(Repeat every 12 hours -> 12 hours to pause and run again)
I also think that it is not cumulative, but only tweets that are duplicated from the beginning.
I am writing this article to get advice on my code or advice on my thoughts
import tweepy
import time
import os
import json
import simplejson
API_key = "x"
API_secret = "x"
Access_token = "x"
Access_token_secret = "x"
auth = tweepy.OAuthHandler(API_key, API_secret)
auth.set_access_token(Access_token, Access_token_secret)
api = tweepy.API(auth)
search_term = 'x'
search_term2= 'x'
search_term3='x'
search_term4='x'
search_term5='x'
lat = "x"
lon = "x"
radius = "x"
location = "%s,%s,%s" % (lat, lon, radius)
c=tweepy.Cursor(api.search,
q="{}+OR+{}".format(search_term, search_term2, search_term3, search_term4, search_term5),
rpp=1000,
geocode=location,
include_entities=True)
data = {}
i = 1
for tweet in c.items():
data['text'] = tweet.text
print(i, ":", data)
i += 1
time.sleep(0.35)
wfile = open(os.getcwd()+"/wtt2.txt", mode='w')
data = {}
i = 0
for tweet in c.items():
data['text'] = tweet.text
wfile.write(data['text']+'\n')
i += 1
wfile.close()
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
#sched.scheduled_job('interval', hours=12)
def timed_job():
print('This job is run every 12 hours.')
sched.configure(options_from_ini_file)
sched.start()
I am wondering how I can automate my program to fetch tweets at the max rate of 180 requests per 15 minutes, which is equivalent to the max count of 100 per request totaling 18,000 tweets. I am creating this program for an independent case study at school.
I would like my program to avoid being rate limited and end up being terminated. So, what I would like it to do is constantly use the max number of requests per 15 minutes and be able to leave it running for 24 hours without user interaction to retrieve all tweets possible for analysis.
Here is my code. It gets tweets of query and puts it into a text file but eventually gets rate limited. Would really appreciate the help
import logging
import time
import csv
import twython
import json
app_key = ""
app_secret = ""
oauth_token = ""
oauth_token_secret = ""
twitter = twython.Twython(app_key, app_secret, oauth_token, oauth_token_secret)
tweets = []
MAX_ATTEMPTS = 1000000
# Max Number of tweets per 15 minutes
COUNT_OF_TWEETS_TO_BE_FETCHED = 18000
for i in range(0,MAX_ATTEMPTS):
if(COUNT_OF_TWEETS_TO_BE_FETCHED < len(tweets)):
break
if(0 == i):
results = twitter.search(q="$AAPL",count='100',lang='en',)
else:
results = twitter.search(q="$AAPL",include_entities='true',max_id=next_max_id)
for result in results['statuses']:
print result
with open('tweets.txt', 'a') as outfile:
json.dump(result, outfile, sort_keys = True, indent = 4)
try:
next_results_url_params = results['search_metadata']['next_results']
next_max_id = next_results_url_params.split('max_id=')[1].split('&')[0]
except:
break
You should be using Twitter's Streaming API.
This will allow you to receive a near-realtime feed of your search. You can write those tweets to a file just as fast as they come in.
Using the track parameter you will be able to receive only the specific tweets you're interested in.
You'll need to use Twython Streamer - and your code will look something like this:
from twython import TwythonStreamer
class MyStreamer(TwythonStreamer):
def on_success(self, data):
if 'text' in data:
print data['text'].encode('utf-8')
def on_error(self, status_code, data):
print status_code
stream = MyStreamer(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
stream.statuses.filter(track='$AAPL')
Is it possible to get the full follower list of an account who has more than one million followers, like McDonald's?
I use Tweepy and follow the code:
c = tweepy.Cursor(api.followers_ids, id = 'McDonalds')
ids = []
for page in c.pages():
ids.append(page)
I also try this:
for id in c.items():
ids.append(id)
But I always got the 'Rate limit exceeded' error and there were only 5000 follower ids.
In order to avoid rate limit, you can/should wait before the next follower page request. Looks hacky, but works:
import time
import tweepy
auth = tweepy.OAuthHandler(..., ...)
auth.set_access_token(..., ...)
api = tweepy.API(auth)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name="McDonalds").pages():
ids.extend(page)
time.sleep(60)
print len(ids)
Hope that helps.
Use the rate limiting arguments when making the connection. The api will self control within the rate limit.
The sleep pause is not bad, I use that to simulate a human and to spread out activity over a time frame with the api rate limiting as a final control.
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, compression=True)
also add try/except to capture and control errors.
example code
https://github.com/aspiringguru/twitterDataAnalyse/blob/master/sample_rate_limit_w_cursor.py
I put my keys in an external file to make management easier.
https://github.com/aspiringguru/twitterDataAnalyse/blob/master/keys.py
I use this code and it works for a large number of followers :
there are two functions one for saving followers id after every sleep period and another one to get the list :
it is a little missy but I hope to be useful.
def save_followers_status(filename,foloowersid):
path='//content//drive//My Drive//Colab Notebooks//twitter//'+filename
if not (os.path.isfile(path+'_followers_status.csv')):
with open(path+'_followers_status.csv', 'wb') as csvfile:
filewriter = csv.writer(csvfile, delimiter=',')
if len(foloowersid)>0:
print("save followers status of ", filename)
file = path + '_followers_status.csv'
# https: // stackoverflow.com / questions / 3348460 / csv - file - written -with-python - has - blank - lines - between - each - row
with open(file, mode='a', newline='') as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for row in foloowersid:
writer.writerow(np.array(row))
csv_file.closed
def get_followers_id(person):
foloowersid = []
count=0
influencer=api.get_user( screen_name=person)
influencer_id=influencer.id
number_of_followers=influencer.followers_count
print("number of followers count : ",number_of_followers,'\n','user id : ',influencer_id)
status = tweepy.Cursor(api.followers_ids, screen_name=person, tweet_mode="extended").items()
for i in range(0,number_of_followers):
try:
user=next(status)
foloowersid.append([user])
count += 1
except tweepy.TweepError:
print('error limite of twiter sleep for 15 min')
timestamp = time.strftime("%d.%m.%Y %H:%M:%S", time.localtime())
print(timestamp)
if len(foloowersid)>0 :
print('the number get until this time :', count,'all folloers count is : ',number_of_followers)
foloowersid = np.array(str(foloowersid))
save_followers_status(person, foloowersid)
foloowersid = []
time.sleep(15*60)
next(status)
except :
print('end of foloowers ', count, 'all followers count is : ', number_of_followers)
foloowersid = np.array(str(foloowersid))
save_followers_status(person, foloowersid)
foloowersid = []
save_followers_status(person, foloowersid)
# foloowersid = np.array(map(str,foloowersid))
return foloowersid
The answer from alecxe is good, however no one has referred to the docs. The correct information and explanation to answer the question lives in the Twitter API documentation. From the documentation :
Results are given in groups of 5,000 user IDs and multiple “pages” of results can be navigated through using the next_cursor value in subsequent requests.
Tweepy's "get_follower_ids()" uses https://api.twitter.com/1.1/followers/ids.json endpoint. This endpoint has a rate limit (15 requests per 15 min).
You are getting the 'Rate limit exceeded' error, cause you are crossing that threshold.
Instead of manually putting the sleep in your code you can use wait_on_rate_limit=True when creating the Tweepy API object.
Moreover, the endpoint has an optional parameter count which specifies the number of users to return per page. The Twitter API documentation does not says anything about its default value. Its maximum value is 5000.
To get the most ids per request explicitly set it to the maximum. So that you need fewer requests.
Here is my code for getting all the followers' ids:
auth = tweepy.OAuth1UserHandler(consumer_key = '', consumer_secret = '',
access_token= '', access_token_secret= '')
api = tweepy.API(auth, wait_on_rate_limit=True)
account_id = 71026122 # instead of account_id you can also use screen_name
follower_ids = []
for page in tweepy.Cursor(api.get_follower_ids, user_id = account_id, count = 5000).pages():
follower_ids.extend(page)