I'm trying to scrape tweets from specific users using tweepy. But instead of giving me the tweets from the user I specify, tweepy returns my own timeline tweets. What am I doing wrong?
tweets = []
def username_tweets_to_csv(username,count):
try:
tweets = tweepy.Cursor(api.user_timeline,user_id=username).items(count)
tweets_list = [[tweet.created_at, tweet.id, tweet.text] for tweet in tweets]
tweets_df = pd.DataFrame(tweets_list,columns=['Datetime', 'Tweet Id', 'Text'])
tweets_df.to_csv('/Users/Carla/Documents/CODE_local/{}-tweets.csv'.format(username), sep=',', index = False)
except BaseException as e:
print('failed on_status,',str(e))
time.sleep(3)
username = "jack"
count = 100
username_tweets_to_csv(username, count)
I was just able to fix it by exchanging the "user_id" parameter for "screen_name". Works perfectly now!
tweets = tweepy.Cursor(api.user_timeline,screen_name=username).items(count)
Related
I am trying to use the Tweepy python package to get the actual replies to a Tweet. I break down the process I have worked on so far:
I import my modules and configure authentication variables with tweepy:
client = tweepy.Client(bearer_token=bearer_token, wait_on_rate_limit=True)
covid_tweets = []
for mytweets in tweepy.Paginator(client.search_all_tweets, query = '#COVID lang:en',
user_fields=['username', 'public_metrics', 'description', 'location'], tweet_fields
= ['created_at', 'geo', 'public_metrics', 'text'], expansions = 'author_id',
start_time = '2019-12-30T00:00:00Z', end_time = '2020-01-15T00:00:00Z',
max_results=10):
time.sleep(2)
covid_tweets.append(mytweets)
I then search for the hashtags I want and the parameters of the Tweepy Paginator with a for loop and append unto an empty list:
client = tweepy.Client(bearer_token=bearer_token, wait_on_rate_limit=True)
covid_tweets = []
for mytweets in tweepy.Paginator(client.search_all_tweets, query = '#COVID lang:en',
user_fields=['username', 'public_metrics', 'description', 'location'], tweet_fields
= ['created_at', 'geo', 'public_metrics', 'text'], expansions = 'author_id',
start_time = '2019-12-30T00:00:00Z', end_time = '2020-01-15T00:00:00Z',
max_results=10):
time.sleep(2)
covid_tweets.append(mytweets)
Then I convert this list into a dataFrame by extracting certain key fields[a user dictionary, user_object]:
#Convert Covid-19 tweets to a DF
result = []
user_dict = {}
# Loop through each response object
for response in covid_tweets:
for user in response.includes['users']:
user_dict[user.id] = {'username': user.username,
'followers': user.public_metrics['followers_count'],
'tweets': user.public_metrics['tweet_count'],
'description': user.description,
'location': user.location
}
for tweet in response.data:
# For each tweet, find the author's information
author_info = user_dict[tweet.author_id]
#check for condition
if ('RT #' not in tweet.text):
# Put all information we want to keep in a single dictionary for each tweet
result.append({'author_id': tweet.author_id,
'tweet_id': tweet.id,
'username': author_info['username'],
'author_followers': author_info['followers'],
'author_tweets': author_info['tweets'],
'author_description': author_info['description'],
'author_location': author_info['location'],
'text': tweet.text,
'created_at': tweet.created_at,
'retweets': tweet.public_metrics['retweet_count'],
'replies': tweet.public_metrics['reply_count'],
'likes': tweet.public_metrics['like_count'],
'quote_count': tweet.public_metrics['quote_count']
})
# Change this list of dictionaries into a dataframe
df_1 = pd.DataFrame(result)
Now my issue is, from the dataFrame, I get to see tweets and reply_count for tweets and a proof of the image is shown below:
And I checked how I can get the replies from the tweets. So I did some checks and wanted to follow this code flow function:
def get_all_replies(tweet, api, fout, depth=10, Verbose=False):
global rep
if depth < 1:
if Verbose:
print('Max depth reached')
return
user = tweet.user.screen_name
tweet_id = tweet.id
search_query = '#' + user
# filter out retweets
retweet_filter = '-filter:retweets'
query = search_query + retweet_filter
try:
myCursor = tweepy.Cursor(api.search_tweets, q=query,
since_id=tweet_id,
max_id=None,
tweet_mode='extended').items()
rep = [reply for reply in myCursor if reply.in_reply_to_status_id == tweet_id]
except tweepy.TweepyException as e:
sys.stderr.write(('Error get_all_replies: {}\n').format(e))
time.sleep(60)
if len(rep) != 0:
if Verbose:
if hasattr(tweet, 'full_text'):
print('Saving replies to: %s' % tweet.full_text)
elif hasattr(tweet, 'text'):
print('Saving replies to: %s' % tweet.text)
print("Output path: %s" % fout)
# save to file
with open(fout, 'a+') as (f):
for reply in rep:
data_to_file = json.dumps(reply._json)
f.write(data_to_file + '\n')
# recursive call
get_all_replies(reply, api, fout, depth=depth - 1, Verbose=False)
return
So basically, with this function, I loop through the dataframe and pick the "tweet_id" & "the screen_name" for the tweet, then design a search query but I realized at the section of the "rep" list returns an empty list, and debugging line by line, actually showed that the in_reply_to_status_id is different from the tweet_id and the cause for the empty list even though the reply count for the dataframe shows a non-zero.
I know this is long but I really wanted to show what I have done so far and explain each process. Thank you
NB: I have access to Academic Research API
Ok so for everyone trying to fix this hurdle I finally found a way to get the tweet replies. In my use case, I have the Academic Research API from Twitter.
The code provided by geduldig on his github Github finally solved my
issue with a little tweaks. A little head-up will be that, with the TwitterAPI package, if you ignore the "start_time" or "end_time" parameter, you might get only the parent tweet, so structure it like this:
pager = TwitterPager(api, 'tweets/search/all',
{
'query':f'conversation_id:{CONVERSATION_ID}',
'start_time': '2019-12-30T00:00:00Z',
'end_time': '2021-12-31T00:00:00Z',
'expansions':'author_id',
'tweet.fields':'author_id,conversation_id,created_at,referenced_tweets'
},
hydrate_type=HydrateType.REPLACE)
I hope this helps the community. Thanks.
What if I want to take the Tweet ID and media_keys and only get the Tweet ID that has media_keys?
I was trying to do it with this sample but I got stuck:
https://docs.tweepy.org/en/stable/examples.html
client = tweepy.Client(consumer_key=API_KEY, consumer_secret=API_SECRET, access_token=ACCESS_TOKEN, access_token_secret=ACCESS_TOKEN_SECRET, bearer_token=Bearer_token)
counts = 10
search_result = client.get_list_tweets(id='list id', max_results=counts, expansions=["attachments.media_keys"])
tweets = search_result.data
includes = search_result.includes
medias = includes['media']
for tweetid in tweets:
tid = tweetid.id
mediass = {media['media_key']: media for media in medias}
for tweet in tweets:
print(tweet.id, mediass)
You can check the attachments field of Tweet objects to obtain media keys for media attached to the Tweet.
I created this bot with tweepy and python, basically i can retweet an like the most recent tweets that contain a certain keyword. I want to get the status of a tweet that has that keyword so that i know if i already retweeted it or not.
import time
import tweepy
import config
# Search/ Like/ Retweet
def get_client():
client = tweepy.Client(bearer_token=config.BEARER_TOKEN,
consumer_key=config.CONSUMER_KEY,
consumer_secret=config.CONSUMER_SECRET,
access_token=config.ACCESS_TOKEN,
access_token_secret=config.ACCESS_TOKEN_SECRET, )
return client
def search_tweets(query):
client = get_client()
tweets = client.search_recent_tweets(query=query, max_results=20)
tweet_data = tweets.data
results = []
if tweet_data is not None and len(tweet_data) > 0:
for tweet in tweet_data:
obj = {'id': tweet.id, 'text': tweet.text}
results.append(obj)
else:
return 'There are no tweets with that keyword!'
return results
client = get_client()
tweets = search_tweets('#vinu')
for tweet in tweets:
client.retweet(tweet["id"])
client.like(tweet['id'])
time.sleep(2)
This is the code. I want to create an if statement to check with api v2 if i already retweeted it , and if so , to continue to the next item in the loop. I know that i can use api.get_status with api v1 , but i dont find how to do it with v2. please help me out.
if tweet_data is not None and len(tweet_data) > 0:
for tweet in tweet_data:
status = tweepy.api(client.access_token).get_status(tweet.id)
if status.retweeted:
continue
else:
obj = {'id': tweet.id, 'text': tweet.text}
results.append(obj)
else:
return ''
return results
This should work in v1 , please help me do the same thing in v2. Thanks!
For Tweepy API v2 you can use the get_retweeters() method for each tweet, then compare your user id with the retrieved retweeters' id's.
if tweet_data is not None and len(tweet_data) > 0:
for tweet in tweet_data:
status = client.get_retweeters(tweet.id, max_results=3)
for stat in status.data:
if stat.id == YOUR_ID_HERE:
continue
else:
obj = {'id': tweet.id, 'text': tweet.text}
results.append(obj)
else:
return 'There are no tweets with that keyword!'
return results
You can change the max_results to whatever limit you'd like as long as you're managing your rates correctly. If you're getting None results try changing the tweet.id to a static id that has retweets and test just that one, this is due to many tweets not having any retweets.
You can find an easy way to find your twitter id using your twitter username here
I have found a python script for extracting tweets and store to csv file. I am not familiar with python yet. Except the tweets, I need also to extract the date and the time of each tweet. I have found how to extract other characteristics, such as "retweeted", "retweet_count", but I am still stuck in date and time.
The script is here:
#!/usr/bin/env python
# encoding: utf-8
import tweepy #https://github.com/tweepy/tweepy
import csv
#Twitter API credentials
consumer_key = "..........................."
consumer_secret = "..........................."
access_key = "..........................."
access_secret = "..........................."
screename = "#realDonaldTrump"
def get_all_tweets(screen_name):
#Twitter only allows access to a users most recent 3240 tweets with this method
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret )
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#initialize a list to hold all the tweepy Tweets
alltweets = []
#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screename ,count=200)
screen_name = "Donald J. Trump"
#save most recent tweets
alltweets.extend(new_tweets)
#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
print "getting tweets before %s" % (oldest)
#all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name = screename,count=200,max_id=oldest)
#save most recent tweets
alltweets.extend(new_tweets)
#update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
print "...%s tweets downloaded so far" % (len(alltweets))
#transform the tweepy tweets into a 2D array that will populate the csv
outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8"), tweet.favorite_count, tweet.retweet_count, tweet.favorited, tweet.retweeted] for tweet in alltweets]
#write the csv
with open('%s_tweets.csv' % screen_name , 'wb') as f:
writer = csv.writer(f)
writer.writerow(["id","created_at","text","favorite_count","retweet_count","favorited","retweeted"])
writer.writerows(outtweets)
pass
if __name__ == '__main__':
#pass in the username of the account you want to download
get_all_tweets(screename)
The tweepy tweet model has created_at:
created_at
Creation time of the Tweet.
Type
datetime.datetime | None
Interesting fact is that you can derive the time from the tweet id. Tweet IDs are k-sorted within a second bound. We can extract the timestamp for a tweet ID by right shifting the tweet ID by 22 bits and adding the Twitter epoch time of 1288834974657.
This is driving me crazy. As you can see below I am trying to use a simple while loop to perform a couple of tweepy searches and append them into a data frame. For some reason however after pulling the first set of 100 tweets it just repeats that set instead of performing a new search. Any advice would be greatly appreciated.
import sys
import csv
import pandas as pd
import tweepy
from tweepy import OAuthHandler
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
num_results = 200
result_count = 0
last_id = None
df = pd.DataFrame(columns=['Name', 'Location', 'Followers', 'Text', 'Coorinates'])
while result_count < num_results:
result = api.search(q='',count=100, geocode= "38.996918,-104.995826,190mi", since_id = last_id)
for tweet in result:
user = tweet.user
last_id = tweet.id_str
name = user.name
friends = user.friends_count
followers = user.followers_count
text = tweet.text.encode('utf-8')
location = user.location
coordinates = tweet.coordinates
df.loc[result_count] = pd.Series({'Name':name, 'Location':location, 'Followers':followers, 'Text':text, 'Coordinates':coordinates})
print(text)
result_count += 1
# Save to Excel
print("Writing all tables to Excel...")
df.to_csv('out.csv')
print("Excel Export Complete.")
The API.search method returns tweets that match a specified query. It's not a Streaming APi, so it returns all data at once.
Furthermore, in your query parameters, you have added count, that specifies the number of statuses to retrieve.
So the problem is that with your query you are returning the first 100 data of the complete set for each while iteration.
I suggest you to change the code in something like this
result = api.search(q='', geocode= "38.996918,-104.995826,190mi", since_id = last_id)
for tweet in result:
user = tweet.user
last_id = tweet.id_str
name = user.name
friends = user.friends_count
followers = user.followers_count
text = tweet.text.encode('utf-8')
location = user.location
coordinates = tweet.coordinates
df.loc[result_count] = pd.Series({'Name':name, 'Location':location, 'Followers':followers, 'Text':text, 'Coordinates':coordinates})
print(text)
Let me know.