Twitter API - Get Tweets from a list of users

Twitter API - Get Tweets from a list of users - python

I am using the Twitter Academic Research V2 API and want to get tweets from a list of users and store them in a dataframe.
My code works for one single user, but not for a list of users. See the code here:
import tweepy
from twitter_authentication import bearer_token
import time
import pandas as pd
import time
client = tweepy.Client(bearer_token, wait_on_rate_limit=True)
# list of twitter users
csu = ["Markus_Soeder", "DoroBaer", "andreasscheuer"]
csu_tweets = []
for politician in csu:
for response in tweepy.Paginator(client.search_all_tweets,
query = f'from:{politician} -is:retweet lang:de',
user_fields = ['username', 'public_metrics', 'description', 'location'],
tweet_fields = ['created_at', 'geo', 'public_metrics', 'text'],
expansions = 'author_id',
start_time = '2022-12-01T00:00:00Z',
end_time = '2022-12-03T00:00:00Z'):
time.sleep(1)
csu_tweets.append(response)
end = time.time()
print(f"Scraping of {csu} needed {(end - start)/60} minutes.")
result = []
user_dict = {}
# Loop through each response object
for response in csu_tweets:
# Take all of the users, and put them into a dictionary of dictionaries with the info we want to keep
for user in response.includes['users']:
user_dict[user.id] = {'username': user.username,
'followers': user.public_metrics['followers_count'],
'tweets': user.public_metrics['tweet_count'],
'description': user.description,
'location': user.location
}
for tweet in response.data:
# For each tweet, find the author's information
author_info = user_dict[tweet.author_id]
# Put all of the information we want to keep in a single dictionary for each tweet
result.append({'author_id': tweet.author_id,
'username': author_info['username'],
'author_followers': author_info['followers'],
'author_tweets': author_info['tweets'],
'author_description': author_info['description'],
'author_location': author_info['location'],
'text': tweet.text,
'created_at': tweet.created_at,
'quote_count': tweet.public_metrics['quote_count'],
'retweets': tweet.public_metrics['retweet_count'],
'replies': tweet.public_metrics['reply_count'],
'likes': tweet.public_metrics['like_count'],
})
# Change this list of dictionaries into a dataframe
df = pd.DataFrame(result)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_25716/2249018491.py in <module>
4 for response in csu_tweets:
5 # Take all of the users, and put them into a dictionary of dictionaries with the info we want to keep
----> 6 for user in response.includes['users']:
7 user_dict[user.id] = {'username': user.username,
8 'followers': user.public_metrics['followers_count'],
KeyError: 'users'
So I get this KeyError: 'users'. I don't get the error if I just scrape tweet from a single user and replace "csu = ["Markus_Soeder", "DoroBaer", "andreasscheuer"] with "csu = "Markus_Soeder".
Does anyone know what could be the issue?
Thanks in advance!

I found the answer to this issue. It gave me the key error, because for some users in this time range there were no tweets and as a result it got stored as "none" in the response. Because of this the for loop didn't work.

Related

Long tweet message got cut using tweepy in Python

I noticed that using the below code, I cannot retrieve the long tweet; somehow it got cut. Can anyone suggest a way to retrieve the full tweet message?
tweets = []
#for month in range(8,13,1):
for i,j in zip(start_time,end_time):
print(i)
print(j)
for response in tweepy.Paginator(client.search_all_tweets,
query = 'สมอง -is:retweet lang:th',
user_fields = ['username', 'public_metrics', 'description', 'location'],
tweet_fields = ['created_at', 'geo', 'public_metrics', 'text'],
expansions = ['author_id', 'geo.place_id'],
start_time = i,
end_time = j):
#max_results=500):
time.sleep(1)
tweets.append(response)

I'm unable to reproduce your snippet since I don't have the credentials but can you try something like this:
import tweepy
client = tweepy.Client("paste your token")
query = 'สมอง -is:retweet lang:th'
tweets = tweepy.Paginator(client.search_recent_tweets, query=query,
user_fields = ['username', 'public_metrics', 'description', 'location'],
tweet_fields=['created_at', 'geo', 'public_metrics', 'text'],
max_results=100).flatten(limit=1000)
for tweet in tweets:
print(tweet.text) ## this should give you the full tweet
Please note to change the respective search method (search_all_tweets). I couldn't use it since it was asking for credentials for v2. You can also get rid of the flatten if you don't want only the data part of your tweet.

Get Replies to a Tweet using Tweepy's (4.10.0) Tweet_id

I am trying to use the Tweepy python package to get the actual replies to a Tweet. I break down the process I have worked on so far:
I import my modules and configure authentication variables with tweepy:
client = tweepy.Client(bearer_token=bearer_token, wait_on_rate_limit=True)
covid_tweets = []
for mytweets in tweepy.Paginator(client.search_all_tweets, query = '#COVID lang:en',
user_fields=['username', 'public_metrics', 'description', 'location'], tweet_fields
= ['created_at', 'geo', 'public_metrics', 'text'], expansions = 'author_id',
start_time = '2019-12-30T00:00:00Z', end_time = '2020-01-15T00:00:00Z',
max_results=10):
time.sleep(2)
covid_tweets.append(mytweets)
I then search for the hashtags I want and the parameters of the Tweepy Paginator with a for loop and append unto an empty list:
client = tweepy.Client(bearer_token=bearer_token, wait_on_rate_limit=True)
covid_tweets = []
for mytweets in tweepy.Paginator(client.search_all_tweets, query = '#COVID lang:en',
user_fields=['username', 'public_metrics', 'description', 'location'], tweet_fields
= ['created_at', 'geo', 'public_metrics', 'text'], expansions = 'author_id',
start_time = '2019-12-30T00:00:00Z', end_time = '2020-01-15T00:00:00Z',
max_results=10):
time.sleep(2)
covid_tweets.append(mytweets)
Then I convert this list into a dataFrame by extracting certain key fields[a user dictionary, user_object]:
#Convert Covid-19 tweets to a DF
result = []
user_dict = {}
# Loop through each response object
for response in covid_tweets:
for user in response.includes['users']:
user_dict[user.id] = {'username': user.username,
'followers': user.public_metrics['followers_count'],
'tweets': user.public_metrics['tweet_count'],
'description': user.description,
'location': user.location
}
for tweet in response.data:
# For each tweet, find the author's information
author_info = user_dict[tweet.author_id]
#check for condition
if ('RT #' not in tweet.text):
# Put all information we want to keep in a single dictionary for each tweet
result.append({'author_id': tweet.author_id,
'tweet_id': tweet.id,
'username': author_info['username'],
'author_followers': author_info['followers'],
'author_tweets': author_info['tweets'],
'author_description': author_info['description'],
'author_location': author_info['location'],
'text': tweet.text,
'created_at': tweet.created_at,
'retweets': tweet.public_metrics['retweet_count'],
'replies': tweet.public_metrics['reply_count'],
'likes': tweet.public_metrics['like_count'],
'quote_count': tweet.public_metrics['quote_count']
})
# Change this list of dictionaries into a dataframe
df_1 = pd.DataFrame(result)
Now my issue is, from the dataFrame, I get to see tweets and reply_count for tweets and a proof of the image is shown below:
And I checked how I can get the replies from the tweets. So I did some checks and wanted to follow this code flow function:
def get_all_replies(tweet, api, fout, depth=10, Verbose=False):
global rep
if depth < 1:
if Verbose:
print('Max depth reached')
return
user = tweet.user.screen_name
tweet_id = tweet.id
search_query = '#' + user
# filter out retweets
retweet_filter = '-filter:retweets'
query = search_query + retweet_filter
try:
myCursor = tweepy.Cursor(api.search_tweets, q=query,
since_id=tweet_id,
max_id=None,
tweet_mode='extended').items()
rep = [reply for reply in myCursor if reply.in_reply_to_status_id == tweet_id]
except tweepy.TweepyException as e:
sys.stderr.write(('Error get_all_replies: {}\n').format(e))
time.sleep(60)
if len(rep) != 0:
if Verbose:
if hasattr(tweet, 'full_text'):
print('Saving replies to: %s' % tweet.full_text)
elif hasattr(tweet, 'text'):
print('Saving replies to: %s' % tweet.text)
print("Output path: %s" % fout)
# save to file
with open(fout, 'a+') as (f):
for reply in rep:
data_to_file = json.dumps(reply._json)
f.write(data_to_file + '\n')
# recursive call
get_all_replies(reply, api, fout, depth=depth - 1, Verbose=False)
return
So basically, with this function, I loop through the dataframe and pick the "tweet_id" & "the screen_name" for the tweet, then design a search query but I realized at the section of the "rep" list returns an empty list, and debugging line by line, actually showed that the in_reply_to_status_id is different from the tweet_id and the cause for the empty list even though the reply count for the dataframe shows a non-zero.
I know this is long but I really wanted to show what I have done so far and explain each process. Thank you
NB: I have access to Academic Research API

Ok so for everyone trying to fix this hurdle I finally found a way to get the tweet replies. In my use case, I have the Academic Research API from Twitter.
The code provided by geduldig on his github Github finally solved my
issue with a little tweaks. A little head-up will be that, with the TwitterAPI package, if you ignore the "start_time" or "end_time" parameter, you might get only the parent tweet, so structure it like this:
pager = TwitterPager(api, 'tweets/search/all',
{
'query':f'conversation_id:{CONVERSATION_ID}',
'start_time': '2019-12-30T00:00:00Z',
'end_time': '2021-12-31T00:00:00Z',
'expansions':'author_id',
'tweet.fields':'author_id,conversation_id,created_at,referenced_tweets'
},
hydrate_type=HydrateType.REPLACE)
I hope this helps the community. Thanks.

How to use tweepy api v2 to get status?

I created this bot with tweepy and python, basically i can retweet an like the most recent tweets that contain a certain keyword. I want to get the status of a tweet that has that keyword so that i know if i already retweeted it or not.
import time
import tweepy
import config
# Search/ Like/ Retweet
def get_client():
client = tweepy.Client(bearer_token=config.BEARER_TOKEN,
consumer_key=config.CONSUMER_KEY,
consumer_secret=config.CONSUMER_SECRET,
access_token=config.ACCESS_TOKEN,
access_token_secret=config.ACCESS_TOKEN_SECRET, )
return client
def search_tweets(query):
client = get_client()
tweets = client.search_recent_tweets(query=query, max_results=20)
tweet_data = tweets.data
results = []
if tweet_data is not None and len(tweet_data) > 0:
for tweet in tweet_data:
obj = {'id': tweet.id, 'text': tweet.text}
results.append(obj)
else:
return 'There are no tweets with that keyword!'
return results
client = get_client()
tweets = search_tweets('#vinu')
for tweet in tweets:
client.retweet(tweet["id"])
client.like(tweet['id'])
time.sleep(2)
This is the code. I want to create an if statement to check with api v2 if i already retweeted it , and if so , to continue to the next item in the loop. I know that i can use api.get_status with api v1 , but i dont find how to do it with v2. please help me out.
if tweet_data is not None and len(tweet_data) > 0:
for tweet in tweet_data:
status = tweepy.api(client.access_token).get_status(tweet.id)
if status.retweeted:
continue
else:
obj = {'id': tweet.id, 'text': tweet.text}
results.append(obj)
else:
return ''
return results
This should work in v1 , please help me do the same thing in v2. Thanks!

For Tweepy API v2 you can use the get_retweeters() method for each tweet, then compare your user id with the retrieved retweeters' id's.
if tweet_data is not None and len(tweet_data) > 0:
for tweet in tweet_data:
status = client.get_retweeters(tweet.id, max_results=3)
for stat in status.data:
if stat.id == YOUR_ID_HERE:
continue
else:
obj = {'id': tweet.id, 'text': tweet.text}
results.append(obj)
else:
return 'There are no tweets with that keyword!'
return results
You can change the max_results to whatever limit you'd like as long as you're managing your rates correctly. If you're getting None results try changing the tweet.id to a static id that has retweets and test just that one, this is due to many tweets not having any retweets.
You can find an easy way to find your twitter id using your twitter username here

Tweepy is accessing my own timeline instead of the specified user's

I'm trying to scrape tweets from specific users using tweepy. But instead of giving me the tweets from the user I specify, tweepy returns my own timeline tweets. What am I doing wrong?
tweets = []
def username_tweets_to_csv(username,count):
try:
tweets = tweepy.Cursor(api.user_timeline,user_id=username).items(count)
tweets_list = [[tweet.created_at, tweet.id, tweet.text] for tweet in tweets]
tweets_df = pd.DataFrame(tweets_list,columns=['Datetime', 'Tweet Id', 'Text'])
tweets_df.to_csv('/Users/Carla/Documents/CODE_local/{}-tweets.csv'.format(username), sep=',', index = False)
except BaseException as e:
print('failed on_status,',str(e))
time.sleep(3)
username = "jack"
count = 100
username_tweets_to_csv(username, count)

I was just able to fix it by exchanging the "user_id" parameter for "screen_name". Works perfectly now!
tweets = tweepy.Cursor(api.user_timeline,screen_name=username).items(count)

Inserting Twitter's json by field into MongoDB using python

I have been working on this for hours and need some help. This mostly works. I am able to connect to Twitter, pull the json data and store it in MongoDB however not all the data that I am seeing in my 'print(tweet)' line is showing up in MongoDB. Specifically I didn't see the screen_name (or name or the matter) field. I really just need these fields: "id", "text", "created_at", "screen_name", "retweet_count", "favourites_count", "lang" and I get them all but the name. I am not sure why it is not being inserted in the DB with all the other fields. Any help would be greatly appreciated!
from twython import Twython
from pymongo import MongoClient
ConsumerKey = "XXXXX"
ConsumerSecret = "XXXXX"
AccessToken = "XXXXX-XXXXX"
AccessTokenSecret = "XXXXX"
twitter = Twython(ConsumerKey,
ConsumerSecret,
AccessToken,
AccessTokenSecret)
result = twitter.search(q="drexel", count='100')
result1 = result['statuses']
for tweet in result1:
print(tweet) #prints tweets so I know I got data
client = MongoClient('mongodb://localhost:27017/')
db = client.twitterdb
tweet_collection = db.twitter_search
#Fields I need ["id", "text", "created_at", "screen_name", "retweet_count", "favourites_count", "lang"]
for tweet in result1:
try:
tweet_collection.insert(tweet)
except:
pass
print("The number of tweets in English: ")
print(tweet_collection.count(lang="en"))

You can use following way:
def get_document(post):
return {
'id': post['id_str'],
'text': post['text'],
'created_at': post['created_at'],
'retweet_count' : post['retweet_count'],
'favourites_count': post['user']['favourites_count'],
'lang': post['lang'],
'screen_name': post['user']['screen_name']
}
for tweet in result1:
try:
tweet_collection.insert(
get_document(tweet)
)
except:
pass
It should work.

The "screen_name" field is a subset of the "user" part of the tweet metadata. Make sure you're drilling down far enough.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Twitter API - Get Tweets from a list of users - python

I found the answer to this issue. It gave me the key error, because for some users in this time range there were no tweets and as a result it got stored as "none" in the response. Because of this the for loop didn't work.

Related

Long tweet message got cut using tweepy in Python

Get Replies to a Tweet using Tweepy's (4.10.0) Tweet_id

How to use tweepy api v2 to get status?

Tweepy is accessing my own timeline instead of the specified user's

Inserting Twitter's json by field into MongoDB using python

Categories

Resources