Putting several tweets in dataframe - python

I am trying to download the last 10 tweets from BarackObama. However, when I try to put them into a dataframe, it only includes the 10th tweet (so only 1). Does someone know how to solve this problem? I tried the top part of the code first with just print instead of data, and then i got all 10 tweets, so I dont know where it goes wrong. I also dont get an error message.
user = 'BarackObama'
posts = tweepy.Cursor(api.user_timeline, screen_name=user,).items(10)
for status in posts:
if status.lang == 'en':
data = {'User': [status.user.name],
'Account name' ['#'+status.user.screen_name],
'Tweet': [status.text],
'Time': [status.created_at],
'Nr of retweets': [status.retweet_count],
'Nr of favorited': [status.favorite_count]}
df = pd.DataFrame(data)
df.head()

Seems like you have to create a list of tweets, and then put them into DataFrame:
user = 'BarackObama'
posts = tweepy.Cursor(api.user_timeline, screen_name=user,).items(10)
tweets = []
for status in posts:
if status.lang == 'en':
data = {'User': [status.user.name],
'Account name' ['#'+status.user.screen_name],
'Tweet': [status.text],
'Time': [status.created_at],
'Nr of retweets': [status.retweet_count],
'Nr of favorited': [status.favorite_count]}
tweets.append(data)
df = pd.DataFrame(tweets)
df.head()

Related

How do I generate a new column on Pandas with Python to Generate Tweet Hyperlinks with Conversation ID

I am using Tweepy to scrape tweets. I cannot get the tweet URL using tweepy but I can get the conversation ID. I want to generate a new column that is essentially twitter.com/user/status/(conversation_id) of every cell before saving it.
How can I do this? My current code after the scraping cursor is:
columns = ['conservation id',
'created_at',
'likes',
'full_text',
'retweet count',
'user location',
'user name',
'user verified',
'in reply to status?',
'language']
data = []
for tweet in tweets:
data.append([tweet.id_str,
tweet.created_at,
tweet.favorite_count,
tweet.full_text,
tweet.retweet_count,
tweet.user.location,
tweet.user.screen_name,
tweet.user.verified,
tweet.in_reply_to_status_id,
tweet.lang])
df = pd.DataFrame(data, columns=columns)
print(df)
df.to_csv('testrun.csv')
Fixed it.
for tweet in tweets:
data.append([tweet.created_at,
tweet.favorite_count,
tweet.full_text,
tweet.retweet_count,
"https://twitter.com/user/status/"+tweet.id_str,
tweet.user.location,
tweet.user.screen_name,
tweet.user.verified,
tweet.in_reply_to_status_id,
tweet.lang])

Want to get twitter data using tweepy but in trouble

I am trying to retrieve Twitter data using Tweepy, using that below code, but I'm having difficulties in collecting media_fields data. Especially, I want to get the type of media, but I failed.
As you can see below, the value is copied and exists in the cell that should be empty.
[enter image description here][1]
import tweepy
from twitter_authentication import bearer_token
import time
import pandas as pd
client = tweepy.Client(bearer_token, wait_on_rate_limit=True)
hoax_tweets = []
for response in tweepy.Paginator(client.search_all_tweets,
query = 'Covid hoax -is:retweet lang:en',
user_fields = ['username', 'public_metrics', 'description', 'location','verified','entities'],
tweet_fields=['id', 'in_reply_to_user_id', 'referenced_tweets', 'context_annotations',
'source', 'created_at', 'entities', 'geo', 'withheld', 'public_metrics',
'text'],
media_fields=['media_key', 'type', 'url', 'alt_text',
'public_metrics','preview_image_url'],
expansions=['author_id', 'in_reply_to_user_id', 'geo.place_id',
'attachments.media_keys','referenced_tweets.id','referenced_tweets.id.author_id'],
place_fields=['id', 'name', 'country_code', 'place_type', 'full_name', 'country',
'geo', 'contained_within'],
start_time = '2021-01-20T00:00:00Z',
end_time = '2021-01-21T00:00:00Z',
max_results=100):
time.sleep(1)
hoax_tweets.append(response)
result = []
user_dict = {}
media_dict = {}
# Loop through each response object
for response in hoax_tweets:
# Take all of the users, and put them into a dictionary of dictionaries with the info we want to keep
for user in response.includes['users']:
user_dict[user.id] = {'username': user.username,
'followers': user.public_metrics['followers_count'],
'tweets': user.public_metrics['tweet_count'],
'description': user.description,
'location': user.location,
'verified': user.verified
}
for media in response.includes['media']:
media_dict[tweet.id] = {'media_key':media.media_key,
'type':media.type
}
for tweet in response.data:
# For each tweet, find the author's information
author_info = user_dict[tweet.author_id]
# Put all of the information we want to keep in a single dictionary for each tweet
result.append({'author_id': tweet.author_id,
'username': author_info['username'],
'author_followers': author_info['followers'],
'author_tweets': author_info['tweets'],
'author_description': author_info['description'],
'author_location': author_info['location'],
'author_verified':author_info['verified'],
'tweet_id': tweet.id,
'text': tweet.text,
'created_at': tweet.created_at,
'retweets': tweet.public_metrics['retweet_count'],
'replies': tweet.public_metrics['reply_count'],
'likes': tweet.public_metrics['like_count'],
'quote_count': tweet.public_metrics['quote_count'],
'in_reply_to_user_id':tweet.in_reply_to_user_id,
'media':tweet.attachments,
'media_type': media,
'conversation':tweet.referenced_tweets
})
# Change this list of dictionaries into a dataframe
df = pd.DataFrame(result)
Also, when I change the code ''media':tweet.attachments' to 'media':tweet.attachments[0] to get 'media_key' data, I get the following error message."TypeError: 'NoneType' object is not subscriptable"
What am I doing wrong? Any suggestions would be appreciated.
[1]: https://i.stack.imgur.com/AxCcl.png
The subscriptable error comes from the fact that tweet.attachments is None, from here the NoneType part. To make it work, you can add a check for None:
'media':tweet.attachments[0] if tweet.attachments else None
I have never used the twitter API, but one thing is to make sure the tweet attachments are always present or if they may be absent.

PRAW Loop With HTTP Exceptions

I am using a Python script to loop through a list of subreddits and pull their posts. The list is long, however, and occassionally there will be 403, 404, etc. errors in there. I am attempting to bypass those which give errors, but have been unable to do so thus far. The code is below.
I am using a list of subreddits and praw to pull from them. However, the list is quite long and occasionally a subreddit on it will be deleted, resulting in an HTTP exception (403, 404, etc). My code is below, does anyone know a line or two I can put in to skip those which give errors?
df = pd.read_csv('reddits.csv', sep = ',')
df.head()
Submission = namedtuple('Submission', ['time', 'score', 'title', 'text', 'author', 'comments', 'url', 'domain', 'permalink', 'ups', 'downs', 'likes', 'crosspost', 'duplicates', 'views'])
data = []
for i in df.reddits:
subreddit = reddit.subreddit(i)
for submission in subreddit.new(limit=10):
time = datetime.utcfromtimestamp(submission.created_utc)
score = submission.score
title = submission.title
text = submission.selftext
author = submission.author
comments = submission.num_comments
url = submission.url
domain = submission.domain
permalink = submission.permalink
ups = submission.ups
downs = submission.downs
likes = submission.likes
crosspost = submission.num_crossposts
duplicates = submission.num_duplicates
views = submission.view_count
data.append(Submission(time, score, title, text, author, comments, url, domain, permalink, ups, downs, likes, crosspost, duplicates, views))
df = pd.DataFrame(data)
os.chdir('wd')
filename = i + str(datetime.now()) + '.csv'
df.to_csv(filename, index=False, encoding='utf-8')
You need to catch the exception, then you can continue
df = pd.read_csv('reddits.csv', sep = ',')
df.head()
Submission = namedtuple('Submission', ['time', 'score', 'title', 'text', 'author', 'comments', 'url', 'domain', 'permalink', 'ups', 'downs', 'likes', 'crosspost', 'duplicates', 'views'])
data = []
for i in df.reddits:
try:
subreddit = reddit.subreddit(i)
except HTTPError as e:
print(f"Got {e} retrieving {subreddit}")
continue # control passes back to next iteration of outer loop
for submission in subreddit.new(limit=10):
submission = Submission(
datetime.utcfromtimestamp(submission.created_utc),
submission.score,
submission.title,
submission.selftext,
submission.author,
submission.num_comments,
submission.url,
submission.domain,
submission.permalink,
submission.ups,
submission.downs,
submission.likes,
submission.num_crossposts,
submission.num_duplicates,
submission.view_count,
)
data.append(submission)
df = pd.DataFrame(data)
os.chdir('wd')
filename = i + str(datetime.now()) + '.csv'
df.to_csv(filename, index=False, encoding='utf-8')
also, unrelated: i is not a good name for the value; it traditionally stands for "index", which is not what is contained there. e would be the corresponding generic name, standing for "element", but reddit would be the idiomatic choice in python.

Exporting tweets to a dataframe

I can't export the user information as a data frame even though it appears fine in the console when I print (users_info). Can someone help? Thanks.
# Define the search term and the date_since date as variables
search_words = "#sunrise"
date_since = "2019-09-01"
# Collect tweets
tweets = tw.Cursor(api.search,
q=search_words,
lang="en",
since=date_since).items(5)
users_info = [[tweet.user.screen_name, tweet.user.location, tweet.text, tweet.created_at, tweet.retweet_count, tweet.source] for tweet in tweets]
df = pd.DataFrame(users_info, columns =['user_name','user_location', 'text', 'date', 'retweet_count', 'url'])
df.to_excel=('sunrise_tweets.xlsx')
There should not be a = after df.to_excel, as this causes your filename to be assigned to df.to_excel instead calling the df.to_excel method:
df.to_excel('sunrise_tweets.xlsx')
Also ensure you have installed openpyxl or XlsxWriter. See the docs for further information.

Python - iterate through a list

I'm trying to automate email reporting using python. My problem is that i cant pull the subject from the data that my email client outputs.
Abbreviated dataset:
[(messageObject){
id = "0bd503eb00000000000000000000000d0f67"
name = "11.26.17 AM [TXT-CAT]{Shoppers:2}"
status = "active"
messageFolderId = "0bd503ef0000000000000000000000007296"
content[] =
(messageContentObject){
type = "html"
subject = "Early Cyber Monday – 60% Off Sitewide "
}
}
]
I can pull the other fields like this:
messageId = []
messageName = []
subject = []
for info in messages:
messageId.append(str(info['id']))
messageName.append(str(info['name']))
subject.append(str(info[content['subject']]))
data = pd.DataFrame({
'id': messageId,
'name': messageName,
'subject': subject
})
data.head()
I've been trying to iterate though content[] using a for loop, but i can't get it to work. Let me know if you have any suggestions.
#FamousJameous gave the correct answer:
That format is called SOAP. My guess for the syntax would be info['content']['subject'] or maybe info['content'][0]['subject']
info['content'][0]['subject'] worked with my data.

Categories

Resources