How to handle errors during tweet extraction using python? - python

I'm trying extract a dataset using tweepy. I have a set of tweet Ids that I use to extract full text tweets. I have looped the IDs and tweepy functions to get the tweet texts, but my program keeps crashing because a few of the tweet IDs on my list are from suspended accounts.
This is the related code snippet I'm using:
# Creating DataFrame using pandas
db = pd.DataFrame(columns=['username', 'description', 'location', 'following',
'followers', 'totaltweets', 'retweetcount', 'text', 'hashtags'])
#reading tweet IDs from file
df = pd.read_excel('dataid.xlsx')
mylist = df['tweet_id'].tolist()
#tweet counter
n=1
#looping for extract tweets
for i in mylist:
tweets=api.get_status(i, tweet_mode="extended")
username = tweets.user.screen_name
description = tweets.user.description
location = tweets.user.location
following = tweets.user.friends_count
followers = tweets.user.followers_count
totaltweets = tweets.user.statuses_count
retweetcount = tweets.retweet_count
text=tweets.full_text
hashtext = list()
ith_tweet = [username, description, location, following,followers, totaltweets, retweetcount, text, hashtext]
db.loc[len(db)] = ith_tweet
n=n+1
filename = 'scraped_tweets.csv'

Related

How to search a specific country's tweets with Tweepy client.search_recent_tweets()

y'all. I'm trying to figure out how to sort for a specific country's tweets using search_recent_tweets. I take a country name as input, use pycountry to get the 2-character country code, and then I can either put some sort of location filter in my query or in search_recent_tweets params. Nothing I have tried so far in either has worked.
######
import tweepy
from tweepy import OAuthHandler
from tweepy import API
import pycountry as pyc
# upload token
BEARER_TOKEN='XXXXXXXXX'
# get tweets
client = tweepy.Client(bearer_token=BEARER_TOKEN)
# TAKE USER INPUT
countryQuery = input("Find recent tweets about travel in a certain country (input country name): ")
keyword = 'women safe' # gets tweets containing women and safe for that country (safe will catch safety)
# get country code to plug in as param in search_recent_tweets
country_code = str(pyc.countries.search_fuzzy(countryQuery)[0].alpha_2)
# get 100 recent tweets containing keywords and from location = countryQuery
query = str(keyword+' place_country='+str(countryQuery)+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
# expansions=geo.place_id, place.fields=[country_code],
# filter posts to remove retweets
# export tweets to json
import json
with open('twitter.json', 'w') as fp:
for tweet in posts.data:
json.dump(tweet.data, fp)
fp.write('\n')
print("* " + str(tweet.text))
I have tried variations of:
query = str(keyword+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, place_fields=[str(countryQuery), country_code], max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
and:
query = str(keyword+' place.fields='+str(countryQuery)+','+country_code+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
These either ended up pulling me NoneType tweets aka nothing or causing a
"The place.fields query parameter value [Germany] is not one of [contained_within,country,country_code,full_name,geo,id,name,place_type]"
The documentation for search_recent_tweets seems like place.fields / place_fields / place_country should be supported.
Any advice would help!!!

Twitter API to fetch Tweets in Python

I'm trying to fetch Tweets from multiple Twitter accounts and then create a database with the TWEETS and the source of the TWEET " user name " by using the following code
posts = api.user_timeline(screen_name = 'AlArabiya_Brk', count = 100 , lang =
"ar",tweet_mode="extended")
df = pd.DataFrame([tweet.full_text for tweet in posts], columns = [ 'Tweets'])
but I have a question: how can I add more than one account? I tried doing:
posts = api.user_timeline(screen_name = ['AlArabiya_Brk','AJABreaking'], count = 100
,lang ="ar",tweet_mode="extended")
but didn't get the desired output
You'll need to make multiple calls with that method.
That API endpoint only allows a single screen name input.

Running multiple querys on YouTube API by looping through title columns of CSV python

I am using YouTubes API to get comment data from a list of music videos. The way I have it working right now is by manually typing in my query and then writing the data to a csv file and repeating for each song like such.
query = "song title"
query_results = service.search().list(
part = 'snippet',
q = query,
order = 'relevance', # You can consider using viewCount
maxResults = 20,
type = 'video', # Channels might appear in search results
relevanceLanguage = 'en',
safeSearch = 'moderate',
).execute()
What I would like to do is use the title and artist columns from a csv file that I have containing the song titles I am trying to gather data for so I can run the program once without having to manually type in the song each time.
A friend suggested using something like this
import pandas as pd
data = pd.read_csv("metadata.csv")
def songtitle():
for i in data.index:
title = data.loc[i,'title']
title = '\"' + title + '\"'
artist = data.loc[i,'artist']
return(artist, title)
But I am not sure how I would make this work because when I run this, it is only returning the final row of data, and even if it did run correctly, how I would handle getting the entire program to repeat it self for every instance of a new song.
you can save song title and artist to a list, the loop over that list to get details.
def get_songTitles():
data = pd.read_csv("metadata.csv")
return data['artist'].tolist(),data['title'].tolist()
artist, song_titles = get_songTitles()
for song in song_titles:
query_results = service.search().list(
part = 'snippet',
q = song,
order = 'relevance', # You can consider using viewCount
maxResults = 20,
type = 'video', # Channels might appear in search results
relevanceLanguage = 'en',
safeSearch = 'moderate',
).execute()

How to access place and geo objects in tweet JSON object

I am currently trying to access the place names and coordinates of tweets from a json file created by twitter's API. While not all of my tweets include these attributes, some do and id like to collect them. my current approach is:
for line in tweets_json:
try:
tweet = json.loads(line.strip()) # only messages contains 'text' field is a tweet
tweet_id = (tweet['id']) # This is the tweet's id
created_at = (tweet['created_at']) # when the tweet posted
text = (tweet['text']) # content of the tweet
user_id = (tweet['user']['id']) # id of the user who posted the tweet
hashtags = []
for hashtag in tweet['entities']['hashtags']:
hashtags.append(hashtag['text'])
lat = []
long = []
for coordinates in tweet['coordinates']['coordinates']:
lat.append(coordinates[0])
long.append(coordinates[1])
country_code = []
place_name = []
for place in tweet['place']:
country_code.append(place['country_code'])
place_name.append(place['full_name'])
except:
# read in a line is not in JSON format (sometimes error occured)
continue
As of right now, no attribute past Hashtags are being collected, Am I trying to access the attributes wrong? more information regarding the JSON object can be found here https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object
By wrapping all your code in a Try/Except block, you're passing over every error that occurs, including KeyErrors when trying to access a 'coordinates' that doesn't exist
If some of the parsed tweet dictionaries contain a key, and you want to collect them, you can do something like this:
from json import JSONDecodeError
for line in tweets_json:
# try to parse json
try:
tweet = json.loads(line.strip()) # only messages contains 'text' field is a tweet
except JSONDecodeError:
print('bad json')
continue
tweet_id = (tweet['id']) # This is the tweet's id
created_at = (tweet['created_at']) # when the tweet posted
text = (tweet['text']) # content of the tweet
user_id = (tweet['user']['id']) # id of the user who posted the tweet
hashtags = []
for hashtag in tweet['entities']['hashtags']:
hashtags.append(hashtag['text'])
lat = []
long = []
# this is how you check for the presence of coordinates
if 'coordinates' in tweet and 'coordinates' in tweet['coordinates']:
for coordinates in tweet['coordinates']['coordinates']:
lat.append(coordinates[0])
long.append(coordinates[1])
country_code = []
place_name = []
for place in tweet['place']:
country_code.append(place['country_code'])
place_name.append(place['full_name'])

Validate if User Id on Twitter to be able to scrape tweets

I created a scraper with python that gets all the followers of a particular twitter user. The issue is that when I use this list of user Ids to get their tweets with logstash, I have an Error.
I used http://gettwitterid.com/ to manually check if these Ids are working, and they are but the list is really long to check it one by one.
Is there a solution with python to split the Ids into two lists, one containing Valid Ids and the other contains the Not valid ones, thet I use the Valid list as input for logstash?
The first 10 rows of the csv file is like this :
"id"
"602169027"
"95104995"
"874339739557670912"
"2981270769"
"93054327"
"870723159011545088"
"3008493180"
"874804469082533888"
"756339889092829184"
"1077712806"
I tried this code to get tweets using Ids imported from csv, but unfortunetly it's raising 144 (Not found)
import tweepy
import pandas as pd
consumer_key = ""
consumer_secret = ""
access_token_key = "-"
access_token_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
dfuids = pandas.read_csv('Uids.csv')
for index, row in dfuids.iterrows():
print row['id']
tweet = api.get_status(dfuids['id'])
importing ids from csv
Try to change your code to this:
for index, row in dfuids.iterrows():
print row['id']
tweet = api.get_status(row['id'])
To escape potential errors, you can add a try / except loop later.
I got the solution after some experiments:
dfuids = pd.read_csv('Uids.csv')
valid = []
notvalid = []
for index, row in dfuids.iterrows():
print index
x = str(row.id)
#print x , type(x)
try:
tweet = api.user_timeline(row.id)
#print "Fine :",row.id
valid.append(x)
#print x, "added to valid"
except:
#print "NotOk :",row.id
notvalid.append(x)
#print x, "added to valid"
This Part of the code was what I needed, so it loops for all the Ids, and test if that user id give us some tweets from the timeline, if correct then it's appended as string to a list called (valid) else if we have an exception for any reason then it's appended to (notvalid).
We can save this list into a dataframe and export csv :
df = pd.DataFrame(valid)
dfnotv = pd.DataFrame(notvalid)
df.to_csv('valid.csv', index=False, encoding='utf-8')
dfnotv.to_csv('notvalid.csv', index=False, encoding='utf-8')

Categories

Resources