How to handle errors during tweet extraction using python?

How to handle errors during tweet extraction using python? - python

I'm trying extract a dataset using tweepy. I have a set of tweet Ids that I use to extract full text tweets. I have looped the IDs and tweepy functions to get the tweet texts, but my program keeps crashing because a few of the tweet IDs on my list are from suspended accounts.
This is the related code snippet I'm using:
# Creating DataFrame using pandas
db = pd.DataFrame(columns=['username', 'description', 'location', 'following',
'followers', 'totaltweets', 'retweetcount', 'text', 'hashtags'])
#reading tweet IDs from file
df = pd.read_excel('dataid.xlsx')
mylist = df['tweet_id'].tolist()
#tweet counter
n=1
#looping for extract tweets
for i in mylist:
tweets=api.get_status(i, tweet_mode="extended")
username = tweets.user.screen_name
description = tweets.user.description
location = tweets.user.location
following = tweets.user.friends_count
followers = tweets.user.followers_count
totaltweets = tweets.user.statuses_count
retweetcount = tweets.retweet_count
text=tweets.full_text
hashtext = list()
ith_tweet = [username, description, location, following,followers, totaltweets, retweetcount, text, hashtext]
db.loc[len(db)] = ith_tweet
n=n+1
filename = 'scraped_tweets.csv'

Related

How to search a specific country's tweets with Tweepy client.search_recent_tweets()

y'all. I'm trying to figure out how to sort for a specific country's tweets using search_recent_tweets. I take a country name as input, use pycountry to get the 2-character country code, and then I can either put some sort of location filter in my query or in search_recent_tweets params. Nothing I have tried so far in either has worked.
######
import tweepy
from tweepy import OAuthHandler
from tweepy import API
import pycountry as pyc
# upload token
BEARER_TOKEN='XXXXXXXXX'
# get tweets
client = tweepy.Client(bearer_token=BEARER_TOKEN)
# TAKE USER INPUT
countryQuery = input("Find recent tweets about travel in a certain country (input country name): ")
keyword = 'women safe' # gets tweets containing women and safe for that country (safe will catch safety)
# get country code to plug in as param in search_recent_tweets
country_code = str(pyc.countries.search_fuzzy(countryQuery)[0].alpha_2)
# get 100 recent tweets containing keywords and from location = countryQuery
query = str(keyword+' place_country='+str(countryQuery)+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
# expansions=geo.place_id, place.fields=[country_code],
# filter posts to remove retweets
# export tweets to json
import json
with open('twitter.json', 'w') as fp:
for tweet in posts.data:
json.dump(tweet.data, fp)
fp.write('\n')
print("* " + str(tweet.text))
I have tried variations of:
query = str(keyword+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, place_fields=[str(countryQuery), country_code], max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
and:
query = str(keyword+' place.fields='+str(countryQuery)+','+country_code+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
These either ended up pulling me NoneType tweets aka nothing or causing a
"The place.fields query parameter value [Germany] is not one of [contained_within,country,country_code,full_name,geo,id,name,place_type]"
The documentation for search_recent_tweets seems like place.fields / place_fields / place_country should be supported.
Any advice would help!!!

Twitter API to fetch Tweets in Python

I'm trying to fetch Tweets from multiple Twitter accounts and then create a database with the TWEETS and the source of the TWEET " user name " by using the following code
posts = api.user_timeline(screen_name = 'AlArabiya_Brk', count = 100 , lang =
"ar",tweet_mode="extended")
df = pd.DataFrame([tweet.full_text for tweet in posts], columns = [ 'Tweets'])
but I have a question: how can I add more than one account? I tried doing:
posts = api.user_timeline(screen_name = ['AlArabiya_Brk','AJABreaking'], count = 100
,lang ="ar",tweet_mode="extended")
but didn't get the desired output

You'll need to make multiple calls with that method.
That API endpoint only allows a single screen name input.

Running multiple querys on YouTube API by looping through title columns of CSV python

I am using YouTubes API to get comment data from a list of music videos. The way I have it working right now is by manually typing in my query and then writing the data to a csv file and repeating for each song like such.
query = "song title"
query_results = service.search().list(
part = 'snippet',
q = query,
order = 'relevance', # You can consider using viewCount
maxResults = 20,
type = 'video', # Channels might appear in search results
relevanceLanguage = 'en',
safeSearch = 'moderate',
).execute()
What I would like to do is use the title and artist columns from a csv file that I have containing the song titles I am trying to gather data for so I can run the program once without having to manually type in the song each time.
A friend suggested using something like this
import pandas as pd
data = pd.read_csv("metadata.csv")
def songtitle():
for i in data.index:
title = data.loc[i,'title']
title = '\"' + title + '\"'
artist = data.loc[i,'artist']
return(artist, title)
But I am not sure how I would make this work because when I run this, it is only returning the final row of data, and even if it did run correctly, how I would handle getting the entire program to repeat it self for every instance of a new song.

you can save song title and artist to a list, the loop over that list to get details.
def get_songTitles():
data = pd.read_csv("metadata.csv")
return data['artist'].tolist(),data['title'].tolist()
artist, song_titles = get_songTitles()
for song in song_titles:
query_results = service.search().list(
part = 'snippet',
q = song,
order = 'relevance', # You can consider using viewCount
maxResults = 20,
type = 'video', # Channels might appear in search results
relevanceLanguage = 'en',
safeSearch = 'moderate',
).execute()

How to access place and geo objects in tweet JSON object

I am currently trying to access the place names and coordinates of tweets from a json file created by twitter's API. While not all of my tweets include these attributes, some do and id like to collect them. my current approach is:
for line in tweets_json:
try:
tweet = json.loads(line.strip()) # only messages contains 'text' field is a tweet
tweet_id = (tweet['id']) # This is the tweet's id
created_at = (tweet['created_at']) # when the tweet posted
text = (tweet['text']) # content of the tweet
user_id = (tweet['user']['id']) # id of the user who posted the tweet
hashtags = []
for hashtag in tweet['entities']['hashtags']:
hashtags.append(hashtag['text'])
lat = []
long = []
for coordinates in tweet['coordinates']['coordinates']:
lat.append(coordinates[0])
long.append(coordinates[1])
country_code = []
place_name = []
for place in tweet['place']:
country_code.append(place['country_code'])
place_name.append(place['full_name'])
except:
# read in a line is not in JSON format (sometimes error occured)
continue
As of right now, no attribute past Hashtags are being collected, Am I trying to access the attributes wrong? more information regarding the JSON object can be found here https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object

By wrapping all your code in a Try/Except block, you're passing over every error that occurs, including KeyErrors when trying to access a 'coordinates' that doesn't exist
If some of the parsed tweet dictionaries contain a key, and you want to collect them, you can do something like this:
from json import JSONDecodeError
for line in tweets_json:
# try to parse json
try:
tweet = json.loads(line.strip()) # only messages contains 'text' field is a tweet
except JSONDecodeError:
print('bad json')
continue
tweet_id = (tweet['id']) # This is the tweet's id
created_at = (tweet['created_at']) # when the tweet posted
text = (tweet['text']) # content of the tweet
user_id = (tweet['user']['id']) # id of the user who posted the tweet
hashtags = []
for hashtag in tweet['entities']['hashtags']:
hashtags.append(hashtag['text'])
lat = []
long = []
# this is how you check for the presence of coordinates
if 'coordinates' in tweet and 'coordinates' in tweet['coordinates']:
for coordinates in tweet['coordinates']['coordinates']:
lat.append(coordinates[0])
long.append(coordinates[1])
country_code = []
place_name = []
for place in tweet['place']:
country_code.append(place['country_code'])
place_name.append(place['full_name'])

Validate if User Id on Twitter to be able to scrape tweets

I created a scraper with python that gets all the followers of a particular twitter user. The issue is that when I use this list of user Ids to get their tweets with logstash, I have an Error.
I used http://gettwitterid.com/ to manually check if these Ids are working, and they are but the list is really long to check it one by one.
Is there a solution with python to split the Ids into two lists, one containing Valid Ids and the other contains the Not valid ones, thet I use the Valid list as input for logstash?
The first 10 rows of the csv file is like this :
"id"
"602169027"
"95104995"
"874339739557670912"
"2981270769"
"93054327"
"870723159011545088"
"3008493180"
"874804469082533888"
"756339889092829184"
"1077712806"
I tried this code to get tweets using Ids imported from csv, but unfortunetly it's raising 144 (Not found)
import tweepy
import pandas as pd
consumer_key = ""
consumer_secret = ""
access_token_key = "-"
access_token_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
dfuids = pandas.read_csv('Uids.csv')
for index, row in dfuids.iterrows():
print row['id']
tweet = api.get_status(dfuids['id'])
importing ids from csv

Try to change your code to this:
for index, row in dfuids.iterrows():
print row['id']
tweet = api.get_status(row['id'])
To escape potential errors, you can add a try / except loop later.

I got the solution after some experiments:
dfuids = pd.read_csv('Uids.csv')
valid = []
notvalid = []
for index, row in dfuids.iterrows():
print index
x = str(row.id)
#print x , type(x)
try:
tweet = api.user_timeline(row.id)
#print "Fine :",row.id
valid.append(x)
#print x, "added to valid"
except:
#print "NotOk :",row.id
notvalid.append(x)
#print x, "added to valid"
This Part of the code was what I needed, so it loops for all the Ids, and test if that user id give us some tweets from the timeline, if correct then it's appended as string to a list called (valid) else if we have an exception for any reason then it's appended to (notvalid).
We can save this list into a dataframe and export csv :
df = pd.DataFrame(valid)
dfnotv = pd.DataFrame(notvalid)
df.to_csv('valid.csv', index=False, encoding='utf-8')
dfnotv.to_csv('notvalid.csv', index=False, encoding='utf-8')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to handle errors during tweet extraction using python? - python

Related

How to search a specific country's tweets with Tweepy client.search_recent_tweets()

Twitter API to fetch Tweets in Python

Running multiple querys on YouTube API by looping through title columns of CSV python

How to access place and geo objects in tweet JSON object

Validate if User Id on Twitter to be able to scrape tweets

Categories

Resources