Validate if User Id on Twitter to be able to scrape tweets - python

I created a scraper with python that gets all the followers of a particular twitter user. The issue is that when I use this list of user Ids to get their tweets with logstash, I have an Error.
I used http://gettwitterid.com/ to manually check if these Ids are working, and they are but the list is really long to check it one by one.
Is there a solution with python to split the Ids into two lists, one containing Valid Ids and the other contains the Not valid ones, thet I use the Valid list as input for logstash?
The first 10 rows of the csv file is like this :
"id"
"602169027"
"95104995"
"874339739557670912"
"2981270769"
"93054327"
"870723159011545088"
"3008493180"
"874804469082533888"
"756339889092829184"
"1077712806"
I tried this code to get tweets using Ids imported from csv, but unfortunetly it's raising 144 (Not found)
import tweepy
import pandas as pd
consumer_key = ""
consumer_secret = ""
access_token_key = "-"
access_token_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
dfuids = pandas.read_csv('Uids.csv')
for index, row in dfuids.iterrows():
print row['id']
tweet = api.get_status(dfuids['id'])
importing ids from csv

Try to change your code to this:
for index, row in dfuids.iterrows():
print row['id']
tweet = api.get_status(row['id'])
To escape potential errors, you can add a try / except loop later.

I got the solution after some experiments:
dfuids = pd.read_csv('Uids.csv')
valid = []
notvalid = []
for index, row in dfuids.iterrows():
print index
x = str(row.id)
#print x , type(x)
try:
tweet = api.user_timeline(row.id)
#print "Fine :",row.id
valid.append(x)
#print x, "added to valid"
except:
#print "NotOk :",row.id
notvalid.append(x)
#print x, "added to valid"
This Part of the code was what I needed, so it loops for all the Ids, and test if that user id give us some tweets from the timeline, if correct then it's appended as string to a list called (valid) else if we have an exception for any reason then it's appended to (notvalid).
We can save this list into a dataframe and export csv :
df = pd.DataFrame(valid)
dfnotv = pd.DataFrame(notvalid)
df.to_csv('valid.csv', index=False, encoding='utf-8')
dfnotv.to_csv('notvalid.csv', index=False, encoding='utf-8')

Related

How to search a specific country's tweets with Tweepy client.search_recent_tweets()

y'all. I'm trying to figure out how to sort for a specific country's tweets using search_recent_tweets. I take a country name as input, use pycountry to get the 2-character country code, and then I can either put some sort of location filter in my query or in search_recent_tweets params. Nothing I have tried so far in either has worked.
######
import tweepy
from tweepy import OAuthHandler
from tweepy import API
import pycountry as pyc
# upload token
BEARER_TOKEN='XXXXXXXXX'
# get tweets
client = tweepy.Client(bearer_token=BEARER_TOKEN)
# TAKE USER INPUT
countryQuery = input("Find recent tweets about travel in a certain country (input country name): ")
keyword = 'women safe' # gets tweets containing women and safe for that country (safe will catch safety)
# get country code to plug in as param in search_recent_tweets
country_code = str(pyc.countries.search_fuzzy(countryQuery)[0].alpha_2)
# get 100 recent tweets containing keywords and from location = countryQuery
query = str(keyword+' place_country='+str(countryQuery)+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
# expansions=geo.place_id, place.fields=[country_code],
# filter posts to remove retweets
# export tweets to json
import json
with open('twitter.json', 'w') as fp:
for tweet in posts.data:
json.dump(tweet.data, fp)
fp.write('\n')
print("* " + str(tweet.text))
I have tried variations of:
query = str(keyword+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, place_fields=[str(countryQuery), country_code], max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
and:
query = str(keyword+' place.fields='+str(countryQuery)+','+country_code+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
These either ended up pulling me NoneType tweets aka nothing or causing a
"The place.fields query parameter value [Germany] is not one of [contained_within,country,country_code,full_name,geo,id,name,place_type]"
The documentation for search_recent_tweets seems like place.fields / place_fields / place_country should be supported.
Any advice would help!!!

parse github api.. getting string indices must be integers error

I need to loop through commits and get name, date, and messages info from
GitHub API.
https://api.github.com/repos/droptable461/Project-Project-Management/commits
I have many different things but I keep getting stuck at string indices must be integers error:
def git():
#name , date , message
#https://api.github.com/repos/droptable461/Project-Project-Management/commits
#commit { author { name and date
#commit { message
#with urlopen('https://api.github.com/repos/droptable461/Project Project-Management/commits') as response:
#source = response.read()
#data = json.loads(source)
#state = []
#for state in data['committer']:
#state.append(state['name'])
#print(state)
link = 'https://api.github.com/repos/droptable461/Project-Project-Management/events'
r = requests.get('https://api.github.com/repos/droptable461/Project-Project-Management/commits')
#print(r)
#one = r['commit']
#print(one)
for item in r.json():
for c in item['commit']['committer']:
print(c['name'],c['date'])
return 'suc'
Need to get person who did the commit, date and their message.
item['commit']['committer'] is a dictionary object, and therefore the line:
for c in item['commit']['committer']: is transiting dictionary keys.
Since you are calling [] on a string (the dictionary key), you are getting the error.
Instead that code should look more like:
def git():
link = 'https://api.github.com/repos/droptable461/Project-Project-Management/events'
r = requests.get('https://api.github.com/repos/droptable461/Project-Project-Management/commits')
for item in r.json():
for key in item['commit']['committer']:
print(item['commit']['committer']['name'])
print(item['commit']['committer']['date'])
print(item['commit']['message'])
return 'suc'

How to access place and geo objects in tweet JSON object

I am currently trying to access the place names and coordinates of tweets from a json file created by twitter's API. While not all of my tweets include these attributes, some do and id like to collect them. my current approach is:
for line in tweets_json:
try:
tweet = json.loads(line.strip()) # only messages contains 'text' field is a tweet
tweet_id = (tweet['id']) # This is the tweet's id
created_at = (tweet['created_at']) # when the tweet posted
text = (tweet['text']) # content of the tweet
user_id = (tweet['user']['id']) # id of the user who posted the tweet
hashtags = []
for hashtag in tweet['entities']['hashtags']:
hashtags.append(hashtag['text'])
lat = []
long = []
for coordinates in tweet['coordinates']['coordinates']:
lat.append(coordinates[0])
long.append(coordinates[1])
country_code = []
place_name = []
for place in tweet['place']:
country_code.append(place['country_code'])
place_name.append(place['full_name'])
except:
# read in a line is not in JSON format (sometimes error occured)
continue
As of right now, no attribute past Hashtags are being collected, Am I trying to access the attributes wrong? more information regarding the JSON object can be found here https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object
By wrapping all your code in a Try/Except block, you're passing over every error that occurs, including KeyErrors when trying to access a 'coordinates' that doesn't exist
If some of the parsed tweet dictionaries contain a key, and you want to collect them, you can do something like this:
from json import JSONDecodeError
for line in tweets_json:
# try to parse json
try:
tweet = json.loads(line.strip()) # only messages contains 'text' field is a tweet
except JSONDecodeError:
print('bad json')
continue
tweet_id = (tweet['id']) # This is the tweet's id
created_at = (tweet['created_at']) # when the tweet posted
text = (tweet['text']) # content of the tweet
user_id = (tweet['user']['id']) # id of the user who posted the tweet
hashtags = []
for hashtag in tweet['entities']['hashtags']:
hashtags.append(hashtag['text'])
lat = []
long = []
# this is how you check for the presence of coordinates
if 'coordinates' in tweet and 'coordinates' in tweet['coordinates']:
for coordinates in tweet['coordinates']['coordinates']:
lat.append(coordinates[0])
long.append(coordinates[1])
country_code = []
place_name = []
for place in tweet['place']:
country_code.append(place['country_code'])
place_name.append(place['full_name'])

How to create single variable in Python with two columns either as tuple or Pandas dataframe?

I need to consolidate these two tweet datasets into a single variable. The variable needs to have two "columns," one for the text of the tweets, the other a binary indicator of the source (e.g. 0 for the first source, 1 for the second). I can use a list of tuples or a Pandas dataframe. I am brand new to coding, so I am not sure how to proceed. I understand that I could create two dictionaries and combine them, but not sure how to add the column that contains the binary indicator. This is where I am now:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
userNRA = api.get_user("NRA")
userCSGV = api.get_user("CSGV")
c_nra = tweepy.Cursor(api.user_timeline, id="NRA")
NRAtweet_store = []
for status in c_nra.items(500):
NRAtweet_store.append(status.text)
c_csgv = tweepy.Cursor(api.user_timeline, id="CSGV")
CSGVtweet_store = []
for status in c_csgv.items(500):
CSGVtweet_store.append(status.text)
Rather than appending just the text, append the text and a flag:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
userNRA = api.get_user("NRA")
userCSGV = api.get_user("CSGV")
tweets = []
c_nra = tweepy.Cursor(api.user_timeline, id="NRA")
for status in c_nra.items(500):
tweets.append((status.text, 0))
c_csgv = tweepy.Cursor(api.user_timeline, id="CSGV")
for status in c_csgv.items(500):
tweets.append((status.text, 1))
This will leave you with one list of tuples, with the second entry in each tuple indicating the source of the first entry.

I am trying to extract tweets from a twitter query on Python using Twython

I am trying to go through a list of tweets related to a specific search term and trying to extract all the hashtags. I wish to make a python list which includes all the hashtags. I started by using Twython as follows
from twython import Twython
api_key = 'xxxx'
api_secret = 'xxxx'
acces_token = 'xxxx'
ak_secret = 'xxxx'
t = Twython(app_key = api_key, app_secret = api_secret, oauth_token = acces_token, oauth_token_secret = ak_secret)
search = t.search(q = 'Python', count = 10)
tweets = search['statuses']
hashtags = []
for tweet in tweets:
b = (tweet['text'],"\n")
if b.startswith('#'):
hastags.append(b)
It doesn't seem to be working. I get the error that
'tuple object has no attribute startswith'
I am not sure if I am meant to make a list of all the statuses first and extract using the mentioned method. Or it is okay to proceed without making the list of statuses first.
Thank you
That is correct, strings have the startswith attribute and tuples do not.
Change the last three lines to this:
b = (tweet['text'])
if b.startswith("#") is True:
hashtags.append(b)
If you really want that line break then it would be:
b = (tweet['text'] + "\n")
if b.startswith("#") is True:
hashtags.append(b)

Categories

Resources