Extract date from tweets (Tweepy, Python) - python

I'm new to Python, and so I'm struggling a bit with this. Basically, the code below gets the text of tweets with the hashtag bitcoin in it, and I want to extract the date and author as well as the text. I've tried different things, but stuck rn.
Greatly appreciate any help with this.
import pandas as pd
import numpy as np
import tweepy
api_key = '*'
api_secret_key = '*'
access_token = '*'
access_token_secret = '*'
authentication = tweepy.OAuthHandler(consumer_key, consumer_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(authentication, wait_on_rate_limit=True)
#Get tweets about Bitcoin and filter out any retweets
search_term = '#bitcoin -filter:retweets'
tweets = tweepy.Cursor(api.search_tweets, q=search_term, lang='en', since='2018-11-01', tweet_mode='extended').items(50)
all_tweets = [tweet.full_text for tweet in tweets]
df = pd.DataFrame(all_tweets, columns=['Tweets'])
df.head()

If you use dir(tweet) then you see all variables and functions in object tweet
author
contributors
coordinates
created_at
destroy
display_text_range
entities
extended_entities
favorite
favorite_count
favorited
full_text
geo
id
id_str
in_reply_to_screen_name
in_reply_to_status_id
in_reply_to_status_id_str
in_reply_to_user_id
in_reply_to_user_id_str
is_quote_status
lang
metadata
parse
parse_list
place
possibly_sensitive
retweet
retweet_count
retweeted
retweets
source
source_url
truncated
user
And there is created_at
all_tweets = []
for tweet in tweets:
#print('\n'.join(dir(tweet)))
all_tweets.append( [tweet.full_text, tweet.created_at] )
df = pd.DataFrame(all_tweets, columns=['Tweets', 'Created At'])
df.head()
Result:
Tweets Created At
0 #Ralvero Of course $KAWA ready for 100x 🚀#ETH ... 2022-03-26 13:51:06+00:00
1 Pairs:1INCHUSDT \n SELL:1.58500\n Time :3/26/2... 2022-03-26 13:51:06+00:00
2 #hotcrosscom #iSafePal 🌐 First LIVE Dapp: Cylu... 2022-03-26 13:51:04+00:00
3 #Justdoitalex #Isabel_Schnabel Finally a truth... 2022-03-26 13:51:03+00:00
4 #Bitcoin has rejected for the fourth time the ... 2022-03-26 13:50:55+00:00
But your code have problem with since because it seems it was removed in version 3.8
See: Collect tweets in a specific time period in Tweepy, until and since doesn't work

Related

How to search a specific country's tweets with Tweepy client.search_recent_tweets()

y'all. I'm trying to figure out how to sort for a specific country's tweets using search_recent_tweets. I take a country name as input, use pycountry to get the 2-character country code, and then I can either put some sort of location filter in my query or in search_recent_tweets params. Nothing I have tried so far in either has worked.
######
import tweepy
from tweepy import OAuthHandler
from tweepy import API
import pycountry as pyc
# upload token
BEARER_TOKEN='XXXXXXXXX'
# get tweets
client = tweepy.Client(bearer_token=BEARER_TOKEN)
# TAKE USER INPUT
countryQuery = input("Find recent tweets about travel in a certain country (input country name): ")
keyword = 'women safe' # gets tweets containing women and safe for that country (safe will catch safety)
# get country code to plug in as param in search_recent_tweets
country_code = str(pyc.countries.search_fuzzy(countryQuery)[0].alpha_2)
# get 100 recent tweets containing keywords and from location = countryQuery
query = str(keyword+' place_country='+str(countryQuery)+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
# expansions=geo.place_id, place.fields=[country_code],
# filter posts to remove retweets
# export tweets to json
import json
with open('twitter.json', 'w') as fp:
for tweet in posts.data:
json.dump(tweet.data, fp)
fp.write('\n')
print("* " + str(tweet.text))
I have tried variations of:
query = str(keyword+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, place_fields=[str(countryQuery), country_code], max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
and:
query = str(keyword+' place.fields='+str(countryQuery)+','+country_code+' -is:retweet') # search for keyword and no retweets
posts = client.search_recent_tweets(query=query, max_results=100, tweet_fields=['id', 'text', 'entities', 'author_id'])
These either ended up pulling me NoneType tweets aka nothing or causing a
"The place.fields query parameter value [Germany] is not one of [contained_within,country,country_code,full_name,geo,id,name,place_type]"
The documentation for search_recent_tweets seems like place.fields / place_fields / place_country should be supported.
Any advice would help!!!

Unexpected Parameter: id on my python jupyter code

i had build code to sentiment analysis using python, my code was like this :
import tweepy
api_key = "sdlksadksa;ldksald"
api_secret_key = "sakdlas,mcsdmv,dlv"
access_token = "alskdklamlas"
access_token_secret = "salkdjklmclqm"
auth = tweepy.OAuthHandler(api_key, api_secret_key)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
hasilUser = api.user_timeline(id="jokowi" , count = 10)
in case to you guys know, my api_key until access_token_secret are dummy and not the real one.
when i run hasilUser idk why it turn like this
Unexpected parameter: id
i had no idea what is going on and what should i do
I think input parameters of API.user_timeline changed(may be from id to user_id).
You can see the actual parameters of it in the code below, 530th line.
https://github.com/tweepy/tweepy/blob/master/tweepy/api.py
#pagination(mode='id')
#payload('status', list=True)
def user_timeline(self, **kwargs):
"""user_timeline(*, user_id, screen_name, since_id, count, max_id, \
trim_user, exclude_replies, include_rts)
Returns the 20 most recent statuses posted from the authenticating user
or the user specified. It's also possible to request another user's
timeline via the id parameter.
Parameters
----------
user_id
|user_id|
screen_name
|screen_name|
since_id
|since_id|
count
|count|
max_id
|max_id|
trim_user
|trim_user|
exclude_replies
|exclude_replies|
include_rts
When set to ``false``, the timeline will strip any native retweets
(though they will still count toward both the maximal length of the
timeline and the slice selected by the count parameter). Note: If
you're using the trim_user parameter in conjunction with
include_rts, the retweets will still contain a full user object.
Returns
-------
:py:class:`List`\[:class:`~tweepy.models.Status`]
References
----------
https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline
"""
return self.request(
'GET', 'statuses/user_timeline', endpoint_parameters=(
'user_id', 'screen_name', 'since_id', 'count', 'max_id',
'trim_user', 'exclude_replies', 'include_rts'
), **kwargs
)

Twitter API: How to search tweets based on query words and predetermined time span + tweets characteristics

Novice programmer here seeking help. I have a list of hashtags for which I want to get all the historical tweets from 01-01-2015 to 31-12-2018.
I tried to use the Tweepy library but it only allows access for the last 7 days of tweets. I also tried to use GetOldTweets as it gives access to historical tweets but it kept continuously crashing. So now I have acquired premium API access for Twitter which also gives me access to the full historic tweets. In order to do do my query with the premium API I cannot use the Tweepy Library (as it does not have a link with the premium APIs right?) and my choices are between TwitterAPI and Search-Tweets.
1- Does TwitterAPI and Search-Tweets supply information regarding the user name, user location, if the user is verified, the language of the tweet, the source of the tweet, the count of the retweets and favourites and the date for each tweet? (As tweepy does). I could not find any information about this.
2- Can I supply a time span in my query?
3- How do I do all of this?
This was my code for the Tweepy library:
hashtags = ["#AAPL","#FB","#KO","#ABT","#PEPCO",...]
df = pd.DataFrame(columns = ["Hashtag", "Tweets", "User", "User_Followers",
"User_Location", "User_Verified", "User_Lang", "User_Status",
"User_Method", "Fav_Count", "RT_Count", "Tweet_date"])
def tweepy_df(df,tags):
for cash in tags:
i = len(df)+1
for tweet in tweepy.Cursor(api.search, q= cash, since = "2015-01-01", until = "2018-12-31").items():
print(i, end = '\r')
df.loc[i, "Hashtag"] = cash
df.loc[i, "Tweets"] = tweet.text
df.loc[i, "User"] = tweet.user.name
df.loc[i, "User_Followers"] = tweet.followers_count
df.loc[i, "User_Location"] = tweet.user.location
df.loc[i, "User_Verified"] = tweet.user.verified
df.loc[i, "User_Lang"] = tweet.lang
df.loc[i, "User_Status"] = tweet.user.statuses_count
df.loc[i, "User_Method"] = tweet.source
df.loc[i, "Fav_Count"] = tweet.favorite_count
df.loc[i, "RT_Count"] = tweet.retweet_count
df.loc[i, "Tweet_date"] = tweet.created_at
i+=1
return df
How do I adapt this for, for example, the Twitter API Library?
I know that it should be adapted to something like this:
for tweet in api.request('search/tweets', {'q':cash})
But it is still missing the desired timespan. And I'm not sure if the names for the characteristics match the ones for this libraries.
Using TwitterAPI, you can make Premium Search requests this way:
from TwitterAPI import TwitterAPI
SEARCH_TERM = '#AAPL OR #FB OR #KO OR #ABT OR #PEPCO'
PRODUCT = 'fullarchive'
LABEL = 'your label'
api = TwitterAPI('consumer key', 'consumer secret', 'access token key', 'access token secret')
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL), {'query':SEARCH_TERM})
for item in r:
if 'text' in item:
print(item['text'])
print(item['user']['name'])
print(item['followers_count'])
print(item['user']['location'])
print(item['user']['verified'])
print(item['lang'])
print(item['user']['statuses_count'])
print(item['source'])
print(item['favorite_count'])
print(item['retweet_count'])
print(item['created_at'])
The Premium search doc explains the supported request arguments. To do a date range use this:
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM, 'fromDate':201501010000, 'toDate':201812310000})

Validate if User Id on Twitter to be able to scrape tweets

I created a scraper with python that gets all the followers of a particular twitter user. The issue is that when I use this list of user Ids to get their tweets with logstash, I have an Error.
I used http://gettwitterid.com/ to manually check if these Ids are working, and they are but the list is really long to check it one by one.
Is there a solution with python to split the Ids into two lists, one containing Valid Ids and the other contains the Not valid ones, thet I use the Valid list as input for logstash?
The first 10 rows of the csv file is like this :
"id"
"602169027"
"95104995"
"874339739557670912"
"2981270769"
"93054327"
"870723159011545088"
"3008493180"
"874804469082533888"
"756339889092829184"
"1077712806"
I tried this code to get tweets using Ids imported from csv, but unfortunetly it's raising 144 (Not found)
import tweepy
import pandas as pd
consumer_key = ""
consumer_secret = ""
access_token_key = "-"
access_token_secret = ""
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
dfuids = pandas.read_csv('Uids.csv')
for index, row in dfuids.iterrows():
print row['id']
tweet = api.get_status(dfuids['id'])
importing ids from csv
Try to change your code to this:
for index, row in dfuids.iterrows():
print row['id']
tweet = api.get_status(row['id'])
To escape potential errors, you can add a try / except loop later.
I got the solution after some experiments:
dfuids = pd.read_csv('Uids.csv')
valid = []
notvalid = []
for index, row in dfuids.iterrows():
print index
x = str(row.id)
#print x , type(x)
try:
tweet = api.user_timeline(row.id)
#print "Fine :",row.id
valid.append(x)
#print x, "added to valid"
except:
#print "NotOk :",row.id
notvalid.append(x)
#print x, "added to valid"
This Part of the code was what I needed, so it loops for all the Ids, and test if that user id give us some tweets from the timeline, if correct then it's appended as string to a list called (valid) else if we have an exception for any reason then it's appended to (notvalid).
We can save this list into a dataframe and export csv :
df = pd.DataFrame(valid)
dfnotv = pd.DataFrame(notvalid)
df.to_csv('valid.csv', index=False, encoding='utf-8')
dfnotv.to_csv('notvalid.csv', index=False, encoding='utf-8')

I am trying to extract tweets from a twitter query on Python using Twython

I am trying to go through a list of tweets related to a specific search term and trying to extract all the hashtags. I wish to make a python list which includes all the hashtags. I started by using Twython as follows
from twython import Twython
api_key = 'xxxx'
api_secret = 'xxxx'
acces_token = 'xxxx'
ak_secret = 'xxxx'
t = Twython(app_key = api_key, app_secret = api_secret, oauth_token = acces_token, oauth_token_secret = ak_secret)
search = t.search(q = 'Python', count = 10)
tweets = search['statuses']
hashtags = []
for tweet in tweets:
b = (tweet['text'],"\n")
if b.startswith('#'):
hastags.append(b)
It doesn't seem to be working. I get the error that
'tuple object has no attribute startswith'
I am not sure if I am meant to make a list of all the statuses first and extract using the mentioned method. Or it is okay to proceed without making the list of statuses first.
Thank you
That is correct, strings have the startswith attribute and tuples do not.
Change the last three lines to this:
b = (tweet['text'])
if b.startswith("#") is True:
hashtags.append(b)
If you really want that line break then it would be:
b = (tweet['text'] + "\n")
if b.startswith("#") is True:
hashtags.append(b)

Categories

Resources