Tweepy double scraping

Tweepy double scraping - python

I have been using tweepy to scrape twitter for about 9 months. On Friday of last week my scraper stopped working as it did two things: 1) It started to return an empty list instead of previous tweets when tweets are present on the users profile 2) scrape old tweets when only the most recent tweets should be scraped. Has anyone been experiencing the same issues? Any suggested fixes appreciated!
def get_tweets(username):
# Authorization to consumer key and consumer secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# Access to user's access key and access secret
auth.set_access_token(access_key, access_secret)
# Calling api
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
text_of_tweet = None
tweet_id = None
number_of_tweets = 1
# Scrape the most recent tweet on the users timeline
tweet = api.user_timeline(screen_name=username, count=number_of_tweets, include_rts=False)
# Check if string all ascii
for item in tweet:
text_of_tweet = item.text
tweet_id = item.id
if (all(ord(c) < 128 for c in text_of_tweet)) == False:
text_of_tweet = conv_true_ascii(text_of_tweet)
list_of_sentences = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', text_of_tweet)
text_of_tweet = list_of_sentences[0]
text_of_tweet = text_of_tweet.split('\n')[0]
# Write to CSV
# csvWriter.writerow([text_of_tweet, tweet_time, tweet_id])
# Return tweet
return text_of_tweet, tweet_id
def conv_true_ascii(single_tweet):
edit_start = single_tweet.encode('ascii', errors='ignore')
edited_tweet = edit_start + b'' * (len(single_tweet) - len(edit_start))
edited_tweet = str(edited_tweet)
edited_tweet = edited_tweet.replace("b'", '')
edited_tweet = edited_tweet.replace(edited_tweet[-1], '')
return edited_tweet

Related

How to get the likes of every tweet containing a specific hashtag with tweepy

I can retrieve tweets with a specific hashtag using tweepy:
Code:
from os import access
import tweepy
import configparser
import pandas as pd
# config = configparser.ConfigParser()
# config.read('config.ini')
api_key = ''
api_key_secret = ''
access_token = ''
access_token_secret = ''
auth = tweepy.OAuthHandler(api_key, api_key_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# user = '#veritasium'
keywords = '#SheHulk'
limit = 1200
tweets = tweepy.Cursor(api.search_tweets, q = keywords, count = 100, tweet_mode = 'extended').items(limit)
columns = ['User', 'Tweet']
data = []
for tweet in tweets:
data.append([tweet.user.screen_name, tweet.full_text])
df = pd.DataFrame(data, columns=columns)
df.to_excel("output.xlsx")
What I want to know is that if I can get the number of likes with every tweet that is retrieved. Any help would be appreciated.

In the Twitter API V1.1 (see documentation here), that field was called favorite_count.
for tweet in tweets:
print(f"That tweet has {tweet.favorite_count} likes").

Retrieving specific tweets of a user

I am trying to scrape tweets from a specified user based on a specific keyword using Tweepy. I tried using
if api.search(q="$"):
but I am running into an error. How can I solve this problem?
#Import the libraries
import tweepy
api_key = ""
api_key_secret = ""
access_token = ""
access_token_secret = ""
auth_handler = tweepy.OAuthHandler(consumer_key=api_key, consumer_secret=api_key_secret)
auth_handler.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth_handler,wait_on_rate_limit=True)
user = api.get_user("TheShual")
print("User details:")
print(user.name)
print(user.description)
print(user.location)
userID = "TheShual"
tweets = api.user_timeline(screen_name=userID,
# 200 is the maximum allowed count
count=20,
include_rts = False,
# Necessary to keep full_text
# otherwise only the first 140 words are extracted
tweet_mode = 'extended'
)
for info in tweets[:10]:
if api.search(q="$"):
print(info.created_at)
print(info.full_text)
print("\n")

How to extract 1000 tweets using Python?

I’m trying to extract tweets based on the country name but the code always retrieves small amounts of tweets (about 23, 50 and 70, not more than that). Does anyone know how to retrieve tweets around (1000-5000)?
# this is not my real credentials
Consume:
CONSUMER_KEY = ‘xxx’
CONSUMER_SECRET = ‘ttt’
# Access:
ACCESS_TOKEN = ‘rffg’
ACCESS_SECRET = ‘mmvvvt’
import tweepy
import csv
# get authorization
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
api = tweepy.API(auth)
# get tweets from country
place = api.geo_search(query="Saudi Arabia", granularity="country" ,since= '10')
place_id = place[0].id
# print tweets and save to csv file
with open('tweets.csv', 'w', newline='', encoding='utf-8') as csvFile:
tweetWriter = csv.writer(csvFile, delimiter=',')
tweets = api.search(q='place:%s' % place_id, count=100, since='1')
count = 0
for tweet in tweets:
count += 1
# tweet.id = unique id for tweet, text = text, place.name = where it was posted, created_at = UTC time
tweetData = [tweet.id, tweet.user.name, tweet.text, tweet.place.name, tweet.created_at]
tweetWriter.writerow(tweetData)
print(count)

Extract date and time of a tweet tweepy Python

I have found a python script for extracting tweets and store to csv file. I am not familiar with python yet. Except the tweets, I need also to extract the date and the time of each tweet. I have found how to extract other characteristics, such as "retweeted", "retweet_count", but I am still stuck in date and time.
The script is here:
#!/usr/bin/env python
# encoding: utf-8
import tweepy #https://github.com/tweepy/tweepy
import csv
#Twitter API credentials
consumer_key = "..........................."
consumer_secret = "..........................."
access_key = "..........................."
access_secret = "..........................."
screename = "#realDonaldTrump"
def get_all_tweets(screen_name):
#Twitter only allows access to a users most recent 3240 tweets with this method
#authorize twitter, initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret )
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#initialize a list to hold all the tweepy Tweets
alltweets = []
#make initial request for most recent tweets (200 is the maximum allowed count)
new_tweets = api.user_timeline(screen_name = screename ,count=200)
screen_name = "Donald J. Trump"
#save most recent tweets
alltweets.extend(new_tweets)
#save the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
#keep grabbing tweets until there are no tweets left to grab
while len(new_tweets) > 0:
print "getting tweets before %s" % (oldest)
#all subsiquent requests use the max_id param to prevent duplicates
new_tweets = api.user_timeline(screen_name = screename,count=200,max_id=oldest)
#save most recent tweets
alltweets.extend(new_tweets)
#update the id of the oldest tweet less one
oldest = alltweets[-1].id - 1
print "...%s tweets downloaded so far" % (len(alltweets))
#transform the tweepy tweets into a 2D array that will populate the csv
outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8"), tweet.favorite_count, tweet.retweet_count, tweet.favorited, tweet.retweeted] for tweet in alltweets]
#write the csv
with open('%s_tweets.csv' % screen_name , 'wb') as f:
writer = csv.writer(f)
writer.writerow(["id","created_at","text","favorite_count","retweet_count","favorited","retweeted"])
writer.writerows(outtweets)
pass
if __name__ == '__main__':
#pass in the username of the account you want to download
get_all_tweets(screename)

The tweepy tweet model has created_at:
created_at
Creation time of the Tweet.
Type
datetime.datetime | None
Interesting fact is that you can derive the time from the tweet id. Tweet IDs are k-sorted within a second bound. We can extract the timestamp for a tweet ID by right shifting the tweet ID by 22 bits and adding the Twitter epoch time of 1288834974657.

Tweepy Search w/ While Loop

This is driving me crazy. As you can see below I am trying to use a simple while loop to perform a couple of tweepy searches and append them into a data frame. For some reason however after pulling the first set of 100 tweets it just repeats that set instead of performing a new search. Any advice would be greatly appreciated.
import sys
import csv
import pandas as pd
import tweepy
from tweepy import OAuthHandler
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
num_results = 200
result_count = 0
last_id = None
df = pd.DataFrame(columns=['Name', 'Location', 'Followers', 'Text', 'Coorinates'])
while result_count < num_results:
result = api.search(q='',count=100, geocode= "38.996918,-104.995826,190mi", since_id = last_id)
for tweet in result:
user = tweet.user
last_id = tweet.id_str
name = user.name
friends = user.friends_count
followers = user.followers_count
text = tweet.text.encode('utf-8')
location = user.location
coordinates = tweet.coordinates
df.loc[result_count] = pd.Series({'Name':name, 'Location':location, 'Followers':followers, 'Text':text, 'Coordinates':coordinates})
print(text)
result_count += 1
# Save to Excel
print("Writing all tables to Excel...")
df.to_csv('out.csv')
print("Excel Export Complete.")

The API.search method returns tweets that match a specified query. It's not a Streaming APi, so it returns all data at once.
Furthermore, in your query parameters, you have added count, that specifies the number of statuses to retrieve.
So the problem is that with your query you are returning the first 100 data of the complete set for each while iteration.
I suggest you to change the code in something like this
result = api.search(q='', geocode= "38.996918,-104.995826,190mi", since_id = last_id)
for tweet in result:
user = tweet.user
last_id = tweet.id_str
name = user.name
friends = user.friends_count
followers = user.followers_count
text = tweet.text.encode('utf-8')
location = user.location
coordinates = tweet.coordinates
df.loc[result_count] = pd.Series({'Name':name, 'Location':location, 'Followers':followers, 'Text':text, 'Coordinates':coordinates})
print(text)
Let me know.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Tweepy double scraping - python

Related

How to get the likes of every tweet containing a specific hashtag with tweepy

Retrieving specific tweets of a user

How to extract 1000 tweets using Python?

Extract date and time of a tweet tweepy Python

Tweepy Search w/ While Loop

Categories

Resources