Downloading tweets with Tweepy

Downloading tweets with Tweepy - python

I have a script that downloads a number of tweets using Cursor function of Tweepy. The issue is if I specify the number of tweets to be downloaded, Tweepy downloads so many many tweets of which 90 percent are duplicates. Below is my exact code snippet.
qw = ['Pele']
tweet_dataset = pd.DataFrame(columns=['Tweet_id','Author'])
for tweet in tw.Cursor(api.search_tweets,tweet_mode='extended', q=qw).items(5):
appending_dataframe = pd.DataFrame([[tweet.id,tweet.author.screen_name]],
columns=['Tweet_id','Author'])
tweet_dataset = tweet_dataset.append(appending_dataframe)
print(tweet_dataset[['Author','Tweet_id']].head())
From the above script I only want to return 5 tweets, instead it loops, the first time 1 tweet, the second time two tweets ... until it reaches the fifth time and return 5 tweets. Please see below snippet of the results:
(https://i.stack.imgur.com/Dnm7y.png)
I only want say 5 tweets from cursor not 5 groups of tweets as Cursor returns it.

The head method returns by default the first 5 lines.
Therefore, at every iteration you are printing the first 5 lines. Which returns 1 line in the first iteration, as there is only one line, 2 lines in the second iteration, and so on.
.head(1) would instead return one line at a time.

Related

Using program in another file gives different output

I'm having a rather unique issue with my code that I have not experienced before and could use some guidance.
Here is an attempt a short explanation:
Basically, I have a program with many functions that are tied to one main one. It takes in data from files sent to it and gives output based on many factors. Running this function in the file itself gives the proper results, however, if I import this function and run it in the main.py, it gives very, very incorrect output.
I am going to do my best to show the least amount of code in this post, so here is the GitHub. Please use it for further reference and understanding of what is happening. I don't know any websites that I can use to link and run my code for these purposes.
sentiment_analysis.py is the file with all of the functions. main.py is the file that utilizes it all, and driver.py is the file given by my prof to test this assignment.
Basic assignment explanation (skip if not needed for answering the question): Take in twitter data from the files given along with keywords that have an associated happiness value. Take all data, split into timezone regions (approximation based on given point values, not real timezones), and then give back basic information about the data in the files. ie. Average happiness per timezone, total keyword tweets, and total tweets, for each region.
Running sentiment_analysis will currently give correct output based on heavy testing.
Running main and driver will give incorrect output. Ex. tweets2 has 25 total lines of twitter data, but using driver will return 91 total tweets and keyword tweets (eastern data, 4th test scenario in driver.py) instead of the expected 15 total tweets in that region.
I've spent about 3 hours testing scenarios and outputting different information to try and debug but have had no luck. If anyone has any idea why it's returning different outputs when called in a different file, that would be great.
The following are the three most important functions in the file, with the first being the one called in another file.
def compute_tweets(tweets, keywords):
try:
with open(tweets, encoding="utf-8", errors="ignore") as f: # opens the file
tweet_list = f.read().splitlines() # reads and splitlines the file. Gets rid of the \n
print(tweet_list)
with open(keywords, encoding="utf-8", errors="ignore") as f:
keyword_dict = {k: int(v) for line in f for k,v in [line.strip().split(',')]}
# instead of opening this file normally i am using dictionary comprehension to turn the entire file into a dictionary
# instead of the standard list which would come from using the readlines() function.
determine_timezone(tweet_list) # this will run the function to split all pieces of the file into region specific ones
eastern = calculations(keyword_dict, eastern_list)
central = calculations(keyword_dict, central_list)
mountain = calculations(keyword_dict, mountain_list)
pacific = calculations(keyword_dict, pacific_list)
return final_calculation(eastern, central, mountain, pacific)
except FileNotFoundError as excpt:
empty_list = []
print(excpt)
print("One or more of the files you entered does not exist.")
return empty_list
# Constants for Timezone Detection
# eastern begin
p1 = [49.189787, -67.444574]
p2 = [24.660845, -67.444574]
# Central begin, eastern end
p3 = [49.189787, -87.518395]
# p4 = [24.660845, -87.518395] - Not needed
# Mountain begin, central end
p5 = [49.189787, -101.998892]
# p6 = [24.660845, -101.998892] - Not needed
# Pacific begin, mountain end
p7 = [49.189787, -115.236428]
# p8 = [24.660845, -115.236428] - Not needed
# pacific end, still pacific
p9 = [49.189787, -125.242264]
# p10 = [24.660845, -125.242264]
def determine_timezone(tweet_list):
for index, tweet in enumerate(tweet_list): # takes in index and tweet data and creates a for loop
long_lat = get_longlat(tweet) # determines the longlat for the tweet that is currently needed to work on
if float(long_lat[0]) <= float(p1[0]) and float(long_lat[0]) >= float(p2[0]):
if float(long_lat[1]) <= float(p1[1]) and float(long_lat[1]) > float(p3[1]):
# this is testing for the eastern region
eastern_list.append(tweet_list[index])
elif float(long_lat[1]) <= float(p3[1]) and float(long_lat[1]) > float(p5[1]):
# testing for the central region
central_list.append(tweet_list[index])
elif float(long_lat[1]) <= float(p5[1]) and float(long_lat[1]) > float(p7[1]):
# testing for mountain region
mountain_list.append(tweet_list[index])
elif float(long_lat[1]) <= float(p7[1]) and float(long_lat[1]) >= float(p9[1]):
# testing for pacific region
pacific_list.append(tweet_list[index])
else:
# if nothing is found, continue to the next element in the tweet data and do nothing
continue
else:
# if nothing is found for the longitude, then also continue
continue
def calculations(keyword_dict, tweet_list):
# - Constants for caclulations and returns
total_tweets = 0
total_keyword_tweets = 0
average_happiness = 0
happiness_sum = 0
for entry in tweet_list: # saying for each piece of the tweet list
word_list = input_splitting(entry) # run through the input splitting for list of words
total_tweets += 1 # add one to total tweets
keyword_happened_counter = 0 # this is used to know if the word list has already had a keyword tweet. Needs to be
# reset to 0 again in this spot.
for word in word_list: # for each word in that word list
for key, value in keyword_dict.items(): # take the key and respective value for each item in the dict
# print("key:", key, "val:", value)
if word == key: # if the word we got is the same as the key value
if keyword_happened_counter == 0: # and the keyword counter hasnt gone up
total_keyword_tweets += 1 # add one to the total keyword tweets
keyword_happened_counter += 1 # then add one to keyword happened counter
happiness_sum += value # and, if we have a keyword tweet, no matter what add to the happiness sum
else:
continue # if we don't have a word == key, continue iterating.
if total_keyword_tweets != 0:
average_happiness = happiness_sum / total_keyword_tweets # calculation for the average happiness value
else:
average_happiness = 0
return [average_happiness, total_keyword_tweets, total_tweets] # returning a tuple of info in proper order
My apologies for the wall of both text and code. I'm new to making posts on here and am trying to include all relevant information... If anyone knows of a better way to do this aside from using github and code blocks, please do let me know.
Thanks in advance.

For loop cycle order

I am creating a short script which tweets automatically via twitter API. Besides setting up the API credentials (out of the scope for the question) I import the following library:
import os
I have set my working directory to be a folder where I have 3 photos. If I run os.listdir('.') I get the following list.
['Image_1.PNG',
'Image_2.PNG',
'Image_3.jpg',]
"mylist" is a list of strings, practically 3 tweets.
The code that posts in Twitter automatically looks like that:
for image in os.listdir('.'):
for num in range(len(mylist)):
api.update_with_media(image, mylist[num])
The code basically assigns to the first image a tweet and posts. Then to the same image the second tweet and posts. Again first image - third tweet. Then it continues the cycle to second and third image altogether 3*3 9 times/posts.
However what I want to achieve is to take the first image with the first tweet and post. Then take second image with second tweet and post. Third image - third tweet. Then I want to run the cycle one more time: 1st image - 1st tweet, 2nd image - 2nd tweet ...etc.

Use zip to iterate through two (or more) collections in parallel
for tweet, image in zip(mylist, os.listdir('.')):
api.update_with_media(image, tweet)
To repeat it more times, you can put this cycle inside another for

Assuming the length of os.listdir('.') and mylist are equal:
length = len(mylist) # If len(os.listdir('.')) is greater than len(mylist),
# replace mylist with os.listdir('.')
imageList = os.listdir('.')
iterations = 2 # The number of time you want this to run
for i in range(0,iterations):
for x in range(0, length):
api.update_with_media(imageList[x], mylist[num])

How to return only exact matches in Elasticsearch Python API

s = Search(using = client, index = .
set_index).source(['metadata.Filename'])\
.query('match', Filename=(date))
total = s.count()
return total
I want to find the total number of instances where '20180511' appears in metadata.Filename in a particular index.
This query is returning a higher number of hits than I would have expected. The data format in metadata.Filename is GEOSCATCAT20180507_12+20180511_0900.V01.nc4. My date variable is in the format '20180511'.
I think the problem is match queries do the thing with scores, where they might return a hit even if it's not an exact match. I was wondering if you had any insight regarding this issue.

Iterating over list with while and for loop in python - issues

I'm trying to query the Twitter API with a list of names and get their friends list. The API part is fine, but I can't figure out how to go through the first 5 names, pull the results, wait for a while to respect the rate limit, then do it again for the next 5 until the list is over. The bit of the code I'm having trouble is this:
first = 0
last = 5
while last < 15: #while last group of 5 items is lower than number of items in list#
for item in list[first:last]: #parses each n twitter IDs in the list#
results = item
text_file = open("output.txt", "a") #creates empty txt output / change path to desired output#
text_file.write(str(item) + "," + results + "\n") #adds twitter ID, resulting friends list, and a line skip to the txt output#
text_file.close()
first = first + 5 #updates list navigation to move on to next group of 5#
last = last + 5
time.sleep(5) #suspends activities for x seconds to respect rate limit#
Shouldn't this script go through the first 5 items in the list, add them to the output file, then change the first:last argument and loop it until the "last" variable is 15 or higher?

No, because your indentation is wrong. Everything happens inside the for loop, so it'll process one item, then change first and last, then sleep...
Move the last three lines back one indent, so that they line up with the for statement. That way they'll be executed once the first five have been done.

Daniel found the issue, but here are some code improvements suggestions:
first, last = 0, 5
with open("output.txt", "a") as text_file:
while last < 15:
for twitter_ID in twitter_IDs[first:last]:
text_file.write("{0},{0}\n".format(twitter_ID))
first += 5
last += 5
time.sleep(5)
As you can see, I removed the results = item as it seemed redundant, leveraged with open..., also used += for increments.
Can you explain why you where doing item = results?

How to get large list of followers Tweepy

I'm trying to use Tweepy to get the full list of followers from an account with like 500k followers, and I have a code that gives me the usernames for smaller accounts, like under 100, but if I get one that's even like 110 followers, it doesn't work. Any help figuring out how to make it work with larger numbers is greatly appreciated!
Here's the code I have right now:
import tweepy
import time
key1 = "..."
key2 = "..."
key3 = "..."
key4 = "..."
accountvar = raw_input("Account name: ")
auth = tweepy.OAuthHandler(key1, key2)
auth.set_access_token(key3, key4)
api = tweepy.API(auth)
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name=accountvar).pages():
ids.extend(page)
time.sleep(60)
users = api.lookup_users(user_ids=ids)
for u in users:
print u.screen_name
The error I keep getting is:
Traceback (most recent call last):
File "test.py", line 24, in <module>
users = api.lookup_users(user_ids=ids)
File "/Library/Python/2.7/site-packages/tweepy/api.py", line 321, in lookup_users
return self._lookup_users(post_data=post_data)
File "/Library/Python/2.7/site-packages/tweepy/binder.py", line 239, in _call
return method.execute()
File "/Library/Python/2.7/site-packages/tweepy/binder.py", line 223, in execute
raise TweepError(error_msg, resp)
tweepy.error.TweepError: [{u'message': u'Too many terms specified in query.', u'code': 18}]
I've looked at a bunch of other questions about this type of question, but none I could find had a solution that worked for me, but if someone has a link to a solution, please send it to me!

I actually figured it out, so I'll post the solution here just for reference.
import tweepy
import time
key1 = "..."
key2 = "..."
key3 = "..."
key4 = "..."
accountvar = raw_input("Account name: ")
auth = tweepy.OAuthHandler(key1, key2)
auth.set_access_token(key3, key4)
api = tweepy.API(auth)
users = tweepy.Cursor(api.followers, screen_name=accountvar).items()
while True:
try:
user = next(users)
except tweepy.TweepError:
time.sleep(60*15)
user = next(users)
except StopIteration:
break
print "#" + user.screen_name
This stops after every 300 names for 15 minutes, and then continues. This makes sure that it doesn't run into problems. This will obviously take ages for large accounts, but as Leb mentioned:
The twitter API only allows 100 users to be searched for at a time...[so] what you'll need to do is iterate through each 100 users but staying within the rate limit.
You basically just have to leave the program running if you want the next set. I don't know why mine is giving 300 at a time instead of 100, but as I mentioned about my program earlier, it was giving me 100 earlier as well.
Hope this helps anyone else that had the same problem as me, and shoutout to Leb for reminding me to focus on the rate limit.

To extend upon this:
You can harvest 3,000 users per 15 minutes by adding a count parameter:
users = tweepy.Cursor(api.followers, screen_name=accountvar, count=200).items()
This will call the Twitter API 15 times as per your version, but rather than the default count=20, each API call will return 200 (i.e. you get 3000 rather than 300).

Twitter provides two ways to fetch the followers: -
Fetching full followers list (using followers/list in Twitter API
or api.followers in tweepy) - Alec and mataxu have provided the
approach to fetch using this way in their answers. The rate limit
with this is you can get at most 200 * 15 = 3000 followers in every
15 minutes window.
Second approach involves two stages:-
a) Fetching only the followers ids first (using followers/ids in
Twitter API or api.followers_ids in tweepy).you can get 5000 *
15 = 75K follower ids in each 15 minutes window.
b) Looking up
their usernames or other data (using users/lookup in twitter api or
api.lookup_users in tweepy). This has rate limitation of about 100 * 180
= 18K lookups each 15 minute window.
Considering the rate limits, Second approach gives followers data 6 times faster when compared to first approach.
Below is the code which could be used to do it using 2nd approach:-
#First, Make sure you have set wait_on_rate_limit to True while connecting through Tweepy
api = tweepy.API(auth, wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
#Below code will request for 5000 follower ids in one request and therefore will give 75K ids in every 15 minute window (as 15 requests could be made in each window).
followerids =[]
for user in tweepy.Cursor(api.followers_ids, screen_name=accountvar,count=5000).items():
followerids.append(user)
print (len(followerids))
#Below function could be used to make lookup requests for ids 100 at a time leading to 18K lookups in each 15 minute window
def get_usernames(userids, api):
fullusers = []
u_count = len(userids)
print(u_count)
try:
for i in range(int(u_count/100) + 1):
end_loc = min((i + 1) * 100, u_count)
fullusers.extend(
api.lookup_users(user_ids=userids[i * 100:end_loc])
)
return fullusers
except:
import traceback
traceback.print_exc()
print ('Something went wrong, quitting...')
#Calling the function below with the list of followeids and tweepy api connection details
fullusers = get_usernames(followerids,api)
Hope this helps.
Similiar approach could be followed for fetching friends details by using api.friends_ids inplace of api.followers_ids
If you need more resources for rate limit comparison and for 2nd approach, check below links:-
https://github.com/tweepy/tweepy/issues/627
https://labsblog.f-secure.com/2018/02/27/how-to-get-twitter-follower-data-using-python-and-tweepy/

The twitter API only allows 100 users to be searched for at a time. That's why no matter how many you input to it you'll get 100. The followers_id is giving you the correct number of users but you're being limited by GET users/lookup
What you'll need to do is iterate through each 100 users but staying within the rate limit.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading tweets with Tweepy - python

The head method returns by default the first 5 lines. Therefore, at every iteration you are printing the first 5 lines. Which returns 1 line in the first iteration, as there is only one line, 2 lines in the second iteration, and so on. .head(1) would instead return one line at a time.

Related

Using program in another file gives different output

For loop cycle order

How to return only exact matches in Elasticsearch Python API

Iterating over list with while and for loop in python - issues

How to get large list of followers Tweepy

Categories

Resources