Getting tweet-text from selenium element for twitter - python

I am trying to scrape the tweets from a trending tag in twitter. I tried to find the xpath of the text in a tweet, but it doesn't work.
browser = webdriver.Chrome('/Users/Suraj/Desktop/twitter/chromedriver')
url = 'https://twitter.com/search?q=%23'+'Swastika'+'&src=trend_click'
browser.get(url)
time.sleep(1)
The following piece of code doesn't give any results.
browser.find_elements_by_xpath('//*[#id="tweet-text"]')
Other content which I was able to find where :
browser.find_elements_by_css_selector("[data-testid=\"tweet\"]") # works
browser.find_elements_by_xpath("/html/body/div[1]/div/div/div[2]/main/div/div/div/div[1]/div/div[2]/div/div/section/div/div/div/div/div/div/article/div/div/div/div[2]/div[2]/div[1]/div/div") # works
I want to know how I can select the text from the tweet.

You can use Selenium to scrape twitter but it would be much easier/faster/efficient to use the twitter API with tweepy. You can sign up for a developer account here: https://developer.twitter.com/en/docs
Once you have signed up get your access keys and use tweepy like so:
import tweepy
# connects to twitter and authenticates your requests
auth = tweepy.OAuthHandler(TWapiKey, TWapiSecretKey)
auth.set_access_token(TWaccessToken, TWaccessTokenSecret)
# wait_on_rate_limit prevents you from requesting too many times and having twitter block you
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
# loops through every tweet that tweepy.Cursor pulls -- api.search tells cursor
# what to do, q is the search term, result_type can be recent popular or mixed,
# and the max_id/since_id are snowflake ids which are twitters way of
# representing time and finally count is the maximum amount of tweets you can return per request.
for tweet in tweepy.Cursor(api.search, q=YourSearchTerm, result_type='recent', max_id=snowFlakeCurrent, since_id=snowFlakeEnd, count=100).items(500):
createdTime = tweet.created_at.strftime('%Y-%m-%d %H:%M')
createdTime = dt.datetime.strptime(createdTime, '%Y-%m-%d %H:%M').replace(tzinfo=pytz.UTC)
data.append(createdTime)
This code is an example of a script that pulls 500 tweets from YourSearchTerm recent tweets and then appends the time each was created to a list. You can check out the tweepy documentation here: http://docs.tweepy.org/en/latest/
Each tweet that you pull with the tweepy.Cursor() will have many attributes that you can choose and append to a list and or do something else. Even though it is possible to scrape twitter with Selenium it's realllly not recommended as it will be very slow whereas tweepy returns result in mere seconds.

Applying for the API is not always successful. I used Twint, which provides a means to scrape quickly. In this case to a CSV output.
def search_twitter(terms, start_date, filename, lang):
c = twint.Config()
c.Search = terms
c.Custom_csv = ["id", "user_id", "username", "tweet"]
c.Output = filename
c.Store_csv = True
c.Lang = lang
c.Since = start_date
twint.run.Search(c)
return

Related

I need a way to pull 100 of the most popular channels and their ID's from YouTube using the youtube api in the python language

http://www.youtube.com/dev/ ive already checked the youtube api dev info and have not found anything pertaining to this.
Obtain API youtube
To be able to make requests of this type you must have an API key,
Go to link https://console.developers.google.com/
Create a new project
find youtube Data API v3 and click on it
Enable the API
go to credentials and create one for the project
write it down and insert it in the script below
This script uses ÁPI created previously to make requests on the channels by creating name in a random way and inserts the data in two files in the first it stores all the info in the second only the id, the channel name and the link to the channel, I hope it is what you are looking for ;)
import json
import urllib.request
import string
import random
channels_to_extract = 100
API_KEY = '' #your api key
while True:
random_name = ''.join(random.choice(string.ascii_uppercase) for _ in range(random.randint(3,10))) # for random name of channel to search
urlData = "https://www.googleapis.com/youtube/v3/search?key={}&maxResults={}&part=snippet&type=channel&q={}".format(API_KEY,channels_to_extract,random_name)
webURL = urllib.request.urlopen(urlData)
data = webURL.read()
encoding = webURL.info().get_content_charset('utf-8')
results = json.loads(data.decode(encoding))
results_id={}
if results['pageInfo']["totalResults"]>=channels_to_extract: # may return 0 result because is a random name
break # when find a result break
for result in results['items']:
results_id[result['id']['channelId']]=[result["snippet"]["title"],'https://www.youtube.com/channel/'+result['id']['channelId']] # get id and link of channel for all result
with open("all_info_channels.json","w") as f: # write all info result in a file json
json.dump(results,f,indent=4)
with open("only_id_channels.json","w") as f: # write only id of channels result in a file json
json.dump(results_id,f,indent=4)
for channelId in results_id.keys():
print('Link --> https://www.youtube.com/channel/'+channelId) # link at youtube channel for all result

Unable to retrieve tweets using Tweepy library

import pandas as pd
import tweepy as tw # To extract the twitter data using Twitters official API
from tqdm import tqdm, notebook
import os
pd.set_option('display.max_columns' , None)
pd.set_option('display.max_rows' , None)
pd.set_option('display.max_colwidth' , None)
pd.set_option('display.width' , None)
consumer_api_key = 'XXXX'
consumer_api_secret = 'XXXX'
auth = tw.OAuthHandler(consumer_api_key, consumer_api_secret)
api = tw.API(auth, wait_on_rate_limit=True)
search_words = "#Ethereum -filter:retweets"
# We type in our key word to search for relevant tweets that contain "#"
#You can fix a time frame with the date since and date until parameters
date_until = "2021-05-01"
# Collect tweets
tweets = tw.Cursor(api.search_tweets,
q=search_words,
lang="en",
until=date_until).items(15000)
tweets_copy = []
for tweet in tqdm(tweets):
tweets_copy.append(tweet)
print(f"New tweets retrieved: {len(tweets_copy)}")
I am trying to extract tweets with the keyword #Ethereum from a specific time frame,but when I run the code I keep getting a red bar in Jupyter Notebook that says "0it [00:00, ?it/s]" and this leads to know tweets being retrieved. Can anyone help?
From the Twitter search documentation:
The Twitter Search API searches against a sampling of recent Tweets published in the past 7 days.
And from the until parameter documentation for this endpoint:
Keep in mind that the search index has a 7-day limit. In other words, no tweets will be found for a date older than one week.
This is also clearly writen in the Tweepy method documentation.

Return and save 800 friend's tweets using Tweepy

I am trying to extract the tweets of my friends using api.home_timeline. I don't want to stream it, but I want to save 800 tweets, the screen names, and their likes/favorites count to a csv file. Twitter only allows 200 tweets at a time. Given my keys as already specified, this is what I have so far:
def data_set(handle):
auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth=set_access_token(ACCESS_KEY,ACCESS_SECRET)
api=tweepy.API(auth)
count_tweets=api.home_timeline(screen_name=handle,count=200)
twits=[]
tweet_data=[tweet.text for tweet in count_tweets]
for t in count_tweets:
twits.append(t)
if __name__== '__main__':
tweet_data('my twitter name')
my original plan was to have multiple count_tweets such as count_tweet1, etc. I am unsure how to proceed with the rest. Any suggestions are greatly appreciated.
Twitter follows pagination. For every request, you make it gives a maximum of 200 tweets(in the case of home_timeline). The 200 tweets you get are based on popularity. You can fetch all the tweets from the user's timeline by iterating over the pages. Tweepy provides Cursor functionality to iterate over the pages
Edited code for your case:
def data_set(handle):
auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth=set_access_token(ACCESS_KEY,ACCESS_SECRET)
api=tweepy.API(auth)
tweet_data = []
for page in tweepy.Cursor(api.user_timeline, screen_name=handle, count=200, tweet_mode='extended').pages():
for tweet in page:
tweet_data.append(tweet.full_text)
return tweet_data
## Not sure why the following lines are needed
# twits=[]
# tweet_data=[tweet.text for tweet in count_tweets]
# for t in count_tweets:
# twits.append(t)
if __name__== '__main__':
print(data_set('my twitter name'))
I have used api.user_timeline instead of api.home_timeline in the code as you said you are trying to fetch the tweets from your friends timeline. If your use case is satisfied by api.home_timeline you can replace it instead.

Converting Webelement.text to a string

I am trying to use machine learning to perform sentiment analysis on data from twitter. To aggregate the data, I've made a class which will mine and
pre-process data. In order to clean and pre-process the data, I'd like to convert each tweet's text to a string. However, when the line of code in the inner for loop in the massMine method is called, i get a WebDriverException: no such session. The relevant bits of code are below, any input is appreciated, thanks.
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import numpy as np
import pandas
import re
class TweetMiner(object):
def __init__(self):
self.base_url = u'https://twitter.com/search?q=from%3A'
self.raw_data = []
def mineTweets(self, query, tweet_quota):
'''
Mine data from a singular twitter account,
input consists of a twitter handle, and a
value indicating how much data to mine
Ex: “#diddy” should be inputted as “diddy”
'''
browser = webdriver.Chrome()
url = self.base_url + query
browser.get(url)
time.sleep(1)
body = browser.find_element_by_tag_name('body')
for _ in range(tweet_quota):
body.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
tweets = browser.find_elements_by_class_name('tweet-text')
for tweet in tweets:
print(tweet.text)
browser.close()
return tweets
def massMine(self, inputArray, dataSize):
'''
Mine data from an array of twitter
accounts, input array consists of twitter
handles and a value indicating how much
data to mine
Ex: “#diddy” should be inputted as “diddy”
'''
for user in inputArray:
rtn = ""
tweets = self.mineTweets(user,dataSize)
for tweet in tweets:
rtn += (tweet.text)
return rtn
EDIT: I don't know what caused this error - but if anyone stumbles across this post with a similar error I was able to workaround by simply writing each tweet to a text file.
I use to get this error when I have opened too many browser instances and haven't closed it properly (both via automation script or manually). When other browser instances were killed one by one this error is removed. I found that C:\Users\(yourAccountName)\AppData\Local\Temp directory is totally filled up and hence causing the NoSuchSession error.
Preferred solution will be to see if too many browsers/tabs are open. Remove them. Or manually remove all the contents inside above Temp path and try.

weird issue using twitter api to search for tweets

I've setup a code in python to search for tweets using the oauth2 and urllib2 libraries only. (I'm not using any particular twitter library)
I'm able to search for tweets based on keywords. However, I'm getting zero number of tweets when I search for this particular keyword - "Jurgen%20Mayer-Hermann". (this is challenge because my ultimate goal is to search for this keyword only.
On the other hand when I search for the same thing online (twitter interface, I'm getting enough tweets). - https://twitter.com/search?q=Jurgen%20Mayer-Hermann&src=typd
Can someone please see if we can identify the issue?
The code is as follows:
def getfeed(mystr, tweetcount):
url = "https://api.twitter.com/1.1/search/tweets.json?q=" + mystr + "&count=" + tweetcount
parameters = []
response = twitterreq(url, "GET", parameters)
res = json.load(response)
return res
search_str = "Jurgen Mayer-Hermann"
search_str = '%22'+search_str+'%22'
search = search_str.replace(" ","%20")
search = search.replace("#","%23")
tweetcount = str(50)
res = getfeed(search, tweetcount)
When I print the constructed url, I get
https://api.twitter.com/1.1/search/tweets.json?q=%22Jurgen%20Mayer-Hermann%22&count=50
I have actually never worked with the Twitter API, but it looks like the count parameter only applies to searches on timelines as a way to limit the amount of tweets per page of results. In other words, you use it with the GET statuses/home_timeline, GET statuses/mentions, and GET statuses/user_timeline endpoints.
Try without count and see what happens.
Please use urllib.urlencode to encode your query parameters, like so:
import urllib
query = urllib.urlencode({'q': '"Jurgen Mayer-Hermann"', count: 50})
This produces 'q=%22Jurgen+Mayer-Hermann%22&count=50'. Which might bring you more luck...

Categories

Resources