I am trying to use machine learning to perform sentiment analysis on data from twitter. To aggregate the data, I've made a class which will mine and
pre-process data. In order to clean and pre-process the data, I'd like to convert each tweet's text to a string. However, when the line of code in the inner for loop in the massMine method is called, i get a WebDriverException: no such session. The relevant bits of code are below, any input is appreciated, thanks.
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import numpy as np
import pandas
import re
class TweetMiner(object):
def __init__(self):
self.base_url = u'https://twitter.com/search?q=from%3A'
self.raw_data = []
def mineTweets(self, query, tweet_quota):
'''
Mine data from a singular twitter account,
input consists of a twitter handle, and a
value indicating how much data to mine
Ex: “#diddy” should be inputted as “diddy”
'''
browser = webdriver.Chrome()
url = self.base_url + query
browser.get(url)
time.sleep(1)
body = browser.find_element_by_tag_name('body')
for _ in range(tweet_quota):
body.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
tweets = browser.find_elements_by_class_name('tweet-text')
for tweet in tweets:
print(tweet.text)
browser.close()
return tweets
def massMine(self, inputArray, dataSize):
'''
Mine data from an array of twitter
accounts, input array consists of twitter
handles and a value indicating how much
data to mine
Ex: “#diddy” should be inputted as “diddy”
'''
for user in inputArray:
rtn = ""
tweets = self.mineTweets(user,dataSize)
for tweet in tweets:
rtn += (tweet.text)
return rtn
EDIT: I don't know what caused this error - but if anyone stumbles across this post with a similar error I was able to workaround by simply writing each tweet to a text file.
I use to get this error when I have opened too many browser instances and haven't closed it properly (both via automation script or manually). When other browser instances were killed one by one this error is removed. I found that C:\Users\(yourAccountName)\AppData\Local\Temp directory is totally filled up and hence causing the NoSuchSession error.
Preferred solution will be to see if too many browsers/tabs are open. Remove them. Or manually remove all the contents inside above Temp path and try.
Related
I have the following program in python that reads in a list of 1390680 URLS of Instagram accounts and gets the follower count for each user. It utilizes the instaloader. Here's the code:
import pandas as pd
from instaloader import Instaloader, Profile
# 1. Loading in the data
# Reading the data from the csv
data = pd.read_csv('IG_Audience.csv')
# Getting the profile urls
urls = data['Profile URL']
def getFollowerCount(PROFILE):
# using the instaloader module to get follower counts from this programmer
# https://stackoverflow.com/questions/52225334/webscraping-instagram-follower-count-beautifulsoup
try:
L = Instaloader()
profile = Profile.from_username(L.context, PROFILE)
print(PROFILE, 'has', profile.followers, 'followers')
return(profile.followers)
except Exception as exception :
print(exception, False)
return(0)
# Follower count List
followerCounts = []
# This loop will fetch the follower count for each user
for url in urls:
# Getting the profile username from the URL by removing the instagram.com
# portion and the backslash at the end of the url
url_dirty = url.replace('https://www.instagram.com/', '')
url_clean = url_dirty[:-1]
followerCounts.append(getFollowerCount(url_clean))
# Converting the list to a series, adding it to the dataframe, and writing it to
# a csv
data['Follower Count'] = pd.Series(followerCounts)
data.to_csv('IG_Audience.csv')
The main issue I have with this is that it is taking a very long time to read through the entire list. It took 14 hours just to get the follower counts for 3035 users. Is there any way to speed up this process?
First I wanna say I'm sorry for being VERY late but hopefully this can help someone in the future. I'm having a similar issue and I believe I found out why, when you get the followers you instaloader doesn't just go to the profiles page and read the number but it gets the URL and profile ID for each account and can only get so many at a time, the best way I can think to get around this would be to make a request to the page and just read the follower count on their main page issue with this however is after I believe 9999 followers it will start saying "10k" or "10.1k" so you'll be off by 100 and just gets worse if the person has over a million because then its off by even more.
I am trying to scrape the tweets from a trending tag in twitter. I tried to find the xpath of the text in a tweet, but it doesn't work.
browser = webdriver.Chrome('/Users/Suraj/Desktop/twitter/chromedriver')
url = 'https://twitter.com/search?q=%23'+'Swastika'+'&src=trend_click'
browser.get(url)
time.sleep(1)
The following piece of code doesn't give any results.
browser.find_elements_by_xpath('//*[#id="tweet-text"]')
Other content which I was able to find where :
browser.find_elements_by_css_selector("[data-testid=\"tweet\"]") # works
browser.find_elements_by_xpath("/html/body/div[1]/div/div/div[2]/main/div/div/div/div[1]/div/div[2]/div/div/section/div/div/div/div/div/div/article/div/div/div/div[2]/div[2]/div[1]/div/div") # works
I want to know how I can select the text from the tweet.
You can use Selenium to scrape twitter but it would be much easier/faster/efficient to use the twitter API with tweepy. You can sign up for a developer account here: https://developer.twitter.com/en/docs
Once you have signed up get your access keys and use tweepy like so:
import tweepy
# connects to twitter and authenticates your requests
auth = tweepy.OAuthHandler(TWapiKey, TWapiSecretKey)
auth.set_access_token(TWaccessToken, TWaccessTokenSecret)
# wait_on_rate_limit prevents you from requesting too many times and having twitter block you
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
# loops through every tweet that tweepy.Cursor pulls -- api.search tells cursor
# what to do, q is the search term, result_type can be recent popular or mixed,
# and the max_id/since_id are snowflake ids which are twitters way of
# representing time and finally count is the maximum amount of tweets you can return per request.
for tweet in tweepy.Cursor(api.search, q=YourSearchTerm, result_type='recent', max_id=snowFlakeCurrent, since_id=snowFlakeEnd, count=100).items(500):
createdTime = tweet.created_at.strftime('%Y-%m-%d %H:%M')
createdTime = dt.datetime.strptime(createdTime, '%Y-%m-%d %H:%M').replace(tzinfo=pytz.UTC)
data.append(createdTime)
This code is an example of a script that pulls 500 tweets from YourSearchTerm recent tweets and then appends the time each was created to a list. You can check out the tweepy documentation here: http://docs.tweepy.org/en/latest/
Each tweet that you pull with the tweepy.Cursor() will have many attributes that you can choose and append to a list and or do something else. Even though it is possible to scrape twitter with Selenium it's realllly not recommended as it will be very slow whereas tweepy returns result in mere seconds.
Applying for the API is not always successful. I used Twint, which provides a means to scrape quickly. In this case to a CSV output.
def search_twitter(terms, start_date, filename, lang):
c = twint.Config()
c.Search = terms
c.Custom_csv = ["id", "user_id", "username", "tweet"]
c.Output = filename
c.Store_csv = True
c.Lang = lang
c.Since = start_date
twint.run.Search(c)
return
I am trying to develop a script with python to web scraping some information on a specific website for learning purposes.
I went over a lot of different tutorials and posts, trying to gather some insights from them, they are very useful but still didn't help me to find a way to log in the website and do searches with different keywords.
I tried to use different APIs, such as requests and urllib, maybe I didn't find the right way to solve it.
The steps lists as follow:
login information set up
Send login information to the website and get response for future use
keywords setup
import header
set up cookiejar
from login response, do the search
After I tried, it will work randomly, and
here is the code:
import getpass
# marvin
# date:2018/2/7
# login stage preparation
def login_values():
login="https://www.****.com/login"
username = input("Please insert your username: ")
password = getpass.getpass("Please type in your password: ")
host="www.****.com"
#store login screts
data = {
"username": username,
"password": password,
}
return login,host,data
The following is for getting the HTML file from a website
import requests
import random
import http.cookiejar
import socket
# Set up web scraping function to output the html text file
def webscrape(login_url,host_url,login_data,target_url):
#static values preparation
##import header
user_agents = [
***
]
agent = random.choice(user_agents)
headers={'User-agent':agent,
'Accept':'*/*',
'Accept-Language':'en-US,en;q=0.9;zh-cmn-Hans',
'Host':host_url,
'charset':'utf-8',
}
##set up cookie jar
cj = http.cookiejar.CookieJar()
#
# get the html file
socket.setdefaulttimeout(20)
s=requests.Session()
req=s.post(login_url, data=login_data)
res = s.get(target_url, cookies=cj,headers=headers)
html=res.text
return html
Here is the code to get each links from html:
from bs4 import BeautifulSoup
#set up html parsing function for parsing all the list links
def getlist(keyword,loginurl,hosturl,valuesurl,html_lists):
page=1
pagenum=10# set up maximum page num
links=[]
soup=BeautifulSoup(html_lists,"lxml")
try:
for li in soup.find("div",class_="search_pager human_pager in-block").ul.find_all('li'):
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
page+=1
if page<=pagenum:
try:
nexturl=soup.find('div',class_='search_pager human_pager in-block').ul.find('li',class_='pagination-next ng-scope ').a['href'] #next page
except AttributeError:
print("{}'s links are all stored!".format(keyword))
return links
else:
chs_html=webscrape(loginurl,hosturl,valuesurl,nexturl)
soup=BeautifulSoup(chs_html,"lxml")
except AttributeError:
target_part=soup.find_all("div",class_="search_result_single search-2017 pb25 pt25 pl30 pr30 ")
[links.append(link.find('a')['href']) for link in target_part]
print("There is only one page")
return links
The test code is:
keyword="****"
myurl="https://www.****.com/search/os2?key={}".format(keyword)
chs_html=webscrape(login,host,values,myurl)
chs_links=getlist(keyword,login,host,values,chs_html)
targethtml=webscrape(login,host,values,chs_links[1])
There are total 22 links and one page containing 19 links, so it is supposed to have more than one page, if the result "There is only one page" shown up, it indicates a failure.
Problems:
The login_values function is to secure my login information by combining all functions to a final function, but apparently, the username and password are still really easy to show just by print() command.
This the main problem!! Like I mentioned before, this method works randomly. By the way, what I mean not working, it is that the HTML file is only the login page instead of the searching result. I want to get a better control to make it work most of the time. I checked user-agents by print agent every time to see if they are relevant, and it is not! I cleared cookies with suspicious to full storage memory, and it is not.
There are sometimes I facing max trial error or OS error, I guess it is the error from the server I was trying to reach, is there a way I can set up a wait timer for me to prevent these errors from happening?
the resource at "BacDive" - ( http://bacdive.dsmz.de/) is a highly useful database for accessing bacterial knowledge, such as strain information, species information and parameters such as growth temperature optimums.
I have a scenario in which I have a set of organism names in a plain text file, and I would like to programmatically search them 1 by 1 against the Bacdive database (which doesnt allow a flat file to be downloaded) and retrieve the relevent information and populate my text file accordingly.
What are the main modules (such as beautifulsoups) that I would need to accomplish this? Is it straight forward? Is it allowed to programmatically access webpages ? Do I need permission?
A bacteria name would be "Pseudomonas putida" . Searching this would give 60 hits on bacdive. Clicking one of the hits, takes us to the specific page, where the line : "Growth temperature: [Ref.: #27] Recommended growth temperature : 26 °C " is the most important.
The script would have to access bacdive (which i have tried accessing using requests, but I feel they do not allow programmatic access, I have asked the moderator about this, and they said I should register for their API first).
I now have the API access. This is the page (http://www.bacdive.dsmz.de/api/bacdive/). This may seem quite simple to people who do HTML scraping, but I am not sure what to do now that I have access to the API.
Here is the solution...
import re
import urllib
from bs4 import BeautifulSoup
def get_growth_temp(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
no_hits = int(map(float, re.findall(r'[+-]?[0-9]+',str(soup.find_all("span", class_="searchresultlayerhits"))))[0])
if no_hits > 1 :
letters = soup.find_all("li", class_="searchresultrow1") + soup.find_all("li", class_="searchresultrow2")
all_urls = []
for i in letters:
all_urls.append('http://bacdive.dsmz.de/index.php' + i.a["href"])
max_temp = []
for ind_url in all_urls:
soup = BeautifulSoup(urllib.urlopen(ind_url).read())
a = soup.body.findAll(text=re.compile('Recommended growth temperature :'))
if a:
max_temp.append(int(map(float, re.findall(r'[+-]?[0-9]+', str(a)))[0]))
print "Recommended growth temperature : %d °C:\t" % max(max_temp)
url = 'http://bacdive.dsmz.de/index.php?search=Pseudomonas+putida'
if __name__ == "__main__":
# TO Open file then iterate thru the urls/bacterias
# with open('file.txt', 'rU') as f:
# for url in f:
# get_growth_temp(url)
get_growth_temp(url)
Edit:
Here I am passing single url. if you want to pass multiple urls to get their growth temperature. call the function(url) by opening file. code is commented.
Hope it helped you..
Thanks
I've setup a code in python to search for tweets using the oauth2 and urllib2 libraries only. (I'm not using any particular twitter library)
I'm able to search for tweets based on keywords. However, I'm getting zero number of tweets when I search for this particular keyword - "Jurgen%20Mayer-Hermann". (this is challenge because my ultimate goal is to search for this keyword only.
On the other hand when I search for the same thing online (twitter interface, I'm getting enough tweets). - https://twitter.com/search?q=Jurgen%20Mayer-Hermann&src=typd
Can someone please see if we can identify the issue?
The code is as follows:
def getfeed(mystr, tweetcount):
url = "https://api.twitter.com/1.1/search/tweets.json?q=" + mystr + "&count=" + tweetcount
parameters = []
response = twitterreq(url, "GET", parameters)
res = json.load(response)
return res
search_str = "Jurgen Mayer-Hermann"
search_str = '%22'+search_str+'%22'
search = search_str.replace(" ","%20")
search = search.replace("#","%23")
tweetcount = str(50)
res = getfeed(search, tweetcount)
When I print the constructed url, I get
https://api.twitter.com/1.1/search/tweets.json?q=%22Jurgen%20Mayer-Hermann%22&count=50
I have actually never worked with the Twitter API, but it looks like the count parameter only applies to searches on timelines as a way to limit the amount of tweets per page of results. In other words, you use it with the GET statuses/home_timeline, GET statuses/mentions, and GET statuses/user_timeline endpoints.
Try without count and see what happens.
Please use urllib.urlencode to encode your query parameters, like so:
import urllib
query = urllib.urlencode({'q': '"Jurgen Mayer-Hermann"', count: 50})
This produces 'q=%22Jurgen+Mayer-Hermann%22&count=50'. Which might bring you more luck...