I am trying to scrape a list of specific movies from IMDB using this tutorial.
The code is working fine expect for the for click to get the URL then saves in content. It is not working. The issue is that nothing change in chrome when running the code I really appreciate if anyone can help.
content = driver.find_element_by_class_name("tF2Cxc").click()
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import time
movie = 'Wolf Totem'
driver = webdriver.Chrome(executable_path=r"D:\chromedriver.exe")
#Go to Google
driver.get("https://www.google.com/")
#Enter the keyword
driver.find_element_by_name("q").send_keys(movie + " imdb")
time.sleep(1)
#Click the google search button
driver.find_element_by_name("btnK").send_keys(Keys.ENTER)
time.sleep(1)
You are using a wrong locator.
To open the a search result on Google page you should use this:
driver.find_element_by_xpath("//div[#class='yuRUbf']/a").click()
This locator will match all the 10 search results, so the first match is the first search result.
Also, clicking on that element will not give you any content, just open the first link below the title of the first search result.
Related
I am using this to code to scrape emails from google search results. However, it only scrapes the first 10 results despite having 100 search results loaded.
Ideally, I would like for it to scrape all search results.
Is there a reason for this?
from selenium import webdriver
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)
time.sleep(10)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx',index=False)
# print(emails)
driver.close()
The code is working as expected and scraping 10 results which is the default from google search. You can use the methods like 'find_element_by_xpath' to find the next button and click it.
This operation needs to be done till the sufficient results are collected in loop. Refer this for more details selenium locating elements
How to use the selenium commands, probably you can look upto web. I found one similar question which can provide some reference
Following up on Bijendra's answer,
you could update the code as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
emails = []
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
for i in range(2):
html = driver.page_source
for e in re.findall(email_pattern, html):
emails.append(e)
a_attr = driver.find_element(By.ID,"pnnext")
a_attr.click()
time.sleep(2)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_csv('email_addresses_.csv',index=False)
driver.close()
You could either change the range value passed in for loop or entirely replace the for loop with while loop so instead of
for i in range(2):
You could do:
while len(emails) < 100:
Make sure to manage the time as to when the page navigates to next page and wait for the next page to load before extracting the available emails and then moving on to clicking the next button on search result page.
Make sure to refer to docs to get a clear idea of what you should do to achieve what you want to. Happy Hacking!!
Selenium loads its own empty browser so your google settings for 100 results need to be on the code because the default is 10 results which is what your getting. You will have better luck using query parameters and adding the one for the number of results to the end of your URL
If you need further information on query parameters to achieve this its the second method described below
tldevtech.com/how-to-show-100-results-per-page-in-google-search
I am scraping news articles related to Infosys at the end of page but getting error
selenium.common.exceptions.InvalidSelectorException: Message: invalid selector .
Want to scrape all articles related to Infosys.
from bs4 import BeautifulSoup
import re
from selenium import webdriver
import chromedriver_binary
import string
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("/Users/abhishekgupta/Downloads/chromedriver")
driver.get("https://finance.yahoo.com/quote/INFY/news?p=INFY")
for i in range(20): # adjust integer value for need
# you can change right side number for scroll convenience or destination
driver.execute_script("window.scrollBy(0, 250)")
# you can change time integer to float or remove
time.sleep(1)
print(driver.find_element_by_xpath('//*[#id="latestQuoteNewsStream-0-Stream"]/ul/li[9]/div/div/div[2]/h3/a/text()').text())
You could use less detailed xpath using // instead of /div/div/div[2]
And if you want last item then get all li as list and later use [-1] to get last element on list
from selenium import webdriver
import time
driver = webdriver.Chrome("/Users/abhishekgupta/Downloads/chromedriver")
#driver = webdriver.Firefox()
driver.get("https://finance.yahoo.com/quote/INFY/news?p=INFY")
for i in range(20):
driver.execute_script("window.scrollBy(0, 250)")
time.sleep(1)
all_items = driver.find_elements_by_xpath('//*[#id="latestQuoteNewsStream-0-Stream"]/ul/li')
#for item in all_items:
# print(item.find_element_by_xpath('.//h3/a').text)
# print(item.find_element_by_xpath('.//p').text)
# print('---')
print(all_items[-1].find_element_by_xpath('.//h3/a').text)
print(all_items[-1].find_element_by_xpath('.//p').text)
xPath you provided does not exist in the page.
Download the xPath Finder Chrome Extension to find the correct xPath for articles.
Here is an example xPath of articles list, you need to loop through id:
/html/body/div[1]/div/div/div[1]/div/div[3]/div[1]/div/div[5]/div/div/div/ul/li[ID]/div/div/div[2]/h3/a/u
I think your code is fine just one thing: there are few difference when we retrieve text or links when using xpath in selenium as compare to scrapy or if you are using lxml fromstring library so here is something that should work for you
#use this code for printing instead
print(driver.find_element_by_xpath('//*[#id="latestQuoteNewsStream-0- Stream"]/ul/li[9]/div/div/div[2]/h3/a').text)
Even if you do this it will work the same way since there is only one element with this id so simply use
#This should also work fine
print(driver.find_element_by_xpath('//*[#id="latestQuoteNewsStream-0- Stream"]').text)
I am trying to scrape a public facebook group using beautifulsoup, I am using the mobile site for the lack of javascript there. So this script supposed to get the link from the 'more' keyword and get the text from p tag there, but it just gets the text from the current page's p tag. Can someone point me the problem? I am new to python and everything in this code.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import requests
browser = webdriver.Firefox()
browser.get('https://mobile.facebook.com/groups/22012931789?refid=27')
for elem in browser.find_elements_by_link_text('More'):
page = requests.get(elem.get_attribute("href"))
soup=BeautifulSoup(page.content,'html.parser')
print(soup.find_all('p')[0].get_text())
It's always useful to see what your script is actually doing, a quick way of doing this is by printing your results at certain steps along the way.
For example, using your code:
for elem in browser.find_elements_by_link_text('More'):
print("elem's href attribute: {}".format(elem.get_attribute("href")))
You'll notice that the first one's blank. We should test for this before trying to get requests to fetch it:
for elem in browser.find_elements_by_link_text('More'):
if elem.get_attribute("href"):
print("Trying to get {}".format(elem.get_attribute("href")))
page = requests.get(elem.get_attribute("href"))
soup=BeautifulSoup(page.content,'html.parser')
print(soup.find_all('p')[0].get_text())
Note that an empty elem.get_attribute("href") returns an empty unicode string, u'' - but pythons considers an empty string to be false, which is why that if works.
Which works fine on my machine. Hope that helps!
I do not understand why selenium will not input my data into amazon search. I know it opens the chrome browser to amazon but it will not fill in the search bar. Any ideas whats wrong with my code
from lxml import html, etree
import csv,os,json
import requests
from time import sleep
from selenium import webdriver
textsearch = "Taco Bell Sauce"
browser = webdriver.Chrome('/home/path/Documents/Selenium/chromedriver')
browser.get("http://www.amazon.com/")
content = browser.page_source
doc = html.fromstring(content)
search = selenium.find_element_by_id("twotabsearchtextbox")
search.send_keys(textsearch)
search.selenium.find_element_by_id("nav-search-submit-text").click()
Any corrections on how i can make this work
This is simply because you should handle the WebDriver instance, that you've created - browser instead of selenium which is Python library that contains webdriver...
So just replace
search = selenium.find_element_by_id("twotabsearchtextbox")
with
search = browser.find_element_by_id("twotabsearchtextbox")
P.S. Also replace
search.selenium.find_element_by_id("nav-search-submit-text").click()
with
browser.find_element_by_id("nav-search-submit-text").click()
or
search.submit()
You need to make a couple of adjustments in your code as follows:
The webdriver instance gets assigned to browser so while using find_element you need to use the browser. The Search Box and the Search Button are within input tag so better to construct an xpath or a css_selector as follows :
from lxml import html, etree
import csv,os,json
import requests
from time import sleep
from selenium import webdriver
textsearch = "Taco Bell Sauce"
browser = webdriver.Chrome(executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
browser.get("http://www.amazon.com/")
content = browser.page_source
doc = html.fromstring(content)
search = browser.find_element_by_xpath("//input[#id='twotabsearchtextbox']")
search.send_keys(textsearch)
search.find_element_by_xpath("//input[#class='nav-input']").click()
Windows 10 Home 64 Bit
Python 2.7 (also tried in 3.3)
Pycharm Community 2006.3.1
Very new to Python so bear with me.
I want to write a script that will go to Google, enter a Search Phrase, click the Search button, look through the search results for a URL (or any string), if there is no result on that page, click the Next button and repeat on subsequent pages until it finds the URL, stops and Prints what page the result was found on.
I honestly don't care if it just runs in the background and gives me the result. At first I was trying to have it litterally open the browser, find the browser objects (search field and search button) via Xpath and execute that was.
You can see the modules I've installed and tried. And I have tried almost every code example I've found on StackOverflow for 2 days so listing everything I've tried would be quite wordy.
If anyone just tell me the modules that would work best and any other direction would be very much appreciated!
Specific modules I've tried for this were Selenim, clipboard, MechanicalSoup, BeautifulSoup, webbrowser, urllib, enter image description hereunittest and Popen.
Thank you in advance!
Chantz
import clipboard
import json as m_json
import mechanicalsoup
import random
import sys
import os
import mechanize
import re
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import unittest
import webbrowser
from mechanize import Browser
from bs4 import BeautifulSoup
from subprocess import Popen
######################################################
######################################################
# Xpath Google Search Box
# //*[#id="lst-ib"]
# Xpath Google Search Button
# //*[#id="tsf"]/div[2]/div[3]/center/input[1]
######################################################
######################################################
webbrowser.open('http://www.google.com')
time.sleep(3)
clipboard.copy("abc") # now the clipboard content will be string "abc"
driver = webdriver.Firefox()
driver.get('http://www.google.com/')
driver.find_element_by_id('//*[#id="lst-ib"]')
text = clipboard.paste("abc") # text will have the content of clipboard
print('text')
# browser = mechanize.Browser()
# url = raw_input("http://www.google.com")
# username = driver.find_element_by_xpath("//form[input/#name='username']")
# username = driver.find_element_by_xpath("//form[#id='loginForm']/input[1]")
# username = driver.find_element_by_xpath("//*[#id="lst-ib"]")
# elements = driver.find_elements_by_xpath("//*[#id="lst-ib"]")
# username = driver.find_element_by_xpath("//input[#name='username']")
# CLICK BUTTON ON PAGE
# http://stackoverflow.com/questions/27869225/python-clicking-a-button-on-a-webpage
Selenium would actually be a straightforward/good module to use for this script; you don't need anything else in this case. The easiest way to reach your goal is probably something like this:
from selenium import webdriver
import time
driver = webdriver.Firefox()
url = 'https://www.google.nl/'
linkList = []
driver.get(url)
string ='search phrase'
text = driver.find_element_by_xpath('//*[#id="lst-ib"]')
text.send_keys(string)
time.sleep(2)
linkBox = driver.find_element_by_xpath('//*[#id="nav"]/tbody/tr')
links = linkBox.find_elements_by_css_selector('a')
for link in links:
linkList.append(link.get_attribute('href'))
print linkList
This code will open your browser, enter your search phrase and then gets the links for the different page numbers. From here you only need to write a loop that enters every link in your browser and looks whether the search phrase is there.
I hope this helps; if you have further questions let me know.