Python Scraper Won't Complete - python

I am using this to code to scrape emails from google search results. However, it only scrapes the first 10 results despite having 100 search results loaded.
Ideally, I would like for it to scrape all search results.
Is there a reason for this?
from selenium import webdriver
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)
time.sleep(10)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx',index=False)
# print(emails)
driver.close()

The code is working as expected and scraping 10 results which is the default from google search. You can use the methods like 'find_element_by_xpath' to find the next button and click it.
This operation needs to be done till the sufficient results are collected in loop. Refer this for more details selenium locating elements
How to use the selenium commands, probably you can look upto web. I found one similar question which can provide some reference

Following up on Bijendra's answer,
you could update the code as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
emails = []
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
for i in range(2):
html = driver.page_source
for e in re.findall(email_pattern, html):
emails.append(e)
a_attr = driver.find_element(By.ID,"pnnext")
a_attr.click()
time.sleep(2)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_csv('email_addresses_.csv',index=False)
driver.close()
You could either change the range value passed in for loop or entirely replace the for loop with while loop so instead of
for i in range(2):
You could do:
while len(emails) < 100:
Make sure to manage the time as to when the page navigates to next page and wait for the next page to load before extracting the available emails and then moving on to clicking the next button on search result page.
Make sure to refer to docs to get a clear idea of what you should do to achieve what you want to. Happy Hacking!!

Selenium loads its own empty browser so your google settings for 100 results need to be on the code because the default is 10 results which is what your getting. You will have better luck using query parameters and adding the one for the number of results to the end of your URL
If you need further information on query parameters to achieve this its the second method described below
tldevtech.com/how-to-show-100-results-per-page-in-google-search

Related

Using Python (Selenium) to Scrape IMDB (.click() is not working)

I am trying to scrape a list of specific movies from IMDB using this tutorial.
The code is working fine expect for the for click to get the URL then saves in content. It is not working. The issue is that nothing change in chrome when running the code I really appreciate if anyone can help.
content = driver.find_element_by_class_name("tF2Cxc").click()
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import time
movie = 'Wolf Totem'
driver = webdriver.Chrome(executable_path=r"D:\chromedriver.exe")
#Go to Google
driver.get("https://www.google.com/")
#Enter the keyword
driver.find_element_by_name("q").send_keys(movie + " imdb")
time.sleep(1)
#Click the google search button
driver.find_element_by_name("btnK").send_keys(Keys.ENTER)
time.sleep(1)
You are using a wrong locator.
To open the a search result on Google page you should use this:
driver.find_element_by_xpath("//div[#class='yuRUbf']/a").click()
This locator will match all the 10 search results, so the first match is the first search result.
Also, clicking on that element will not give you any content, just open the first link below the title of the first search result.

Gathering data from table using Pandas and Beautiful Soup after logging in using Selenium

I'm trying to scrape data from a paginated table. The table can only be accessed by logging in to a user account. I've decided to approach this using Selenium to log in. I then hope to be able to read this into a Pandas DataFrame. I plan on using BeautifulSoup as a go between.
Here is my code:
from selenium import webdriver
import time
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.example.com/userarea"
driver = webdriver.Chrome()
time.sleep(6)
driver.get(url)
time.sleep(6)
username = driver.find_element_by_id("user")
username.clear()
username.send_keys("xyz#email.com")
password = driver.find_element_by_id("password")
password.clear()
password.send_keys('password')
driver.find_element_by_xpath('//button[]').click()
driver.find_element_by_xpath('//button[text()="Log in"]').click()
time.sleep(6)
driver.find_element_by_xpath('//span[text()="Text"]').click()
driver.find_element_by_xpath('//span[text()="Text"]').click()
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
try:
tables = soup.find_all('th')
print(tables) #Returns an empty list
df = pd.read_html(str(tables))
df.head()
except:
driver.close()
driver.close()
Unfortunately, this is only printing an empty list. I've tried using lxml too but no joy.
Using the inspection tools it does seem that there aren't any table tags, so I tried to find all <th> tags instead (which definitely are present). Again no joy. I've not yet tried to work through the individual pages. I only mention the pagination in case it offers a clue to the issue.
Any idea what I'm missing?
Thank you to those that offered suggestions. In the end furas' suggestion was best placed and it turned out the script was running too quickly. I paused Python for 6 seconds after clicking on the page with the table on. Seems to run on javascript and I can actually see the values pop into place now as the script works through the pagination.
import time
#Navigate to page, then let it load using:
time.sleep(6)

Python Web Scraper - Issue grabbing links from href

I've been following along this guide to web scraping LinkedIn and google searches. There have been some changes in the HTML of google's search results since the guide was created so I've had to tinker with the code a bit. I'm at the point where I need to grab the links from the search results but have run into an issue where the program doesn't return anything even after implementing a code fix from this post due to an error. I'm not sure what I'm doing wrong here.
import Parameters
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from parsel import Selector
import csv
# defining new variable passing two parameters
writer = csv.writer(open(Parameters.file_name, 'w'))
# writerow() method to the write to the file object
writer.writerow(['Name', 'Job Title', 'Company', 'College', 'Location', 'URL'])
# specifies the path to the chromedriver.exe
driver = webdriver.Chrome('/Users/.../Python Scripts/chromedriver')
driver.get('https://www.linkedin.com')
sleep(0.5)
# locate email form by_class_name then send_keys() to simulate key strokes
username = driver.find_element_by_id('session_key')
username.send_keys(Parameters.linkedin_username)
sleep(0.5)
password = driver.find_element_by_id('session_password')
password.send_keys(Parameters.linkedin_password)
sleep(0.5)
sign_in_button = driver.find_element_by_class_name('sign-in-form__submit-button')
sign_in_button.click()
sleep(3)
driver.get('https:www.google.com')
sleep(3)
search_query = driver.find_element_by_name('q')
search_query.send_keys(Parameters.search_query)
sleep(0.5)
search_query.send_keys(Keys.RETURN)
sleep(3)
################# HERE IS WHERE THE ISSUE LIES ######################
#linkedin_urls = driver.find_elements_by_class_name('iUh30')
linkedin_urls = driver.find_elements_by_css_selector("yuRUbf > a")
for url_prep in linkedin_urls:
url_prep.get_attribute('href')
#linkedin_urls = [url.text for url in linkedin_urls]
sleep(0.5)
print('Supposed to be URLs')
print(linkedin_urls)
The search parameter is
search_query = 'site:linkedin.com/in/ AND "python developer" AND "London"'
Results in an empty list:
Snippet of the HTML section I want to grab:
EDIT: This is the output if I go by .find_elements_by_class_name or by Sector97's 1st edits.
Found an alternative solution that might make it a bit easier to achieve what you're after. Credit to A.Pond at
https://stackoverflow.com/a/62050505
Use the google search api to get the links from the results.
You may need to install the library first
pip install google
You can then use the api to quickly extract an arbitrary number of links:
from googlesearch import search
links = []
query = 'site:linkedin.com/in AND "python developer" AND "London"'
for j in search(query, tld = 'com',start = 0,stop = 100,pause=4):
links.append(j)
I got the first 100 results but you can play around with the parameters to get more or less as you need.
You can see more about this api here:
https://www.geeksforgeeks.org/performing-google-search-using-python-code/
I think I found the error in your code.
Instead of using
linkedin_urls = driver.find_elements_by_css_selector("yuRUbf > a")
Try this instead:
web_elements = driver.find_elements_by_class_name("yuRUbf")
That gets you the parent elements. You can then extract the url text using a simple list comprehension:
linkedin_urls = [elem.find_element_by_css_selector('a').get_attribute('href') for elem in web_elements]

Use Python to go through Google Search Results for given Search Phrase and URL

Windows 10 Home 64 Bit
Python 2.7 (also tried in 3.3)
Pycharm Community 2006.3.1
Very new to Python so bear with me.
I want to write a script that will go to Google, enter a Search Phrase, click the Search button, look through the search results for a URL (or any string), if there is no result on that page, click the Next button and repeat on subsequent pages until it finds the URL, stops and Prints what page the result was found on.
I honestly don't care if it just runs in the background and gives me the result. At first I was trying to have it litterally open the browser, find the browser objects (search field and search button) via Xpath and execute that was.
You can see the modules I've installed and tried. And I have tried almost every code example I've found on StackOverflow for 2 days so listing everything I've tried would be quite wordy.
If anyone just tell me the modules that would work best and any other direction would be very much appreciated!
Specific modules I've tried for this were Selenim, clipboard, MechanicalSoup, BeautifulSoup, webbrowser, urllib, enter image description hereunittest and Popen.
Thank you in advance!
Chantz
import clipboard
import json as m_json
import mechanicalsoup
import random
import sys
import os
import mechanize
import re
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import unittest
import webbrowser
from mechanize import Browser
from bs4 import BeautifulSoup
from subprocess import Popen
######################################################
######################################################
# Xpath Google Search Box
# //*[#id="lst-ib"]
# Xpath Google Search Button
# //*[#id="tsf"]/div[2]/div[3]/center/input[1]
######################################################
######################################################
webbrowser.open('http://www.google.com')
time.sleep(3)
clipboard.copy("abc") # now the clipboard content will be string "abc"
driver = webdriver.Firefox()
driver.get('http://www.google.com/')
driver.find_element_by_id('//*[#id="lst-ib"]')
text = clipboard.paste("abc") # text will have the content of clipboard
print('text')
# browser = mechanize.Browser()
# url = raw_input("http://www.google.com")
# username = driver.find_element_by_xpath("//form[input/#name='username']")
# username = driver.find_element_by_xpath("//form[#id='loginForm']/input[1]")
# username = driver.find_element_by_xpath("//*[#id="lst-ib"]")
# elements = driver.find_elements_by_xpath("//*[#id="lst-ib"]")
# username = driver.find_element_by_xpath("//input[#name='username']")
# CLICK BUTTON ON PAGE
# http://stackoverflow.com/questions/27869225/python-clicking-a-button-on-a-webpage
Selenium would actually be a straightforward/good module to use for this script; you don't need anything else in this case. The easiest way to reach your goal is probably something like this:
from selenium import webdriver
import time
driver = webdriver.Firefox()
url = 'https://www.google.nl/'
linkList = []
driver.get(url)
string ='search phrase'
text = driver.find_element_by_xpath('//*[#id="lst-ib"]')
text.send_keys(string)
time.sleep(2)
linkBox = driver.find_element_by_xpath('//*[#id="nav"]/tbody/tr')
links = linkBox.find_elements_by_css_selector('a')
for link in links:
linkList.append(link.get_attribute('href'))
print linkList
This code will open your browser, enter your search phrase and then gets the links for the different page numbers. From here you only need to write a loop that enters every link in your browser and looks whether the search phrase is there.
I hope this helps; if you have further questions let me know.

python selenium - how can I get all visible text on the page (i.e., not the page source) [duplicate]

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.
I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.
After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:
from selenium import webdriver
import codecs
filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')
driver = webdriver.Firefox()
driver.get("http://www.examplepage.com")
allelements = driver.find_elements_by_xpath("//*")
ferdigtxt = []
for i in allelements:
if i.text in ferdigtxt:
pass
else:
ferdigtxt.append(i.text)
filen.writelines(i.text)
filen.close()
driver.quit()
The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)
I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.
Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.
Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text
Using lxml, you might try something like this:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean
url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
browser.get(url) # Load page
content=browser.page_source
cleaner=clean.Cleaner()
content=cleaner.clean_html(content)
with open('/tmp/source.html','w') as f:
f.write(content.encode('utf-8'))
doc=LH.fromstring(content)
with open('/tmp/result.txt','w') as f:
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text=elt.text or ''
tail=elt.tail or ''
words=' '.join((text,tail)).strip()
if words:
words=words.encode('utf-8')
f.write(words+'\n')
This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).
Here's a variation on #unutbu's answer:
#!/usr/bin/env python
import sys
from contextlib import closing
import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean import Cleaner
from selenium.webdriver import Firefox # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
cache = FileSystemCache('.cachedir', threshold=100000)
url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
# get page
page_source = cache.get(url)
if page_source is None:
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
page_source = browser.page_source
cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text
I've separated your task in two:
get page (including elements generated by javascript)
extract text
The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.

Categories

Resources