I am trying to iterate over a list of links on a [website][1] but Selenium is not able to locate particular and seemingly random ones. In particular, I am trying to click on each of the cities and extract the number of stores using a for loop but it always skips, say, "Alameda" among all some other cities even though when I see nothing different about the html code.
driver = webdriver.Chrome(path)
driver.set_window_size(1120, 1000)
driver.get("https://locations.traderjoes.com/ca/")
cities = driver.find_elements_by_class_name('itemlist')
for i in range(0, len(cities)):
print(city_list[i])
if cities[i].is_displayed():
cities[i].click()
num = len(driver.find_elements_by_class_name('address-left'))
num_stores_by_city.append(num)
driver.find_element_by_xpath('//*[#id="content"]/a[2]').click()
else:
time.sleep(3)
cities[i].click()
num = len(driver.find_elements_by_class_name('address-left'))
num_stores_by_city.append(num)
driver.find_element_by_xpath('//*[#id="content"]/a[2]').click()
This will determine the cities and then loop through each gathering the number of stores and adding information to a dictionary type object:
driver = webdriver.Chrome(path)
url = 'https://locations.traderjoes.com/ca/'
driver.get(url)
city_list = {}
city_index = 0
processing_cities = True
while processing_cities:
cities = driver.find_elements_by_css_selector('.itemlist a')
if city_index < len(cities):
city_text = cities[city_index].text
cities[city_index].click()
store_locations = driver.find_elements_by_css_selector('.itemlist')
city_list[city_text] = len(store_locations)
driver.get(url)
city_index += 1
else:
processing_cities = False
print(city_list)
One of the issues you were running into was that once you click on an element your previously found elements become stale. You need to re-find previously found elements to interact with them again.
Related
I need to scrap all the google reviews. There are 90,564 reviews in my page. However the code i wrote can scrap only top 9 reviews. The other reviews are not scraped.
The code is given below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# specify the url of the business page on Google
url = 'https://www.google.com/maps/place/ISKCON+temple+Bangalore/#13.0098328,77.5510964,15z/data=!4m7!3m6!1s0x0:0x7a7fb24a41a6b2b3!8m2!3d13.0098328!4d77.5510964!9m1!1b1'
# create an instance of the Chrome driver
driver = webdriver.Chrome()
# navigate to the specified url
driver.get(url)
# Wait for the reviews to load
wait = WebDriverWait(driver, 20) # increased the waiting time
review_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'wiI7pd')))
# extract the text of each review
reviews = [element.text for element in review_elements]
# print the reviews
print(reviews)
# close the browser
driver.quit()
what should i edit/modify the code to extract all the reviews?
Here is the working code for you after launching the url
totalRev = "div div.fontBodySmall"
username = ".d4r55"
reviews = "wiI7pd"
wait = WebDriverWait(driver, 20)
totalRevCount = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, totalRev))).get_attribute("textContent").split(' ')[0].replace(',','').replace('.','')
print("totalRevCount - ", totalRevCount)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, totalRev))).click()
mydict = {}
found = 0
while found < int(totalRevCount):
review_elements = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, reviews)))
reviewer_names = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, username)))
found = len(mydict)
for rev, name in zip(review_elements, reviewer_names):
mydict[name.text] = rev.text
if len(rev.text) == 0:
found = int(totalRevCount) + 1
break
for i in range(8):
ActionChains(driver).key_down(Keys.ARROW_DOWN).perform()
print("found - ", found)
print(mydict)
time.sleep(2)
Explanation -
Get the locators for user name and review since we are going to create a key-value pair which will be useful in creating a non-duplicate result
You need to first get the total number of reviews/ratings that are present for that given location.
Get the username and review for the "visible" part of the webpage and store it in the dictionary
Scroll down the page and wait a few seconds
Get the username and review again and add them to dictionary. Only new ones will be added
As soon as a review that has no text (only rating), the loop will close and you have your results.
NOTE - If you want all reviews irrespective of the review text present or not, you can remove the "if" loop
I think you'll need to scoll down at first, and the get all the reviews.
scroll_value = 230
driver.execute_script( 'window.scrollBy( 0, '+str(scroll_value)+ ' )' ) # to scroll by value
# to get the current scroll value on the y axis
scroll_Y = driver.execute_script( 'return window.scrollY' )
That might be because the elements don't get loaded elsewise.
Since they are over 90'000, you might consider scolling down a little, then getting the reviews, repeat.
Resource: https://stackoverflow.com/a/74508235/20443541
I am attempting to scrape data through multiple pages (36) from a website to gather the document number and the revision number for each available document and save it to two different lists. If I run the code block below for each individual page, it works perfectly. However, when I added the while loop to loop through all 36 pages, it will loop, but only the data from the first page is saved.
#sam.gov website
url = 'https://sam.gov/search/?index=sca&page=1&sort=-modifiedDate&pageSize=25&sfm%5Bstatus%5D%5Bis_active%5D=true&sfm%5BwdPreviouslyPerformedWrapper%5D%5BpreviouslyPeformed%5D=prevPerfNo%2F'
#webdriver
driver = webdriver.Chrome(options = options_, executable_path = r'C:/Users/439528/Python Scripts/Spyder/chromedriver.exe' )
driver.get(url)
#get rid of pop up window
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#sds-dialog-0 > button > usa-icon > i-bs > svg'))).click()
#list of revision numbers
revision_num = []
#empty list for all the WD links
WD_num = []
substring = '2015'
current_page = 0
while True:
current_page += 1
if current_page == 36:
#find all elements on page named "field name". For each one, get the text. if the text is 'Revision Date'
#then, get the 'sibling' element, which is the actual revision number. append the date text to the revision_num list.
elements = driver.find_elements_by_class_name('sds-field__name')
wd_links = driver.find_elements_by_class_name('usa-link')
for i in elements:
element = i.text
if element == 'Revision Number':
revision_numbers = i.find_elements_by_xpath("./following-sibling::div")
for x in revision_numbers:
a = x.text
revision_num.append(a)
#finding all links that have the partial text 2015 and putting the wd text into the WD_num list
for link in wd_links:
wd = link.text
if substring in wd:
WD_num.append(wd)
print('Last Page Complete!')
break
else:
#find all elements on page named "field name". For each one, get the text. if the text is 'Revision Date'
#then, get the 'sibling' element, which is the actual revision number. append the date text to the revision_num list.
elements = driver.find_elements_by_class_name('sds-field__name')
wd_links = driver.find_elements_by_class_name('usa-link')
for i in elements:
element = i.text
if element == 'Revision Number':
revision_numbers = i.find_elements_by_xpath("./following-sibling::div")
for x in revision_numbers:
a = x.text
revision_num.append(a)
#finding all links that have the partial text 2015 and putting the wd text into the WD_num list
for link in wd_links:
wd = link.text
if substring in wd:
WD_num.append(wd)
#click on next page
click_icon = WebDriverWait(driver, 5, 0.25).until(EC.visibility_of_element_located([By.ID,'bottomPagination-nextPage']))
click_icon.click()
WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.ID, 'main-container')))
Things I've tried:
I added the WebDriverWait in order to slow the script down for the page to load and/or elements to be clickable/located
I declared the empty lists outside the loop so it does not overwrite over each iteration
I have edited the while loop multiple times to either count up to 36 (while current_page <37) or moved the counter to the top or bottom of the loop)
Any ideas? TIA.
EDIT: added screenshot of 'field name'
I have refactor your code and made things very simple.
driver = webdriver.Chrome(options = options_, executable_path = r'C:/Users/439528/Python Scripts/Spyder/chromedriver.exe' )
revision_num = []
WD_num = []
for page in range(1,37):
url = 'https://sam.gov/search/?index=sca&page={}&sort=-modifiedDate&pageSize=25&sfm%5Bstatus%5D%5Bis_active%5D=true&sfm%5BwdPreviouslyPerformedWrapper%5D%5BpreviouslyPeformed%5D=prevPerfNo%2F'.format(page)
driver.get(url)
if page==1:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#sds-dialog-0 > button > usa-icon > i-bs > svg'))).click()
elements = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//a[contains(#class,'usa-link') and contains(.,'2015')]")))
wd_links = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH,"//div[#class='sds-field__name' and text()='Revision Number']/following-sibling::div")))
for element in elements:
revision_num.append(element.text)
for wd_link in wd_links:
WD_num.append(wd_link.text)
print(revision_num)
print(WD_num)
if you know only 36 pages to iterate you can pass the value in the url.
wait for element visible using webdriverwait
construct your xpath in such a way so can identify element uniquely without if, but.
console output on my terminal:
Very new to Python and Selenium, looking to scrape a few data points. I'm struggling in three areas:
I don't understand how to loop through multiple URLs properly
I can't figure out why the script is iterating twice over each URL
I can't figure out why it's only outputting the data for the second URL
Much thanks for taking a look!
Here's my current script:
urls = [
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.crutchfield.com/%2F&tab=mobile',
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.lastpass.com%2F&tab=mobile'
]
driver = webdriver.Chrome(executable_path='/Library/Frameworks/Python.framework/Versions/3.9/bin/chromedriver')
for url in urls:
for page in range(0, 1):
driver.get(url)
wait = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'origin-field-data')))
df = pd.DataFrame(columns = ['Title', 'Core Web Vitals', 'FCP', 'FID', 'CLS', 'TTI', 'TBT', 'Total Score'])
company = driver.find_elements_by_class_name("audited-url__link")
data = []
for i in company:
data.append(i.get_attribute('href'))
for x in data:
#Get URL name
title = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[1]/div/div[2]/h1/a')
co_name = title.text
#Get Core Web Vitals text pass/fail
cwv = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[1]/span[2]')
core_web = cwv.text
#Get FCP
fcp = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[1]/div')
first_content = fcp.text
#Get FID
fid = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[3]/div[1]/div')
first_input = fid.text
#Get CLS
cls = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[2]/div/div[1]/div[1]/div[2]/div[4]/div[1]/div')
layout_shift = cls.text
#Get TTI
tti = driver.find_element_by_xpath('//*[#id="interactive"]/div/div[1]')
time_interactive = tti.text
#Get TBT
tbt = driver.find_element_by_xpath('//*[#id="total-blocking-time"]/div/div[1]')
total_block = tbt.text
#Get Total Score
total_score = driver.find_element_by_xpath('//*[#id="page-speed-insights"]/div[2]/div[3]/div[2]/div[1]/div[1]/div/div[1]/a/div[2]')
score = total_score.text
#Adding all columns to dataframe
df.loc[len(df)] = [co_name,core_web,first_content,first_input,layout_shift,time_interactive,total_block,score]
driver.close()
#df.to_csv('Double Page Speed Test 9-10.csv')
print(df)
Q1 : I don't understand how to loop through multiple URLs properly ?
Ans : for url in urls:
Q2. I can't figure out why the script is iterating twice over each URL
Ans : Cause you have for page in range(0, 1):
Update 1:
I did not run your entire code with DF. Also sometimes either one of the pages, does not show the number and href, but when I typically run the below code,
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(50)
wait = WebDriverWait(driver, 20)
urls = [
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.crutchfield.com/%2F&tab=mobile',
'https://developers.google.com/speed/pagespeed/insights/?url=https://www.lastpass.com%2F&tab=mobile'
]
data = []
for url in urls:
driver.get(url)
wait = WebDriverWait(driver, 120).until(EC.presence_of_element_located((By.CLASS_NAME, 'origin-field-data')))
company = driver.find_elements_by_css_selector("h1.audited-url a")
for i in company:
data.append(i.get_attribute('href'))
print(data)
this output :
['https://www.crutchfield.com//', 'https://www.lastpass.com/', 'https://www.lastpass.com/']
which is true case the element locator that we have used is representing 1 element on page 1 or 2 element on page 2
I suspect this isn't very complicated, but I can't see to figure it out. I'm using Selenium and Beautiful Soup to parse Petango.com. Data will be used to help a local shelter understand how they compare in different metrics to other area shelters. so next will be taking these data frames and doing some analysis.
I grab detail urls from a different module and import the list here.
My issue is, my lists are only showing the value from the HTML from the first dog. I was stepping through and noticed my len are different for the soup iterations, so I realize my error is after that somewhere but I can't figure out how to fix.
Here is my code so far (running the whole process vs using a cached page)
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
from Petango import pet_links
headings = []
values = []
ShelterInfo = []
ShelterInfoWebsite = []
ShelterInfoEmail = []
ShelterInfoPhone = []
ShelterInfoAddress = []
Breed = []
Age = []
Color = []
SpayedNeutered = []
Size = []
Declawed = []
AdoptionDate = []
# to access sites, change url list to pet_links (break out as needed) and change if false to true. false looks to the html file
url_list = (pet_links[4], pet_links[6], pet_links[8])
#url_list = ("Petango.html", "Petango.html", "Petango.html")
for link in url_list:
page_source = None
if True:
#pet page = link should populate links from above, hard code link was for 1 detail page, =to hemtl was for cached site
PetPage = link
#PetPage = 'https://www.petango.com/Adopt/Dog-Terrier-American-Pit-Bull-45569732'
#PetPage = Petango.html
PetDriver = webdriver.Chrome(executable_path='/Users/paulcarson/Downloads/chromedriver')
PetDriver.implicitly_wait(30)
PetDriver.get(link)
page_source = PetDriver.page_source
PetDriver.close()
else:
with open("Petango.html",'r') as f:
page_source = f.read()
PetSoup = BeautifulSoup(page_source, 'html.parser')
print(len(PetSoup.text))
#get the details about the shelter and add to lists
ShelterInfo.append(PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find('h4').text)
ShelterInfoParagraphs = PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find_all('p')
First_Paragraph = ShelterInfoParagraphs[0]
if "Website" not in First_Paragraph.text:
raise AssertionError("first paragraph is not about site")
ShelterInfoWebsite.append(First_Paragraph.find('a').text)
Second_Paragraph = ShelterInfoParagraphs[1]
ShelterInfoEmail.append(Second_Paragraph.find('a')['href'])
Third_Paragraph = ShelterInfoParagraphs[2]
ShelterInfoPhone.append(Third_Paragraph.find('span').text)
Fourth_Paragraph = ShelterInfoParagraphs[3]
ShelterInfoAddress.append(Fourth_Paragraph.find('span').text)
#get the details about the pet
ul = PetSoup.find('div', class_='group details-list').ul # Gets the ul tag
li_items = ul.find_all('li') # Finds all the li tags within the ul tag
for li in li_items:
heading = li.strong.text
headings.append(heading)
value = li.span.text
if value:
values.append(value)
else:
values.append(None)
Breed.append(values[0])
Age.append(values[1])
print(Age)
Color.append(values[2])
SpayedNeutered.append(values[3])
Size.append(values[4])
Declawed.append(values[5])
AdoptionDate.append(values[6])
ShelterDF = pd.DataFrame(
{
'Shelter': ShelterInfo,
'Shelter Website': ShelterInfoWebsite,
'Shelter Email': ShelterInfoEmail,
'Shelter Phone Number': ShelterInfoPhone,
'Shelter Address': ShelterInfoAddress
})
PetDF = pd.DataFrame(
{'Breed': Breed,
'Age': Age,
'Color': Color,
'Spayed/Neutered': SpayedNeutered,
'Size': Size,
'Declawed': Declawed,
'Adoption Date': AdoptionDate
})
print(PetDF)
print(ShelterDF)
output from print len and print the age value as the loop progresses
12783
['6y 7m']
10687
['6y 7m', '6y 7m']
10705
['6y 7m', '6y 7m', '6y 7m']
Could someone please point me in the right direction?
Thank you for your help!
Paul
You need to change the find method into find_all() in BeautifulSoup so it locates all the elements.
Values is global and you only ever append the first value in this list to Age
Age.append(values[1])
Same problem for your other global lists (static index whether 1 or 2 etc...).
You need a way to track the appropriate index to use perhaps through a counter, or determine other logic to ensure current value is added e.g. with current Age, is it is the second li in the loop? Or just append PetSoup.select_one("[data-bind='text: age']").text
It looks like each item of interest e.g. colour, spayed contains the data-bind attribute so you can use those with the appropriate attribute value to select each value and avoid a loop over li elements.
e.g. current_colour = PetSoup.select_one("[data-bind='text: color']").text
Best to set in a variable and test is not None before accessing with .text
I am new to programming and need some help with my web-crawler.
At the moment, I have my code opening up every web-page in the list. However, I wish to extract information from each one it loads. This is what I have.
from selenium import webdriver
import csv
driver = webdriver.Firefox()
links_code = driver.find_elements_by_xpath('//a[#class="in-match"]')
first_two = links_code[0:2]
first_two_links = []
for i in first_two:
link = i.get_attribute("href")
first_two_links.append(link)
for i in first_two_links:
driver.get(i)
This loops through the first two pages but scrapes no info. So I tried adding to the for-loop as follows
odds = []
for i in first_two_links:
driver.get(i)
driver.find_element_by_xpath('//span[#class="table-main__detail-
odds--hasarchive"]')
odds.append(odd)
However. This runs into an error.
Any help much appreciated.
You are not actually appending anything! you need to assign a variable to
driver.find_element_by_xpath('//span[#class="table-main__detail-
odds--hasarchive"]')
then append it to the list!
from selenium import webdriver;
import csv;
driver = webdriver.Firefox();
links_code : list = driver.find_elements_by_xpath('//a[#class="in-match"]');
first_two : list = links_code[0:2];
first_two_links : list = [];
i : int;
for i in first_two:
link = i.get_attribute("href");
first_two_links.append(link);
for i in first_two_links:
driver.get(i);
odds : list = [];
i :int;
for i in first_two_links:
driver.get(i);
o = driver.find_element_by_xpath('//span[#class="table-main__detail- odds--hasarchive"]');
odds.append(o);
First, after you start the driver you need to go to a website...
Second, in the second for loop, you are trying to append the wrong object... use i not odd or make odd = driver.find_element_by_xpath('//span[#class="table-main__detail-odds--hasarchive"]')
If you can provide the URL or the HTML we can help more!
Try this (I have used Google as an example you will need to change the code...):
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("https://www.google.com")
links_code = driver.find_elements_by_xpath('//a')
first_two = links_code[0:2]
first_two_links = []
for i in first_two:
link = i.get_attribute("href")
first_two_links.append(link)
print(link)
odds = []
for i in first_two_links:
driver.get(i)
odd = driver.page_source
print(odd)
# driver.find_element_by_xpath('//span[#class="table-main__detail- odds--hasarchive"]')
odds.append(odd)