I have the following loop to navigate thru multiple dynamic drop down boxes, submitting it to reach another webpage, then returning to the initial page to iterate thru all the options.
Problem is that I need to insert in a line to direct the browser to the initial page before every loop. Hence the slow execution speed. Any ideas on improving the code?
Example of the code below:
select = Select(driver.find_element_by_id('ID'))
options = select.options
for index in range(1, len(options):
driver.get('webpage') # issue for slow execution speed
select.select_by_index(index)
select2 = Select(driver.find_element_by_id('ID2')) # can only select after selecting the first drop down box
options2 = select2.options
for index2 in range(1, len(options2):
driver.get('webpage') # issue for slow execution speed
select.select_by_index(index)
select2 = Select(driver.find_element_by_id('ID2'))
options2 = select2.options
select2.select_by_index(index2)
submit() # click on button to navigate to another page, read data then back to loop
Related
HI I'm new to StackOverflow. Apoligies in advance if the post is not well structured.
I have been learning web scraping in python and as part of a hobby project I was developing I was trying to web scrape Google Jobs and extract specific data to be stored in a pandas data frame.
I'm using selenium on python to achieve this.
So, the main challenge for me was to figure out a way to scrape all the job records from the site obtained from the search query (url = Google Jobs). This was difficult only because google jobs is dynamically loading ie. infinite scrolling, and the page initially loads only 10 results in the side panel. Upon scrolling down, only 10 more results are loaded successively with each scroll.
Website preview
I used selenium to help me with this. I figured that I can somehow automate the scrolling by instructing selenium to scroll into view the list element (<\li>) associated with the last job entry in the side panel and run a for loop to repeat it till all results are loaded onto the page.
Then I just had to extract the list elements and store their text into a data frame.
The problem is each of the job entries has anywhere between 3 - 6 lines of text with each line representing some attribute like Job Title or Company name or Location etc., with the number of lines of each job entry being different, resulting in some entries with more lines than the others.
Different number of lines for each job entry
So when I split the text into a python list using '\n' as the seperator, it results in lists with different lengths.
This becomes a problem for me when i use pd.DataFrame(list) to generate a dataframe, resulting in records with jumbled order of fields.
Different Length Lists 😓
Below is the code I have come up with:
#imports
import pandas as pd
import numpy as np
from serpapi import GoogleSearch
import requests
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
#using selenium to launch and scroll through the Google Jobs page
url = "https://www.google.com/search?q=google+jobs+data+analyst&oq=google+jobs+data+analyst&aqs=chrome..69i57j69i59j0i512j0i22i30i625l4j69i60.4543j0j7&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&ved=2ahUKEwjXsv-_iZP9AhVPRmwGHX5xDEsQutcGKAF6BAgPEAU&sxsrf=AJOqlzWGHNISzgpAUCZBmQA1mWXXt3I7gA:1676311105893#htivrt=jobs&htidocid=GS94rKdYQqQAAAAAAAAAAA%3D%3D&fpstate=tldetail"
driver = webdriver.Chrome()
driver.get(url)
joblist =[]
#pointing to the html element to scroll to
elementxpath = '//*[#id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
datas = driver.find_elements(By.XPATH,'//*
#capturing all the job list objects in the first page
[#id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul/li')
joblist.append([da.text for da in datas])
#adding 3s delay for website to load after scrolling before executing code
time.sleep(3)
#capturing all the job list objects in the second set of 10 results loaded after 1st scroll down
elementxpath = '//*[#id="VoQFxe"]/div/div/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
datas = driver.find_elements(By.XPATH,'//*[#id="VoQFxe"]/div/div/ul/li')
joblist.append([da.text for da in datas])
x=2
time.sleep(3)
#using a while loop to scroll and capture for the remaining scroll downs as element xpath is in iterable format unlike th previous 2 xpaths
while True:
elementxpath = '//*[#id="VoQFxe"]/div['+str(1*x)+']/div/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
x+=1
time.sleep(3)
datas = driver.find_elements(By.XPATH,'//*[#id="VoQFxe"]/div['+str(1*x)+']/div/ul/li')
joblist.append([da.text for da in datas])
if x>1000:
break
else:
continue
#unpacking and cleaning captured values from joblist to a newlist of lists in the desired format for creating a dataframe
jlist = []
for n in joblist:
for a in range(0,len(n)-1):
if n[a]!='':
jlist.append(n[a].split('\n'))
jobdf = pd.DataFrame(jlist)
jobdf.columns = ['Logo','Role', 'Company', 'Source','Posted','Full / Part Time', 'Waste']
jobdf
This is the output data frame:
Jumbled mess 😶
Men and Women of culture, I implore your help to get a ordered DataFrame that makes sense. Thank you!
Usually you can use .split('\n') only in simple cases, but in this case is a bad idea. A better practice is to use a unique xpath for each element you want to scrape, one for the logo, one for role, etc.
Another good practice is to initialize a dictionary at the beginning with one key for each element you want to scrape, and then append data as you loop over the jobs.
The following code does exactly this. It is not optimized for speed, in fact it scrolls to each job and scrape it, while the best way would be to scrape data of all the displayed jobs and then scroll to the bottom, then scrape all the new jobs and scroll again, and so on.
# import libraries...
# load webpage...
from selenium.common.exceptions import NoSuchElementException
xpaths = {
'Logo' :"./div[1]//img",
'Role' :"./div[2]",
'Company' :"./div[4]/div/div[1]",
'Location' :"./div[4]/div/div[2]",
'Source' :"./div[4]/div/div[3]",
'Posted' :"./div[4]/div/div[4]/div[1]",
'Full / Part Time':"./div[4]/div/div[4]/div[2]",
}
data = {key:[] for key in xpaths}
jobs_to_do = 100
jobs_done = 0
while jobs_done < jobs_to_do:
lis = driver.find_elements(By.XPATH, "//li[#data-ved]//div[#role='treeitem']/div/div")
for li in lis[jobs_done:]:
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', li)
for key in xpaths:
try:
t = li.find_element(By.XPATH, xpaths[key]).get_attribute('src' if key=='Logo' else 'innerText')
except NoSuchElementException:
t = '*missing data*'
data[key].append(t)
jobs_done += 1
print(f'{jobs_done=}', end='\r')
time.sleep(.2)
Then by running pd.DataFrame(data) you get something like this
As you can see from the image, some values in the "Posted" column should be put instead in the column "Full / Part Time". This happens because some jobs don't have info about posted time. I noticed also that some jobs not only have "posted" and "full/part time" data, but also the "salary". So you should adjust the code to take into account these facts, it is not so easy because the HTML objects don't have specific classes for each element, so I think you have to exploit the svg symbols (clock, bag and banknote) shown in this image
UPDATE
I tried using the svg paths to correctly scrape "posted", "full/part time" and "salary" and it works! Here are the paths
xpaths = {
'Logo' :"./div[1]//img",
'Role' :"./div[2]",
'Company' :"./div[4]/div/div[1]",
'Location' :"./div[4]/div/div[2]",
'Source' :"./div[4]/div/div[3]",
'Posted' :".//*[name()='path'][contains(#d,'M11.99')]/ancestor::div[1]",
'Full / Part Time':".//*[name()='path'][contains(#d,'M20 6')]/ancestor::div[1]",
'Salary' :".//*[name()='path'][#fill-rule='evenodd']/ancestor::div[1]"
}
Replace the old paths with the new ones and it will work as expected, as shown in the picture below
In my project, I am downloading all the reports by clicking each link written as a "Date". Below is the image of the table.
I have to extract a report of each date mentioned in the table column "Payment Date". Each date is a link for a report. So, I am clicking all the dates one-by-one to get the report downloaded.
for dt in driver.find_elements_by_xpath('//*[#id="tr-undefined"]/td[1]/span'):
dt.click()
time.sleep(random.randint(5, 10))
So, the process here going is when I click one date it will download a report of that date. Then, I will click next date to get a report of that date. So, I made a for loop to loop through all the links and get a report of all the dates.
But it is giving me Stale element exception. After clicking 1st date it is not able to click the next date. I am getting error and code stops.
How can I solve this?
You're getting a stale element exception because the DOM is updating elements in your selection on each click.
An example: on-click, a tag "clicked" is appended to an element's class. Since the list you've selected contains elements which have changed (1st element has a new class), it throws an error.
A quick and dirty solution is to re-perform your query after each iteration. This is especially helpful if the list of values grows or shrinks with clicks.
# Create an anonymous function to re-use
# This function can contain any selector
get_elements = lambda: driver.find_elements_by_xpath('//*[#id="tr-undefined"]/td[1]/span')
i = 0
while True:
elements = get_elements()
# Exit if you're finished iterating
if not elements or i>len(elements):
break
# This should always work
element[i].click()
# sleep
time.sleep(random.randint(5, 10))
# Update your counter
i+=1
The simplest way to solve it is to get a specific link each time before clicking on it.
links = driver.find_elements_by_xpath('//*[#id="tr-undefined"]/td[1]/span')
for i in range(len(links)):
element = driver.find_elements_by_xpath('(//*[#id="tr-undefined"]/td[1]/span)[i+1]')
element.click()
time.sleep(random.randint(5, 10))
set-up
I use Python + Selenium to scrape info of companies of this site.
Since the website doesn't allow me to simply load page urls, I plan to click on the next page arrow element at the bottom of the list and using a while loop with a counter.
the code
browser.get('https://new.abb.com/channel-partners/search#')
wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'abb-pagination')))
# start while loop and counter
c = 1
while c < 65:
c += 1
# obtain list of companies element
wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'#PublicWrapper > main > section:nth-child(7) > div:nth-child(2)')))
resultlist = el_css('#PublicWrapper > main > section:nth-child(7) > div:nth-child(2)')
# loop over companies in list
for company in resultlist.find_elements_by_xpath('div'):
# company name
name = company.find_element_by_xpath('h3/a/span').text
# code to capture more company info follows
# next page arrow element
next_page_arrow = el_cn('abb-pagination__item--next')
next_page_arrow.click()
issue
The code captures the company info just fine outside of the while loop, i.e. just the first page.
However, when inserted in the while loop to iterate over the pages, I get the following error: StaleElementReferenceException: stale element reference: element is not attached to the page document (Session info: chrome=88.0.4324.192)
If I go over it, it seems resultlist of the subsequent page does get captured, but the loop over companies in resultlist yields this error.
What to do?
the simplest solution would be to use an implicity wait:
driver.get('https://new.abb.com/channel-partners/search#')
company_name = []
while True:
time.sleep(1)
company_name+=[elem.text for elem in wait.until(EC.presence_of_all_elements_located((By.XPATH,'//span[#property="name"]')))]
# if next page arrow element still available, click, else break while
if driver.find_elements_by_xpath('//li[#class="abb-pagination__item--next"]/a[contains(#href,"#page")]'):
wait.until(EC.presence_of_element_located((By.XPATH,'//li[#class="abb-pagination__item--next"]/a'))).click()
else:
break
len(company_name)
output:
951
You don't need the counter, you can check if arrow url is still available, this way if a page 65, 66, [...] were added, your logic would still work.
The problem here is that the while is too fast, and the page does not load in time. You could alternatively save the first list of company names, click in the next arrow and compare with the new list, if both were the same, wait a little more until the new list is differente from the previous one.
i am in a project to scroll and get the every posts that everyone posted. the problem is my code not reading every posts (just 2 or 3 and skipping to next). below is my code and i like to have my code in a way that it reads every posts. i also tried changing sleep() duration and pixel count while scrolling and scroll to view options , but no improvement
# scrolling and grabbing data
for i in range(1000):
element = driver.find_element_by_xpath('//div[contains(#class, "mnk10 copy-txt")]')
# driver.execute_script("return document.body.scrollHeight / 2",element)
driver.execute_script("arguments[0].scrollIntoView(true)",element)
# driver.execute_script("arguments[0].scrollBy(0, -300)",element)
# driver.execute_script("return, document.body.scrollHeight/4",element)
data1 = driver.find_element_by_xpath('//div[contains(#class, "mnk10 copy-txt")]').get_attribute('dat-plin-txt')
print(data1)
time.sleep(2)
You could try to scroll by a certain amount of one post instead of "scrollIntoView(true)" using the following script part:
driver.execute_script("arguments[0].scrollBy(0, 500)", element)
you might or might not have to change the "500" part
I am trying to read the value of a text from a page using Selenium Python. The element is always visible and then goes invisible and then becomes visible and the text value changes rapidly until it reaches the final value. I think it is using some form of javascript to get the value before displaying the final value.
The page is https://www.junglescout.com/estimator/
I am trying to get the result of the estimate from the element js-magic-result.
I'm able to use Selenium to fill the three forms and click the 'calculates sales' button
I am using the chromedriver in python selenium to read the value.
all the tests gets me one of the intermediate values before it finishes loading.
I tried using the folling inplace of the browser2.implicitly_wait(5):
driver.implicitly_wait(5)
wait = WebDriverWait(browser2,3)
wait.until(EC.invisibility_of_element_located((By.CLASS_NAME,'js-magic-result')))
wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'js-magic-result')))
Here is the full code I am using
browser2 = webdriver.Chrome(options=options,executable_path=driverPath)
url = 'https://www.junglescout.com/estimator/'
browser2.get(url)
container = browser2.find_element_by_class_name('js-estimation-section')
rankField = container.find_element_by_name('theRankInput')
rankField.send_keys('345')
# Click for storesList drop down
storeGroup = container.find_element_by_class_name('js-est-stores-list-input')
storeGroup.find_element_by_class_name('x-icon-caret-down').click()
# Get Stores List items
wait = WebDriverWait(browser2,3)
stores = wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'js-est-stores-list')))
stores.find_elements_by_tag_name('span')[0].click()
# Wait for storesList is invisible
wait.until(EC.invisibility_of_element_located((By.CLASS_NAME,'js-est-stores-list')))
# Click for Categories list drop down
catGroup = container.find_element_by_class_name('js-est-categories-list-input')
catGroup.find_element_by_tag_name('input').click()
# Get Categories List items
cats = wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'js-est-categories-list')))
# Get Categories List items
for cat in cats.find_elements_by_class_name('us-available'):
if (cat.find_element_by_tag_name('span').text == 'Electronics'):
cat.click()
break
# Wait for storesList is invisible
wait.until(EC.invisibility_of_element_located((By.CLASS_NAME,'js-est-categories-list')))
# wait5 = WebDriverWait(browser2,3)
submit = wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'js-est-btn')))
submit.click()
browser2.implicitly_wait(5)
print(container.find_element_by_class_name('js-magic-result').text)
What I expect is to get the final value returned but what I get is one of the intermediate values from the element.
Instead Of
print(container.find_element_by_class_name('js-magic-result').text)
Please try this.
print(browser2.find_element_by_xpath("//table[#class='js-estimation-section']//tbody//tr/td[2]/p").text)
Make sure before print some delay should be there so please replace only print code.
Let me know if this works.