Scroll and read every data - python selenium - python

i am in a project to scroll and get the every posts that everyone posted. the problem is my code not reading every posts (just 2 or 3 and skipping to next). below is my code and i like to have my code in a way that it reads every posts. i also tried changing sleep() duration and pixel count while scrolling and scroll to view options , but no improvement
# scrolling and grabbing data
for i in range(1000):
element = driver.find_element_by_xpath('//div[contains(#class, "mnk10 copy-txt")]')
# driver.execute_script("return document.body.scrollHeight / 2",element)
driver.execute_script("arguments[0].scrollIntoView(true)",element)
# driver.execute_script("arguments[0].scrollBy(0, -300)",element)
# driver.execute_script("return, document.body.scrollHeight/4",element)
data1 = driver.find_element_by_xpath('//div[contains(#class, "mnk10 copy-txt")]').get_attribute('dat-plin-txt')
print(data1)
time.sleep(2)

You could try to scroll by a certain amount of one post instead of "scrollIntoView(true)" using the following script part:
driver.execute_script("arguments[0].scrollBy(0, 500)", element)
you might or might not have to change the "500" part

Related

Scrape and extract job data from Google Jobs using selenium and store in Pandas DataFrame

HI I'm new to StackOverflow. Apoligies in advance if the post is not well structured.
I have been learning web scraping in python and as part of a hobby project I was developing I was trying to web scrape Google Jobs and extract specific data to be stored in a pandas data frame.
I'm using selenium on python to achieve this.
So, the main challenge for me was to figure out a way to scrape all the job records from the site obtained from the search query (url = Google Jobs). This was difficult only because google jobs is dynamically loading ie. infinite scrolling, and the page initially loads only 10 results in the side panel. Upon scrolling down, only 10 more results are loaded successively with each scroll.
Website preview
I used selenium to help me with this. I figured that I can somehow automate the scrolling by instructing selenium to scroll into view the list element (<\li>) associated with the last job entry in the side panel and run a for loop to repeat it till all results are loaded onto the page.
Then I just had to extract the list elements and store their text into a data frame.
The problem is each of the job entries has anywhere between 3 - 6 lines of text with each line representing some attribute like Job Title or Company name or Location etc., with the number of lines of each job entry being different, resulting in some entries with more lines than the others.
Different number of lines for each job entry
So when I split the text into a python list using '\n' as the seperator, it results in lists with different lengths.
This becomes a problem for me when i use pd.DataFrame(list) to generate a dataframe, resulting in records with jumbled order of fields.
Different Length Lists 😓
Below is the code I have come up with:
#imports
import pandas as pd
import numpy as np
from serpapi import GoogleSearch
import requests
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
#using selenium to launch and scroll through the Google Jobs page
url = "https://www.google.com/search?q=google+jobs+data+analyst&oq=google+jobs+data+analyst&aqs=chrome..69i57j69i59j0i512j0i22i30i625l4j69i60.4543j0j7&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&ved=2ahUKEwjXsv-_iZP9AhVPRmwGHX5xDEsQutcGKAF6BAgPEAU&sxsrf=AJOqlzWGHNISzgpAUCZBmQA1mWXXt3I7gA:1676311105893#htivrt=jobs&htidocid=GS94rKdYQqQAAAAAAAAAAA%3D%3D&fpstate=tldetail"
driver = webdriver.Chrome()
driver.get(url)
joblist =[]
#pointing to the html element to scroll to
elementxpath = '//*[#id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
datas = driver.find_elements(By.XPATH,'//*
#capturing all the job list objects in the first page
[#id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul/li')
joblist.append([da.text for da in datas])
#adding 3s delay for website to load after scrolling before executing code
time.sleep(3)
#capturing all the job list objects in the second set of 10 results loaded after 1st scroll down
elementxpath = '//*[#id="VoQFxe"]/div/div/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
datas = driver.find_elements(By.XPATH,'//*[#id="VoQFxe"]/div/div/ul/li')
joblist.append([da.text for da in datas])
x=2
time.sleep(3)
#using a while loop to scroll and capture for the remaining scroll downs as element xpath is in iterable format unlike th previous 2 xpaths
while True:
elementxpath = '//*[#id="VoQFxe"]/div['+str(1*x)+']/div/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
x+=1
time.sleep(3)
datas = driver.find_elements(By.XPATH,'//*[#id="VoQFxe"]/div['+str(1*x)+']/div/ul/li')
joblist.append([da.text for da in datas])
if x>1000:
break
else:
continue
#unpacking and cleaning captured values from joblist to a newlist of lists in the desired format for creating a dataframe
jlist = []
for n in joblist:
for a in range(0,len(n)-1):
if n[a]!='':
jlist.append(n[a].split('\n'))
jobdf = pd.DataFrame(jlist)
jobdf.columns = ['Logo','Role', 'Company', 'Source','Posted','Full / Part Time', 'Waste']
jobdf
This is the output data frame:
Jumbled mess 😶
Men and Women of culture, I implore your help to get a ordered DataFrame that makes sense. Thank you!
Usually you can use .split('\n') only in simple cases, but in this case is a bad idea. A better practice is to use a unique xpath for each element you want to scrape, one for the logo, one for role, etc.
Another good practice is to initialize a dictionary at the beginning with one key for each element you want to scrape, and then append data as you loop over the jobs.
The following code does exactly this. It is not optimized for speed, in fact it scrolls to each job and scrape it, while the best way would be to scrape data of all the displayed jobs and then scroll to the bottom, then scrape all the new jobs and scroll again, and so on.
# import libraries...
# load webpage...
from selenium.common.exceptions import NoSuchElementException
xpaths = {
'Logo' :"./div[1]//img",
'Role' :"./div[2]",
'Company' :"./div[4]/div/div[1]",
'Location' :"./div[4]/div/div[2]",
'Source' :"./div[4]/div/div[3]",
'Posted' :"./div[4]/div/div[4]/div[1]",
'Full / Part Time':"./div[4]/div/div[4]/div[2]",
}
data = {key:[] for key in xpaths}
jobs_to_do = 100
jobs_done = 0
while jobs_done < jobs_to_do:
lis = driver.find_elements(By.XPATH, "//li[#data-ved]//div[#role='treeitem']/div/div")
for li in lis[jobs_done:]:
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', li)
for key in xpaths:
try:
t = li.find_element(By.XPATH, xpaths[key]).get_attribute('src' if key=='Logo' else 'innerText')
except NoSuchElementException:
t = '*missing data*'
data[key].append(t)
jobs_done += 1
print(f'{jobs_done=}', end='\r')
time.sleep(.2)
Then by running pd.DataFrame(data) you get something like this
As you can see from the image, some values in the "Posted" column should be put instead in the column "Full / Part Time". This happens because some jobs don't have info about posted time. I noticed also that some jobs not only have "posted" and "full/part time" data, but also the "salary". So you should adjust the code to take into account these facts, it is not so easy because the HTML objects don't have specific classes for each element, so I think you have to exploit the svg symbols (clock, bag and banknote) shown in this image
UPDATE
I tried using the svg paths to correctly scrape "posted", "full/part time" and "salary" and it works! Here are the paths
xpaths = {
'Logo' :"./div[1]//img",
'Role' :"./div[2]",
'Company' :"./div[4]/div/div[1]",
'Location' :"./div[4]/div/div[2]",
'Source' :"./div[4]/div/div[3]",
'Posted' :".//*[name()='path'][contains(#d,'M11.99')]/ancestor::div[1]",
'Full / Part Time':".//*[name()='path'][contains(#d,'M20 6')]/ancestor::div[1]",
'Salary' :".//*[name()='path'][#fill-rule='evenodd']/ancestor::div[1]"
}
Replace the old paths with the new ones and it will work as expected, as shown in the picture below

Clicking in selenium python after using move_by_offset()

I am trying to navigate through a miniature window containing a grid (1000+ rows) which lies inside of a website. And my goal requires me to click each row of the grid. Initially I tried locating each row using xPATH but there is no reliable way to do this because when I scroll down the grid, the xPATH changes (eg. row 1 -> ...div/div[1]/..., row n -> ...div/div[6]/...)
therefore to get around this issue, I used actionChains move_by_offset and located the row I want this way. and using a loop to add on some grid-width each time, before hitting some maximum number of rows that can appear on screen and scrolling down using a.send_keys(Keys.PAGE_DOWN).
This is all working well and i can see the hover animation above the rows when i want them. However, I cannot get the actionChains to click on the rows when hovering. I have tried click, double click and click_and_hold + release to no success.
Any help on either making move_by_offset + click work or using xPATH's more effectively would be great.
For context, I am using this on website for work so I cannot release the URL/photos of it. This is my first time using selenium.
for s in range(n):
scrollLim = 7
base = 180
for x in range(scrollLim):
print(str(x))
a.reset_actions()
a.move_by_offset(60, base + 60 * x).double_click()
a.perform()
time.sleep(1)
if x == scrollLim - 1:
a.send_keys(Keys.PAGE_DOWN).perform()

Selenium Python: How to wait until an element finish changing the text before reading the value using?

I am trying to read the value of a text from a page using Selenium Python. The element is always visible and then goes invisible and then becomes visible and the text value changes rapidly until it reaches the final value. I think it is using some form of javascript to get the value before displaying the final value.
The page is https://www.junglescout.com/estimator/
I am trying to get the result of the estimate from the element js-magic-result.
I'm able to use Selenium to fill the three forms and click the 'calculates sales' button
I am using the chromedriver in python selenium to read the value.
all the tests gets me one of the intermediate values before it finishes loading.
I tried using the folling inplace of the browser2.implicitly_wait(5):
driver.implicitly_wait(5)
wait = WebDriverWait(browser2,3)
wait.until(EC.invisibility_of_element_located((By.CLASS_NAME,'js-magic-result')))
wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'js-magic-result')))
Here is the full code I am using
browser2 = webdriver.Chrome(options=options,executable_path=driverPath)
url = 'https://www.junglescout.com/estimator/'
browser2.get(url)
container = browser2.find_element_by_class_name('js-estimation-section')
rankField = container.find_element_by_name('theRankInput')
rankField.send_keys('345')
# Click for storesList drop down
storeGroup = container.find_element_by_class_name('js-est-stores-list-input')
storeGroup.find_element_by_class_name('x-icon-caret-down').click()
# Get Stores List items
wait = WebDriverWait(browser2,3)
stores = wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'js-est-stores-list')))
stores.find_elements_by_tag_name('span')[0].click()
# Wait for storesList is invisible
wait.until(EC.invisibility_of_element_located((By.CLASS_NAME,'js-est-stores-list')))
# Click for Categories list drop down
catGroup = container.find_element_by_class_name('js-est-categories-list-input')
catGroup.find_element_by_tag_name('input').click()
# Get Categories List items
cats = wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'js-est-categories-list')))
# Get Categories List items
for cat in cats.find_elements_by_class_name('us-available'):
if (cat.find_element_by_tag_name('span').text == 'Electronics'):
cat.click()
break
# Wait for storesList is invisible
wait.until(EC.invisibility_of_element_located((By.CLASS_NAME,'js-est-categories-list')))
# wait5 = WebDriverWait(browser2,3)
submit = wait.until(EC.visibility_of_element_located((By.CLASS_NAME,'js-est-btn')))
submit.click()
browser2.implicitly_wait(5)
print(container.find_element_by_class_name('js-magic-result').text)
What I expect is to get the final value returned but what I get is one of the intermediate values from the element.
Instead Of
print(container.find_element_by_class_name('js-magic-result').text)
Please try this.
print(browser2.find_element_by_xpath("//table[#class='js-estimation-section']//tbody//tr/td[2]/p").text)
Make sure before print some delay should be there so please replace only print code.
Let me know if this works.

selenium + python loop for dynamic drop down boxes

I have the following loop to navigate thru multiple dynamic drop down boxes, submitting it to reach another webpage, then returning to the initial page to iterate thru all the options.
Problem is that I need to insert in a line to direct the browser to the initial page before every loop. Hence the slow execution speed. Any ideas on improving the code?
Example of the code below:
select = Select(driver.find_element_by_id('ID'))
options = select.options
for index in range(1, len(options):
driver.get('webpage') # issue for slow execution speed
select.select_by_index(index)
select2 = Select(driver.find_element_by_id('ID2')) # can only select after selecting the first drop down box
options2 = select2.options
for index2 in range(1, len(options2):
driver.get('webpage') # issue for slow execution speed
select.select_by_index(index)
select2 = Select(driver.find_element_by_id('ID2'))
options2 = select2.options
select2.select_by_index(index2)
submit() # click on button to navigate to another page, read data then back to loop

How to scroll a page to the end

I have tried to do this:
driver_1.execute_script("window.scrollTo(0, document.body.scrollHeight);")
but it does nothing, so I made a loop to scroll the page by steps:
initial_value = 0
end = 300000
for i in xrange(1000,end,1000):
driver_1.execute_script("window.scrollTo(" + str(initial_value) + ', ' + str(i) + ")")
time.sleep(0.5)
initial_value = i
print 'scrolling >>>>'
It kinda works, but I don't know how long is a a given page, so I have to put a big number as the max height, that gives me two problems. First is that even a big number couldn't be large enought to scroll some pages and second one is that if the page is shorter than that limit a loss quite a lot time waiting for the script to finish when is doing nothing
You need something to rely on, some indicator for you to stop scrolling. Here is an example use case when we would stop scrolling if more than N particular elements are already loaded:
Slow scrolling down the page using Selenium
Similar use case:
Headless endless scroll selenium
FYI, you may have noticed an other way to scroll to bottom - scrolling into view of a footer:
footer = driver.find_element_by_tag_name("footer")
driver.execute_script("arguments[0].scrollIntoView();", footer)
To scroll the page to the end, you could simply send the END key:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://stackoverflow.com/search?tab=newest&q=selenium")
driver.find_element_by_tag_name("body").send_keys(Keys.END)
You could also scroll the full height :
driver = webdriver.Firefox()
driver.get("http://stackoverflow.com/search?tab=newest&q=selenium")
driver.execute_script("window.scrollBy(0, document.documentElement.scrollHeight)")
Hey I found another solution that worked perfectly for me. Check this answer here.
Also this implementation:
driver.find_element_by_tag_name("body").send_keys(Keys.END) does not work for pages that that use infinite scrolling design.

Categories

Resources