Looping through a series of web elements of common class name - python

I'am writing a code to do the following using python and selenium:
1.go to google maps and search London Restaurants
2.click on first restaurant to view details and then go back to previous page and click the next restaurant (i, i+1, i+2 etc...)
Note all restaurant click pages have common class names (being 'section-result')
however when I'am running the code, for some reason, driver is not clicking on the restaurant to go to the details page.
I have tried the following code, which also was suggested in another forum post for this problem. However so far unsuccessfully.
also i have tried to do a for loop which i have also included in the code section as (option 2)
from selenium import webdriver
import random
import time
import pandas as pd
driver=webdriver.Chrome(executable_path="C:/users/usr/Desktop/chromedriver.exe")
UrlA = "https://www.google.com/maps/search/"
UrlB= "London"
UrlC="Restaurant"
UrlD= UrlA + UrlB + '+' + UrlC
driver.get("http://www.google.com/ncr") #to load page in english language
driver.get(UrlD)
time.sleep(2)
driver.maximize_window()
elements = driver.find_elements_by_class_name('section-result')
Option 1:
for i in elements:
i.click()
driver.back()
Option 2:
for i in range (1,20):
elements[i].click
driver.back
the code line (i dot click) is not responding and instead its going back to previous page. Please advise the correct modification for the code

Related

Scrape and extract job data from Google Jobs using selenium and store in Pandas DataFrame

HI I'm new to StackOverflow. Apoligies in advance if the post is not well structured.
I have been learning web scraping in python and as part of a hobby project I was developing I was trying to web scrape Google Jobs and extract specific data to be stored in a pandas data frame.
I'm using selenium on python to achieve this.
So, the main challenge for me was to figure out a way to scrape all the job records from the site obtained from the search query (url = Google Jobs). This was difficult only because google jobs is dynamically loading ie. infinite scrolling, and the page initially loads only 10 results in the side panel. Upon scrolling down, only 10 more results are loaded successively with each scroll.
Website preview
I used selenium to help me with this. I figured that I can somehow automate the scrolling by instructing selenium to scroll into view the list element (<\li>) associated with the last job entry in the side panel and run a for loop to repeat it till all results are loaded onto the page.
Then I just had to extract the list elements and store their text into a data frame.
The problem is each of the job entries has anywhere between 3 - 6 lines of text with each line representing some attribute like Job Title or Company name or Location etc., with the number of lines of each job entry being different, resulting in some entries with more lines than the others.
Different number of lines for each job entry
So when I split the text into a python list using '\n' as the seperator, it results in lists with different lengths.
This becomes a problem for me when i use pd.DataFrame(list) to generate a dataframe, resulting in records with jumbled order of fields.
Different Length Lists 😓
Below is the code I have come up with:
#imports
import pandas as pd
import numpy as np
from serpapi import GoogleSearch
import requests
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
#using selenium to launch and scroll through the Google Jobs page
url = "https://www.google.com/search?q=google+jobs+data+analyst&oq=google+jobs+data+analyst&aqs=chrome..69i57j69i59j0i512j0i22i30i625l4j69i60.4543j0j7&sourceid=chrome&ie=UTF-8&ibp=htl;jobs&sa=X&ved=2ahUKEwjXsv-_iZP9AhVPRmwGHX5xDEsQutcGKAF6BAgPEAU&sxsrf=AJOqlzWGHNISzgpAUCZBmQA1mWXXt3I7gA:1676311105893#htivrt=jobs&htidocid=GS94rKdYQqQAAAAAAAAAAA%3D%3D&fpstate=tldetail"
driver = webdriver.Chrome()
driver.get(url)
joblist =[]
#pointing to the html element to scroll to
elementxpath = '//*[#id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
datas = driver.find_elements(By.XPATH,'//*
#capturing all the job list objects in the first page
[#id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul/li')
joblist.append([da.text for da in datas])
#adding 3s delay for website to load after scrolling before executing code
time.sleep(3)
#capturing all the job list objects in the second set of 10 results loaded after 1st scroll down
elementxpath = '//*[#id="VoQFxe"]/div/div/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
datas = driver.find_elements(By.XPATH,'//*[#id="VoQFxe"]/div/div/ul/li')
joblist.append([da.text for da in datas])
x=2
time.sleep(3)
#using a while loop to scroll and capture for the remaining scroll downs as element xpath is in iterable format unlike th previous 2 xpaths
while True:
elementxpath = '//*[#id="VoQFxe"]/div['+str(1*x)+']/div/ul/li[10]'
element = driver.find_element(By.XPATH,elementxpath)
driver.execute_script('arguments[0].scrollIntoView(true)',element)
x+=1
time.sleep(3)
datas = driver.find_elements(By.XPATH,'//*[#id="VoQFxe"]/div['+str(1*x)+']/div/ul/li')
joblist.append([da.text for da in datas])
if x>1000:
break
else:
continue
#unpacking and cleaning captured values from joblist to a newlist of lists in the desired format for creating a dataframe
jlist = []
for n in joblist:
for a in range(0,len(n)-1):
if n[a]!='':
jlist.append(n[a].split('\n'))
jobdf = pd.DataFrame(jlist)
jobdf.columns = ['Logo','Role', 'Company', 'Source','Posted','Full / Part Time', 'Waste']
jobdf
This is the output data frame:
Jumbled mess 😶
Men and Women of culture, I implore your help to get a ordered DataFrame that makes sense. Thank you!
Usually you can use .split('\n') only in simple cases, but in this case is a bad idea. A better practice is to use a unique xpath for each element you want to scrape, one for the logo, one for role, etc.
Another good practice is to initialize a dictionary at the beginning with one key for each element you want to scrape, and then append data as you loop over the jobs.
The following code does exactly this. It is not optimized for speed, in fact it scrolls to each job and scrape it, while the best way would be to scrape data of all the displayed jobs and then scroll to the bottom, then scrape all the new jobs and scroll again, and so on.
# import libraries...
# load webpage...
from selenium.common.exceptions import NoSuchElementException
xpaths = {
'Logo' :"./div[1]//img",
'Role' :"./div[2]",
'Company' :"./div[4]/div/div[1]",
'Location' :"./div[4]/div/div[2]",
'Source' :"./div[4]/div/div[3]",
'Posted' :"./div[4]/div/div[4]/div[1]",
'Full / Part Time':"./div[4]/div/div[4]/div[2]",
}
data = {key:[] for key in xpaths}
jobs_to_do = 100
jobs_done = 0
while jobs_done < jobs_to_do:
lis = driver.find_elements(By.XPATH, "//li[#data-ved]//div[#role='treeitem']/div/div")
for li in lis[jobs_done:]:
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', li)
for key in xpaths:
try:
t = li.find_element(By.XPATH, xpaths[key]).get_attribute('src' if key=='Logo' else 'innerText')
except NoSuchElementException:
t = '*missing data*'
data[key].append(t)
jobs_done += 1
print(f'{jobs_done=}', end='\r')
time.sleep(.2)
Then by running pd.DataFrame(data) you get something like this
As you can see from the image, some values in the "Posted" column should be put instead in the column "Full / Part Time". This happens because some jobs don't have info about posted time. I noticed also that some jobs not only have "posted" and "full/part time" data, but also the "salary". So you should adjust the code to take into account these facts, it is not so easy because the HTML objects don't have specific classes for each element, so I think you have to exploit the svg symbols (clock, bag and banknote) shown in this image
UPDATE
I tried using the svg paths to correctly scrape "posted", "full/part time" and "salary" and it works! Here are the paths
xpaths = {
'Logo' :"./div[1]//img",
'Role' :"./div[2]",
'Company' :"./div[4]/div/div[1]",
'Location' :"./div[4]/div/div[2]",
'Source' :"./div[4]/div/div[3]",
'Posted' :".//*[name()='path'][contains(#d,'M11.99')]/ancestor::div[1]",
'Full / Part Time':".//*[name()='path'][contains(#d,'M20 6')]/ancestor::div[1]",
'Salary' :".//*[name()='path'][#fill-rule='evenodd']/ancestor::div[1]"
}
Replace the old paths with the new ones and it will work as expected, as shown in the picture below

How can I do web scraping through XPath from dynamic charts with multiple criteria options?

I am very new to scraping and programing in general.
That's why I am asking for help with the next issue.
There is a web site under the url.
I need to get data from dynamic charts.
The code has to be written with an option of looping through all the required days data represented for and an option of looping though all elements containing the data.
First issue is that I need somehow to get the data following the XPath.
And the second one is that I have to write the loop to get all
the required inflammation
url = "https://www.oree.com.ua/index.php/control/results_mo/DAM"
from selenium import webdriver
import requests
import pandas as pd
import time
browser = webdriver.PhantomJS(executable_path = "C:/ProgramData/Anaconda3/Lib/site-packages/phantomjs-2.1.1-windows/bin/phantomjs")
browser.get(url)
time.sleep(2)
elements = browser.find_elements_by_xpath("html/body/div[5]/div[1]/div[3]/div[3]/div/div/table/tbody/tr[1]/td[3]/text()")
for element in elements:
print(element)
browser.quit()
Not sure Selenium is required here. You can get directly data from this mixed html/json object (change the date accordingly to your needs) :
https://www.oree.com.ua/index.php/PXS/get_pxs_hdata/04.04.2020/DAM/1
Then request with :
//tbody//tr/td[i]
Where i is the column of interest. Range of i is 3-7. Column 3 is "Sales volume, MW.h", 4 is "Purchase volume, MW.h", etc...
Output for sales volume (04/04/2020) :

Getting an element by attribute and using driver to click on a child element in webscraping - Python

Webscraping results from Indeed.com
-Searching 'Junior Python' in 'Los Angeles, CA' (Done)
-Sometimes popup window opens. Close the window if popup occurs.(Done)
-Top 3 results are sponsored so skip these and go to real results
-Click on result summary section which opens up side panel with full summary
-Scrape the full summary
-When result summary is clicked, url changes. Rather than opening new window, I would like to scrape the side panel full summary
-Each real result is under ('div':{'data-tn-component':'organicJob'}). I am able to get jobtitle, company, and short summary using BeautifulSoup. However, I would like to get the full summary on the side panel.
Problem
1) When I try to click on the link(using Selenium) (jobtitle or the short summary, which opens up the side panel), the code only ends up clicking on the 1st link which is the 'sponsored'. Unable to locate and click on real result under id='jobOrganic'
2) Once a real result is clicked on(manually), I can see that the full summary side panel is under <td id='auxCol'> and within this, under . The full summary is contained within the <p> tags. When I try to have a selenium scrape full summary using findAll('div':{'id':'vjs-desc'}), all I get is blank result [].
3) The url also changes when the side panel is opened. I tried using Selenium to have driver get the new url and then soup the url to get results but all I'm getting is the 1st sponsored result which is not what I want. I'm not sure why BeautifulSoup keeps getting the result for sponsored, even when I'm running the code under 'id='jobOrganic' real results.
Here is my code. I've been working on this for almost past two days, went through stackoverflow, documentation, and google but unable to find the answer. I'm hoping someone can point out what I'm doing wrong and why I'm unable to get the full summary.
Thanks and sorry for this being so long.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as bs
url = 'https://www.indeed.com/'
driver = webdriver.Chrome()
driver.get(url)
whatinput = driver.find_element_by_id('text-input-what')
whatinput.send_keys('Junior Python')
whereinput = driver.find_element_by_id('text-input-where')
whereinput.click()
whereinput.clear()
whereinput.send_keys('Los Angeles, CA')
findbutton = driver.find_element_by_xpath('//*[#id="whatWhere"]/form/div[3]/button')
findbutton.click()
try:
popup = driver.find_element_by_id('prime-popover-close-button')
popup.click()
except:
pass
This is where I'm stuck. The result summary is under {'data-tn-component':'organicJob'}, span class='summary'. Once I click on this, side panel opens up.
soup = bs(driver.page_source,'html.parser')
contents = soup.findAll('div',{"data-tn-component":"organicJob"})
for each in contents:
summary = driver.find_element_by_class_name('summary')
summary.click()
This opens side panel but it clicks the first sponsored link in the whole page (sponsored link), not the real result. This basically goes out of the 'organicJob' resultset for some reason.
url = driver.current_url
driver.get(url)
I tried to set the new url after clicking on the link(sponsored) to test out whether I can even get the side panel full summary(albeit sponsored, as test purpose).
soup=bs(driver.page_source,'html.parser')
fullsum = soup.findAll('div',{"id":"vjs-desc"})
print(fullsum)
This actually prints out the full summary of side panel, but it keeps printing the same 1st result over and over through the whole loop, instead of moving to the next one.
The problem is you are fetching divs using beautiful soup but, clicking using selenium which is not aware of your collected divs.
As you are using find_element_by_class_name() method of the driver object. It searches the whole page instead of your intended div object each(in the for loop). Thus, it ends up fetching the same first result from the whole page in each iterations.
One, quick work around is possible using only selenium(this will be slower though)
elements = driver.find_elements_by_tag_name('div')
for element in elements:
if "organicJob" in element.get_attribute("data-tn-component"):
summary = element.find_element_by_class_name('summary')
summary.click()
The above code will search for all the divs and, iterate over them to find divs with data-tn-component attribute containing organicJob. Once, it find one it will search for element with class name summary and click on that element.

Navigate through all the members of Research Gate Python Selenium

I am a rookie in python selenium. I have to navigate through all the members from the members page of an institution in Research Gate, which means I have to click the first member to go to their profile page and go back to the members page to click the next member.I tried for loop, but every time it is clicking only on the first member. Could anyone please guide me. Here is what I have tried.
from selenium import webdriver
import urllib
driver = webdriver.Firefox("/usr/local/bin/")
university="Lawrence_Technological_University"
members= driver.get('https://www.researchgate.net/institution/' + university +'/members')
membersList = driver.find_element_by_tag_name("ul")
list = membersList.find_elements_by_tag_name("li")
for member in list:
driver.find_element_by_class_name('display-name').click()
print(driver.current_url)
driver.back()
You are not even doing anything with the list members in your for loop. The state of the page changes after navigating to a different page & coming back, so you need to find the element again. One approach to handle this is given below:
for i in range(len(list)):
membersList = driver.find_element_by_tag_name("ul")
element = membersList.find_elements_by_tag_name("li")[i]
element.click()
driver.back()

Search results don't change URL - Web Scraping with Python and Selenium

I am trying to create a python script to scrape the public county records website. I ultimately want to be able to have a list of owner names and the script run through all the names and pull the most recent deed of trust information (lender name and date filed). For the code below, I just wrote the owner name as a string 'ANCHOR EQUITIES LTD'.
I have used Selenium to automate the entering of owner name into form boxes but when the 'return' button is pressed and my results are shown, the website url does not change. I try to locate the specific text in the table using xpath but the path does not exist when I look for it. I have concluded the path does not exist because it is searching for the xpath on the first page with no results shown. BeautifulSoup4 wouldn't work in this situation because parsing the url would only return a blank search form html
See my code below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
browser.get('http://deed.co.travis.tx.us/ords/f?p=105:5:0::NO:::#results')
ownerName = browser.find_element_by_id("P5_GRANTOR_FULLNAME")
ownerName.send_keys('ANCHOR EQUITIES LTD')
docType = browser.find_element_by_id("P5_DOCUMENT_TYPE")
docType.send_keys("deed of trust")
ownerName.send_keys(Keys.RETURN)
print(browser.page_source)
#lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]/text()")
enter code here
I have commented out the variable that is giving me trouble.. Please help!!!!
If I am not explaining my problem correctly, please feel free to ask and I will clear up any questions.
I think you almost have it.
You match the element you are interested in using:
lenderNameElement = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]")
Next you access the text of that element:
lenderName = lenderNameElement.text
Or in a single step:
lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]").text
have you used following xpath?
//table[contains(#summary,"Search Results")]/tbody/tr
I have checked it's work perfect.In that, you have to iterate each tr

Categories

Resources