I'm scraping a job board using Selenium. In the "job summary" section, all informations are not mandatory, which means that using XPath, if an information (e.g salary) is missing, the next information will be stored in this place.
I am trying to use the emojis before every information to check what information it is. The problem I have now is that the text related to the emoji is stored in the parent's next sibling, and I'm not able to retrieve it. Here's the html:
<li class="sc-16yjgsd-0 haea-DT">
<span role="img" class="sc-16yjgsd-1 cnWfGH">
<i class="sc-ACYlI bCYQRQ wui-icon-font" name="salary"></i></span>
<span class="sc-16yjgsd-3 bCCdzk">Salaire entre 40,5K € et 46K €</span></li>
And here is my latest try :
jobinfos = {"salary": [], "location": [], 'contract' : [], 'contract' : [], 'suitcase' : []}
for i in links[0:4]:
driver.get(i)
jobsummary = driver.find_element_by_xpath("//*[#id='prc-1-1-1']/main/div[1]/div/div[1]/div[2]/ul")
emoji_elements = jobsummary.find_elements_by_xpath("/html/body/div[1]/div[1]/div/div/div/div/main/div[1]/div/div[1]/div[2]/ul/li[1]/span[1]/i")
print("Number of emoji elements found:", len(emoji_elements))
for emoji_element in emoji_elements:
emoji_type = emoji_element.get_attribute("name")
print(emoji_type)
info_element = emoji_element.find_element_by_xpath("..//i[#name='{}']/following-sibling::i").format(emoji_type)
info = info_element.text
jobinfos[emoji_type].append(info)
which returns that it is unable to locate the following sibling.
Related
Below 'a' element has two text strings "First Name" and "View First Names's profile". With below python code using get_text() I am getting both the text strings. However I want to get only first i.e. "First Name". Pl let me know code to drop 2nd string i.e. "View First Names's profile"
all_classes = src.find_all('div', {'class':'mb1'})
for linkClass in all_classes:
linkClass = linkClass.find_all('a', {'class': 'app-aware-link'})
for element in linkClass:
name = element.get_text().strip()
Name.append(name)
HTML
<a class="app-aware-link" href="https://www.linkedin.com/in/shreyansjain-iitdhn?miniProfileUrn=urn%3Ali%3Afs_miniProfile%3AACoAABpqUi4Bg1wC5QB22-ydCRRB580Zd4gutQ8">
<span dir="ltr">
<span aria-hidden="true"><!-- -->First Name<!-- --></span><span class="visually-hidden"><!-- -->View First Names’s profile<!-- --></span>
</span>
</a>
To extract the first name in Selenium I would do this :
use the below CSS_SELECTOR for First name :
.app-aware-link span[dir='ltr'] span:first-of-type
Profile name :
.app-aware-link span[dir='ltr'] span:last-of-type
and extract the text b/w them like this :
first-name :
for name in driver.find_elements(By.CSS_SELECTOR, " .app-aware-link span[dir='ltr'] span:first-of-type"):
print(name.text)
Profile_name :
for profile_name in driver.find_elements(By.CSS_SELECTOR, ".app-aware-link span[dir='ltr'] span:last-of-type"):
print(profile_name.text)
Try this:
all_classes = src.find_all('div', {'class':'mb1'})
for linkClass in all_classes:
linkClass = linkClass.find_all('a', {'class': 'app-aware-link'})
for element in linkClass:
if element is not None:
first_name = element.find_elements_by_xpath('./span/span')[0]
if first_name is not None:
name = first_name.get_text().strip()
Name.append(name)
How can I scroll to text that only contains min class
<div>
<div class="item filter_2 firstPart">
<div class="date">16/10/2018</div>
<div class="time">04:00</div>
<div class="event">Ningbo, China</div>
<div class="subevent">HE, Yecong - Kecmanovic, Miomir</div>
<div class="odds">
<div class="odd" idq="2998675069">
<div class="tq">1HH</div>
<div class="value">8.00</div>
</div>
<div class="odd min" idq="2998675068">
<div class="tq">2HH</div>
<div class="value">1.03</div>
</div>
</div>
</div>
</div>
I will like to scroll to text if min class is present
Here is what i have tried:
new_text = ['2.10', '2.15', '2.20', '2.25', '2.30', '2.35', '2.40',
'2.45', '2.50', '2.55', '2.60', '2.65', '2.70',
'2.75', '2.80', '2.85', '2.90', '2.95', '3.10']
for text in new_text:
if text in driver.page_source:
parent = driver.find_element_by_css_selector(".odd.min")
child = parent.find_element_by_xpath("//div[#class='value'
and text()='" + text + "']")
if child:
print(text)
element = child
driver.execute_script('arguments[0].scrollIntoView();',
element)
driver.save_screenshot('lo7.png')
break
else:
print("No odd found")
continue
The problem about this code is that it also scrolls to text that does not contain min class
Image file:
//div[#class='odd min']/div[#class='tq']/text()
u can try this xpath expression to get the value of "2HH".
The problem is with your XPath locator. You are locating parent and then starting there using an XPath searching only for children using parent.find_element_by_xpath("//div.... If you want an XPath to start from the parent context, you need to add a . at the start, e.g. ".//div[#class='value' and ...". If you don't include that ., then your XPath looks at the entire page, as you discovered.
There is a better way to do this... don't print a bunch of screenshots, just pull the odds that you want and compare them to your desired list.
values_from_page = driver.find_elements_by_css_selector(".odd.min > div.value") # all odds elements from the page
odds = (e.text for e in values_from_page if e.is_displayed()) # filter down to only visible elements and get the text
print(odds)
new_text = ['2.10', '2.15', '2.20', '2.25', '2.30', '2.35', '2.40',
'2.45', '2.50', '2.55', '2.60', '2.65', '2.70',
'2.75', '2.80', '2.85', '2.90', '2.95', '3.10']
missing_odds = new_text.difference(odds) # filter down to any new_text odds missing on page
print(missing_odds)
This is untested code but should be pretty close. With my code, it should run WAY faster because you are only scraping the page once (and only once) instead of scraping twice per item in new_text plus scrolling the page and taking a screenshot for each one.
When you take a screenshot, someone has to look at it to verify. That takes manual work and time... avoid this whenever possible. Let the automation do the validation for you and only report when something is wrong/missing. If missing_odds is empty len(x) = 0, then all the items in new_text were found. Anything that is printed, was missing from the page.
Hopefully that helps get you started in the right direction.
I'm trying to get a list of variables (date, size, medium, etc.) from this page (https://www.artprice.com/artist/844/hans-arp/lots/pasts) using python/selenium.
For the titles it was easy enough to use :
titles = driver.find_elements_by_class_name("sln_lot_show")
for title in titles:
print(title.text)
However the other variables seem to be text within the source code which have no identifiable id or class.
For example, to fetch the dates made I have tried:
dates_made = driver.find_elements_by_xpath("//div[#class='col-sm-6']/p[1]")
for date_made in dates_made:
print(date_made.get_attribute("date"))
and
dates_made = driver.find_elements_by_xpath("//div[#class='col-sm-6']/p[1]/date")
for date_made in dates_made:
print(date_made.text)
which both produce no error, but are not printing any results.
Is there some way to this text, which has no specific class or id?
Specific html here :
......
<div class="col-xs-8 col-sm-6">
<p>
<i><a id="sln_16564482" class="sln_lot_show" href="/artist/844/hans-arp/print-multiple/16564482/vers-le-blanc-infini" title=""Vers le Blanc Infini"" ng-click="send_ga_event_now('artist_past_lots_search', 'select_lot_position', 'title', {eventValue: 1})">
"Vers le Blanc Infini"
</a></i>
<date>
(1960)
</date>
</p>
<p>
Print-Multiple, Etching, aquatint,
<span ng-show="unite_to == 'in'" class="ng-hide">15 3/4 x 18 in</span>
<span ng-show="unite_to == 'cm'">39 x 45 cm</span>
</p>
Progressive mode, below Javascript will return you two-dimensional array (lots and details - 0,1,2,8,9 your indexes):
lots = driver.execute_script("[...document.querySelectorAll(".lot .row")].map(e => [...e.querySelectorAll("p")].map(e1 => e1.textContent.trim()))")
Classic mode:
lots = driver.find_elements_by_css_selector(".lot .row")
for lot in lots:
lotNo = lot.find_element_by_xpath("./div[1]/p[1]").get_attribute("textContent").strip()
title = lot.find_element_by_xpath("./div[2]/i").get_attribute("textContent").strip()
details = lot.find_element_by_xpath("./div[2]/p[2]").get_attribute("textContent").strip()
date = lot.find_element_by_xpath("./div[3]/p[1]").get_attribute("textContent").strip()
country = lot.find_element_by_xpath("./div[3]/p[2]").get_attribute("textContent").strip()
I need to extract a value of span tag property using selenium.
This is the html code :
<small class="time">
<a title="2015" class="class2 class3 class4 class5" href="url">
<span data-long-form="true" data-time-ms="1438835437000" data-time="1438835437" data-aria-label-part="last" class="class6 class7">Aug 5</span>
</a>
</small>
I need to extract the value of the "date-time" property of the span tag, here is the python code I am trying to use :
try:
timestamp = element.find_element_by_xpath(".//small[contains(#class, 'time')]/a[1]/span[1]")
print "timestamp", timestamp.value_of_css_property("data-time")
except exp.NoSuchElementException:
print "Timestamp location not proper"
I also tried :
timestamp = element.find_element_by_css_selector(".class2.class3.class4.class5").value_of_css_property("date-time")
but all are returning blank result.
Any Idea what is the cause of this problem?
Use get_attribute():
element = driver.find_element_by_css_selector("small.time span[data-time]")
element.get_attribute("data-time")
Note that in your second attempt, you've used date-time instead of data-time.
The issue I'm having is scraping out the element itself. I'm able to scrape the first two (IncidentNbr and DispatchTime ) but I can't get the address... (1300 Dunn Ave) I want to be able to scrape that element but also have it dynamic enough so I'm not actually parsing for "1300 Dunn Ave" I'm parsing for that element. Here is the source code
<td><span id="lstCallsForService_ctrl0_lblIncidentNbr">150318182198</span></td>
<td><nobr><span id="lstCallsForService_ctrl0_lblDispatchTime">3-18 10:25</span></nobr></td>
<td>
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL" target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
</td>
And here is my code:
from lxml import html
import requests
page = requests.get('http://callsforservice.jaxsheriff.org/')
tree = html.fromstring(page.text)
callSignal = tree.xpath('//span[#id="lstCallsForService_ctrl0_lblIncidentNbr"]/text()')
dispatchTime = tree.xpath('//span[#id="lstCallsForService_ctrl0_lblDispatchTime"]/text()')
location = tree.xpath('//span[#id="lstCallsForService_ctrl0_lnkAddress"]/text()')
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
And this is my output:
Call Signal: ['150318182198']
Dispatch Time: ['3-18 10:25']
Location: []
Any idea on how I can scrape out the address?
First of all, it is an a element, not a span. And you need a double slash before the text():
//a[#id="lstCallsForService_ctrl0_lnkAddress"]//text()
Why a double slash? This is because in reality this a element has no direct text node children:
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=5100 CLEVELAND RD, Jacksonville, FL" target="_blank">
<u>5100 CLEVELAND RD</u>
</a>
You could also reach the text through u tag:
//a[#id="lstCallsForService_ctrl0_lnkAddress"]/u/text()
Besides, to scale the solution into multiple results:
iterate over table rows
for every row find the cell values using a partial id attribute match using contains()
use text_content() method to get the text
Implementation:
for item in tree.xpath('//tr[#class="closedCall"]'):
callSignal = item.xpath('.//span[contains(#id, "lblIncidentNbr")]')[0].text_content()
dispatchTime = item.xpath('.//span[contains(#id, "lblDispatchTime")]')[0].text_content()
location = item.xpath('.//a[contains(#id, "lnkAddress")]')[0].text_content()
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
print "------"
Prints:
Call Signal: 150318182333
Dispatch Time: 3-18 11:22
Location: 9600 APPLECROSS RD
------
Call Signal: 150318182263
Dispatch Time: 3-18 11:12
Location: 1100 E 1ST ST
------
...
This is the element you are looking for:
<a id="lstCallsForService_ctrl0_lnkAddress"
href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL"
target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
As you can see, it is not a span element. Your current XPath expression:
//span[#id="lstCallsForService_ctrl0_lnkAddress"]/text()
is looking for a span element with this ID, when it should actually be selecting an a element. Use
//a[#id="lstCallsForService_ctrl0_lnkAddress"]/text()
instead. Then, the result should be
Location: ['1300 DUNN AVE']
Please also read alecxe's answer which has more practical advice than mine.