Find sibling node by xpath with Python selenium - python

Here is a fragment of xml. I need to use selenium to find the quote id value 1616968600, but I'm new to xpath and I could use some help.
<div class="row">
.....
</div>
<div class="row">
<div class="col-md-2 ng-scope" style="font-weight: bold" translate="Business_Partner_Id">Business partner name: </div>
<div class="col-md-2 ng-binding">Avnet Hall-Mark</div>
<div class="col-md-2 ng-scope" style="font-weight: bold" translate="Quote_Id">Quote ID: </div>
<div class="col-md-3 ng-binding">1616968600</div>
</div>

Locate the div having Quote ID text and get the next sibling:
//div[contains(., "Quote ID")]/following-sibling::div
Usage:
quote_id_elm = driver.find_element_by_xpath('//div[contains(., "Quote ID")]/following-sibling::div')
print(quote_id_elm.text)

Related

python scrap chrome web-store comment

I am trying to scrape reviews from Chrome Web-Store and having a problem with how to distinct between a comment and the replies to the comment.
Below is an example for such HTML, where the user "John Smith" has a comment and a reply.
I am currently using pyppeteer to scrap the content.
I tried querySelectionAll for .ba-bc-Xb-K and .ba-bc-Xb and several other ways, but was not able to clearly make identification
<div class="ba-fb">
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a/default-user=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">Lucy</span><span class="ba-Eb-Nf">Jun 26, 2022</span>
<div class="ba-Eb-N">
<div class="rsw-stars" aria-label="1 star">
<div class="rsw-starred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
</div>
</div>
<br>
<div class="ba-Eb-ba" dir="auto">We use this because its easy to hover and text over phone numbers of clients. IT VERY GLITCHY AND CRASHES OFTEN. 90% our of business runs on SMS texting, so I really wish I didn't use this for my company. If any has a better option please let me know!!!! I've been using this for a year and its getting better!!!!!!! AVOID IF YOU CAN!!!! Customer Service is ALSO TERRIBLE!</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BGDdh8EqnrWQb-fggox4SOmWi01kdMh4CdLCQD9oHM2uKG-GiDamTukgoJw7LwDNaVtssNY9zUfkPqZTbmL6bYR7YM8Tfa86zq-joAbx8qi5xjbhVjHguGAQoDUMi0YYV_pkFaVKt6ISOsZBGJKlLvhS3uCBg8VrwTO04skZFgbPvYGgPjeQgCKwOz4LyvBPf6dlvKz">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BGDdh8EqnrWQb-fggox4SOmWi01kdMh4CdLCQD9oHM2uKG-GiDamTukgoJw7LwDNaVtssNY9zUfkPqZTbmL6bYR7YM8Tfa86zq-joAbx8qi5xjbhVjHguGAQoDUMi0YYV_pkFaVKt6ISOsZBGJKlLvhS3uCBg8VrwTO04skZFgbPvYGgPjeQgCKwOz4LyvBPf6dlvKz">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a tabindex="0" class="ba-bc-zb-y z-b-ob-y" role="button">Reply</a><a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
<div class="yb-ba-Eb-k">
<div class="Fg-b-ob-k Pa">
<textarea class="Fg-b-ob-Gc" rows="5" maxlength="4096" aria-label="Write your reply" placeholder="Write your reply"></textarea>
<div class="Od"></div>
<div class="Fg-b-ob-Jb-k"><input type="button" value="Cancel" class="g-c g-c-aSvl1d Aa Fg-b-ob-Fb-c"> <input type="button" value="Post" class="g-c g-c-wb Aa Fg-b-ob-qd-c"></div>
</div>
</div>
<div class="Od"></div>
<div class="Fg-b-ob-fb"></div>
<div class="Fg-b-mb-Fk Pa"><a role="button" tabindex="0" class="mb-Fk-c">Load more replies</a></div>
</div>
</div>
</div>
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a-/AFdZucpu4S27XT0-ymC2sQo4ML3v0EkQWHfQeW-YO5jyPg=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">John Smith</span><span class="ba-Eb-Nf">May 24, 2022</span>
<div class="ba-Eb-N">
<div class="rsw-stars" aria-label="1 star">
<div class="rsw-starred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
</div>
</div>
<br>
<div class="ba-Eb-ba" dir="auto">Desktop app is interesting and the chrome browser buddy is even better. I wish I was not forced by my company to use the company.</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BE0070MjUM89cQCwjN0anL45obXJS3lggtKPsNh1lW8nApB3slGfCkLIRHtWYvTrteJ5Hsx15_Lq2GFBMLLbrKFghCR9XqAfnbN5yIZquqVmHLhEkzLpjGKotj-iX8wKux-rJoLU_8vz3wUKa76z0Ttw8QF2EKBKeT-vhT2WYDm8qPVpdpmgnYnObbYr_aDQlz4P5FD">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BE0070MjUM89cQCwjN0anL45obXJS3lggtKPsNh1lW8nApB3slGfCkLIRHtWYvTrteJ5Hsx15_Lq2GFBMLLbrKFghCR9XqAfnbN5yIZquqVmHLhEkzLpjGKotj-iX8wKux-rJoLU_8vz3wUKa76z0Ttw8QF2EKBKeT-vhT2WYDm8qPVpdpmgnYnObbYr_aDQlz4P5FD">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a tabindex="0" class="ba-bc-zb-y z-b-ob-y" role="button">Reply</a><a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
<div class="yb-ba-Eb-k">
<div class="Fg-b-ob-k Pa">
<textarea class="Fg-b-ob-Gc" rows="5" maxlength="4096" aria-label="Write your reply" placeholder="Write your reply"></textarea>
<div class="Od"></div>
<div class="Fg-b-ob-Jb-k"><input type="button" value="Cancel" class="g-c g-c-aSvl1d Aa Fg-b-ob-Fb-c"> <input type="button" value="Post" class="g-c g-c-wb Aa Fg-b-ob-qd-c"></div>
</div>
</div>
<div class="Od"></div>
<div class="Fg-b-ob-fb">
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a-/AFdZucpu4S27XT0-ymC2sQo4ML3v0EkQWHfQeW-YO5jyPg=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">John Smith</span><span class="ba-Eb-Nf">May 24, 2022</span><br>
<div class="ba-Eb-ba" dir="auto">I'm happy to chat with the engineering and UX team to tell you exactly how to fix it.</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BFmRFRFwJGRfUDOW8jG3rXnLzUlJu5dFPOnRhcZ3Qpf7k7js81NA_AsDgEfcDAZt0H9yZfs73z4D-hSlo1bxU2QLKaAXG2SMo-85mMfMl_-V6KnhrLHz2FEyGejziQP8UkVi-SsuqBw_lc0GmW9TtC5naBzAp94w9FygzBqeDyguPYXJMc">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BFmRFRFwJGRfUDOW8jG3rXnLzUlJu5dFPOnRhcZ3Qpf7k7js81NA_AsDgEfcDAZt0H9yZfs73z4D-hSlo1bxU2QLKaAXG2SMo-85mMfMl_-V6KnhrLHz2FEyGejziQP8UkVi-SsuqBw_lc0GmW9TtC5naBzAp94w9FygzBqeDyguPYXJMc">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="Fg-b-mb-Fk Pa"><a role="button" tabindex="0" class="mb-Fk-c">Load more replies</a></div>
</div>
</div>
</div>
</div>
Generally, I'd avoid using these classnames because they change too fast.
I see that comment on this page can be only one level deep. The parent comment has always <textarea>, so replies don't have it. You can distinguish parent-reply with this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser") # <-- html_doc is your snippet from the question
out = {}
for div in soup.select("div:has(>.comment-thread-displayname)"):
# this is a reply:
if not div.parent.find("textarea"):
continue
replies = []
for s in div.find_next_siblings():
if (reply := s.find(class_="comment-thread-displayname")) :
name, date, text = reply.parent.get_text(
strip=True, separator="\n"
).split("\n", maxsplit=2)
replies.append((name, date, text))
name, date, text = div.get_text(strip=True, separator="\n").split(
"\n", maxsplit=2
)
out[(name, date, text)] = replies
print(out)
Prints dictionary where keys are 3-item tuples of (name, date, text) of parent comment and values are lists of 3-item tuples of replies:
{
(
"Lucy",
"Jun 26, 2022",
"We use this because its easy to hover and text over phone numbers of clients. IT VERY GLITCHY AND CRASHES OFTEN. 90% our of business runs on SMS texting, so I really wish I didn't use this for my company. If any has a better option please let me know!!!! I've been using this for a year and its getting better!!!!!!! AVOID IF YOU CAN!!!! Customer Service is ALSO TERRIBLE!",
): [],
(
"John Smith",
"May 24, 2022",
"Desktop app is interesting and the chrome browser buddy is even better. I wish I was not forced by my company to use the company.",
): [
(
"John Smith",
"May 24, 2022",
"I'm happy to chat with the engineering and UX team to tell you exactly how to fix it.",
)
],
}

Pyton, Selenium: I need to collect urls but there no a tags in element

Good day, guys. I have a task to collect Name and Email for person from this site:
https://www.espeakers.com/s/nsas/search?available_on=&awards&budget=0%2C10&bureau_id=304&distance=1000&fee=false&items_per_page=3701&language=en&location=&norecord=false&nt=0&page=0&presenter_type=&q=%5B%5D&require&review=false&sort=speakername&video=false&virtual=false
I use selenium and python to scrape it, but I have a problem with accessing an url for people. The sample structure of person card is:
<div class="col-xs-12 col-sm-6 col-md-4 col-lg-3">
<div class="speaker-tile" id="sid12026">
<div class="speaker-thumb" style='background-image: url("https://streamer.espeakers.com/assets/6/12026/159445.jpg"); background-size: contain;'>
<div class="row">
<div class="col-xs-8 text-left">
</div>
<div class="col-xs-4 text-right speaker-top-actions">
<i class="fa fa-ellipsis-h fa-fw">
</i>
</div>
</div>
</div>
<div class="speaker-details">
<div class="speaker-name">
Alex Aanderud
</div>
<div class="row" style="margin-top: 15px;">
<div class="col-xs-12 col-sm-12">
<div class="speaker-location">
<i class="fa fa-map-marker mp-tertiary-background">
</i>
AZ
<span>
,
</span>
US
</div>
</div>
<div class="col-sm-6 col-xs-12">
<div class="speaker-awards">
</div>
</div>
</div>
<div class="speaker-oneline text-left">
<p>
</p>
<div>
Certified Trainer of Advanced Integrative Psychology and Certified John Maxwell Speaker, Trainer, Coach, will transform your organization and improve your results.
</div>
</div>
<div class="speaker-assets">
<div class="row">
</div>
</div>
<div class="speaker-actions">
<div class="row">
<div class="text-center col-xs-12">
<div class="btn btn-flat mp-primary btn-block">
<span class="hidden-xs hidden-sm">
View Profile
</span>
<span class="visible-xs visible-sm">
Profile
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
And the when you click on
<span class="hidden-xs hidden-sm">
View Profile
</span>
It moves you to page with person info where I can access it. How I can use selenium to do this, or there are others solutions that can help me.
Thanks!
If you notice, all the profile urls are of the form
https://www.espeakers.com/s/nsas/profile/id
where id is a 5 digits number such as 27397. So you just need to extract the id and concatenate it with the base url to obtain the profile url.
url = 'https://www.espeakers.com/s/nsas/profile/'
profile_urls = [url + el.get_attribute('id')[3:] for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-tile')]
names = [el.text for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-name')]
names is a list containing all the names, urls is a list containing the corresponding profile urls

Exclude div from css selector

I am trying to select only text that is not inside the div element. In this case I want to exclude the div with class bbCodeBlock. How can I do that? The idea is exclude the citation.
demo
<li id="post-6062713">
<div class="uix_message ">
<div class="messageInfo">
<div class="messageContent">
<article>
<blockquote class="messageText SelectQuoteContainer">
<div class="bbCodeBlock bbCodeQuote">
<aside>
<div class="attribution type">user said:
↑
</div>
<blockquote class="quoteContainer">
<div class="quote">text to ignore</div>
</blockquote>
</aside>
</div>
text to change color
<br>
<br>
<br>
<br>
<div class="bbCodeBlock bbCodeQuote">
<aside>
<div class="attribution type">user said:
↑
</div>
<blockquote class="quoteContainer">
<div class="quote">text to ignore</div>
</blockquote>
</aside>
</div>
text to change color
<div class="messageTextEndMarker"> </div>
</blockquote>
</article>
</div>
</div>
</div>
</li>
This is a basic demo to replicate with scrappy because I need to exclude the quotes, so I am looking for a one line selector to apply to something like
'text': quote.css('article blockquote').extract()
EDITED after comment:
I just realized that the content which you want to be affected is not inside a div, but at the same level as the div with class .bbCodeBlock, i.e direct content of the blockquote element.
Therefore you need to use two selectors/rules, one to set the color for blockquote, the other to exclude those children div elements of blockquote which have the .bbCodeBlock class :
article blockquote {
color: red;
}
article blockquote div:not(.bbCodeBlock) {
color: initial;
}
<li id="post-6062713">
<div class="uix_message ">
<div class="messageInfo">
<div class="messageContent">
<article>
<blockquote class="messageText SelectQuoteContainer">
<div class="bbCodeBlock bbCodeQuote">
<aside>
<div class="attribution type">user said:
↑
</div>
<blockquote class="quoteContainer">
<div class="quote">text to ignore</div>
</blockquote>
</aside>
</div>
text to change color
<br>
<br>
<br>
<br>
<div class="bbCodeBlock bbCodeQuote">
<aside>
<div class="attribution type">user said:
↑
</div>
<blockquote class="quoteContainer">
<div class="quote">text to ignore</div>
</blockquote>
</aside>
</div>
text to change color
<div class="messageTextEndMarker"> </div>
</blockquote>
</article>
</div>
</div>
</div>
</li>

beautifulsoup: finding specific class name in nested div

I try to get reviews in agoda site for analysis by using beautifulsoup
i have inspected and see that the reviews is in :
<div class="container-agoda">
<div class="a">
<div class="b">
<div class="c">
<div class="d">
<div class="col-xs-9 review-comment" data-selenium="comments-detail">>
<div name="review-title" class="title" data-selenium="comments-title">
HAD 1 HOUR SLEEP
</div>
<div class="review-comment-section">
<div class="comment-detail" data-selenium="reviews-comments">
<span>Great location</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
but this class in nested in 10+ classes
I have tried by
for div in soup.findAll('div', attrs={"class":"comment-detail"}):
print(div)
but it get nothing.
Is it have a method for get an exactly as find ''' class="comment-detail" data-selenium="reviews-comments" ''' or any suggestion.
Thank you.

How to select data within multiple div when no unique id associated with it

i am trying to select "Users Interact, Digital Purchases" for below html in beautifulsoup.but i failed so help me please.
<div class="details-wrapper apps-secondary-color">
<div class="details-section metadata">
<div class="details-section-heading">
<div class="details-section-contents">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="title"> Interactive Elements </div>
<div class="content">Users Interact, Digital Purchases</div>
</div>
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="meta-info meta-info-wide">
<div class="details-sharing-section">
</div>
<div class="details-section-divider"></div>
</div>
</div>
</div>
You can rely on the class attribute:
soup.find("div", class_="content")
Or, with a CSS selector:
soup.select_one("div.content")
If the content class is not something uniquely identifying the element and you know the preceding "Interactive Elements" label:
import re
label = soup.find("div", class_="title", text=re.compile("Interactive Elements"))
print(label.find_next_sibling("div", class_="content"))
You can achive this in various waysL
1. document.querySelector('.content').innerHTML;
2. $('.content').text(); / $('.content').html();
3. soup.find("div", class_="content")
4. soup.select_one("div.content")

Categories

Resources