I am trying to select only text that is not inside the div element. In this case I want to exclude the div with class bbCodeBlock. How can I do that? The idea is exclude the citation.
demo
<li id="post-6062713">
<div class="uix_message ">
<div class="messageInfo">
<div class="messageContent">
<article>
<blockquote class="messageText SelectQuoteContainer">
<div class="bbCodeBlock bbCodeQuote">
<aside>
<div class="attribution type">user said:
↑
</div>
<blockquote class="quoteContainer">
<div class="quote">text to ignore</div>
</blockquote>
</aside>
</div>
text to change color
<br>
<br>
<br>
<br>
<div class="bbCodeBlock bbCodeQuote">
<aside>
<div class="attribution type">user said:
↑
</div>
<blockquote class="quoteContainer">
<div class="quote">text to ignore</div>
</blockquote>
</aside>
</div>
text to change color
<div class="messageTextEndMarker"> </div>
</blockquote>
</article>
</div>
</div>
</div>
</li>
This is a basic demo to replicate with scrappy because I need to exclude the quotes, so I am looking for a one line selector to apply to something like
'text': quote.css('article blockquote').extract()
EDITED after comment:
I just realized that the content which you want to be affected is not inside a div, but at the same level as the div with class .bbCodeBlock, i.e direct content of the blockquote element.
Therefore you need to use two selectors/rules, one to set the color for blockquote, the other to exclude those children div elements of blockquote which have the .bbCodeBlock class :
article blockquote {
color: red;
}
article blockquote div:not(.bbCodeBlock) {
color: initial;
}
<li id="post-6062713">
<div class="uix_message ">
<div class="messageInfo">
<div class="messageContent">
<article>
<blockquote class="messageText SelectQuoteContainer">
<div class="bbCodeBlock bbCodeQuote">
<aside>
<div class="attribution type">user said:
↑
</div>
<blockquote class="quoteContainer">
<div class="quote">text to ignore</div>
</blockquote>
</aside>
</div>
text to change color
<br>
<br>
<br>
<br>
<div class="bbCodeBlock bbCodeQuote">
<aside>
<div class="attribution type">user said:
↑
</div>
<blockquote class="quoteContainer">
<div class="quote">text to ignore</div>
</blockquote>
</aside>
</div>
text to change color
<div class="messageTextEndMarker"> </div>
</blockquote>
</article>
</div>
</div>
</div>
</li>
Related
I am trying to scrape reviews from Chrome Web-Store and having a problem with how to distinct between a comment and the replies to the comment.
Below is an example for such HTML, where the user "John Smith" has a comment and a reply.
I am currently using pyppeteer to scrap the content.
I tried querySelectionAll for .ba-bc-Xb-K and .ba-bc-Xb and several other ways, but was not able to clearly make identification
<div class="ba-fb">
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a/default-user=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">Lucy</span><span class="ba-Eb-Nf">Jun 26, 2022</span>
<div class="ba-Eb-N">
<div class="rsw-stars" aria-label="1 star">
<div class="rsw-starred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
</div>
</div>
<br>
<div class="ba-Eb-ba" dir="auto">We use this because its easy to hover and text over phone numbers of clients. IT VERY GLITCHY AND CRASHES OFTEN. 90% our of business runs on SMS texting, so I really wish I didn't use this for my company. If any has a better option please let me know!!!! I've been using this for a year and its getting better!!!!!!! AVOID IF YOU CAN!!!! Customer Service is ALSO TERRIBLE!</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BGDdh8EqnrWQb-fggox4SOmWi01kdMh4CdLCQD9oHM2uKG-GiDamTukgoJw7LwDNaVtssNY9zUfkPqZTbmL6bYR7YM8Tfa86zq-joAbx8qi5xjbhVjHguGAQoDUMi0YYV_pkFaVKt6ISOsZBGJKlLvhS3uCBg8VrwTO04skZFgbPvYGgPjeQgCKwOz4LyvBPf6dlvKz">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BGDdh8EqnrWQb-fggox4SOmWi01kdMh4CdLCQD9oHM2uKG-GiDamTukgoJw7LwDNaVtssNY9zUfkPqZTbmL6bYR7YM8Tfa86zq-joAbx8qi5xjbhVjHguGAQoDUMi0YYV_pkFaVKt6ISOsZBGJKlLvhS3uCBg8VrwTO04skZFgbPvYGgPjeQgCKwOz4LyvBPf6dlvKz">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a tabindex="0" class="ba-bc-zb-y z-b-ob-y" role="button">Reply</a><a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
<div class="yb-ba-Eb-k">
<div class="Fg-b-ob-k Pa">
<textarea class="Fg-b-ob-Gc" rows="5" maxlength="4096" aria-label="Write your reply" placeholder="Write your reply"></textarea>
<div class="Od"></div>
<div class="Fg-b-ob-Jb-k"><input type="button" value="Cancel" class="g-c g-c-aSvl1d Aa Fg-b-ob-Fb-c"> <input type="button" value="Post" class="g-c g-c-wb Aa Fg-b-ob-qd-c"></div>
</div>
</div>
<div class="Od"></div>
<div class="Fg-b-ob-fb"></div>
<div class="Fg-b-mb-Fk Pa"><a role="button" tabindex="0" class="mb-Fk-c">Load more replies</a></div>
</div>
</div>
</div>
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a-/AFdZucpu4S27XT0-ymC2sQo4ML3v0EkQWHfQeW-YO5jyPg=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">John Smith</span><span class="ba-Eb-Nf">May 24, 2022</span>
<div class="ba-Eb-N">
<div class="rsw-stars" aria-label="1 star">
<div class="rsw-starred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
</div>
</div>
<br>
<div class="ba-Eb-ba" dir="auto">Desktop app is interesting and the chrome browser buddy is even better. I wish I was not forced by my company to use the company.</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BE0070MjUM89cQCwjN0anL45obXJS3lggtKPsNh1lW8nApB3slGfCkLIRHtWYvTrteJ5Hsx15_Lq2GFBMLLbrKFghCR9XqAfnbN5yIZquqVmHLhEkzLpjGKotj-iX8wKux-rJoLU_8vz3wUKa76z0Ttw8QF2EKBKeT-vhT2WYDm8qPVpdpmgnYnObbYr_aDQlz4P5FD">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BE0070MjUM89cQCwjN0anL45obXJS3lggtKPsNh1lW8nApB3slGfCkLIRHtWYvTrteJ5Hsx15_Lq2GFBMLLbrKFghCR9XqAfnbN5yIZquqVmHLhEkzLpjGKotj-iX8wKux-rJoLU_8vz3wUKa76z0Ttw8QF2EKBKeT-vhT2WYDm8qPVpdpmgnYnObbYr_aDQlz4P5FD">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a tabindex="0" class="ba-bc-zb-y z-b-ob-y" role="button">Reply</a><a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
<div class="yb-ba-Eb-k">
<div class="Fg-b-ob-k Pa">
<textarea class="Fg-b-ob-Gc" rows="5" maxlength="4096" aria-label="Write your reply" placeholder="Write your reply"></textarea>
<div class="Od"></div>
<div class="Fg-b-ob-Jb-k"><input type="button" value="Cancel" class="g-c g-c-aSvl1d Aa Fg-b-ob-Fb-c"> <input type="button" value="Post" class="g-c g-c-wb Aa Fg-b-ob-qd-c"></div>
</div>
</div>
<div class="Od"></div>
<div class="Fg-b-ob-fb">
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a-/AFdZucpu4S27XT0-ymC2sQo4ML3v0EkQWHfQeW-YO5jyPg=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">John Smith</span><span class="ba-Eb-Nf">May 24, 2022</span><br>
<div class="ba-Eb-ba" dir="auto">I'm happy to chat with the engineering and UX team to tell you exactly how to fix it.</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BFmRFRFwJGRfUDOW8jG3rXnLzUlJu5dFPOnRhcZ3Qpf7k7js81NA_AsDgEfcDAZt0H9yZfs73z4D-hSlo1bxU2QLKaAXG2SMo-85mMfMl_-V6KnhrLHz2FEyGejziQP8UkVi-SsuqBw_lc0GmW9TtC5naBzAp94w9FygzBqeDyguPYXJMc">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BFmRFRFwJGRfUDOW8jG3rXnLzUlJu5dFPOnRhcZ3Qpf7k7js81NA_AsDgEfcDAZt0H9yZfs73z4D-hSlo1bxU2QLKaAXG2SMo-85mMfMl_-V6KnhrLHz2FEyGejziQP8UkVi-SsuqBw_lc0GmW9TtC5naBzAp94w9FygzBqeDyguPYXJMc">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="Fg-b-mb-Fk Pa"><a role="button" tabindex="0" class="mb-Fk-c">Load more replies</a></div>
</div>
</div>
</div>
</div>
Generally, I'd avoid using these classnames because they change too fast.
I see that comment on this page can be only one level deep. The parent comment has always <textarea>, so replies don't have it. You can distinguish parent-reply with this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser") # <-- html_doc is your snippet from the question
out = {}
for div in soup.select("div:has(>.comment-thread-displayname)"):
# this is a reply:
if not div.parent.find("textarea"):
continue
replies = []
for s in div.find_next_siblings():
if (reply := s.find(class_="comment-thread-displayname")) :
name, date, text = reply.parent.get_text(
strip=True, separator="\n"
).split("\n", maxsplit=2)
replies.append((name, date, text))
name, date, text = div.get_text(strip=True, separator="\n").split(
"\n", maxsplit=2
)
out[(name, date, text)] = replies
print(out)
Prints dictionary where keys are 3-item tuples of (name, date, text) of parent comment and values are lists of 3-item tuples of replies:
{
(
"Lucy",
"Jun 26, 2022",
"We use this because its easy to hover and text over phone numbers of clients. IT VERY GLITCHY AND CRASHES OFTEN. 90% our of business runs on SMS texting, so I really wish I didn't use this for my company. If any has a better option please let me know!!!! I've been using this for a year and its getting better!!!!!!! AVOID IF YOU CAN!!!! Customer Service is ALSO TERRIBLE!",
): [],
(
"John Smith",
"May 24, 2022",
"Desktop app is interesting and the chrome browser buddy is even better. I wish I was not forced by my company to use the company.",
): [
(
"John Smith",
"May 24, 2022",
"I'm happy to chat with the engineering and UX team to tell you exactly how to fix it.",
)
],
}
Good day, guys. I have a task to collect Name and Email for person from this site:
https://www.espeakers.com/s/nsas/search?available_on=&awards&budget=0%2C10&bureau_id=304&distance=1000&fee=false&items_per_page=3701&language=en&location=&norecord=false&nt=0&page=0&presenter_type=&q=%5B%5D&require&review=false&sort=speakername&video=false&virtual=false
I use selenium and python to scrape it, but I have a problem with accessing an url for people. The sample structure of person card is:
<div class="col-xs-12 col-sm-6 col-md-4 col-lg-3">
<div class="speaker-tile" id="sid12026">
<div class="speaker-thumb" style='background-image: url("https://streamer.espeakers.com/assets/6/12026/159445.jpg"); background-size: contain;'>
<div class="row">
<div class="col-xs-8 text-left">
</div>
<div class="col-xs-4 text-right speaker-top-actions">
<i class="fa fa-ellipsis-h fa-fw">
</i>
</div>
</div>
</div>
<div class="speaker-details">
<div class="speaker-name">
Alex Aanderud
</div>
<div class="row" style="margin-top: 15px;">
<div class="col-xs-12 col-sm-12">
<div class="speaker-location">
<i class="fa fa-map-marker mp-tertiary-background">
</i>
AZ
<span>
,
</span>
US
</div>
</div>
<div class="col-sm-6 col-xs-12">
<div class="speaker-awards">
</div>
</div>
</div>
<div class="speaker-oneline text-left">
<p>
</p>
<div>
Certified Trainer of Advanced Integrative Psychology and Certified John Maxwell Speaker, Trainer, Coach, will transform your organization and improve your results.
</div>
</div>
<div class="speaker-assets">
<div class="row">
</div>
</div>
<div class="speaker-actions">
<div class="row">
<div class="text-center col-xs-12">
<div class="btn btn-flat mp-primary btn-block">
<span class="hidden-xs hidden-sm">
View Profile
</span>
<span class="visible-xs visible-sm">
Profile
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
And the when you click on
<span class="hidden-xs hidden-sm">
View Profile
</span>
It moves you to page with person info where I can access it. How I can use selenium to do this, or there are others solutions that can help me.
Thanks!
If you notice, all the profile urls are of the form
https://www.espeakers.com/s/nsas/profile/id
where id is a 5 digits number such as 27397. So you just need to extract the id and concatenate it with the base url to obtain the profile url.
url = 'https://www.espeakers.com/s/nsas/profile/'
profile_urls = [url + el.get_attribute('id')[3:] for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-tile')]
names = [el.text for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-name')]
names is a list containing all the names, urls is a list containing the corresponding profile urls
I have an HTML file with multiple tags (there are multiple div inside the div as well). I want to add a new tag along with class to the end of the HTML at a specific position. I tried with append, insert, and insert_after/insert_before as well, however, it's not working as I expected.
My html input is:
<div id="page">
<div id="records">
<div class="record">
<div class="header">
<div class="title">
Something here to display
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display again once
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content again once</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display second time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content second time</p>
</div>
</div>
</div>
</div>
i want to add new <div> tag with class="record" at the end, before the closing tag of <div id="records">.
output would look like this:
<div id="page">
<div id="records">
<div class="record">
<div class="header">
<div class="title">
Something here to display
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display again once
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content again once</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display second time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content second time</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content 3rd time</p>
</div>
</div>
</div>
</div>
In my case, the number of <div class="record"> is not fixed, the number may vary always.
I would like to get a solution/suggestion for this problem using BeautifulSoup in python.
You can use insert_after after the last item in soup.find_all('div', class_='record'):
from bs4 import BeautifulSoup
html = '<div id="records"> <div class="record"> <div class="header"> <div class="title"> Something here to display </div> </div> <div class="disclaimer"> <p>Here i want to print content</p> </div> </div> <div class="record"> <div class="header"> <div class="title"> Something here to display again once </div> </div> <div class="disclaimer"> <p>Here i want to print content again once</p> </div> </div> <div class="record"> <div class="header"> <div class="title"> Something here to display second time </div> </div> <div class="disclaimer"> <p>Here i want to print content second time</p> </div> </div> </div>'
soup = BeautifulSoup(html, 'html.parser')
extra_html = '''
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content 3rd time</p>
</div>
</div>'''
soup.find_all('div', class_='record')[-1].insert_after(BeautifulSoup(extra_html, 'html.parser')) # [-1] selects the last item
Output print(soup.prettify()):
<div id="records">
<div class="record">
<div class="header">
<div class="title">
Something here to display
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content
</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display again once
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content again once
</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display second time
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content second time
</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content 3rd time
</p>
</div>
</div>
</div>
using .append(), it need to select the parent element or <div id="page">
newRecord = '''
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content 3rd time</p>
</div>
</div>
'''
soup = BeautifulSoup(sourceHTML, 'html.parser')
page = soup.select_one('#page')
page.append(BeautifulSoup(newRecord, 'html.parser'))
print(soup.prettify())
What I'm trying to do is get the data from the child element in the below code:
<div class="ListContainer">
<ul class="uiList">
...
</ul>
<div class="expandedList">
<ul>
<div id="1012450"><img /></div>
<div id="1012451"><img /></div>
<div id="1012452"><img /></div>
<div id="1012453"><img /></div>
</ul>
</div>
<div class="expandedList">
<ul>
<div id="1012454"><img /></div>
<div id="1012455"><img /></div>
<div id="1012456"><img /></div>
<div id="1012457"><img /></div>
</ul>
</div>
....
....
</div>
I want to get the id of each div inside the expandedList class. I tried using xpath but that not capturing all the expandedList.
div_ids = []
for div_element in driver.find_elements_by_css_selector('div.expandedList div[id]'):
div_ids.append(div_element.get_attribute('id'))
Here is a fragment of xml. I need to use selenium to find the quote id value 1616968600, but I'm new to xpath and I could use some help.
<div class="row">
.....
</div>
<div class="row">
<div class="col-md-2 ng-scope" style="font-weight: bold" translate="Business_Partner_Id">Business partner name: </div>
<div class="col-md-2 ng-binding">Avnet Hall-Mark</div>
<div class="col-md-2 ng-scope" style="font-weight: bold" translate="Quote_Id">Quote ID: </div>
<div class="col-md-3 ng-binding">1616968600</div>
</div>
Locate the div having Quote ID text and get the next sibling:
//div[contains(., "Quote ID")]/following-sibling::div
Usage:
quote_id_elm = driver.find_element_by_xpath('//div[contains(., "Quote ID")]/following-sibling::div')
print(quote_id_elm.text)