Scrape table with div class by sibling, if data found

Scrape table with div class by sibling, if data found - python

I would like to scrape a html table which contains elements in <div class="..."> format. To scrape it I think I'll need to use:
if found driver.find_element_by_xpath contains(footable-row-detail-name)
get value from /following-sibling which is (class="footable-row-detail-value")
This is just one table. The site I'm scraping has a lot of tables and some tables don't have all the data (that's why "if found")
I would like to use python 3 for that.
I hope I explained it well. The HTML code for one table:
<div class="footable-row-detail-inner">
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Discipline(s) thérapeutique(s):
</div>
<div class="footable-row-detail-value">
197. Omeopatia, 202. Linfodrenaggio manuale, 205. Massaggio classico, 664. Riflessoterapia generale
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Cognome:
</div>
<div class="footable-row-detail-value">
ABBONDANZIERI Katia
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Via:
</div>
<div class="footable-row-detail-value">
Place du Cirque, 2
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
NPA:
</div>
<div class="footable-row-detail-value">
1204
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Luogo:
</div>
<div class="footable-row-detail-value">
Genève
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Tel / Cellulare:
</div>
<div class="footable-row-detail-value">
022 328 23 44
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Cellulare:
</div>
<div class="footable-row-detail-value">
079 601 92 75
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Discipline(s) thérapeutique(s):
</div>
<div class="footable-row-detail-value">
<div class="thZone">
<div class="zCat">
METHODES DE MASSAGE
</div>
<div class="zThr">
Linfodrenaggio manuale
</div>
<div class="zThr">
Massaggio classico
</div>
<div class="zCat">
METHODES PRESCRIPTIVES
</div>
<div class="zThr">
Omeopatia
</div>
<div class="zCat">
METHODES REFLEXES
</div>
<div class="zThr">
Riflessoterapia generale
</div>
</div>
</div>
</div>
Any help is appreciated.

This runs for me. I am using jupyter and running this line by line. You might encounter errors when the elements aren't loaded yet so please make adjustments if an error occurs for you.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
driver = webdriver.Chrome()
driver.get("http://asca.ch/Partners.aspx?lang=it")
cantone = driver.find_element_by_xpath("""//*[#id="ctl00_MainContent_ddl_cantons_Input"]""")
cantone.click()
cantone.send_keys('GE')
cantone.send_keys(Keys.ENTER)
confermo = driver.find_element_by_xpath("""//*[#id="MainContent__chkDisclaimer"]""")
confermo.click()
ricera = driver.find_element_by_xpath("""//*[#id="MainContent_btn_submit"]""")
ricera.click()
toggle = driver.find_elements_by_class_name("""footable-toggle""")
print(toggle)
while not toggle:
time.sleep(.2)
toggle = driver.find_elements_by_class_name("""footable-toggle""")
for r in toggle:
time.sleep(.2)
r.click()
data = driver.find_elements_by_class_name("""footable-row-detail-cell""")
while not data:
time.sleep(.2)
data = driver.find_elements_by_class_name("""footable-row-detail-cell""")
list_df = []
for r in data:
ratum = r.get_attribute('innerHTML')
datum = r.get_attribute('innerHTML')\
.replace("""<div class="footable-row-detail-inner">""","<table>")\
.replace("""<div class="footable-row-detail-row">""","<tr>")\
.replace("""<div class="footable-row-detail-name">""","<td>")\
.replace("""<div class="footable-row-detail-value">""","</td><td>")
list_df.append(dict(pd.read_html(datum)[0].values.tolist()))
df = pd.DataFrame(list_df)
df.to_csv('data.csv')
print(df)

One solution using python3 is html.parser module!
There is a simple example to get you started :)

Related

python scrap chrome web-store comment

I am trying to scrape reviews from Chrome Web-Store and having a problem with how to distinct between a comment and the replies to the comment.
Below is an example for such HTML, where the user "John Smith" has a comment and a reply.
I am currently using pyppeteer to scrap the content.
I tried querySelectionAll for .ba-bc-Xb-K and .ba-bc-Xb and several other ways, but was not able to clearly make identification
<div class="ba-fb">
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a/default-user=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">Lucy</span><span class="ba-Eb-Nf">Jun 26, 2022</span>
<div class="ba-Eb-N">
<div class="rsw-stars" aria-label="1 star">
<div class="rsw-starred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
</div>
</div>
<br>
<div class="ba-Eb-ba" dir="auto">We use this because its easy to hover and text over phone numbers of clients. IT VERY GLITCHY AND CRASHES OFTEN. 90% our of business runs on SMS texting, so I really wish I didn't use this for my company. If any has a better option please let me know!!!! I've been using this for a year and its getting better!!!!!!! AVOID IF YOU CAN!!!! Customer Service is ALSO TERRIBLE!</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BGDdh8EqnrWQb-fggox4SOmWi01kdMh4CdLCQD9oHM2uKG-GiDamTukgoJw7LwDNaVtssNY9zUfkPqZTbmL6bYR7YM8Tfa86zq-joAbx8qi5xjbhVjHguGAQoDUMi0YYV_pkFaVKt6ISOsZBGJKlLvhS3uCBg8VrwTO04skZFgbPvYGgPjeQgCKwOz4LyvBPf6dlvKz">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BGDdh8EqnrWQb-fggox4SOmWi01kdMh4CdLCQD9oHM2uKG-GiDamTukgoJw7LwDNaVtssNY9zUfkPqZTbmL6bYR7YM8Tfa86zq-joAbx8qi5xjbhVjHguGAQoDUMi0YYV_pkFaVKt6ISOsZBGJKlLvhS3uCBg8VrwTO04skZFgbPvYGgPjeQgCKwOz4LyvBPf6dlvKz">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a tabindex="0" class="ba-bc-zb-y z-b-ob-y" role="button">Reply</a><a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
<div class="yb-ba-Eb-k">
<div class="Fg-b-ob-k Pa">
<textarea class="Fg-b-ob-Gc" rows="5" maxlength="4096" aria-label="Write your reply" placeholder="Write your reply"></textarea>
<div class="Od"></div>
<div class="Fg-b-ob-Jb-k"><input type="button" value="Cancel" class="g-c g-c-aSvl1d Aa Fg-b-ob-Fb-c"> <input type="button" value="Post" class="g-c g-c-wb Aa Fg-b-ob-qd-c"></div>
</div>
</div>
<div class="Od"></div>
<div class="Fg-b-ob-fb"></div>
<div class="Fg-b-mb-Fk Pa"><a role="button" tabindex="0" class="mb-Fk-c">Load more replies</a></div>
</div>
</div>
</div>
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a-/AFdZucpu4S27XT0-ymC2sQo4ML3v0EkQWHfQeW-YO5jyPg=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">John Smith</span><span class="ba-Eb-Nf">May 24, 2022</span>
<div class="ba-Eb-N">
<div class="rsw-stars" aria-label="1 star">
<div class="rsw-starred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
<div class="rsw-unstarred" aria-hidden="true"></div>
</div>
</div>
<br>
<div class="ba-Eb-ba" dir="auto">Desktop app is interesting and the chrome browser buddy is even better. I wish I was not forced by my company to use the company.</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BE0070MjUM89cQCwjN0anL45obXJS3lggtKPsNh1lW8nApB3slGfCkLIRHtWYvTrteJ5Hsx15_Lq2GFBMLLbrKFghCR9XqAfnbN5yIZquqVmHLhEkzLpjGKotj-iX8wKux-rJoLU_8vz3wUKa76z0Ttw8QF2EKBKeT-vhT2WYDm8qPVpdpmgnYnObbYr_aDQlz4P5FD">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BE0070MjUM89cQCwjN0anL45obXJS3lggtKPsNh1lW8nApB3slGfCkLIRHtWYvTrteJ5Hsx15_Lq2GFBMLLbrKFghCR9XqAfnbN5yIZquqVmHLhEkzLpjGKotj-iX8wKux-rJoLU_8vz3wUKa76z0Ttw8QF2EKBKeT-vhT2WYDm8qPVpdpmgnYnObbYr_aDQlz4P5FD">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a tabindex="0" class="ba-bc-zb-y z-b-ob-y" role="button">Reply</a><a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
<div class="yb-ba-Eb-k">
<div class="Fg-b-ob-k Pa">
<textarea class="Fg-b-ob-Gc" rows="5" maxlength="4096" aria-label="Write your reply" placeholder="Write your reply"></textarea>
<div class="Od"></div>
<div class="Fg-b-ob-Jb-k"><input type="button" value="Cancel" class="g-c g-c-aSvl1d Aa Fg-b-ob-Fb-c"> <input type="button" value="Post" class="g-c g-c-wb Aa Fg-b-ob-qd-c"></div>
</div>
</div>
<div class="Od"></div>
<div class="Fg-b-ob-fb">
<div>
<div class="ba-bc-Xb">
<div class="ba-ji-A ba-ua-zl-Xb"><img src="//lh3.googleusercontent.com/a-/AFdZucpu4S27XT0-ymC2sQo4ML3v0EkQWHfQeW-YO5jyPg=s40-c-k" class="Lg-ee-A-O-xb" alt=""></div>
<div class="ba-bc-Xb-K">
<div class="ba-pa">
<span class="comment-thread-displayname" dir="auto">John Smith</span><span class="ba-Eb-Nf">May 24, 2022</span><br>
<div class="ba-Eb-ba" dir="auto">I'm happy to chat with the engineering and UX team to tell you exactly how to fix it.</div>
</div>
<div class="ba-bc-Xb-cd">
<div class="bd-Ob Aa">
<div class="bd-Ob-L dd">Was this review helpful?</div>
<label class="voting-editor-button XzMRXd-sn"><input class="XzMRXd-sn-lc XzMRXd-lc" type="radio" name="vote_AIe9_BFmRFRFwJGRfUDOW8jG3rXnLzUlJu5dFPOnRhcZ3Qpf7k7js81NA_AsDgEfcDAZt0H9yZfs73z4D-hSlo1bxU2QLKaAXG2SMo-85mMfMl_-V6KnhrLHz2FEyGejziQP8UkVi-SsuqBw_lc0GmW9TtC5naBzAp94w9FygzBqeDyguPYXJMc">Yes</label><label class="voting-editor-button XzMRXd-eb"><input class="XzMRXd-eb-lc XzMRXd-lc" type="radio" name="vote_AIe9_BFmRFRFwJGRfUDOW8jG3rXnLzUlJu5dFPOnRhcZ3Qpf7k7js81NA_AsDgEfcDAZt0H9yZfs73z4D-hSlo1bxU2QLKaAXG2SMo-85mMfMl_-V6KnhrLHz2FEyGejziQP8UkVi-SsuqBw_lc0GmW9TtC5naBzAp94w9FygzBqeDyguPYXJMc">No</label>
</div>
<div class="ba-bc-zb-Pe">
<a class="ba-bc-zb-y ba-Eb-xe-ba Pa" role="button" tabindex="0">Delete</a>
<div class="ba-bc-zb-y Da-ub"><a tabindex="0" class="Aa Da-ub-y" role="button">Mark as spam or abuse</a></div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="Fg-b-mb-Fk Pa"><a role="button" tabindex="0" class="mb-Fk-c">Load more replies</a></div>
</div>
</div>
</div>
</div>

Generally, I'd avoid using these classnames because they change too fast.
I see that comment on this page can be only one level deep. The parent comment has always <textarea>, so replies don't have it. You can distinguish parent-reply with this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, "html.parser") # <-- html_doc is your snippet from the question
out = {}
for div in soup.select("div:has(>.comment-thread-displayname)"):
# this is a reply:
if not div.parent.find("textarea"):
continue
replies = []
for s in div.find_next_siblings():
if (reply := s.find(class_="comment-thread-displayname")) :
name, date, text = reply.parent.get_text(
strip=True, separator="\n"
).split("\n", maxsplit=2)
replies.append((name, date, text))
name, date, text = div.get_text(strip=True, separator="\n").split(
"\n", maxsplit=2
)
out[(name, date, text)] = replies
print(out)
Prints dictionary where keys are 3-item tuples of (name, date, text) of parent comment and values are lists of 3-item tuples of replies:
{
(
"Lucy",
"Jun 26, 2022",
"We use this because its easy to hover and text over phone numbers of clients. IT VERY GLITCHY AND CRASHES OFTEN. 90% our of business runs on SMS texting, so I really wish I didn't use this for my company. If any has a better option please let me know!!!! I've been using this for a year and its getting better!!!!!!! AVOID IF YOU CAN!!!! Customer Service is ALSO TERRIBLE!",
): [],
(
"John Smith",
"May 24, 2022",
"Desktop app is interesting and the chrome browser buddy is even better. I wish I was not forced by my company to use the company.",
): [
(
"John Smith",
"May 24, 2022",
"I'm happy to chat with the engineering and UX team to tell you exactly how to fix it.",
)
],
}

Pyton, Selenium: I need to collect urls but there no a tags in element

Good day, guys. I have a task to collect Name and Email for person from this site:
https://www.espeakers.com/s/nsas/search?available_on=&awards&budget=0%2C10&bureau_id=304&distance=1000&fee=false&items_per_page=3701&language=en&location=&norecord=false&nt=0&page=0&presenter_type=&q=%5B%5D&require&review=false&sort=speakername&video=false&virtual=false
I use selenium and python to scrape it, but I have a problem with accessing an url for people. The sample structure of person card is:
<div class="col-xs-12 col-sm-6 col-md-4 col-lg-3">
<div class="speaker-tile" id="sid12026">
<div class="speaker-thumb" style='background-image: url("https://streamer.espeakers.com/assets/6/12026/159445.jpg"); background-size: contain;'>
<div class="row">
<div class="col-xs-8 text-left">
</div>
<div class="col-xs-4 text-right speaker-top-actions">
<i class="fa fa-ellipsis-h fa-fw">
</i>
</div>
</div>
</div>
<div class="speaker-details">
<div class="speaker-name">
Alex Aanderud
</div>
<div class="row" style="margin-top: 15px;">
<div class="col-xs-12 col-sm-12">
<div class="speaker-location">
<i class="fa fa-map-marker mp-tertiary-background">
</i>
AZ
<span>
,
</span>
US
</div>
</div>
<div class="col-sm-6 col-xs-12">
<div class="speaker-awards">
</div>
</div>
</div>
<div class="speaker-oneline text-left">
<p>
</p>
<div>
Certified Trainer of Advanced Integrative Psychology and Certified John Maxwell Speaker, Trainer, Coach, will transform your organization and improve your results.
</div>
</div>
<div class="speaker-assets">
<div class="row">
</div>
</div>
<div class="speaker-actions">
<div class="row">
<div class="text-center col-xs-12">
<div class="btn btn-flat mp-primary btn-block">
<span class="hidden-xs hidden-sm">
View Profile
</span>
<span class="visible-xs visible-sm">
Profile
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
And the when you click on
<span class="hidden-xs hidden-sm">
View Profile
</span>
It moves you to page with person info where I can access it. How I can use selenium to do this, or there are others solutions that can help me.
Thanks!

If you notice, all the profile urls are of the form
https://www.espeakers.com/s/nsas/profile/id
where id is a 5 digits number such as 27397. So you just need to extract the id and concatenate it with the base url to obtain the profile url.
url = 'https://www.espeakers.com/s/nsas/profile/'
profile_urls = [url + el.get_attribute('id')[3:] for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-tile')]
names = [el.text for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-name')]
names is a list containing all the names, urls is a list containing the corresponding profile urls

beautifulsoup add <div> with class at end of html

I have an HTML file with multiple tags (there are multiple div inside the div as well). I want to add a new tag along with class to the end of the HTML at a specific position. I tried with append, insert, and insert_after/insert_before as well, however, it's not working as I expected.
My html input is:
<div id="page">
<div id="records">
<div class="record">
<div class="header">
<div class="title">
Something here to display
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display again once
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content again once</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display second time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content second time</p>
</div>
</div>
</div>
</div>
i want to add new <div> tag with class="record" at the end, before the closing tag of <div id="records">.
output would look like this:
<div id="page">
<div id="records">
<div class="record">
<div class="header">
<div class="title">
Something here to display
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display again once
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content again once</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display second time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content second time</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content 3rd time</p>
</div>
</div>
</div>
</div>
In my case, the number of <div class="record"> is not fixed, the number may vary always.
I would like to get a solution/suggestion for this problem using BeautifulSoup in python.

You can use insert_after after the last item in soup.find_all('div', class_='record'):
from bs4 import BeautifulSoup
html = '<div id="records"> <div class="record"> <div class="header"> <div class="title"> Something here to display </div> </div> <div class="disclaimer"> <p>Here i want to print content</p> </div> </div> <div class="record"> <div class="header"> <div class="title"> Something here to display again once </div> </div> <div class="disclaimer"> <p>Here i want to print content again once</p> </div> </div> <div class="record"> <div class="header"> <div class="title"> Something here to display second time </div> </div> <div class="disclaimer"> <p>Here i want to print content second time</p> </div> </div> </div>'
soup = BeautifulSoup(html, 'html.parser')
extra_html = '''
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content 3rd time</p>
</div>
</div>'''
soup.find_all('div', class_='record')[-1].insert_after(BeautifulSoup(extra_html, 'html.parser')) # [-1] selects the last item
Output print(soup.prettify()):
<div id="records">
<div class="record">
<div class="header">
<div class="title">
Something here to display
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content
</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display again once
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content again once
</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display second time
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content second time
</p>
</div>
</div>
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>
Here i want to print content 3rd time
</p>
</div>
</div>
</div>

using .append(), it need to select the parent element or <div id="page">
newRecord = '''
<div class="record">
<div class="header">
<div class="title">
Something here to display 3rd time
</div>
</div>
<div class="disclaimer">
<p>Here i want to print content 3rd time</p>
</div>
</div>
'''
soup = BeautifulSoup(sourceHTML, 'html.parser')
page = soup.select_one('#page')
page.append(BeautifulSoup(newRecord, 'html.parser'))
print(soup.prettify())

Web scraping problem : Data does not show when printed

So i tried to scrape this website: https://top-1000-sekolah.ltmpt.ac.id/site/page?id=2001
if you inspect element, there's a div with id of tab-1, tab-2,tab-3, tab-4 . So I tried ti scrape each id but somehow only tab-1 data's were grabbed. so what did I do wrong??
pk = driver.find_element_by_xpath("(//div[#id='tab-1'])")
pbm = driver.find_element_by_id('tab-2')
pu = driver.find_element_by_id('tab-3')
ppu = driver.find_element_by_id('tab-4')
The output I expect from tab-2 is :
Kemampuan Kuantitatif
2
Urut Nasional
1
Urut Provinsi
Rerata
640,253
Nilai Tertinggi
721,15
Nilai Terendah
511,14
Standar Deviasi
44,1
and currently tab-2 output is blank( ' ' )

Try doing this:
pbm = driver.find_element_by_id('tab-2')
print(pbm.text)
If that doesn't work, I suspect it is because that div class with the id of tab-2 has many child elements. You will need to select those individual child elements directly to get the data you need. Use the XPATH method which you used up top.
<div class="row">
<div class="col-lg-12 details order-2 order-lg-1">
<h3 align="center">
Kemampuan Memahami Bacaan dan Menulis
</h3>
<hr>
<div class="row">
<div class="col-lg-6 col-md-6">
<div class="count-box">
<i class="icofont-award"></i>
<span data-toggle="counter-up">5</span>
<p>Urut Nasional</p>
</div>
</div>
<div class="col-lg-6 col-md-6">
<div class="count-box">
<i class="icofont-award"></i>
<span data-toggle="counter-up">1</span>
<p>Urut Provinsi</p>
</div>
</div>
</div>
<hr>
<div class="row">
<div class="col-sm-3">
<div class="card bg-light mb-3" style="max-width: 18rem;">
<div class="card-header" align="center">Rerata</div>
<div class="card-body">
<h3 class="card-title" align="center"><b>589,104</b></h3>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="card bg-light mb-3" style="max-width: 18rem;">
<div class="card-header" align="center">Nilai Tertinggi</div>
<div class="card-body">
<h3 class="card-title" align="center"><b>709,61</b></h3>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="card bg-light mb-3" style="max-width: 18rem;">
<div class="card-header" align="center">Nilai Terendah</div>
<div class="card-body">
<h3 class="card-title" align="center"><b>371,88</b></h3>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="card bg-light mb-3" style="max-width: 18rem;">
<div class="card-header" align="center">Standar Deviasi</div>
<div class="card-body">
<h3 class="card-title" align="center"><b>65,96</b></h3>
</div>
</div>
</div>
</div>
</div>
</div>
For example to parse the name Kemampuan Kuantitatif,
name = driver.find_element_by_xpath('//*[#id="tab-2"]/div/div/h3')
print(name)

beautifulsoup: finding specific class name in nested div

I try to get reviews in agoda site for analysis by using beautifulsoup
i have inspected and see that the reviews is in :
<div class="container-agoda">
<div class="a">
<div class="b">
<div class="c">
<div class="d">
<div class="col-xs-9 review-comment" data-selenium="comments-detail">>
<div name="review-title" class="title" data-selenium="comments-title">
HAD 1 HOUR SLEEP
</div>
<div class="review-comment-section">
<div class="comment-detail" data-selenium="reviews-comments">
<span>Great location</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
but this class in nested in 10+ classes
I have tried by
for div in soup.findAll('div', attrs={"class":"comment-detail"}):
print(div)
but it get nothing.
Is it have a method for get an exactly as find ''' class="comment-detail" data-selenium="reviews-comments" ''' or any suggestion.
Thank you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape table with div class by sibling, if data found - python

One solution using python3 is html.parser module! There is a simple example to get you started :)

Related

python scrap chrome web-store comment

Pyton, Selenium: I need to collect urls but there no a tags in element

beautifulsoup add <div> with class at end of html

Web scraping problem : Data does not show when printed

beautifulsoup: finding specific class name in nested div

Categories

Resources