I try to get reviews in agoda site for analysis by using beautifulsoup
i have inspected and see that the reviews is in :
<div class="container-agoda">
<div class="a">
<div class="b">
<div class="c">
<div class="d">
<div class="col-xs-9 review-comment" data-selenium="comments-detail">>
<div name="review-title" class="title" data-selenium="comments-title">
HAD 1 HOUR SLEEP
</div>
<div class="review-comment-section">
<div class="comment-detail" data-selenium="reviews-comments">
<span>Great location</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
but this class in nested in 10+ classes
I have tried by
for div in soup.findAll('div', attrs={"class":"comment-detail"}):
print(div)
but it get nothing.
Is it have a method for get an exactly as find ''' class="comment-detail" data-selenium="reviews-comments" ''' or any suggestion.
Thank you.
Related
Good day, guys. I have a task to collect Name and Email for person from this site:
https://www.espeakers.com/s/nsas/search?available_on=&awards&budget=0%2C10&bureau_id=304&distance=1000&fee=false&items_per_page=3701&language=en&location=&norecord=false&nt=0&page=0&presenter_type=&q=%5B%5D&require&review=false&sort=speakername&video=false&virtual=false
I use selenium and python to scrape it, but I have a problem with accessing an url for people. The sample structure of person card is:
<div class="col-xs-12 col-sm-6 col-md-4 col-lg-3">
<div class="speaker-tile" id="sid12026">
<div class="speaker-thumb" style='background-image: url("https://streamer.espeakers.com/assets/6/12026/159445.jpg"); background-size: contain;'>
<div class="row">
<div class="col-xs-8 text-left">
</div>
<div class="col-xs-4 text-right speaker-top-actions">
<i class="fa fa-ellipsis-h fa-fw">
</i>
</div>
</div>
</div>
<div class="speaker-details">
<div class="speaker-name">
Alex Aanderud
</div>
<div class="row" style="margin-top: 15px;">
<div class="col-xs-12 col-sm-12">
<div class="speaker-location">
<i class="fa fa-map-marker mp-tertiary-background">
</i>
AZ
<span>
,
</span>
US
</div>
</div>
<div class="col-sm-6 col-xs-12">
<div class="speaker-awards">
</div>
</div>
</div>
<div class="speaker-oneline text-left">
<p>
</p>
<div>
Certified Trainer of Advanced Integrative Psychology and Certified John Maxwell Speaker, Trainer, Coach, will transform your organization and improve your results.
</div>
</div>
<div class="speaker-assets">
<div class="row">
</div>
</div>
<div class="speaker-actions">
<div class="row">
<div class="text-center col-xs-12">
<div class="btn btn-flat mp-primary btn-block">
<span class="hidden-xs hidden-sm">
View Profile
</span>
<span class="visible-xs visible-sm">
Profile
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
And the when you click on
<span class="hidden-xs hidden-sm">
View Profile
</span>
It moves you to page with person info where I can access it. How I can use selenium to do this, or there are others solutions that can help me.
Thanks!
If you notice, all the profile urls are of the form
https://www.espeakers.com/s/nsas/profile/id
where id is a 5 digits number such as 27397. So you just need to extract the id and concatenate it with the base url to obtain the profile url.
url = 'https://www.espeakers.com/s/nsas/profile/'
profile_urls = [url + el.get_attribute('id')[3:] for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-tile')]
names = [el.text for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-name')]
names is a list containing all the names, urls is a list containing the corresponding profile urls
I have a webpage follow the pattern:
<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>
...
And I want to scrape the href and date time attribute in pairs: [abc/def/gh.com,2020-05-31], [ijk/lmn/op.com, 2020-04-30]
How can I realize this?
Thank you.
You can try the following:
from bs4 import BeautifulSoup
t='''<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>'''
soup=BeautifulSoup(t,"lxml")
aTags=soup.select('a')
data=[]
for aTag in aTags:
timeTag=aTag.select_one('time')
data.append([aTag.get('href'),timeTag['datetime']])
print(data)
Instead of t you can use the response from selenium.
Output:
[['abc/def/gh.com', '2020-05-31'], ['ijk/lmn/op.com', '2020-04-30']]
You can use the find_element_by_xpath() and get_attribute() functions using Python, as follows:
# for the hrefs
urls = [a.get_attribute('href') for a in driver.find_elements_by_xpath('//a[contains(#class, "card cardlisting0")]')]
# for the datetimes
dates = [time_element.get_attribute('datetime') for time_element in driver.find_elements_by_xpath('//a//time')]
I would like to scrape a html table which contains elements in <div class="..."> format. To scrape it I think I'll need to use:
if found driver.find_element_by_xpath contains(footable-row-detail-name)
get value from /following-sibling which is (class="footable-row-detail-value")
This is just one table. The site I'm scraping has a lot of tables and some tables don't have all the data (that's why "if found")
I would like to use python 3 for that.
I hope I explained it well. The HTML code for one table:
<div class="footable-row-detail-inner">
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Discipline(s) thérapeutique(s):
</div>
<div class="footable-row-detail-value">
197. Omeopatia, 202. Linfodrenaggio manuale, 205. Massaggio classico, 664. Riflessoterapia generale
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Cognome:
</div>
<div class="footable-row-detail-value">
ABBONDANZIERI Katia
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Via:
</div>
<div class="footable-row-detail-value">
Place du Cirque, 2
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
NPA:
</div>
<div class="footable-row-detail-value">
1204
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Luogo:
</div>
<div class="footable-row-detail-value">
Genève
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Tel / Cellulare:
</div>
<div class="footable-row-detail-value">
022 328 23 44
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Cellulare:
</div>
<div class="footable-row-detail-value">
079 601 92 75
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Discipline(s) thérapeutique(s):
</div>
<div class="footable-row-detail-value">
<div class="thZone">
<div class="zCat">
METHODES DE MASSAGE
</div>
<div class="zThr">
Linfodrenaggio manuale
</div>
<div class="zThr">
Massaggio classico
</div>
<div class="zCat">
METHODES PRESCRIPTIVES
</div>
<div class="zThr">
Omeopatia
</div>
<div class="zCat">
METHODES REFLEXES
</div>
<div class="zThr">
Riflessoterapia generale
</div>
</div>
</div>
</div>
Any help is appreciated.
This runs for me. I am using jupyter and running this line by line. You might encounter errors when the elements aren't loaded yet so please make adjustments if an error occurs for you.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
driver = webdriver.Chrome()
driver.get("http://asca.ch/Partners.aspx?lang=it")
cantone = driver.find_element_by_xpath("""//*[#id="ctl00_MainContent_ddl_cantons_Input"]""")
cantone.click()
cantone.send_keys('GE')
cantone.send_keys(Keys.ENTER)
confermo = driver.find_element_by_xpath("""//*[#id="MainContent__chkDisclaimer"]""")
confermo.click()
ricera = driver.find_element_by_xpath("""//*[#id="MainContent_btn_submit"]""")
ricera.click()
toggle = driver.find_elements_by_class_name("""footable-toggle""")
print(toggle)
while not toggle:
time.sleep(.2)
toggle = driver.find_elements_by_class_name("""footable-toggle""")
for r in toggle:
time.sleep(.2)
r.click()
data = driver.find_elements_by_class_name("""footable-row-detail-cell""")
while not data:
time.sleep(.2)
data = driver.find_elements_by_class_name("""footable-row-detail-cell""")
list_df = []
for r in data:
ratum = r.get_attribute('innerHTML')
datum = r.get_attribute('innerHTML')\
.replace("""<div class="footable-row-detail-inner">""","<table>")\
.replace("""<div class="footable-row-detail-row">""","<tr>")\
.replace("""<div class="footable-row-detail-name">""","<td>")\
.replace("""<div class="footable-row-detail-value">""","</td><td>")
list_df.append(dict(pd.read_html(datum)[0].values.tolist()))
df = pd.DataFrame(list_df)
df.to_csv('data.csv')
print(df)
One solution using python3 is html.parser module!
There is a simple example to get you started :)
i am trying to select "Users Interact, Digital Purchases" for below html in beautifulsoup.but i failed so help me please.
<div class="details-wrapper apps-secondary-color">
<div class="details-section metadata">
<div class="details-section-heading">
<div class="details-section-contents">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="title"> Interactive Elements </div>
<div class="content">Users Interact, Digital Purchases</div>
</div>
<div class="meta-info">
<div class="meta-info">
<div class="meta-info contains-text-link">
<div class="meta-info">
<div class="meta-info meta-info-wide">
<div class="details-sharing-section">
</div>
<div class="details-section-divider"></div>
</div>
</div>
</div>
You can rely on the class attribute:
soup.find("div", class_="content")
Or, with a CSS selector:
soup.select_one("div.content")
If the content class is not something uniquely identifying the element and you know the preceding "Interactive Elements" label:
import re
label = soup.find("div", class_="title", text=re.compile("Interactive Elements"))
print(label.find_next_sibling("div", class_="content"))
You can achive this in various waysL
1. document.querySelector('.content').innerHTML;
2. $('.content').text(); / $('.content').html();
3. soup.find("div", class_="content")
4. soup.select_one("div.content")
I have a html code:
<div id='div1'>
<div id='d'> </div>
<p></p>
</div>
How do I get all that in a div with an id div1?
soup.find('div',{'id':"div1"}) returns:
<div id='div1'>
<div id='d'> </div>
<p></p>
</div>
I need to get only:
<div id='d'> </div>
<p></p>
See the documentation, specifically .find() and .contents.
You want the content between the start and end of the tag including all child tags.
soup.find('div', id="div1").contents