How to pick up specific data with Python and Selenium

How to pick up specific data with Python and Selenium - python

I'm Hiro from Japan.
I just started studying Python with Selenium by myself.
I am happy if somebody help me solve following problem.
Number of "data" which is enclosed in "p" tags is always changed so don't know how many data will be shown every time.
Data4 is always appear but order is changed.
Also these data are assigned same class.
For example, In case sample A Data4 is 4th, but sample B is 2nd.
Please help me how pick up "Data4".
Thank you for your help in advance.
(sample A)
----------------------------------------
<div class="elist" id="test">
<ul class="ijkl">
<li class="elRow">
<div class="elRowTitle">
<p>Code1</p>
</div>
<div class="elRowData">
<p>Data1</p>
</div>
</li>
<li class="elRow">
<div class="elRowTitle">
<p>Code2</p>
</div>
<div class="elRowData">
<p>Data2</p>
</div>
</li>
<li class="elRow">
<div class="elRowTitle">
<p>Code3</p>
</div>
<div class="elRowData">
<p>Data3</p>
</div>
</li>
<li class="elRow">
<div class="elRowTitle">
<p>Code4</p>
</div>
<div class="elRowData">
<p>Data4</p>
</div>
</li>
</ul>
</div>
----------------------------------------
(sample B)
----------------------------------------
<div class="elist" id="test">
<ul class="ijkl">
<li class="elRow">
<div class="elRowTitle">
<p>Code1</p>
</div>
<div class="elRowData">
<p>Data1</p>
</div>
</li>
<li class="elRow">
<div class="elRowTitle">
<p>Code4</p>
</div>
<div class="elRowData">
<p>Data4</p>
</div>
</li>
</ul>
</div>
----------------------------------------
I wrote the following script using "find_elements_by_xxxxx" method for both sample A and B but did not work.
from selenium import webdriver
browser = webdriver.Chrome('chromedriver.exe')
elem_ItemCodes = driver.find_elements_by_tag_name('div')
elem_ItemCodes2 = elem_ItemCodes.find_elements_by_class_name('elRowData')
elem_ItemCode = elem_ItemCodes2[3].text
print(elem_ItemCode)

You can use parent as reference,
The xpath equivalent will be :
//div[#id="test"]//div[#class="elRowData"][4]/p
your approach is correct, but make few changes like:
from selenium import webdriver
browser = webdriver.Chrome('chromedriver.exe')
elem_ItemCodes = driver.find_elements_by_id('test')
elem_ItemCodes2 = elem_ItemCodes.find_elements_by_class_name('elRowData')
elem_ItemCode = elem_ItemCodes2[4].find_elemen_by_tag_name("p").text
print(elem_ItemCode)
The text is in P tag and the index in xpath starts from 1 and not 0

Related

Pyton, Selenium: I need to collect urls but there no a tags in element

Good day, guys. I have a task to collect Name and Email for person from this site:
https://www.espeakers.com/s/nsas/search?available_on=&awards&budget=0%2C10&bureau_id=304&distance=1000&fee=false&items_per_page=3701&language=en&location=&norecord=false&nt=0&page=0&presenter_type=&q=%5B%5D&require&review=false&sort=speakername&video=false&virtual=false
I use selenium and python to scrape it, but I have a problem with accessing an url for people. The sample structure of person card is:
<div class="col-xs-12 col-sm-6 col-md-4 col-lg-3">
<div class="speaker-tile" id="sid12026">
<div class="speaker-thumb" style='background-image: url("https://streamer.espeakers.com/assets/6/12026/159445.jpg"); background-size: contain;'>
<div class="row">
<div class="col-xs-8 text-left">
</div>
<div class="col-xs-4 text-right speaker-top-actions">
<i class="fa fa-ellipsis-h fa-fw">
</i>
</div>
</div>
</div>
<div class="speaker-details">
<div class="speaker-name">
Alex Aanderud
</div>
<div class="row" style="margin-top: 15px;">
<div class="col-xs-12 col-sm-12">
<div class="speaker-location">
<i class="fa fa-map-marker mp-tertiary-background">
</i>
AZ
<span>
,
</span>
US
</div>
</div>
<div class="col-sm-6 col-xs-12">
<div class="speaker-awards">
</div>
</div>
</div>
<div class="speaker-oneline text-left">
<p>
</p>
<div>
Certified Trainer of Advanced Integrative Psychology and Certified John Maxwell Speaker, Trainer, Coach, will transform your organization and improve your results.
</div>
</div>
<div class="speaker-assets">
<div class="row">
</div>
</div>
<div class="speaker-actions">
<div class="row">
<div class="text-center col-xs-12">
<div class="btn btn-flat mp-primary btn-block">
<span class="hidden-xs hidden-sm">
View Profile
</span>
<span class="visible-xs visible-sm">
Profile
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
And the when you click on
<span class="hidden-xs hidden-sm">
View Profile
</span>
It moves you to page with person info where I can access it. How I can use selenium to do this, or there are others solutions that can help me.
Thanks!

If you notice, all the profile urls are of the form
https://www.espeakers.com/s/nsas/profile/id
where id is a 5 digits number such as 27397. So you just need to extract the id and concatenate it with the base url to obtain the profile url.
url = 'https://www.espeakers.com/s/nsas/profile/'
profile_urls = [url + el.get_attribute('id')[3:] for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-tile')]
names = [el.text for el in driver.find_elements(By.CSS_SELECTOR, '.speaker-name')]
names is a list containing all the names, urls is a list containing the corresponding profile urls

selenium scrape multiple attributes within a block at the same time

I have a webpage follow the pattern:
<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>
...
And I want to scrape the href and date time attribute in pairs: [abc/def/gh.com,2020-05-31], [ijk/lmn/op.com, 2020-04-30]
How can I realize this?
Thank you.

You can try the following:
from bs4 import BeautifulSoup
t='''<a class="card cardlisting0" href="abc/def/gh.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-05-31">3 hours ago</time>
</div>
</div>
</a>
<a class="card cardlisting1" href="ijk/lmn/op.com">
<div class="contentWrapper">
<div class="card-content">
<time datetime="2020-04-30">20200430</time>
</div>
</div>
</a>'''
soup=BeautifulSoup(t,"lxml")
aTags=soup.select('a')
data=[]
for aTag in aTags:
timeTag=aTag.select_one('time')
data.append([aTag.get('href'),timeTag['datetime']])
print(data)
Instead of t you can use the response from selenium.
Output:
[['abc/def/gh.com', '2020-05-31'], ['ijk/lmn/op.com', '2020-04-30']]

You can use the find_element_by_xpath() and get_attribute() functions using Python, as follows:
# for the hrefs
urls = [a.get_attribute('href') for a in driver.find_elements_by_xpath('//a[contains(#class, "card cardlisting0")]')]
# for the datetimes
dates = [time_element.get_attribute('datetime') for time_element in driver.find_elements_by_xpath('//a//time')]

Scrape table with div class by sibling, if data found

I would like to scrape a html table which contains elements in <div class="..."> format. To scrape it I think I'll need to use:
if found driver.find_element_by_xpath contains(footable-row-detail-name)
get value from /following-sibling which is (class="footable-row-detail-value")
This is just one table. The site I'm scraping has a lot of tables and some tables don't have all the data (that's why "if found")
I would like to use python 3 for that.
I hope I explained it well. The HTML code for one table:
<div class="footable-row-detail-inner">
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Discipline(s) thérapeutique(s):
</div>
<div class="footable-row-detail-value">
197. Omeopatia, 202. Linfodrenaggio manuale, 205. Massaggio classico, 664. Riflessoterapia generale
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Cognome:
</div>
<div class="footable-row-detail-value">
ABBONDANZIERI Katia
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Via:
</div>
<div class="footable-row-detail-value">
Place du Cirque, 2
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
NPA:
</div>
<div class="footable-row-detail-value">
1204
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Luogo:
</div>
<div class="footable-row-detail-value">
Genève
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Tel / Cellulare:
</div>
<div class="footable-row-detail-value">
022 328 23 44
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Cellulare:
</div>
<div class="footable-row-detail-value">
079 601 92 75
</div>
</div>
<div class="footable-row-detail-row">
<div class="footable-row-detail-name">
Discipline(s) thérapeutique(s):
</div>
<div class="footable-row-detail-value">
<div class="thZone">
<div class="zCat">
METHODES DE MASSAGE
</div>
<div class="zThr">
Linfodrenaggio manuale
</div>
<div class="zThr">
Massaggio classico
</div>
<div class="zCat">
METHODES PRESCRIPTIVES
</div>
<div class="zThr">
Omeopatia
</div>
<div class="zCat">
METHODES REFLEXES
</div>
<div class="zThr">
Riflessoterapia generale
</div>
</div>
</div>
</div>
Any help is appreciated.

This runs for me. I am using jupyter and running this line by line. You might encounter errors when the elements aren't loaded yet so please make adjustments if an error occurs for you.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
driver = webdriver.Chrome()
driver.get("http://asca.ch/Partners.aspx?lang=it")
cantone = driver.find_element_by_xpath("""//*[#id="ctl00_MainContent_ddl_cantons_Input"]""")
cantone.click()
cantone.send_keys('GE')
cantone.send_keys(Keys.ENTER)
confermo = driver.find_element_by_xpath("""//*[#id="MainContent__chkDisclaimer"]""")
confermo.click()
ricera = driver.find_element_by_xpath("""//*[#id="MainContent_btn_submit"]""")
ricera.click()
toggle = driver.find_elements_by_class_name("""footable-toggle""")
print(toggle)
while not toggle:
time.sleep(.2)
toggle = driver.find_elements_by_class_name("""footable-toggle""")
for r in toggle:
time.sleep(.2)
r.click()
data = driver.find_elements_by_class_name("""footable-row-detail-cell""")
while not data:
time.sleep(.2)
data = driver.find_elements_by_class_name("""footable-row-detail-cell""")
list_df = []
for r in data:
ratum = r.get_attribute('innerHTML')
datum = r.get_attribute('innerHTML')\
.replace("""<div class="footable-row-detail-inner">""","<table>")\
.replace("""<div class="footable-row-detail-row">""","<tr>")\
.replace("""<div class="footable-row-detail-name">""","<td>")\
.replace("""<div class="footable-row-detail-value">""","</td><td>")
list_df.append(dict(pd.read_html(datum)[0].values.tolist()))
df = pd.DataFrame(list_df)
df.to_csv('data.csv')
print(df)

One solution using python3 is html.parser module!
There is a simple example to get you started :)

beautifulsoup: finding specific class name in nested div

I try to get reviews in agoda site for analysis by using beautifulsoup
i have inspected and see that the reviews is in :
<div class="container-agoda">
<div class="a">
<div class="b">
<div class="c">
<div class="d">
<div class="col-xs-9 review-comment" data-selenium="comments-detail">>
<div name="review-title" class="title" data-selenium="comments-title">
HAD 1 HOUR SLEEP
</div>
<div class="review-comment-section">
<div class="comment-detail" data-selenium="reviews-comments">
<span>Great location</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
but this class in nested in 10+ classes
I have tried by
for div in soup.findAll('div', attrs={"class":"comment-detail"}):
print(div)
but it get nothing.
Is it have a method for get an exactly as find ''' class="comment-detail" data-selenium="reviews-comments" ''' or any suggestion.
Thank you.

Select an item from expandable class using selenium python

For the following html:
<ul id="dataset-menu" class="treeview">
<li id="cat_01" class="expandable"></li>
<li id="cat_02" class="collapsable">
<div class="hitarea collapsable-hitarea"></div>
<span class="folder" title=""></span>
<ul style="display: block;">
<li></li>
<li>
<span class="collection">
<div class="cell">
<input id="coll_5555" class="dataset_checkbox" type="checkbox" name="dataset_checkbox" value="5555"></input>
</div>
<div class="cell"></div>
</span>
</li>
<li class="last"></li>
</ul>
</li>
<li id="cat_03" class="expandable"></li>
I have to select the item where the following occurs:
<input id="coll_5555" class="dataset_checkbox" type="checkbox" name="dataset_checkbox" value="5555"></input>
Any idea please?

As i understand first you have to click on li with id cat_02 and then click on the checkbox.
driver.find_element_by_css_selector("#cat_02 div.collapsable-hitarea").click()
driver.find_element_by_id("coll_5555").click();

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to pick up specific data with Python and Selenium - python

Related

Pyton, Selenium: I need to collect urls but there no a tags in element

selenium scrape multiple attributes within a block at the same time

Scrape table with div class by sibling, if data found

beautifulsoup: finding specific class name in nested div

Select an item from expandable class using selenium python

Categories

Resources