Python 3 - Selenium - scraping data from nested divs

Python 3 - Selenium - scraping data from nested divs - python

I'm reasonably new to python and I'm trying to check which div class appears first on a page. I've done this with table rows but I can't seem to wrap my head around how to do this with divs.
What I'm trying to determine is whether the latest update is an email sent <div class="EMAIL SENT"> or a notes added <div class="Notes">. The most recent item will appear first from the top, but other actions may have taken place since then, for example, <div class="Updated">
I've not managed to write any code to do this or event get close, but in my head I imagine it to work like this.
for sub_div_classes in browser.find_element_by_class_name('cb'):
classname = ~check name of sub_div_class
if classname = "EMAIL SENT":
class_info = browser.find_element_by_class_name('plus_header_Additional_info').text
print(class_info) ¬output: EMAIL SENT :Email sent on 20-03-2016 00:22:09 by [REDACTED]
trigger_1()
if classname = "Notes":
trigger_2()
~move on to next div class in list
Below is the page code I'm trying to work with. I'd be really appreciative of any advice or assistance anyone can provide.
<div class="cb" style="margin:5px 0 0 0;">
<div class="Updated">
<div class="plus_header_Additional_info">Updated :Incident Updated on 20-03-2016 00:22:52 by User = [REDACTED]
<img src="images/minus.png" style="float:right;">
</div>
<div class="plus_content" style="display: block;" id="contentDivImg2_0">
<div>
Assigned to STRIKE1,
by User = [REDACTED].
</div>
<br>
</div>
</div>
<div class="Updated">
<div class="plus_header_Additional_info">Updated :PEND CLIENT STRIKE - 1 added on 20-03-2016 00:22:36 by [REDACTED].
<img src="images/minus.png" style="float:right;">
</div>
<div class="plus_content" style="display: block;" id="contentDivImg2_1">
<div>
</div>
<br>
</div>
</div>
<div class="EMAIL SENT">
<div class="plus_header_Additional_info">EMAIL SENT :Email sent on 20-03-2016 00:22:09 by [REDACTED]
<img src="images/minus.png" style="float:right;">
</div>
<div class="plus_content" style="display: block;" id="contentDivImg2_2">
<div>
To :- [NAME]#[DOMAIN].CO.UK Subject: Ticket - [IN-000999999] Description : Dear User,
[REDACTED]
</div>
<br>
</div>
</div>
<div class="Updated">
<div class="plus_header_Additional_info">Updated :Incident Updated on 12-03-2016 10:56:15 by User = [REDACTED]
<img src="images/minus.png" style="float:right;">
</div>
<div class="plus_content" style="display: block;" id="contentDivImg2_3">
<div>
Status:- PROGRESSING changed to PEND CLIENT,
Assigned to SOFTWARE DEPLOYED,
by User = [REDACTED].
</div>
<br>
</div>
</div>
<div class="Notes">
<div class="plus_header_Additional_info">Notes :Notes Added on 12-03-2016 10:55:53 by [REDACTED].
<img src="images/minus.png" style="float:right;">
</div>
<div class="plus_content" style="display: block;" id="contentDivImg2_4">
<div>
<textarea id="notes4" name="notes1" cols="" class="emailForm_input1" style="width: 97%; overflow: hidden; word-wrap: break-word; resize: horizontal; height: 237px;" readonly="readonly">Hello,
[REDACTED]
</textarea>
</div>
<br>
</div>
</div>
</div>

Use an or with an xpath:
.xpath("//div[#class='Notes' or #class='EMAIL SENT']")[0]
If Notes comes first you will get Notes and vice versa.
If we change a bit of your html snippet like below, adding some text to <div class="EMAIL SENT">in email and changing a later tag class to <div class="Notes">in notes:
We can see using lxml how it works:
In [13]: from lxml.etree import fromstring, HTMLParser
In [14]: xml = fromstring(html, HTMLParser())
In [15]: xml.xpath("//div[#class='Notes' or #class='EMAIL SENT']")
Out[15]: [<Element div at 0x7f96598d4ea8>, <Element div at 0x7f96598d4ef0>]
In [16]: xml.xpath("//div[#class='Notes' or #class='EMAIL SENT']")[0].text
Out[16]: 'in email\n '
In [17]: xml.xpath("//div[#class='Notes' or #class='EMAIL SENT']")[1].text
Out[17]: 'in notes\n
So with selenium you want to just find the element by xpath.

Related

Web Scraping of nested div elements with repeating class names

<div class="information_row" id="dashboard">
Statewise
<div class="info_title1">Cases Across India</div>
<div class="active-case">
<div class="block-active-cases">
<span class="icount">3,86,351</span>
<div class="increase_block">
<div class="color-green down-arrow">
2,157 <i></i>
</div>
</div>
</div>
<div class="info_label">Active Cases
<span class="per_block">(1.21%)</span>
</div>
</div>
<div class="iblock discharge">
<div class="iblock_text">
<div class="info_label"> Discharged
<div class="per_block">
(97.45%)
</div>
</div>
<span class="icount">3,12,20,981</span>
<div class="increase_block">
<div class="color-green up-arrow">
40,013 <i></i>
</div>
</div>
</div>
</div>
<div class="iblock death_case">
<div class="iblock_text">
<div class="info_label">Deaths
<div class="per_block">
(1.34%)
</div>
</div>
<span class="icount">4,29,179</span>
<div class="increase_block">
<div class="color-red up-arrow">
497 <i></i>
</div>
</div>
</div>
</div>
<div class="iblock t_case">
<div class="iblock_text">
<div class="info_label">Total Cases
<div class="per_block"></div>
</div>
<span class="icount">3,20,36,511</span>
<div class="increase_block">
<div class="color-red up-arrow">
38,353 <i></i>
</div>
</div>
</div>
</div></div>
I am working on a web scraping project using python and beautifulsoup. As a beginner I am unable to parse the data which I need (Numerical Statistics on covid) since the class names which contain the numerical data are repeated and not unique like icount, per_block,increase_block. What I want is to parse and store only these numerical data in different variables like below-
Total_cases = 3,20,36,511
Total_cases_in_last_24_hrs = 38,353 and likewise for all other categories(Discharge, deaths, active cases)
Here is my code-
URL = 'https://www.mygov.in/covid-19/'
page = requests.get(URL,headers=headers)
clean_data=BeautifulSoup(page.text,'html.parser')
span=clean_data.findAll('span',class_='icount')
#print(clean_data)
total_cases = clean_data.find("div",class_="iblock
t_case",attrs={'spanclass':'icount'}).get_text()
print(total_cases)
I have been working on it for long time but could not find a solution. Please help.
This is the reference code from Click here to visit the website.
Thank You.

One possible solution is to select all text from class="t_case" and split the text:
import requests
from bs4 import BeautifulSoup
url = "https://www.mygov.in/covid-19/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
_, total_cases, new_cases = (
soup.select_one(".t_case").get_text(strip=True, separator="|").split("|")
)
print(total_cases)
print(new_cases)
Prints:
3,20,36,511
38,353
Or:
t_case = soup.select_one(".t_case")
total_cases = t_case.select_one(".icount")
new_cases = t_case.select_one(".color-red, .color-green")
print(total_cases.get_text(strip=True))
print(new_cases.get_text(strip=True))

Selenium python how to upload file when there is no input type file?

I am using Selenium python to automate a site. The problem I have face is, I have to upload file but there is no input type file available where I could have been using send_keys() method.
The File upload element:
<div id="data-assets-interior-file-upload" data-upload-properties="{"formId":"form-main-1","path":"data[assets][interior]","modalUpload":"Uploading...","warnExtensionsStrings":{"pdf":"<div class=\"a-box a-alert-inline a-alert-inline-warning\"><div class=\"a-box-inner a-alert-container\"><i class=\"a-icon a-icon-alert\"><\/i><div class=\"a-alert-content\">\n Most PDF files do not produce great results in an automated conversion process. We recommend using a Word, Mobi, ePUB or HTML file if you have one. <a href=\"\/en_US\/help\/topic\/A2GF0UFHIYG9VQ?ref_=_fg\" target=\"_blank\" rel=\"noopener noreferrer\">Learn more.<\/a>\n <\/div><\/div><\/div>\n <div id=\"file-warn-actions\" class=\"a-form-actions a-spacing-none a-spacing-top-large\">\n <div class=\"a-row a-spacing-none\">\n <div class=\"a-column a-span6\">\n <span class=\"a-declarative\" data-action=\"potter-file-warn-extension-continue\" data-potter-file-warn-extension-continue=\"{}\">\n <span id=\"file-warn-extension-continue\" class=\"a-button a-button-base button-fill\"><span class=\"a-button-inner\"><button id=\"file-warn-extension-continue-announce\" class=\"a-button-text\" type=\"button\">\n Continue with PDF\n <\/button><\/span><\/span>\n <\/span>\n <\/div>\n <div class=\"a-column a-span6 a-span-last\">\n <span class=\"a-declarative\" data-action=\"potter-file-warn-extension-cancel\" data-potter-file-warn-extension-cancel=\"{}\">\n <span id=\"file-warn-extension-cancel\" class=\"a-button a-button-primary button-fill\"><span class=\"a-button-inner\"><button id=\"file-warn-extension-cancel-announce\" class=\"a-button-text\" type=\"button\">\n I have another format\n <\/button><\/span><\/span>\n <\/span>\n <\/div>\n <\/div>\n <\/div>","pdf-header":"Do you have another format?"},"acceptedExtensions":"doc,docx,zip,htm,html,mobi,azw,epub,rtf,txt,pdf,kpf","multipart":null,"persistSuccess":true,"warnExtensions":["pdf"],"key":"save","url":"\/en_US\/title-setup\/kindle\/A3U1YUNVSBYTMZ\/content\/action\/save","workflowId":"assets.interior","assetType":"DIGITAL_BOOK_BLOCK"}" class="a-section jele-file-field">
<div class="a-section a-spacing-none file-upload-options-section">
<p class="a-spacing-small">
</p>
<div class="a-row file-upload-extra-info-message-section">
<div class="a-column a-span12">
<div class="a-box a-alert a-alert-info"><div class="a-box-inner a-alert-container"><i class="a-icon a-icon-alert"></i><div class="a-alert-content">Use Kindle Create to transform your manuscript to an eBook with professional book themes, images, and Table of Contents. Click here to download for free.</div></div></div>
</div>
</div>
<br/>
<div class="a-row file-upload-browse-section">
<div class="a-column a-span12">
<span class="a-declarative" data-action="browse-clicked" data-browse-clicked="{"signInRequired":false,"id":"data-assets-interior-file-upload"}">
<span id="data-assets-interior-file-upload-browse-button" class="a-button a-button-primary file-upload-browse-button"><span class="a-button-inner"><button id="data-assets-interior-file-upload-browse-button-announce" class="a-button-text" type="button">
Upload Book
</button></span></span>
</span>
<span class="a-declarative" data-action="file-selected" data-file-selected="{}" id="data-assets-interior-uploader">
<span class="fileuploader a-hidden"></span>
</span>
<p class="a-spacing-top-small a-size-mini a-color-tertiary a-text-italic">
</p>
</div>
</div>
<input type="hidden" name="" value="doc,docx,zip,htm,html,mobi,azw,epub,rtf,txt,pdf,kpf" id="data-assets-interior-file-upload-accepted-extensions" class="accepted-extensions"/>
</div> </div>
Can anyone let me know, how to handle this scenario? If you are gonna recommend me some other library for it then please post relevant examples in python as well. Thank you

Way to select certain div among other divs that contains certain text in Selenium python

I'm using selenium to write an online letter to my friends in army. The website offers no APIs or whatsoever.
Quite a few of my friends are in the army and I wanted to choose who to send the letter to.
Let's say my friends' name is Howard from now on.
Selection page is like this
Each of the friends has it's own card-styled div, all of them shares the same class (cafe-card-box) with no id or name.
All the divs are in slider which is horribly coded. For some reason, divs are duplicated several times invisibly. There are 2-3 divs for Howard only.
Order of the divs are not same across users.
Name of the soldiers are in cafe-card-box(class) -> flex(class) -> profile-wrap(class) -> id(class) -> span(tag only). All the divs are same except for the content in .
Some randomly blank texts share the class="id". And the span tag not only has name but also how long the solider has been in army like this:
Jacob (Been in for 2 weeks)
Initial approach
Initially, I wrote this code:
cafes = self.driver.find_elements_by_class_name("cafe-card-box")
for cafe in cafes:
cf_name = cafe.find_element_by_class_name("id").text[0:3] #Almost every Korean names are 3 characters.
if cf_name == soldier_name:
print("found.")
cafe.find_element_by_link_text("위문편지").click()
break
else:
print("It's not the one. Moving to the next ID class.")
This worked as expected, provided that the name somewhere in the div. The problem is that the program needs to work even when the name is wrong. I later tried this code:
while n<=len(cafes):
n = n + 1
try:
for cafe in cafes:
cf_name = cafe.find_element_by_class_name("id").text[0:3]
if cf_name == soldier_name:
print("Found!")
cafe.find_element_by_link_text("위문편지").click()
ps(3)
break
except:
print("Can't find anyone.")
self.driver.quit()
quit()
This downright didn't work. And in retrospect, first code that actually worked doesn't look so legit at all. I now want to loop through each card divs, find if the name is matching, change the frame to it if it does, and click the button in that specific div.
Is this possible? If so how? I feel like I've tried everything.
Side Question
Is there a better way to extract name from ?
cafe.find_element_by_class_name("id").text[0:3]
This doesn't seem so professional. All the names are separated via 1 blank space.
Edit
Adding HTML code.
<div class="group">
<div class="section-title bd_gradation">
<strong class="title">내 카페 <em>(2)</em></strong>
</div>
<div class="swiper-container cafe-slide-wrap swiper-container-horizontal" id="divSlide1">
<div class="swiper-wrapper" style="transition-duration: 0ms; transform: translate3d(-1140px, 0px, 0px);"><div class="swiper-slide swiper-slide-duplicate swiper-slide-duplicate-active swiper-slide-prev" data-swiper-slide-index="0">
<!-- cafe-card-box -->
<div class="cafe-card-box">
<div class="flex">
<div class="photo-wrap" onclick="javascript:fn_selectListPost(1,'20121590200','4737','0000140002');" style="cursor: pointer;">
<script>
var filedata = {
fileTypeCd : "0000210002"
,thumb : thumbSizeMgr.unitMark
,filePath : "/images/upload/20191122/nb3705#naver.com/"
,savedFileNm : "20191122092608029_ge1"
,extNm : "jpg"
};
document.write('<img src="'+fn_getFileSrcUrl(filedata)+'" alt="">');
</script><img src="./카페 메인_files/20191122092608029_ge1.jpg" alt="">
</div>
<div class="profile-wrap" onclick="javascript:fn_compMain('4737','20121590200');" style="cursor: pointer;">
<div class="id"><!-- 최대 2줄 -->
<span>{NAME CENSORED} (입영 2주차)</span>
</div>
<div class="cafe-sh-txt"><!-- 최대 2줄 -->
{PRIVATE INFO CENSORED}
</div>
<div class="cafe-sh-date"><!-- 최대 2줄 -->
<span>입영일 <em> 2020.07.06 </em></span>
<span>수료일 <em> 2020.08.12 </em></span>
</div>
</div>
</div>
<div class="btn-wrap">
위문편지
카페바로가기
</div>
</div>
<!-- //cafe-card-box -->
<div class="cafe-card-box">
<div class="flex">
<div class="photo-wrap" onclick="javascript:fn_selectListPost(1,'20020191700','4727','0000140001');" style="cursor: pointer;">
<script>
var filedata = {
fileTypeCd : "0000210002"
,thumb : thumbSizeMgr.unitMark
,filePath : "/images/upload/20200227/1234/"
,savedFileNm : "20200227104858343_ge1"
,extNm : "png"
};
document.write('<img src="'+fn_getFileSrcUrl(filedata)+'" alt="">');
</script><img src="./카페 메인_files/20200227104858343_ge1.png" alt="">
</div>
<div class="profile-wrap" onclick="javascript:fn_compMain('4727','20020191700');" style="cursor: pointer;">
<div class="id"><!-- 최대 2줄 -->
<span>{NAME CENSORED} (입영 2주차)</span>
</div>
<div class="cafe-sh-txt"><!-- 최대 2줄 -->
{PRIVATE INFO CENSORED}
</div>
<div class="cafe-sh-date"><!-- 최대 2줄 -->
<span>입영일 <em> 2020.07.06 </em></span>
<span>수료일 <em> 2020.08.11 </em></span>
</div>
</div>
</div>
<div class="btn-wrap">
위문편지
카페바로가기
</div>
</div>
<!-- //cafe-card-box -->
</div>
<div class="swiper-slide swiper-slide-active swiper-slide-duplicate-next swiper-slide-duplicate-prev" data-swiper-slide-index="0">
<!-- cafe-card-box -->
<div class="cafe-card-box">
<div class="flex">
<div class="photo-wrap" onclick="javascript:fn_selectListPost(1,'20121590200','4737','0000140002');" style="cursor: pointer;">
<script>
var filedata = {
fileTypeCd : "0000210002"
,thumb : thumbSizeMgr.unitMark
,filePath : "/images/upload/20191122/nb3705#naver.com/"
,savedFileNm : "20191122092608029_ge1"
,extNm : "jpg"
};
document.write('<img src="'+fn_getFileSrcUrl(filedata)+'" alt="">');
</script><img src="./카페 메인_files/20191122092608029_ge1.jpg" alt="">
</div>
<div class="profile-wrap" onclick="javascript:fn_compMain('4737','20121590200');" style="cursor: pointer;">
<div class="id"><!-- 최대 2줄 -->
<span>{NAME CENSORED} (입영 2주차)</span>
</div>
<div class="cafe-sh-txt"><!-- 최대 2줄 -->
{PRIVATE INFO CENSORED}
</div>
<div class="cafe-sh-date"><!-- 최대 2줄 -->
<span>입영일 <em> 2020.07.06 </em></span>
<span>수료일 <em> 2020.08.12 </em></span>
</div>
</div>
</div>
<div class="btn-wrap">
위문편지
카페바로가기
</div>
</div>
<!-- //cafe-card-box -->
<!-- cafe-card-box -->
<div class="cafe-card-box">
<div class="flex">
<div class="photo-wrap" onclick="javascript:fn_selectListPost(1,'20020191700','4727','0000140001');" style="cursor: pointer;">
<script>
var filedata = {
fileTypeCd : "0000210002"
,thumb : thumbSizeMgr.unitMark
,filePath : "/images/upload/20200227/1234/"
,savedFileNm : "20200227104858343_ge1"
,extNm : "png"
};
document.write('<img src="'+fn_getFileSrcUrl(filedata)+'" alt="">');
</script><img src="./카페 메인_files/20200227104858343_ge1.png" alt="">
</div>
<div class="profile-wrap" onclick="javascript:fn_compMain('4727','20020191700');" style="cursor: pointer;">
<div class="id"><!-- 최대 2줄 -->
<span>{NAME CENSORED} (입영 2주차)</span>
</div>
<div class="cafe-sh-txt"><!-- 최대 2줄 -->
{PRIVATE INFO CENSORED}
</div>
<div class="cafe-sh-date"><!-- 최대 2줄 -->
<span>입영일 <em> 2020.07.06 </em></span>
<span>수료일 <em> 2020.08.11 </em></span>
</div>
</div>
</div>
<div class="btn-wrap">
위문편지
카페바로가기
</div>
</div>
<!-- //cafe-card-box -->
</div>

You can find all div elements containing certain text:
from selenium import webdriver
driver = webdriver.Chrome()
# Some code...
divList = [div for div in driver.find_elements_by_tag_name('div') if 'The text to find' in div.get_attribute('innerText')]

Don't know if this could be useful for you, but you also could use XPATH, like this:
from selenium import webdriver
driver = webdriver.Chrome()
# Some code...
elementList = driver.find_elements_by_xpath('//div[contains(#class,"profile-wrap")]/div[#class="id"]/span[contains(text(),'NAME')]')
Please be aware of this facts:
XPATH is the slowest "find_" method among others, use it if you have no other choice or the project don't care so much about performances
XPATH cannot perform a case-insensitive search, so you have to make a translation (see here https://stackoverflow.com/a/8474109/3228768) and maybe it's not suitable for the chars you need to find
XPATH could easily perform a selection of the ancestors appending /.. to the query
NOTE:
as you can see I used two different conditions to target a div by class. One is with contains() and the other is without it. The difference is that in the second form, the target is matched only if the target has the class name as unique value of the "class" attribute.
you can extract text from the elements returned by the xpath in order to obtain a realiable text extraction

Beautifulsoup, find the only tag in the htm that has no attribute

I know...from the title this answer seems the same oh thousand of others. BUT I have still searched all related and similar questions. What I'm asking is, given this html (just an exemple):
<html>
<body>
<div class="div-share noprint">
<div class="addthis_toolbox addthis_default_style">
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="someimg" class="playblk" height="25" src="some source" title="sometitle" width="25"/></span>
</a>
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="someimg" class="playblk" height="25" src="some source" title="sometitle" width="25"/></span>
</a>
</div>
</div>
<div class="addthis_toolbox addthis_default_style">
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="some img" class="playblk" height="25" src="othersource" title="some othertitle" width="25"/></span>
</a>
</div>
<div class="div-share">
<h1>"The Divine Wings Of Tragedy" lyrics</h1></div>,
<div class="pther">
<h2><b>Symphony X Lyrics</b></h2>
</div>
<div class="ringtone">
<span id="cf_text_top"></span>
</div>
<div>
<i>[Part I - At the Four Corners of the Earth]</i>
<br/>
<br/> On the edge of paradise
<br/> Tears of woe fall, cold as ice
<br/> Hear my cry
<br/>
</div>
</body>
</html>
I want to find the only tag that has no attributes. Not an empy attr, like I saw in other questions, or a strange specific attribute, or attrs = None ... that tag has nothing else. But if I use findAll, I find all the other tag in the html. the same if I use attrs = False, attrs = None and so on..,
so there is a possibility?
thanks a lot!

You can pass a lambda function to the find_all method that checks the tag name and that there are no attrs within the element:
soup.find_all(lambda tag: tag.name == 'div' and not tag.attrs)

Unable to get whole row from BeautifulSoup

I've been practicing my scraping and everything was going fine but as hard as I try I can't seem to get this specific data I'm looking for.
Structure looks like this
</div>
<div class="col-xs-12 col-sm-12 col-md-7 list-field-wrap">
<div class="pull-left">
<div class="row">
<div class=" list-field type-field" style="width: 45px"><div class="visible-xs-block visible-sm-block list-label">BIB</div>17584</div>
<div class=" list-field type-age_class" style="width: 65px"><div class="visible-xs-block visible-sm-block list-label">Division</div>20-24</div>
</div>
</div>
What I want to do is get the 17584 with class = "visible-xs-block visible-sm-block list-label"
Unfortunately every time I try to select it. It only returns
<div class="visible-xs-block visible-sm-block list-label">BIB</div>
This is my code I've been trying to select it
bib = soup.find('div', class_="visible-xs-block visible-sm-block list-label"
print(bib)
WAS ABLE TO FIGURE IT OUT STRUCTURE START EARLIER.

17584 is not part of the tag with class visible-xs-block visible-sm-block list-label:
<div class=" list-field type-field" style="width: 45px">
<div class="visible-xs-block visible-sm-block list-label">
BIB
</div>
17584
</div>
Try to select list-field type-field instead.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 3 - Selenium - scraping data from nested divs - python

Related

Web Scraping of nested div elements with repeating class names

Selenium python how to upload file when there is no input type file?

Way to select certain div among other divs that contains certain text in Selenium python

Beautifulsoup, find the only tag in the htm that has no attribute

Unable to get whole row from BeautifulSoup

Categories

Resources