Unable to get whole row from BeautifulSoup

Unable to get whole row from BeautifulSoup - python

I've been practicing my scraping and everything was going fine but as hard as I try I can't seem to get this specific data I'm looking for.
Structure looks like this
</div>
<div class="col-xs-12 col-sm-12 col-md-7 list-field-wrap">
<div class="pull-left">
<div class="row">
<div class=" list-field type-field" style="width: 45px"><div class="visible-xs-block visible-sm-block list-label">BIB</div>17584</div>
<div class=" list-field type-age_class" style="width: 65px"><div class="visible-xs-block visible-sm-block list-label">Division</div>20-24</div>
</div>
</div>
What I want to do is get the 17584 with class = "visible-xs-block visible-sm-block list-label"
Unfortunately every time I try to select it. It only returns
<div class="visible-xs-block visible-sm-block list-label">BIB</div>
This is my code I've been trying to select it
bib = soup.find('div', class_="visible-xs-block visible-sm-block list-label"
print(bib)
WAS ABLE TO FIGURE IT OUT STRUCTURE START EARLIER.

17584 is not part of the tag with class visible-xs-block visible-sm-block list-label:
<div class=" list-field type-field" style="width: 45px">
<div class="visible-xs-block visible-sm-block list-label">
BIB
</div>
17584
</div>
Try to select list-field type-field instead.

Related

Web Scraping of nested div elements with repeating class names

<div class="information_row" id="dashboard">
Statewise
<div class="info_title1">Cases Across India</div>
<div class="active-case">
<div class="block-active-cases">
<span class="icount">3,86,351</span>
<div class="increase_block">
<div class="color-green down-arrow">
2,157 <i></i>
</div>
</div>
</div>
<div class="info_label">Active Cases
<span class="per_block">(1.21%)</span>
</div>
</div>
<div class="iblock discharge">
<div class="iblock_text">
<div class="info_label"> Discharged
<div class="per_block">
(97.45%)
</div>
</div>
<span class="icount">3,12,20,981</span>
<div class="increase_block">
<div class="color-green up-arrow">
40,013 <i></i>
</div>
</div>
</div>
</div>
<div class="iblock death_case">
<div class="iblock_text">
<div class="info_label">Deaths
<div class="per_block">
(1.34%)
</div>
</div>
<span class="icount">4,29,179</span>
<div class="increase_block">
<div class="color-red up-arrow">
497 <i></i>
</div>
</div>
</div>
</div>
<div class="iblock t_case">
<div class="iblock_text">
<div class="info_label">Total Cases
<div class="per_block"></div>
</div>
<span class="icount">3,20,36,511</span>
<div class="increase_block">
<div class="color-red up-arrow">
38,353 <i></i>
</div>
</div>
</div>
</div></div>
I am working on a web scraping project using python and beautifulsoup. As a beginner I am unable to parse the data which I need (Numerical Statistics on covid) since the class names which contain the numerical data are repeated and not unique like icount, per_block,increase_block. What I want is to parse and store only these numerical data in different variables like below-
Total_cases = 3,20,36,511
Total_cases_in_last_24_hrs = 38,353 and likewise for all other categories(Discharge, deaths, active cases)
Here is my code-
URL = 'https://www.mygov.in/covid-19/'
page = requests.get(URL,headers=headers)
clean_data=BeautifulSoup(page.text,'html.parser')
span=clean_data.findAll('span',class_='icount')
#print(clean_data)
total_cases = clean_data.find("div",class_="iblock
t_case",attrs={'spanclass':'icount'}).get_text()
print(total_cases)
I have been working on it for long time but could not find a solution. Please help.
This is the reference code from Click here to visit the website.
Thank You.

One possible solution is to select all text from class="t_case" and split the text:
import requests
from bs4 import BeautifulSoup
url = "https://www.mygov.in/covid-19/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
_, total_cases, new_cases = (
soup.select_one(".t_case").get_text(strip=True, separator="|").split("|")
)
print(total_cases)
print(new_cases)
Prints:
3,20,36,511
38,353
Or:
t_case = soup.select_one(".t_case")
total_cases = t_case.select_one(".icount")
new_cases = t_case.select_one(".color-red, .color-green")
print(total_cases.get_text(strip=True))
print(new_cases.get_text(strip=True))

Way to select certain div among other divs that contains certain text in Selenium python

I'm using selenium to write an online letter to my friends in army. The website offers no APIs or whatsoever.
Quite a few of my friends are in the army and I wanted to choose who to send the letter to.
Let's say my friends' name is Howard from now on.
Selection page is like this
Each of the friends has it's own card-styled div, all of them shares the same class (cafe-card-box) with no id or name.
All the divs are in slider which is horribly coded. For some reason, divs are duplicated several times invisibly. There are 2-3 divs for Howard only.
Order of the divs are not same across users.
Name of the soldiers are in cafe-card-box(class) -> flex(class) -> profile-wrap(class) -> id(class) -> span(tag only). All the divs are same except for the content in .
Some randomly blank texts share the class="id". And the span tag not only has name but also how long the solider has been in army like this:
Jacob (Been in for 2 weeks)
Initial approach
Initially, I wrote this code:
cafes = self.driver.find_elements_by_class_name("cafe-card-box")
for cafe in cafes:
cf_name = cafe.find_element_by_class_name("id").text[0:3] #Almost every Korean names are 3 characters.
if cf_name == soldier_name:
print("found.")
cafe.find_element_by_link_text("위문편지").click()
break
else:
print("It's not the one. Moving to the next ID class.")
This worked as expected, provided that the name somewhere in the div. The problem is that the program needs to work even when the name is wrong. I later tried this code:
while n<=len(cafes):
n = n + 1
try:
for cafe in cafes:
cf_name = cafe.find_element_by_class_name("id").text[0:3]
if cf_name == soldier_name:
print("Found!")
cafe.find_element_by_link_text("위문편지").click()
ps(3)
break
except:
print("Can't find anyone.")
self.driver.quit()
quit()
This downright didn't work. And in retrospect, first code that actually worked doesn't look so legit at all. I now want to loop through each card divs, find if the name is matching, change the frame to it if it does, and click the button in that specific div.
Is this possible? If so how? I feel like I've tried everything.
Side Question
Is there a better way to extract name from ?
cafe.find_element_by_class_name("id").text[0:3]
This doesn't seem so professional. All the names are separated via 1 blank space.
Edit
Adding HTML code.
<div class="group">
<div class="section-title bd_gradation">
<strong class="title">내 카페 <em>(2)</em></strong>
</div>
<div class="swiper-container cafe-slide-wrap swiper-container-horizontal" id="divSlide1">
<div class="swiper-wrapper" style="transition-duration: 0ms; transform: translate3d(-1140px, 0px, 0px);"><div class="swiper-slide swiper-slide-duplicate swiper-slide-duplicate-active swiper-slide-prev" data-swiper-slide-index="0">
<!-- cafe-card-box -->
<div class="cafe-card-box">
<div class="flex">
<div class="photo-wrap" onclick="javascript:fn_selectListPost(1,'20121590200','4737','0000140002');" style="cursor: pointer;">
<script>
var filedata = {
fileTypeCd : "0000210002"
,thumb : thumbSizeMgr.unitMark
,filePath : "/images/upload/20191122/nb3705#naver.com/"
,savedFileNm : "20191122092608029_ge1"
,extNm : "jpg"
};
document.write('<img src="'+fn_getFileSrcUrl(filedata)+'" alt="">');
</script><img src="./카페 메인_files/20191122092608029_ge1.jpg" alt="">
</div>
<div class="profile-wrap" onclick="javascript:fn_compMain('4737','20121590200');" style="cursor: pointer;">
<div class="id"><!-- 최대 2줄 -->
<span>{NAME CENSORED} (입영 2주차)</span>
</div>
<div class="cafe-sh-txt"><!-- 최대 2줄 -->
{PRIVATE INFO CENSORED}
</div>
<div class="cafe-sh-date"><!-- 최대 2줄 -->
<span>입영일 <em> 2020.07.06 </em></span>
<span>수료일 <em> 2020.08.12 </em></span>
</div>
</div>
</div>
<div class="btn-wrap">
위문편지
카페바로가기
</div>
</div>
<!-- //cafe-card-box -->
<div class="cafe-card-box">
<div class="flex">
<div class="photo-wrap" onclick="javascript:fn_selectListPost(1,'20020191700','4727','0000140001');" style="cursor: pointer;">
<script>
var filedata = {
fileTypeCd : "0000210002"
,thumb : thumbSizeMgr.unitMark
,filePath : "/images/upload/20200227/1234/"
,savedFileNm : "20200227104858343_ge1"
,extNm : "png"
};
document.write('<img src="'+fn_getFileSrcUrl(filedata)+'" alt="">');
</script><img src="./카페 메인_files/20200227104858343_ge1.png" alt="">
</div>
<div class="profile-wrap" onclick="javascript:fn_compMain('4727','20020191700');" style="cursor: pointer;">
<div class="id"><!-- 최대 2줄 -->
<span>{NAME CENSORED} (입영 2주차)</span>
</div>
<div class="cafe-sh-txt"><!-- 최대 2줄 -->
{PRIVATE INFO CENSORED}
</div>
<div class="cafe-sh-date"><!-- 최대 2줄 -->
<span>입영일 <em> 2020.07.06 </em></span>
<span>수료일 <em> 2020.08.11 </em></span>
</div>
</div>
</div>
<div class="btn-wrap">
위문편지
카페바로가기
</div>
</div>
<!-- //cafe-card-box -->
</div>
<div class="swiper-slide swiper-slide-active swiper-slide-duplicate-next swiper-slide-duplicate-prev" data-swiper-slide-index="0">
<!-- cafe-card-box -->
<div class="cafe-card-box">
<div class="flex">
<div class="photo-wrap" onclick="javascript:fn_selectListPost(1,'20121590200','4737','0000140002');" style="cursor: pointer;">
<script>
var filedata = {
fileTypeCd : "0000210002"
,thumb : thumbSizeMgr.unitMark
,filePath : "/images/upload/20191122/nb3705#naver.com/"
,savedFileNm : "20191122092608029_ge1"
,extNm : "jpg"
};
document.write('<img src="'+fn_getFileSrcUrl(filedata)+'" alt="">');
</script><img src="./카페 메인_files/20191122092608029_ge1.jpg" alt="">
</div>
<div class="profile-wrap" onclick="javascript:fn_compMain('4737','20121590200');" style="cursor: pointer;">
<div class="id"><!-- 최대 2줄 -->
<span>{NAME CENSORED} (입영 2주차)</span>
</div>
<div class="cafe-sh-txt"><!-- 최대 2줄 -->
{PRIVATE INFO CENSORED}
</div>
<div class="cafe-sh-date"><!-- 최대 2줄 -->
<span>입영일 <em> 2020.07.06 </em></span>
<span>수료일 <em> 2020.08.12 </em></span>
</div>
</div>
</div>
<div class="btn-wrap">
위문편지
카페바로가기
</div>
</div>
<!-- //cafe-card-box -->
<!-- cafe-card-box -->
<div class="cafe-card-box">
<div class="flex">
<div class="photo-wrap" onclick="javascript:fn_selectListPost(1,'20020191700','4727','0000140001');" style="cursor: pointer;">
<script>
var filedata = {
fileTypeCd : "0000210002"
,thumb : thumbSizeMgr.unitMark
,filePath : "/images/upload/20200227/1234/"
,savedFileNm : "20200227104858343_ge1"
,extNm : "png"
};
document.write('<img src="'+fn_getFileSrcUrl(filedata)+'" alt="">');
</script><img src="./카페 메인_files/20200227104858343_ge1.png" alt="">
</div>
<div class="profile-wrap" onclick="javascript:fn_compMain('4727','20020191700');" style="cursor: pointer;">
<div class="id"><!-- 최대 2줄 -->
<span>{NAME CENSORED} (입영 2주차)</span>
</div>
<div class="cafe-sh-txt"><!-- 최대 2줄 -->
{PRIVATE INFO CENSORED}
</div>
<div class="cafe-sh-date"><!-- 최대 2줄 -->
<span>입영일 <em> 2020.07.06 </em></span>
<span>수료일 <em> 2020.08.11 </em></span>
</div>
</div>
</div>
<div class="btn-wrap">
위문편지
카페바로가기
</div>
</div>
<!-- //cafe-card-box -->
</div>

You can find all div elements containing certain text:
from selenium import webdriver
driver = webdriver.Chrome()
# Some code...
divList = [div for div in driver.find_elements_by_tag_name('div') if 'The text to find' in div.get_attribute('innerText')]

Don't know if this could be useful for you, but you also could use XPATH, like this:
from selenium import webdriver
driver = webdriver.Chrome()
# Some code...
elementList = driver.find_elements_by_xpath('//div[contains(#class,"profile-wrap")]/div[#class="id"]/span[contains(text(),'NAME')]')
Please be aware of this facts:
XPATH is the slowest "find_" method among others, use it if you have no other choice or the project don't care so much about performances
XPATH cannot perform a case-insensitive search, so you have to make a translation (see here https://stackoverflow.com/a/8474109/3228768) and maybe it's not suitable for the chars you need to find
XPATH could easily perform a selection of the ancestors appending /.. to the query
NOTE:
as you can see I used two different conditions to target a div by class. One is with contains() and the other is without it. The difference is that in the second form, the target is matched only if the target has the class name as unique value of the "class" attribute.
you can extract text from the elements returned by the xpath in order to obtain a realiable text extraction

Any way to only extract specific div from beautiful soup

I have run into an issue while working on a web scraping project in python. I am new to python and am not sure how to extract a specific line, or a value from part of a line, from the beautiful soup output. I would like to get only the data-rarity part from this site but i haven't found how to do that without removing the entire line from the list.
Any help is much appreciated!
I have this:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
print(rarity[0])
This outputs:
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
I would ideally want to get only the value after the data-rarity so just the 102 part from this in the inspect element of the site.
<div class="profileCards__cards">
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>

Use:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
for r in rarity:
print(r.find("div", {'class': 'profileCards__card'})["data-rarity"])

Python 3 - Selenium - scraping data from nested divs

I'm reasonably new to python and I'm trying to check which div class appears first on a page. I've done this with table rows but I can't seem to wrap my head around how to do this with divs.
What I'm trying to determine is whether the latest update is an email sent <div class="EMAIL SENT"> or a notes added <div class="Notes">. The most recent item will appear first from the top, but other actions may have taken place since then, for example, <div class="Updated">
I've not managed to write any code to do this or event get close, but in my head I imagine it to work like this.
for sub_div_classes in browser.find_element_by_class_name('cb'):
classname = ~check name of sub_div_class
if classname = "EMAIL SENT":
class_info = browser.find_element_by_class_name('plus_header_Additional_info').text
print(class_info) ¬output: EMAIL SENT :Email sent on 20-03-2016 00:22:09 by [REDACTED]
trigger_1()
if classname = "Notes":
trigger_2()
~move on to next div class in list
Below is the page code I'm trying to work with. I'd be really appreciative of any advice or assistance anyone can provide.
<div class="cb" style="margin:5px 0 0 0;">
<div class="Updated">
<div class="plus_header_Additional_info">Updated :Incident Updated on 20-03-2016 00:22:52 by User = [REDACTED]
<img src="images/minus.png" style="float:right;">
</div>
<div class="plus_content" style="display: block;" id="contentDivImg2_0">
<div>
Assigned to STRIKE1,
by User = [REDACTED].
</div>
<br>
</div>
</div>
<div class="Updated">
<div class="plus_header_Additional_info">Updated :PEND CLIENT STRIKE - 1 added on 20-03-2016 00:22:36 by [REDACTED].
<img src="images/minus.png" style="float:right;">
</div>
<div class="plus_content" style="display: block;" id="contentDivImg2_1">
<div>
</div>
<br>
</div>
</div>
<div class="EMAIL SENT">
<div class="plus_header_Additional_info">EMAIL SENT :Email sent on 20-03-2016 00:22:09 by [REDACTED]
<img src="images/minus.png" style="float:right;">
</div>
<div class="plus_content" style="display: block;" id="contentDivImg2_2">
<div>
To :- [NAME]#[DOMAIN].CO.UK Subject: Ticket - [IN-000999999] Description : Dear User,
[REDACTED]
</div>
<br>
</div>
</div>
<div class="Updated">
<div class="plus_header_Additional_info">Updated :Incident Updated on 12-03-2016 10:56:15 by User = [REDACTED]
<img src="images/minus.png" style="float:right;">
</div>
<div class="plus_content" style="display: block;" id="contentDivImg2_3">
<div>
Status:- PROGRESSING changed to PEND CLIENT,
Assigned to SOFTWARE DEPLOYED,
by User = [REDACTED].
</div>
<br>
</div>
</div>
<div class="Notes">
<div class="plus_header_Additional_info">Notes :Notes Added on 12-03-2016 10:55:53 by [REDACTED].
<img src="images/minus.png" style="float:right;">
</div>
<div class="plus_content" style="display: block;" id="contentDivImg2_4">
<div>
<textarea id="notes4" name="notes1" cols="" class="emailForm_input1" style="width: 97%; overflow: hidden; word-wrap: break-word; resize: horizontal; height: 237px;" readonly="readonly">Hello,
[REDACTED]
</textarea>
</div>
<br>
</div>
</div>
</div>

Use an or with an xpath:
.xpath("//div[#class='Notes' or #class='EMAIL SENT']")[0]
If Notes comes first you will get Notes and vice versa.
If we change a bit of your html snippet like below, adding some text to <div class="EMAIL SENT">in email and changing a later tag class to <div class="Notes">in notes:
We can see using lxml how it works:
In [13]: from lxml.etree import fromstring, HTMLParser
In [14]: xml = fromstring(html, HTMLParser())
In [15]: xml.xpath("//div[#class='Notes' or #class='EMAIL SENT']")
Out[15]: [<Element div at 0x7f96598d4ea8>, <Element div at 0x7f96598d4ef0>]
In [16]: xml.xpath("//div[#class='Notes' or #class='EMAIL SENT']")[0].text
Out[16]: 'in email\n '
In [17]: xml.xpath("//div[#class='Notes' or #class='EMAIL SENT']")[1].text
Out[17]: 'in notes\n
So with selenium you want to just find the element by xpath.

Need help scraping items from a list with Scrapy using ancestor

I am trying to scrape the details like Contact, Location, Phone and Rate. The html is as below. The list is a dynamic one so sometimes only few of the items like Contact and Location may appear on the page while sometimes all of them can appear. I am thinking I can use the icon tag to get the required text but am unable to find any documentation on this. Any help would be highly appreciated.
Thanks in advance.
<div class="detail-all-label">
<i class="abc-Contact"></i>
<div class="detail-all-text"><b>Contact</b>: Ram Bahadur</div>
</div>
<div class="detail-all-label">
<i class="abc-font abc-Location"></i>
<div class="detail-all-text"><b>Location</b>: Kathmandu</div>
</div>
<div class="detail-all-label">
<i class="abc-font abc-Website"></i>
<div class="detail-all-text"><b>Website</b>: itworkremotely</div>
</div>
<div class="detail-all-label">
<i class="abc-font abc-Phone"></i>
<div class="detail-all-text"><b>Phone</b>: 3283550121</div>
</div>
<div class="detail-all-label">
<i class="abc-font abc-Rate"></i>
<div class="detail-all-text"><b>Rate</b>: €700 - 10000</div>
</div>

You can get all of the detail values that have a preceding b element inside the div with class="detail-all-text":
for detail in response.xpath("//div[#class='detail-all-text']/b"):
name = detail.xpath("text()").extract()[0]
value = detail.xpath("following-sibling::text()")[0]
print name, value

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to get whole row from BeautifulSoup - python

17584 is not part of the tag with class visible-xs-block visible-sm-block list-label: <div class=" list-field type-field" style="width: 45px"> <div class="visible-xs-block visible-sm-block list-label"> BIB </div> 17584 </div> Try to select list-field type-field instead.

Related

Web Scraping of nested div elements with repeating class names

Way to select certain div among other divs that contains certain text in Selenium python

Any way to only extract specific div from beautiful soup

Python 3 - Selenium - scraping data from nested divs

Need help scraping items from a list with Scrapy using ancestor

Categories

Resources