Any way to only extract specific div from beautiful soup - python

I have run into an issue while working on a web scraping project in python. I am new to python and am not sure how to extract a specific line, or a value from part of a line, from the beautiful soup output. I would like to get only the data-rarity part from this site but i haven't found how to do that without removing the entire line from the list.
Any help is much appreciated!
I have this:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
print(rarity[0])
This outputs:
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
I would ideally want to get only the value after the data-rarity so just the 102 part from this in the inspect element of the site.
<div class="profileCards__cards">
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>

Use:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
for r in rarity:
print(r.find("div", {'class': 'profileCards__card'})["data-rarity"])

Related

Extract URLs from a class using Scrapy

I am trying to use scrapy to get a list of URLs from this website. I have the class of the div and I want all a tags in it.
here is the link for the website I am trying to get each URL for the profiles.
https://www.letsmakeaplan.org/find-a-cfp-professional?limit=10&pg=1&sort=random&distance=5
This is the code to try and pull the URLs from the page above
sel = Selector(text=driver.page_source)
books1 = sel.xpath("//div[#class='faceted-search-results-container-listing']/a/#herf").extract()
this comes back empty
This is is the code from the website
<<div class="faceted-search-results-container-listing" style="">
<a href="/find-a-cfp-professional/certified-professional-profile/a9a0ca36-3c70-4ea4-a853-7f704fe4cc98" class="find-cfp-item js-card-link">
<div class="find-cfp-item-top">
<div class="h5 find-cfp-item-name">C. H. Simmons, CFP®</div>
<div class="find-cfp-item-read-more"><span>view details</span></div>
</div>
<div class="find-cfp-item-bottom">
<div class="find-cfp-item-column" data-column="1">
<img src="https://login.cfp.net/eweb/photos/91475.jpg" data-default-img="/-/media/feature/cfp/lmapprofile/default-profile-avatar.jpeg" data-default-img-backup="/images/default-profile-avatar.jpeg" alt="C. Simmons Headshot" class="find-cfp-item-headshot" onerror="handleImg(this, event);">
<div class="find-cfp-item-text">
Simmons and Starzl Wealth Management<br>
110 Bay St<br>
Gadsden, AL 35901-5229<br>
</div>
</div>
<div class="find-cfp-item-column" data-column="2">
<div class="h6 find-cfp-item-column-heading">Planning Services Offered</div>
<div class="find-cfp-item-text" data-line-clamp="4">
Investment Planning, Retirement Planning
</div>
</div>
<div class="find-cfp-item-column" data-column="3">
<div class="find-cfp-item-column-inner">
<div class="h6 find-cfp-item-column-heading">Client Focus</div>
<div class="find-cfp-item-text" data-line-clamp="1">
None Provided
</div>
</div>
<div class="find-cfp-item-column-inner">
<div class="h6 find-cfp-item-column-heading">Minimum Investable Assets</div>
<div class="find-cfp-item-text" data-line-clamp="1">
$500,000
</div>
</div>
</div>
</div>
</a>
It looks like the search results come from an ajax call to an api in json format and rendered dynamically.
You can get all of the information from the raw json data if you scrape the api url instead...
scrapy.Request(url='https://www.letsmakeaplan.org/api/feature/lmapprofilesearch/search?limit=10&pg=1&sort=random&distance=5')
def parse(response):
data = response.json()
results = data["results"]
links = [i["item_url"] for i in results]
yield {'links': links}
output:
'/find-a-cfp-professional/certified-professional-profile/b1a27bac-77f0-4796-ab7f-7e15c19d8421'
'/find-a-cfp-professional/certified-professional-profile/e493f31f-88c7-4fdd-9863-9712ba85c95c'
'/find-a-cfp-professional/certified-professional-profile/2d634f05-331e-4699-b1a8-96e7a20aa0bf'
'/find-a-cfp-professional/certified-professional-profile/d9074216-7321-469f-b42f-2988d84d4a2b'
'/find-a-cfp-professional/certified-professional-profile/7f55e98c-df27-4922-b3a4-07c341a87f65'
'/find-a-cfp-professional/certified-professional-profile/1b0377a2-4545-45af-9ac4-18a8af2ffecd'
'/find-a-cfp-professional/certified-professional-profile/66b78e79-608b-4079-86c2-d9ae84c3a762'
'/find-a-cfp-professional/certified-professional-profile/e884f42b-8239-475a-b55f-5bb6f1130a36'
'/find-a-cfp-professional/certified-professional-profile/b00abd44-5969-4f02-a052-e6ef34b60e9b'
'/find-a-cfp-professional/certified-professional-profile/10ae9e9f-f11e-4f79-91c4-05f24e0c7a0e'

Beautifulsoup find_All command not working

I've got some html where a bit of it looks like
<div class="large-12 columns">
<div class="box">
<div class="header">
<h2>
Line-Ups
</h2>
</div>
<div class="large-6 columns aufstellung-box" style="padding: 0px;">
<div class="unterueberschrift aufstellung-unterueberschrift-mannschaft">
<div>
<a class="vereinprofil_tooltip" href="/fc-portsmouth/startseite/verein/1020/saison_id/2006" id="1020">
<img alt="Portsmouth FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/1020_1564722280.png?lm=1564722280" title=" "/>
...........
<div class="large-6 columns" style="padding: 0px;">
<div class="unterueberschrift aufstellung-unterueberschrift-mannschaft aufstellung-bordertop-small">
<div>
<a class="vereinprofil_tooltip" href="/fc-arsenal/startseite/verein/11/saison_id/2006" id="11">
<img alt="Arsenal FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/11_1400911988.png?lm=1400911994" title=" "/>
</a>
</div>
<div>
<nobr>
<a class="sb-vereinslink" href="/fc-arsenal/startseite/verein/11/saison_id/2006" id="11">
Arsenal FC
The key here is that <div class="large-6 shows up twice, which is what I'm trying to split on.
The code I'm using is simply boxes = soup.find_All("div",{'class',re.compile(r'^large-6 columns')}) however that is returning absolutely nothing.
I've used BeautifulSoup successfully plenty of times before and I'm sure it's something stupid that I'm missing, but I've been banging my head against a wall for the last 2 hours and can't seem to figure it out. Any help would be much appreciated.
My understanding is that Python is case sensitive. Thus, I think you need to make it soup.find_all rather than All. The code below ran with a working url.
url = "https://####.###"
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
coverpage = r.content
soup = BeautifulSoup(coverpage, 'html5lib')
test = soup.find_all("a")
print(test)
When I made the all into All it broke with the following error:
test = soup.find_All("a")
TypeError: 'NoneType' object is not callable

Web Scraping with BeautifulSoup -- Python

I need to scrape the code below, to retrieve the portions that say "SCRAPE THIS" and "SCRAPE THIS AS WELL". I have been playing around with it for a few hours with no luck! Does anyone know how this can be done?
<div class="mod-body add-border"> <div class="mod-inline mod-body-A-F"> <h4>SCRAPE THIS</h4> <div class="mod-body"> <ul class="list"> <li>SCRAPE THIS AS WELL</li> </ul> </div> </div>
try this code:
from bs4 import BeautifulSoup
text = """<div class="mod-body add-border"> <div class="mod-inline mod-body-A-F"> <h4>SCRAPE THIS</h4> <div class="mod-body"> <ul class="list"> <li>SCRAPE THIS AS WELL</li> </ul> </div> </div>"""
x = BeautifulSoup(text, 'lxml')
print(x.find('h4').get_text())
print(x.find('li').get_text())

Remove html after some point in Beautilful Soup

I have a trouble. My aim is to parse the data until some moment. Then, I want to stop parsing.
<span itemprop="address">
Some address
</span>
<i class="fa fa-signal">
</i>
...
</p>
</div>
</div>
<div class="search_pagination" id="pagination">
<ul class="pagination">
</ul>
</div>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="panel" itemscope="" itemtype="http://schema.org/WPSideBar">
<h2 class="heading_a" itemprop="name">
Top-10 today
</h2> #a lot of tags after that moment
I want to get all the values from <span itemprop="address"> (there are a lot of them before) until the moment Top-10 today.
You can actually let BeautifulSoup parse only the tags you are interested in via SoupStrainer:
from bs4 import BeautifulSoup, SoupStrainer
only_addresses = SoupStrainer("span", itemprop="address")
soup = BeautifulSoup(html_doc, "html.parser", parse_only=only_addresses)
If you though have some "addresses" before the "Top-10 today" and some after but you are interested in those coming before it, you can make a custom searching function:
def search_addresses(tag):
return tag.name == "span" and tag.get("itemprop") == "address" and \
tag.find_next("h2", text=lambda text: text and "Top-10 today" in text)
addresses = soup.find_all(search_addresses)
It does not look trivial, but the idea is simple - we are using find_next() for every "address" to check if "Top-10 today" heading exists after it.

Web scraping with Beautiful soup multiple tags

I'm trying to get multiple addresses from a web page with an A to Z of links.
First I get A to Z links with:
URL = "http://www.example.com"
html = urlopen(URL).read()
soup = BeautifulSoup(html, "lxml")
content = soup.find("div", "view-content")
links = [BASE_URL + li.a["href"] for li in content.findAll("li")]
This works great and in links above I have a list of links to each individual web page with multiple addresses on each separate page.
For getting the addresses I need I used:
for item in links[0:5]:
try:
htmlss = urlopen(item).read()
soup = bfs(htmlss, "lxml")
titl = soup.find('div','views-field-title').a.contents
add = soup.find('div','views-field-address').span.contents
zipp = soup.find('div','views-field-city-state-zip').span.contents
except AttributeError:
continue
The above code will take each link and get the first address on the page with all the A's and the first address on the page with all the B's and so on.
My problem is that on some of the pages there are multiple addresses on each page and the above code only retrieves the first address on that page i.e. First A address first B address and so on.
I've tried using soup.findAll but it doesn't work with a.content or span.content
Basically I need to find the address lines in the html pages with non-unique tags. If I use soup.findAll I get all the content for say (div, views-field-title) which gives me a lot of content I don't need.
Example of some html:
<div class="views-field-nothing-1"></div>
<div class="views-field-nothing">
<span class="field-content">
<div class="views-field-title">
<span class="field-content">
<a href="/golf-courses/details/ca/alameda/chuck-corica-golf-complex-earl-fry">
Chuck Corica Golf Complex, Earl Fry
</a>
</span>
</div>
<div class="views-field-address">
<span class="field-content"></span>
</div>
<div class="views-field-city-state-zip">
<span class="field-content">
Alameda, California 94502-6502
</span>
</div>
</span>
</div>
<div class="views-field-value"></div>
<div class="views-field-nothing-1"></div>
<div class="views-field-nothing">
<span class="field-content">
<div class="views-field-title">
<span class="field-content">
<a href="/golf-courses/details/ca/alameda/chuck-corica-golf-complex-jack-clark">
Chuck Corica Golf Complex, Jack Clark
</a>
</span>
</div>
<div class="views-field-address">
<span class="field-content">
1 Clubhouse Memorial Rd
<br></br>
</span>
</div>
<div class="views-field-city-state-zip">
<span class="field-content">
Alameda, California 94502-6502
</span>
</div>
</span>
</div>
This is just a sample of similar html I need to find data for. Thanks

Categories

Resources