Extract URLs from a class using Scrapy

Extract URLs from a class using Scrapy - python

I am trying to use scrapy to get a list of URLs from this website. I have the class of the div and I want all a tags in it.
here is the link for the website I am trying to get each URL for the profiles.
https://www.letsmakeaplan.org/find-a-cfp-professional?limit=10&pg=1&sort=random&distance=5
This is the code to try and pull the URLs from the page above
sel = Selector(text=driver.page_source)
books1 = sel.xpath("//div[#class='faceted-search-results-container-listing']/a/#herf").extract()
this comes back empty
This is is the code from the website
<<div class="faceted-search-results-container-listing" style="">
<a href="/find-a-cfp-professional/certified-professional-profile/a9a0ca36-3c70-4ea4-a853-7f704fe4cc98" class="find-cfp-item js-card-link">
<div class="find-cfp-item-top">
<div class="h5 find-cfp-item-name">C. H. Simmons, CFP®</div>
<div class="find-cfp-item-read-more"><span>view details</span></div>
</div>
<div class="find-cfp-item-bottom">
<div class="find-cfp-item-column" data-column="1">
<img src="https://login.cfp.net/eweb/photos/91475.jpg" data-default-img="/-/media/feature/cfp/lmapprofile/default-profile-avatar.jpeg" data-default-img-backup="/images/default-profile-avatar.jpeg" alt="C. Simmons Headshot" class="find-cfp-item-headshot" onerror="handleImg(this, event);">
<div class="find-cfp-item-text">
Simmons and Starzl Wealth Management<br>
110 Bay St<br>
Gadsden, AL 35901-5229<br>
</div>
</div>
<div class="find-cfp-item-column" data-column="2">
<div class="h6 find-cfp-item-column-heading">Planning Services Offered</div>
<div class="find-cfp-item-text" data-line-clamp="4">
Investment Planning, Retirement Planning
</div>
</div>
<div class="find-cfp-item-column" data-column="3">
<div class="find-cfp-item-column-inner">
<div class="h6 find-cfp-item-column-heading">Client Focus</div>
<div class="find-cfp-item-text" data-line-clamp="1">
None Provided
</div>
</div>
<div class="find-cfp-item-column-inner">
<div class="h6 find-cfp-item-column-heading">Minimum Investable Assets</div>
<div class="find-cfp-item-text" data-line-clamp="1">
$500,000
</div>
</div>
</div>
</div>
</a>

It looks like the search results come from an ajax call to an api in json format and rendered dynamically.
You can get all of the information from the raw json data if you scrape the api url instead...
scrapy.Request(url='https://www.letsmakeaplan.org/api/feature/lmapprofilesearch/search?limit=10&pg=1&sort=random&distance=5')
def parse(response):
data = response.json()
results = data["results"]
links = [i["item_url"] for i in results]
yield {'links': links}
output:
'/find-a-cfp-professional/certified-professional-profile/b1a27bac-77f0-4796-ab7f-7e15c19d8421'
'/find-a-cfp-professional/certified-professional-profile/e493f31f-88c7-4fdd-9863-9712ba85c95c'
'/find-a-cfp-professional/certified-professional-profile/2d634f05-331e-4699-b1a8-96e7a20aa0bf'
'/find-a-cfp-professional/certified-professional-profile/d9074216-7321-469f-b42f-2988d84d4a2b'
'/find-a-cfp-professional/certified-professional-profile/7f55e98c-df27-4922-b3a4-07c341a87f65'
'/find-a-cfp-professional/certified-professional-profile/1b0377a2-4545-45af-9ac4-18a8af2ffecd'
'/find-a-cfp-professional/certified-professional-profile/66b78e79-608b-4079-86c2-d9ae84c3a762'
'/find-a-cfp-professional/certified-professional-profile/e884f42b-8239-475a-b55f-5bb6f1130a36'
'/find-a-cfp-professional/certified-professional-profile/b00abd44-5969-4f02-a052-e6ef34b60e9b'
'/find-a-cfp-professional/certified-professional-profile/10ae9e9f-f11e-4f79-91c4-05f24e0c7a0e'

Related

Web Scraping of nested div elements with repeating class names

<div class="information_row" id="dashboard">
Statewise
<div class="info_title1">Cases Across India</div>
<div class="active-case">
<div class="block-active-cases">
<span class="icount">3,86,351</span>
<div class="increase_block">
<div class="color-green down-arrow">
2,157 <i></i>
</div>
</div>
</div>
<div class="info_label">Active Cases
<span class="per_block">(1.21%)</span>
</div>
</div>
<div class="iblock discharge">
<div class="iblock_text">
<div class="info_label"> Discharged
<div class="per_block">
(97.45%)
</div>
</div>
<span class="icount">3,12,20,981</span>
<div class="increase_block">
<div class="color-green up-arrow">
40,013 <i></i>
</div>
</div>
</div>
</div>
<div class="iblock death_case">
<div class="iblock_text">
<div class="info_label">Deaths
<div class="per_block">
(1.34%)
</div>
</div>
<span class="icount">4,29,179</span>
<div class="increase_block">
<div class="color-red up-arrow">
497 <i></i>
</div>
</div>
</div>
</div>
<div class="iblock t_case">
<div class="iblock_text">
<div class="info_label">Total Cases
<div class="per_block"></div>
</div>
<span class="icount">3,20,36,511</span>
<div class="increase_block">
<div class="color-red up-arrow">
38,353 <i></i>
</div>
</div>
</div>
</div></div>
I am working on a web scraping project using python and beautifulsoup. As a beginner I am unable to parse the data which I need (Numerical Statistics on covid) since the class names which contain the numerical data are repeated and not unique like icount, per_block,increase_block. What I want is to parse and store only these numerical data in different variables like below-
Total_cases = 3,20,36,511
Total_cases_in_last_24_hrs = 38,353 and likewise for all other categories(Discharge, deaths, active cases)
Here is my code-
URL = 'https://www.mygov.in/covid-19/'
page = requests.get(URL,headers=headers)
clean_data=BeautifulSoup(page.text,'html.parser')
span=clean_data.findAll('span',class_='icount')
#print(clean_data)
total_cases = clean_data.find("div",class_="iblock
t_case",attrs={'spanclass':'icount'}).get_text()
print(total_cases)
I have been working on it for long time but could not find a solution. Please help.
This is the reference code from Click here to visit the website.
Thank You.

One possible solution is to select all text from class="t_case" and split the text:
import requests
from bs4 import BeautifulSoup
url = "https://www.mygov.in/covid-19/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
_, total_cases, new_cases = (
soup.select_one(".t_case").get_text(strip=True, separator="|").split("|")
)
print(total_cases)
print(new_cases)
Prints:
3,20,36,511
38,353
Or:
t_case = soup.select_one(".t_case")
total_cases = t_case.select_one(".icount")
new_cases = t_case.select_one(".color-red, .color-green")
print(total_cases.get_text(strip=True))
print(new_cases.get_text(strip=True))

Using Selenium and BS4 is it possible to scrape the text outside the "=" within the div tag

I am looking at scraping the below information using both selenium and bs4, and was wondering if I find the below div tag, is it possible to scrape the data inside the quotation marks? for exmaple: data-room-type-code="SUK"
<div
class="sl-flexbox room-price-item hidden-top-border"
data-room-name="Superior Shard Room"
data-bed-type="K"
data-bed-name="King"
data-pay-type-tag-filter="No Prepayment"
data-cancel-tag-filter=""
data-breakfast-tag-filter=""
data-room-type-code="SUK"
data-rate-code="ZBAR"
data-price="430"
>
<div class="room-price-basic-info">
<div class="room-price-title title-regular">Flexible Rate / CustomStay</div>
<ul class="abstract text-regular">
<li>No Prepayment</li>
</ul>
<div
class="show-detail text-btn js-show-detail"
data-index="0-productRates-0"
>
OFFER DETAILS
</div>
</div>
<div class="room-price-book-info">
<div class="number text-medium">GBP 430</div>
</div>
<div class="boot-btn text-medium js-booking-room" data-type="PRICE">
Book Now
</div>
</div>

Any way to only extract specific div from beautiful soup

I have run into an issue while working on a web scraping project in python. I am new to python and am not sure how to extract a specific line, or a value from part of a line, from the beautiful soup output. I would like to get only the data-rarity part from this site but i haven't found how to do that without removing the entire line from the list.
Any help is much appreciated!
I have this:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
print(rarity[0])
This outputs:
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
I would ideally want to get only the value after the data-rarity so just the 102 part from this in the inspect element of the site.
<div class="profileCards__cards">
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>

Use:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
for r in rarity:
print(r.find("div", {'class': 'profileCards__card'})["data-rarity"])

Web Scraping with BeautifulSoup -- Python

I need to scrape the code below, to retrieve the portions that say "SCRAPE THIS" and "SCRAPE THIS AS WELL". I have been playing around with it for a few hours with no luck! Does anyone know how this can be done?
<div class="mod-body add-border"> <div class="mod-inline mod-body-A-F"> <h4>SCRAPE THIS</h4> <div class="mod-body"> <ul class="list"> <li>SCRAPE THIS AS WELL</li> </ul> </div> </div>

try this code:
from bs4 import BeautifulSoup
text = """<div class="mod-body add-border"> <div class="mod-inline mod-body-A-F"> <h4>SCRAPE THIS</h4> <div class="mod-body"> <ul class="list"> <li>SCRAPE THIS AS WELL</li> </ul> </div> </div>"""
x = BeautifulSoup(text, 'lxml')
print(x.find('h4').get_text())
print(x.find('li').get_text())

Web scraping with Beautiful soup multiple tags

I'm trying to get multiple addresses from a web page with an A to Z of links.
First I get A to Z links with:
URL = "http://www.example.com"
html = urlopen(URL).read()
soup = BeautifulSoup(html, "lxml")
content = soup.find("div", "view-content")
links = [BASE_URL + li.a["href"] for li in content.findAll("li")]
This works great and in links above I have a list of links to each individual web page with multiple addresses on each separate page.
For getting the addresses I need I used:
for item in links[0:5]:
try:
htmlss = urlopen(item).read()
soup = bfs(htmlss, "lxml")
titl = soup.find('div','views-field-title').a.contents
add = soup.find('div','views-field-address').span.contents
zipp = soup.find('div','views-field-city-state-zip').span.contents
except AttributeError:
continue
The above code will take each link and get the first address on the page with all the A's and the first address on the page with all the B's and so on.
My problem is that on some of the pages there are multiple addresses on each page and the above code only retrieves the first address on that page i.e. First A address first B address and so on.
I've tried using soup.findAll but it doesn't work with a.content or span.content
Basically I need to find the address lines in the html pages with non-unique tags. If I use soup.findAll I get all the content for say (div, views-field-title) which gives me a lot of content I don't need.
Example of some html:
<div class="views-field-nothing-1"></div>
<div class="views-field-nothing">
<span class="field-content">
<div class="views-field-title">
<span class="field-content">
<a href="/golf-courses/details/ca/alameda/chuck-corica-golf-complex-earl-fry">
Chuck Corica Golf Complex, Earl Fry
</a>
</span>
</div>
<div class="views-field-address">
<span class="field-content"></span>
</div>
<div class="views-field-city-state-zip">
<span class="field-content">
Alameda, California 94502-6502
</span>
</div>
</span>
</div>
<div class="views-field-value"></div>
<div class="views-field-nothing-1"></div>
<div class="views-field-nothing">
<span class="field-content">
<div class="views-field-title">
<span class="field-content">
<a href="/golf-courses/details/ca/alameda/chuck-corica-golf-complex-jack-clark">
Chuck Corica Golf Complex, Jack Clark
</a>
</span>
</div>
<div class="views-field-address">
<span class="field-content">
1 Clubhouse Memorial Rd
<br></br>
</span>
</div>
<div class="views-field-city-state-zip">
<span class="field-content">
Alameda, California 94502-6502
</span>
</div>
</span>
</div>
This is just a sample of similar html I need to find data for. Thanks

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract URLs from a class using Scrapy - python

Related

Web Scraping of nested div elements with repeating class names

Using Selenium and BS4 is it possible to scrape the text outside the "=" within the div tag

Any way to only extract specific div from beautiful soup

Web Scraping with BeautifulSoup -- Python

Web scraping with Beautiful soup multiple tags

Categories

Resources