After running this code
section = soup.find_all('section', class_='b-branches')
I get
<div class="b-branches__item"><i class="icon fa"><b>Firm</b> </i>RJT Roadlines</div>
Now i want to just extract RJIT Roadlines only not...Firm
So i tried
for i in section: firm = i.find('div', class_='b-branches__item') print(firm)
It will return both Firm and RJIT Roadlines
So, how to extract only div tag's text??
You can use tag.contents[1] to get your expected output.
Example:
from bs4 import BeautifulSoup
html = """
<div class="b-branches__item"><i class="icon fa"><b>Firm</b> </i>RJT Roadlines</div>
"""
soup=BeautifulSoup(html,'html.parser')
tag = soup.find('div', class_='b-branches__item')
print(tag.contents[1])
Output:
RJT Roadlines
I want to delete all divs without classes (but not the content that is in the div).
My input
<h1>Test</h1>
<div>
<div>
<div class="test">
<p>abc</p>
</div>
</div>
</div>
The output I want
<h1>Test</h1>
<div class="test">
<p>abc</p>
</div>
My try 1
Based on "Deleting a div with a particular class":
from bs4 import BeautifulSoup
soup = BeautifulSoup('<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>', 'html.parser')
for div in soup.find_all("div", {'class':''}):
div.decompose()
print(soup)
# <h1>Test</h1>
My try 2
from htmllaundry import sanitize
myinput = '<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>'
myoutput = sanitize(myinput)
print myoutput
# <p>Test</p><p>abc</p> instead of <h1>Test</h1><div class="test"><p>abc</p></div>
My try 3
Based on "Clean up HTML in python"
from lxml.html.clean import Cleaner
def sanitize(dirty_html):
cleaner = Cleaner(remove_tags=('font', 'div'))
return cleaner.clean_html(dirty_html)
myhtml = '<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>'
print(sanitize(myhtml))
# <div><h1>Test</h1><p>abc</p></div>
My try 4
from html_sanitizer import Sanitizer
sanitizer = Sanitizer() # default configuration
output = sanitizer.sanitize('<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>')
print(output)
# <h1>Test</h1><p>abc</p>
Problem: A div element is used to wrap the HTML fragment for the parser, therefore div tags are not allowed. (Source: Manual)
If you want to exclude div without class, preserving its content:
from bs4 import BeautifulSoup
markup = '<h1>Test</h1><div><div><div class="test"><p>abc</p></div></div></div>'
soup = BeautifulSoup(markup,"html.parser")
for tag in soup.find_all():
empty = tag.name == 'div' and not(tag.has_attr('class'))
if not(empty):
print(tag)
Output:
<h1>Test</h1>
<div class="test"><p>abc</p></div>
<p>abc</p>
Please checkout this.
from bs4 import BeautifulSoup
data="""
<div>
<div>
<div class="test">
<p>abc</p>
</div>
</div>
</div>
"""
soup = BeautifulSoup(data, features="html5lib")
for div in soup.find_all("div", class_=True):
print(div)
I'm trying to find longitudes and latitudes in this html code:
<div class="map-outer-wrap">
<div class="map-wrap" data-zoom="15" style="height:500px;" data-latitude="37.4418834" data-longitude="-122.14301949999998" data-style="color">
<div data-latitude="37.4418834" data-longitude="-122.14301949999998"></div>
</div>
View on Google Map
</div>
(The full page is here: https://www.towncity.com/property/whole-hotel-for-sale-in-riverside-area/)
Not exactly knowing where to start to actually get to data-latitude and data-longitude, I tried to narrow down my search to get to the closest div (map-wrap), but even this returns an empty list.
parser = LinkParser()
data, links = parser.getLinks("https://www.towncity.com/property/whole-hotel-for-sale-in-riverside-area/)
lnglat = BeautifulSoup(data, "lxml").findAll("div", {"class": "map-wrap"}).text
What's the proper way to retrieve the values of data-latitude and data-longitude in this page?
You can access the attributes like key-value pairs
Ex:
s = """
<div class="map-outer-wrap">
<div class="map-wrap" data-zoom="15" style="height:500px;" data-latitude="37.4418834" data-longitude="-122.14301949999998" data-style="color">
<div data-latitude="37.4418834" data-longitude="-122.14301949999998"></div>
</div>
View on Google Map
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, "html.parser")
print( soup.find("div", class_="map-wrap")["data-latitude"] )
print( soup.find("div", class_="map-wrap")["data-longitude"] )
Output:
37.4418834
-122.14301949999998
Try this:
from bs4 import BeautifulSoup
import requests
s = requests.get("https://www.towncity.com/property/whole-hotel-for-sale-in-riverside-area/")
soup = BeautifulSoup(s.content, "lxml")
print(soup.find("div", class_="map-wrap")["data-latitude"])
print(soup.find("div", class_="map-wrap")["data-longitude"])
output:
37.4418834
-122.14301949999998
i'd like to change some html tag name.
let me know the method with python code.:
As-is
<div class="title-title-1">hello</div>
<div class="text-body">i like you</div>
<div class="p">hehe</div>
To be
<title-title-1>hello</title-title-1>
<text-body>i like you</text-body>
<p>hehe</p>
somebody help me!!
Using BeautifulSoup
Demo:
from bs4 import BeautifulSoup
s = """<div class="title-title-1">hello</div>
<div class="text-body">i like you</div>
<div class="p">hehe</div>"""
soup = BeautifulSoup(s, "html.parser")
title = soup.find("div", class_="title-title-1").text
body = soup.find("div", class_="text-body").text
p = soup.find("div", class_="p").text
toFrame = """<title-title-1>{0}</title-title-1>
<text-body>{1}</text-body>
<p>{2}</p>""".format(title, body, p)
print(toFrame)
Output:
<title-title-1>hello</title-title-1>
<text-body>i like you</text-body>
<p>hehe</p>
I have written python script to scrape data from http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings
It is a list of 100 players and I successfully scraped this data. The problem is, when i run script instead of scraping data just one time it scraped the same data 3 times.
<div class="cb-col cb-col-100 cb-font-14 cb-lst-itm text-center">
<div class="cb-col cb-col-16 cb-rank-tbl cb-font-16">1</div>
<div class="cb-col cb-col-50 cb-lst-itm-sm text-left">
<div class="cb-col cb-col-33">
<div class="cb-col cb-col-50">
<span class=" cb-ico" style="position:absolute;"></span> –
</div>
<div class="cb-col cb-col-50">
<img src="http://i.cricketcb.com/i/stats/fw/50x50/img/faceImages/2250.jpg" class="img-responsive cb-rank-plyr-img">
</div>
</div>
<div class="cb-col cb-col-67 cb-rank-plyr">
<a class="text-hvr-underline text-bold cb-font-16" href="/profiles/2250/steven-smith" title="Steven Smith's Profile">Steven Smith</a>
<div class="cb-font-12 text-gray">AUSTRALIA</div>
</div>
</div>
<div class="cb-col cb-col-17 cb-rank-tbl">906</div>
<div class="cb-col cb-col-17 cb-rank-tbl">1</div>
</div>
And here is python script which i write scrap each player data.
import sys,requests,csv,io
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
maindiv = soup.find_all("div", {"class": "text-center"})
for div in maindiv:
print(div.text)
but instead of scraping the data once, it scrapes the same data 3 times.
Where can I make changes to get data just one time?
Select the table and look for the divs in that:
maindiv = soup.select("#batsmen-tests div.text-center")
for div in maindiv:
print(div.text)
Your original output and that above gets all the text from the divs as one line which is not really useful, if you just want the player names:
anchors = soup.select("#batsmen-tests div.cb-rank-plyr a")
for a in anchors:
print(a.text)
A quick and easy way to get the data in a nice csv format is to just get text from each child:
maindiv = soup.select("#batsmen-tests div.text-center")
for d in maindiv[1:]:
row_data = u",".join(s.strip() for s in filter(None, (t.find(text=True, recursive=False) for t in d.find_all())))
if row_data:
print(row_data)
Now you get output like:
# rank, up/down, name, country, rating, best rank
1,–,Steven Smith,AUSTRALIA,906,1
2,–,Joe Root,ENGLAND,878,1
3,–,Kane Williamson,NEW ZEALAND,876,1
4,–,Hashim Amla,SOUTH AFRICA,847,1
5,–,Younis Khan,PAKISTAN,845,1
6,–,Adam Voges,AUSTRALIA,802,5
7,–,AB de Villiers,SOUTH AFRICA,802,1
8,–,Ajinkya Rahane,INDIA,785,8
9,2,David Warner,AUSTRALIA,772,3
10,–,Alastair Cook,ENGLAND,770,2
11,1,Misbah-ul-Haq,PAKISTAN,764,6
As opposed to:
PositionPlayerRatingBest Rank
Player
1 –Steven SmithAUSTRALIA9061
2 –Joe RootENGLAND8781
3 –Kane WilliamsonNEW ZEALAND8761
4 –Hashim AmlaSOUTH AFRICA8471
5 –Younis KhanPAKISTAN8451
6 –Adam VogesAUSTRALIA8025
The reason you get output three times is because the website has three categories you have to select it and then accordingly you can use it.
Simplest way of doing it with your code would be to add just one line
import sys,requests,csv,io
from bs4 import BeautifulSoup
url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen- rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
specific_div = soup.find_all("div", {"id": "batsmen-tests"})
maindiv = specific_div[0].find_all("div", {"class": "text-center"})
for div in maindiv:
print(div.text)
This will give similar reuslts with just test batsmen, for other output just change the "id" in specific_div line.