I want to know how I can collect the desire data with beautiful soup here is the code and trying to collect the text data that is "RoSharon1977"
I'm trying using
<div id="twitter" class="editable-item">
<div id="twitter-view">
<ul><li>
RoSharon1977
</li></ul>
</div></div>
You have to find the div by its id, then get the next ul element, etc and continuing to drill down until you reach the a element, then get the text of it:
from bs4 import BeautifulSoup
html = '''<div id="twitter" class="editable-item">
<div id="twitter-view">
<ul><li>
RoSharon1977
</li></ul>
</div></div>'''
soup = BeautifulSoup(html)
print soup.find('div', attrs={'id': 'twitter-view'}).findNext('ul').findNext('li').findNext('a').text
Or depending on how the whole webpage looks you could simply do:
soup = BeautifulSoup(html)
print soup.find('a').text
And if there are multiple a elements:
soup = BeautifulSoup(html)
for a in soup.find_all('a'):
print a.text
Related
I'm new to BeautifulSoup, I found all the cards, about 12. But when I'm trying to loop through each card and print link href. I kept getting this error
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
cards = soup.find_all('div', attrs={'class': 'up-card-section'})
# print(cards)
print(len(cards))
for link in cards.find_all('a'):
print(link.get('href'))
cards = soup.find_all('div', attrs={'class': 'up-card-section'})
Will return a collection of all the div's found, you'll need to loop over them before finding the chil a's.
That said, you should probably use findChildren for finding the a elements.
Example Demo with an minimal piece of HTML
from bs4 import BeautifulSoup
html = """
<div class='up-card-section'>
<div class='foo'>
<a href='example.com'>FooBar</a>
</div>
</div>
<div class='up-card-section'>
<div class='foo'>
<a href='example2.com'>FooBar</a>
</div>
</div>
"""
res = []
soup = BeautifulSoup(html, 'html.parser')
for card in soup.findAll('div', attrs={'class': 'up-card-section'}):
for link in card.findChildren('a', recursive=True):
print(link.get('href'))
Output:
example.com
example2.com
I am working on a webscraper project and can't get BeautifulSoup to give me the text between the Div. Below is my code. Any suggestions on how to get python to print just the "5x5" without the "Div to /Div" and without the whitespace?
source = requests.get('https://www.stor-it.com/self-storage/meridian-id-83646').text
soup = BeautifulSoup(source, 'lxml')
unit = soup.find('div', class_="unit-size")
print (unit)
This script returns the following:
<div class="unit-size">
5x5 </div>
You can use text to retrieve the text, then strip to remove whitespace
Try unit.text.strip()
Change your print statement from print(unit) to print(unit.text)
Use a faster css class selector
from bs4 import BeautifulSoup
source= '''
<div class="unit-size">
5x5 </div>
'''
soup = BeautifulSoup(source, 'lxml')
unit = soup.select('.unit-size')
print(unit[0].text.strip())
The code I am using to scrape the content
class Scraper(object):
# contains methods to scrape data from curse
def scrape(url):
req = request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
return request.urlopen(req).read()
def lookup(page, tag, class_name):
parsed = BeautifulSoup(page, "html.parser")
return parsed.find_all(tag, class_=class_name)
This returns a list with entries similar to this
<li class="title"><h4>World Quest Tracker</h4></li>
I'm attempting to extract the text inbetween the href tags, in this instance
World Quest Tracker
How could I accomplish this?
Try this.
from bs4 import BeautifulSoup
html='''
<li class="title"><h4>World Quest Tracker</h4></li>
'''
soup = BeautifulSoup(html, "lxml")
for item in soup.select(".title"):
print(item.text)
Result:
World Quest Tracker
html_doc = '<li class="title"><h4>World Quest Tracker</h4></li>'
soup = BeautifulSoup(html_doc, 'html.parser')
print soup.find('a').text
this will print
u'World Quest Tracker'
I'm attempting to extract the text inbetween the href tags
If you actually want the text in the href attribute, and not the text content wrapped by the <a></a> anchor (your wording is a bit unclear), use get('href'):
from bs4 import BeautifulSoup
html = '<li class="title"><h4>World Quest Tracker</h4></li>'
soup = BeautifulSoup(html, 'lxml')
soup.find('a').get('href')
'/addons/wow/world-quest-tracker'
I have the code below. It successfully gets the content I need but also includes the tag I am searching for. How do I exclude this?
Additionally, the content is is using DIV tags rather than P tags. How do I amend all <div>....</div> tags to be <p>...</p>
Example Output
<div class="article__body"><div>One</div><div>Two</div><div>Three</div></div>
Desired Output
<p>One</p><p>Two</p><p>Three</p>
CODE:
from BeautifulSoup import BeautifulSoup
import urllib2
import re
html_page = urllib2.urlopen("http://example.com/news/1234")
soup = BeautifulSoup(html_page)
print soup.find("h2", {"class": "article__title"})
print ("=================================")
print soup.find("div", {"class": "article__body"})
print ("=================================")
print soup.find("div", {"class": "article__image"})
I need to extract the name of the artists from an HTML page. Here's a snippet of the page:
</td>
<td class="playbuttonCell">
<a class="playbutton preview-track" href="/music/example" data-analytics-redirect="false" >
<img class="transparent_png play_icon" width="13" height="13" alt="Play" src="http://cdn.last.fm/flatness/preview/play_indicator.png" style="" />
</a>
</td>
<td class="subjectCell" title="example, played 3 times">
<div>
<a href="/music/example-artist" >Example artist name</a>
I've tried this but isn't doing the job.
import urllib
from bs4 import BeautifulSoup
html = urllib.urlopen('http://www.last.fm/user/Jehl/charts?rangetype=overall&subtype=artists').read()
soup = BeautifulSoup(html)
print soup('a')
for link in soup('a'):
print html
Where am I screwing up?
You can try this:
In [1]: from bs4 import BeautifulSoup
In [2]: s = # Your string here...
In [3]: soup = BeautifulSoup(s)
In [4]: for anchor in soup.find_all('a'):
...: print anchor.text
...:
...:
here lies the text i need
Here, the find_all method returns a list that contains all matching anchor tags, after which we can print the text property to get the value between the tags.
for link in soup.select('td.subjectCell a'):
print link.text
It selects (just like CSS) the a elements inside td elements that have the subjectCell class.
spans = soup.find_all("div", {"class": "overlay tran3s"})
for span in spans:
links = span.find_all('a')
for link in links:
print(link.text)
soup.findAll and link.attrs can be used to read the href attributes easily.
Working code:
soup = BeautifulSoup(html)
for link in soup.findAll('a'):
print (link.attrs['href'])
Output:
/music/example
/music/example-artist
Regular expressions are your friend here. As an alternative to RocketDonkey's answer, which uses BeautifulSoup properly; you can parse through soup('a') with a regular expression like
>([a-zA-Z]*|[0-9]|(\w\s*)*)</a>
you can utilize the re.findall method to grab the text in between the anchor tags directly.