I am trying to build a function in a python webscraper that moves to the next page in a list of results. I am having trouble locating the element in beautiful soup as the link is found at the end of many other tags, and doesn't have any attributes such as class or ID.
Here is a snippet of the html:
<a href="http://www.url?&=page=2">
Next
</a>
I have been reading the bs4 documentation trying to understand how I can extract the URL, but I am coming up stumped. I am thinking that it could be done by either:
finding the last .a['href'] in the parent element, as it is always the last one.
finding the href based on the fact that it always has text of 'Next'
I don't know how to write something that would solve either 1. or 2.
Am I along the right lines? Does anyone have any suggestions to achieve my goal? Thanks
To find <a> tag that contains text Next, you can do:
from bs4 import BeautifulSoup
txt = '''
<a href="http://www.url?&=page=2">
Next
</a>'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select_one('a:contains("Next")')['href'])
Prints:
http://www.url?&=page=2
Or:
print(soup.find('a', text=lambda t: t.strip() == 'Next')['href'])
To get the last <a> tag inside some element, you can index the ResultSet with [-1]:
from bs4 import BeautifulSoup
txt = '''
<div id="block">
Some other link
Next
</div>
'''
soup = BeautifulSoup(txt, 'html.parser')
print(soup.select('div#block > a')[-1]['href'])
Related
I am currently working on a project for my Python class and I'm pretty much stuck.
My program is getting my online anime list from my profile via web scraping.
It's supposed to fetch the title names and thumbnails and just give me the links.
My problem basically is, that I cant extract an image link inside the table row tag.
Here is a screenshot of the HTML code
basically there is a <tr ..... data-title='<img src="url.jpg"> tag which contains a picture link.
It was no problem extracting the titles but it's different this time since it's inside of a tag.
def Icon_Crawler(self):
page_soup = soup(self.html_stream, "html.parser")
elements = page_soup.findAll('tbody')
for element in elements:
try:
store_rows = element.findAll("tr",attrs={"data-title"})
except AttributeError:
pass
print(store_rows)
This is what I have so far.
First I would use {"data-title": True} instead of attrs={"data-title"} which doesn't work for me
Second I would use for-loop to work with every element separatelly and then I would use item['data-title'] oritem.attrs['data-title']` or other method to get attribute
EDIT: because this attribute has HTML with tag <img> so I would use BeautifulSoup to get src from this tag
from bs4 import BeautifulSoup as BS
text = '''
<tr data-title='<img src="url1.jpg" alt="1">' >
<tr data-title='<img src="url2.jpg" alt="2">' >
'''
soup = BS(text, 'html.parser')
all_items = soup.find_all('tr', {"data-title": True})
for item in all_items:
print('item:', item['data-title'])
#print('item:', item.get('data-title'))
#print('item:', item.attrs['data-title'])
#print('item:', item.attrs.get('data-title'))
link = item['data-title']
s = BS(link, 'html.parser')
print('src:', s.find('img')['src'])
Result:
item: <img src="url1.jpg" alt="1">
src: url1.jpg
item: <img src="url2.jpg" alt="2">
src: url2.jpg
I'm using Python 3.7. I want to locate all the elements in my HTML page that have an attribute, "data-permalink", regardless of what its value is, even if the value is empty. However, I'm confused about how to do this. I'm using the bs4 package and tried the following
soup = BeautifulSoup(html)
soup.findAll("data-permalink")
[]
soup.findAll("a")
[<a href=" ... </a>]
soup.findAll("a.data-permalink")
[]
The attribute is normally only found in anchor tags on my page, hence my unsuccessful, "a.data-permalink" attempt. I would like to return the elements that contain the attribute.
Your selector is invalid
soup.findAll("a.data-permalink")
it should be used for the method .select() but still it invalid because it mean select <a> with the class not the attribute.
to match everything use the * for select()
.select('*[data-permalink]')
or True if using findAll()
.findAll(True, attrs={'data-permalink' : True})
example
from bs4 import BeautifulSoup
html = '''<a data-permalink="a">link</a>
<b>bold</b>
<i data-permalink="i">italic</i>'''
soup= BeautifulSoup(html, 'html.parser')
permalink = soup.select('*[data-permalink]')
# or
# permalink = soup.findAll(True, attrs={'data-permalink' : True})
print(permalink)
Results, the <b> element is skipped
[<a data-permalink="a">link</a>, <i data-permalink="i">italic</i>]
I'm using Beautiful Soup 4 to parse a news site for links contained in the body text. I was able to find all the paragraphs that contained the links but the paragraph.get('href') returned type none for each link. I'm using Python 3.5.1. Any help is really appreciated.
from bs4 import BeautifulSoup
import urllib.request
import re
soup = BeautifulSoup("http://www.cnn.com/2016/11/18/opinions/how-do-you-deal-with-donald-trump-dantonio/index.html", "html.parser")
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
print(paragraph.get('href'))
Do you really want this?
for paragraph in soup.find_all("div", class_="zn-body__paragraph"):
for a in paragraph("a"):
print(a.get('href'))
Note that paragraph.get('href') tries to find attribute href in <div> tag you found. As there's no such attribute, it returns None. Most probably you actually have to find all tags <a> which a descendants of your <div> (this can be done with paragraph("a") which is a shortcut for paragraph.find_all("a") and then for every element <a> look at their href attribute.
I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
Decheterie de Bagnols
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'
Locate the h4 element and use find_next_siblings():
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4
I am attempting to use BeautifulSoup to parse through a DOM tree and extract the names of authors. Below is a snippet of HTML to show the structure of the code I'm going to scrape.
<html>
<body>
<div class="list-authors">
<span class="descriptor">Authors:</span>
Dacheng Lin,
Ronald A. Remillard,
Jeroen Homan
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span>
A.G. Kosovichev
</div>
<!--There are many other div tags with this structure-->
</body>
</html>
My point of confusion is that when I do soup.find, it finds the first occurrence of the div tag that I'm searching for. After that, I search for all 'a' link tags. At this stage, how do I extract the authors names from each of the link tags and print them out? Is there a way to do it using BeautifulSoup or do I need to use Regex? How do I continue iterating over every other other div tag and extract the authors names?
import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
try:
authordiv = soup.find('div', attrs={'class': 'list-authors'})
links=tds.findAll('a')
for link in links:
print ''.join(link[0].contents)
#Iterate through entire page and print authors
except IOError:
print 'IO error'
just use findAll for the divs link you do for the links
for authordiv in soup.findAll('div', attrs={'class': 'list-authors'}):
Since link is already taken from an iterable, you don't need to subindex link -- you can just do link.contents[0].
print link.contents[0] with your new example with two separate <div class="list-authors"> yields:
Dacheng Lin
Ronald A. Remillard
Jeroen Homan
A.G. Kosovichev
So I'm not sure I understand the comment about searching other divs. If they are different classes, you will either need to do a separate soup.find and soup.findAll, or just modify your first soup.find.