Use get_text() for only one HTML class - Python, BeautifulSoup - python

I am trying to access the only text in one class HTML. I tried to apply to the documentation BeautifulSoup, but I always get the same error message or all items in this tag.
My code.py
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.auchandirect.pl/auchan-warszawa/pl/pepsi-cola-max-niskokaloryczny-napoj-gazowany-o-smaku-cola/p-98502176"
r = requests.get(url, headers={'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}, timeout=15)
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
products_links = soup.findAll("a", {'class' : 'current-page'})
print(products_links)
In the results i only needs this 'Max niskokaloryczny napój gazowany o smaku cola'.
My results are:
<a class="current-page" href="/auchan-warszawa/pl/pepsi-cola-max-niskokaloryczny-napoj-gazowany-o-smaku-cola/p-98502176"><span>Max niskokaloryczny napój gazowany o smaku cola</span></a>
Or if i will apply this code according to the documentation (print(products_links.get_text())) Pycharm returns:
ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
How can I extract the text correctly from "current-page"?
Why does not the function return text in the tags ?
What's the difference in getting access to a class using 'findAll("a", class_="current-page")' relative to 'findAll("a", {'class' : 'current-page'})'it gives the same results?
Any help will be appreciated.

findAll returns a list of items found in your defined tag. Imagine if there are multiple tags alike, it returns a list of the multiple tags that match.
There should not be any differences whether you use findAll("a", class_="current-page") or passing a dict with multiple arguments {'class' : 'current-page'}. I might be wrong but I believe because some of these methods were inheritted from earlier versions.
You can extract a text from the returned object by selecting the element and getting the text attribute shown below:
products_links = soup.findAll("a", {'class' : 'current-page'}, text = True)
print(products_links[0].text)

Related

Whats the problem with my code it prints none when I use find() method and when I use findAll() method it prints empty array?

from bs4 import BeautifulSoup
import requests
yt_link = "https://www.youtube.com/watch?v=bKDdT_nyP54"
response = requests.get(yt_link)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.findAll('div', {'class': 'style-scope ytd-app'})
print(title)
It prints empty array [], and if I use find() method then it prints None as a result.
Why does this happen. Please help me I am stuck here.
Yes its difficult to find title because of youtube uses javascript and dynamic content rendering so what you can do try to print soup first and find title from it,so in meta you can find title extract it. And it work for any URL probably
from bs4 import BeautifulSoup
import requests
yt_link = "https://www.youtube.com/watch?v=bKDdT_nyP54"
response = requests.get(yt_link)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('meta',attrs={"name":"title"})
print(title.get("content"))
output:
Akon - Smack That (Official Music Video) ft. Eminem
As find() method returns the first matching object if it doesn't find then returns None and findAll() method returns the list of matching object if it doesn't find then returns empty list.

Cannot remove html tags without triggering error

So I'm trying to run this simple code where I parse some information from a site and return only the information between the tags.
Code below
from bs4 import BeautifulSoup
import requests as reg
import csv
import re
url = ('https://pythonprogramming.net/parsememcparseface/')
response = reg.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find('div', class_='body')
header = data.find_all('th')
print(header.text)
I'm trying to return:
Program Name Internet Points Kittens?
However, this returns error message:
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
Now when I remove the .text I can get
[<th>Program Name</th>, <th>Internet Points</th>, <th>Kittens?</th>]
But obviously I want the tags removed.
Any help please?
Thanks ^_^
As the error message states, find_all returns a list of items, not a single item. The problem is not that there is other stuff in the list, it is that you have a list, and .text is defined to work not on a list but on a single item. Does this work better (a little closer to your original code):
headers = data.find_all('th')
for header in headers:
print(header.text)

python xpath returns empty list - exilead

I'm fairly new to scraping with Python.
I am trying to obtain the number of search results from a query on Exilead. In this example I would like to get "
586,564 results".
This is the code I am running:
r = requests.get(URL, headers=headers)
tree = html.fromstring(r.text)
stats = tree.xpath('//[#id="searchform"]/div/div/small/text()')
This returns an empty list.
I copy-pasted the xPath directly from the elements' page.
As an alternative, I have tried using Beautiful soup:
html = r.text
soup = BeautifulSoup(html, 'xml')
stats = soup.find('small', {'class': 'pull-right'}).text
which returns a Attribute error: NoneType object does not have attribute text.
When I checked the html source I realised I actually cannot find the element I am looking for (the number of results) on the source.
Does anyone know why this is happening and how this can be resolved?
Thanks a lot!
When I checked the html source I realised I actually cannot find the element I am looking for (the number of results) on the source.
This suggests that the data you're looking for is dynamically generated with javascript. You'll need to be able to see the element you're looking for in the html source.
To confirm this being the cause of your error, you could try something really simple like:
html = r.text
soup = BeautifulSoup(html, 'lxml')
*note the 'lxml' above.
And then manually check 'soup' to see if your desired element is there.
I can get that with a css selector combination of small.pull-right to target the tag and the class name of the element.
from bs4 import BeautifulSoup
import requests
url = 'https://www.exalead.com/search/web/results/?q=lead+poisoning'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
print(soup.select_one('small.pull-right').text)

parsing webpage using python

I am trying to parse a webpage (forums.macrumors.com) and get a list of all the threads posted.
So I have got this so far:
import urllib2
import re
from BeautifulSoup import BeautifulSoup
address = "http://forums.macrumors.com/forums/os/"
website = urllib2.urlopen(address)
website_html = website.read()
text = urllib2.urlopen(address).read()
soup = BeautifulSoup(text)
Now the webpage source has this code at the start of each thread:
<li id="thread-1880" class="discussionListItem visible sticky WikiPost "
data-author="ABCD">
How do I parse this so I can then get to the thread link within this li tag? Thanks for the help.
So from your code here, you have the soup object which contains the BeautifulSoup object of your html. The question is what part of that tag you're looking for is static? Is the id always the same? the class?
Finding by the id:
my_li = soup.find('li', {'id': 'thread-1880'})
Finding by the class:
my_li = soup.find('li', {'class': 'discussionListItem visible sticky WikiPost "})
Ideally you would figure out the unique class you can check for and use that instead of a list of classes.
if you are expecting an a tag inside of this object, you can do this to check:
if my_li and my_li.a:
print my_li.a.attrs.get('href')
I always recommend checking though, because if the my_li ends up being None or there is no a inside of it, your code will fail.
For more details, check out the BeautifulSoup documentation
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
The idea here would be to use CSS selectors and to get the a elements inside the h3 with class="title" inside the div with class="titleText" inside the li element having the id attribute starting with "thread":
for link in soup.select("div.discussionList li[id^=thread] div.titleText h3.title a[href]"):
print link["href"]
You can tweak the selector further, but this should give you a good starting point.

Using beautiful soup 4 to scrape URLS within a <p class="postbody"> tag and save them to a text file

I realize this is probably incredibly straightforward but please bear with me. I'm trying to use beautifulsoup 4 to scrape a website that has a list of blog posts for the urls of those posts. The tag that I want is within an tag. There are multiple tags that include a header and then a link that I want to capture. This is the code I'm working with:
with io.open('TPNurls.txt', 'a', encoding='utf8') as logfile:
snippet = soup.find_all('p', class="postbody")
for link in snippet.find('a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
The error I'm getting is:
AttributeError: 'ResultSet' object has no attribute 'find'
I understand that means "head" is a set and beautifulsoup doesn't let me look for tags within a set. But then how can I do this? I need it to find the entire set of tags and then look for the tag within each one and then save each one on a separate line to a file.
The actual reason for the error is that snippet is a result of find_all() call and is basically a list of results, there is no find() function available on it. Instead, you meant:
snippet = soup.find('p', class_="postbody")
for link in snippet.find_all('a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
Also, note the use of class_ here - class is a reserved keyword and cannot be used as a keyword argument here. See Searching by CSS class for more info.
Alternatively, make use of CSS selectors:
for link in snippet.select('p.postbody a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
p.postbody a would match all a tags inside the p tag with class postbody.
In your code,
snippet = soup.find_all('p', class="postbody")
for link in snippet.find('a'):
Here snippet is a bs4.element.ResultSet type object. So you are getting this error. But the elements of this ResultSet object are bs4.element.Tag type where you can apply find method.
Change your code like this,
snippet = soup.find_all("p", { "class" : "postbody" })
for link in snippet:
if link.find('a'):
fulllink = link.a['href']
logfile.write(fulllink + "\n")

Categories

Resources