Bs4 and requests Python - python

I was trying to make a pokedex (https://replit.com/#Yoplayer1py/Gui-Pokedex) and I wanted to get the pokemon's description from https://www.pokemon.com/us/pokedex/{__pokename__} Here pokename means the name of the pokemon. for example: https://www.pokemon.com/us/pokedex/unown
There is a tag contains the description and the p tag's class is : version-xactive.
When i print the description i get nothing or sometimes i get None.
here's the code:
import requests
from bs4 import BeautifulSoup
# Assign URL
url = "https://www.pokemon.com/us/pokedex/"+text_id_name.get(1.0, "end-1c")
# Fetch raw HTML content
html_content = requests.get(url).text
# Now that the content is ready, iterate
# through the content using BeautifulSoup:
soup = BeautifulSoup(html_content, "html.parser")
# similarly to get all the occurences of a given tag
print(soup.find('p', attrs={'class': 'version-xactive'}).text)
The text_id_name.get(1.0, "end-1c") is from tkinter text input.
it shows that :
Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python3.8/tkinter/__init__.py", line 1883, in __call__
return self.func(*args)
File "main.py", line 57, in load_pokemon
print(soup.find('p', attrs={'class': 'version-xactive'}).text)
AttributeError: 'NoneType' object has no attribute 'text'
Thanks in advance !!

It looks like the classes, multiple, of the description are version-x active (at least for Unown). That is why soup.find('p', attrs={'class': 'version-xactive'} is not finding the element, thereby returning None (hence why you are getting the error).
Adding a space will fix your problem: print(soup.find('p', attrs={'class': 'version-xactive'}).text). Just to note: if there are multiple p elements with the same classes, so find method might not return the element you want.
Adding a null check will also prevent the error from occurring:
description = soup.find('p', attrs={'class': 'version-x active'})
if description:
print(desription.text)

You should probably separate out your calls so you can do a safety check and a type check.
Replace
print(soup.find('p', attrs={'class': 'version-xactive'}).text)
with
tags = soup.find('p', attrs={'class': 'version-xactive'})
print("Tags:", tags)
if(type(tags) != NoneType):
print(tags.text)
That should give you more information at least. It might still break on tags.text. If it does, put the printout from print("Tags:", tags) up so we can see what the data looks like.

Related

Cannot remove html tags without triggering error

So I'm trying to run this simple code where I parse some information from a site and return only the information between the tags.
Code below
from bs4 import BeautifulSoup
import requests as reg
import csv
import re
url = ('https://pythonprogramming.net/parsememcparseface/')
response = reg.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find('div', class_='body')
header = data.find_all('th')
print(header.text)
I'm trying to return:
Program Name Internet Points Kittens?
However, this returns error message:
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
Now when I remove the .text I can get
[<th>Program Name</th>, <th>Internet Points</th>, <th>Kittens?</th>]
But obviously I want the tags removed.
Any help please?
Thanks ^_^
As the error message states, find_all returns a list of items, not a single item. The problem is not that there is other stuff in the list, it is that you have a list, and .text is defined to work not on a list but on a single item. Does this work better (a little closer to your original code):
headers = data.find_all('th')
for header in headers:
print(header.text)

Use get_text() for only one HTML class - Python, BeautifulSoup

I am trying to access the only text in one class HTML. I tried to apply to the documentation BeautifulSoup, but I always get the same error message or all items in this tag.
My code.py
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.auchandirect.pl/auchan-warszawa/pl/pepsi-cola-max-niskokaloryczny-napoj-gazowany-o-smaku-cola/p-98502176"
r = requests.get(url, headers={'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}, timeout=15)
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
products_links = soup.findAll("a", {'class' : 'current-page'})
print(products_links)
In the results i only needs this 'Max niskokaloryczny napój gazowany o smaku cola'.
My results are:
<a class="current-page" href="/auchan-warszawa/pl/pepsi-cola-max-niskokaloryczny-napoj-gazowany-o-smaku-cola/p-98502176"><span>Max niskokaloryczny napój gazowany o smaku cola</span></a>
Or if i will apply this code according to the documentation (print(products_links.get_text())) Pycharm returns:
ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
How can I extract the text correctly from "current-page"?
Why does not the function return text in the tags ?
What's the difference in getting access to a class using 'findAll("a", class_="current-page")' relative to 'findAll("a", {'class' : 'current-page'})'it gives the same results?
Any help will be appreciated.
findAll returns a list of items found in your defined tag. Imagine if there are multiple tags alike, it returns a list of the multiple tags that match.
There should not be any differences whether you use findAll("a", class_="current-page") or passing a dict with multiple arguments {'class' : 'current-page'}. I might be wrong but I believe because some of these methods were inheritted from earlier versions.
You can extract a text from the returned object by selecting the element and getting the text attribute shown below:
products_links = soup.findAll("a", {'class' : 'current-page'}, text = True)
print(products_links[0].text)

Keep getting 'TypeError: 'NoneType' object is not callable' with beautiful soup and python3

I am a beginner and struggling though a course, so this problem is probably really simple, but I am running this (admittedly messy) code (saved under file x.py) to extract a link and a name from a website with line formats like:
<li style="margin-top: 21px;">
Prabhjoit
</li>
So I set up this:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
for line in soup:
if not line.startswith('<li'):
continue
stuff = line.split('"')
link = stuff[3]
thing = stuff[4].split('<')
name = thing[0].split('>')
count = count + 1
if count == 18:
break
print(name[1])
print(link)
And it keeps producing the error:
Traceback (most recent call last):
File "x.py", line 15, in <module>
if not line.startswith('<li'):
TypeError: 'NoneType' object is not callable
I have struggled with this for hours, and I would be grateful for any suggestions.
line is not a string, and it has no startswith() method. It is a BeautifulSoup Tag object, because BeautifulSoup has parsed the HTML source text into a rich object model. Don't try to treat it as text!
The error is caused because if you access any attribute on a Tag object that it doesn't know about, it does a search for a child element with that name (so here it executes line.find('startswith')), and since there is no element with that name, None is returned. None.startswith() then fails with the error you see.
If you wanted to find the 18th <li> element, just ask BeautifulSoup for that specific element:
soup = BeautifulSoup(html, 'html.parser')
li_link_elements = soup.select('li a[href]', limit=18)
if len(li_link_elements) == 18:
last = li_link_elements[-1]
print(last.get_text())
print(last['href'])
This uses a CSS selector to find only the <a> link elements whose parent is a <li> element and that have a href attribute. The search is limited to just 18 such tags, and the last one is printed, but only if we actually found 18 in the page.
The element text is retrieved with the Element.get_text() method, which will include text from any nested elements (such as <span> or <strong> or other extra markup), and the href attribute is accessed using standard indexing notation.

Python 3 'NoneType' object has no attribute 'text'

# import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
#specify the url
html = 'https://www.bloomberg.com/quote/SPX:IND'
# query the website and return the html to thevariable 'page'
page = urlopen(html)
# parse the html using beautiful soup and store in variable 'soup'
data = BeautifulSoup(page, 'html.parser')
#take out the <div> of name and get its value
name_box = data.find('h1', attrs={'class': 'companyName_99a4824b'})
name = name_box.text.strip() #strip is used to remove starting and trailing
print (name)
# get the index price
price_box = data.find('div', attrs={'class':'priceText_1853e8a5'})
price = price_box.text
print (price)
I was following a guide on medium.com here and was having some conflictions due to lacking of knowledge of python and scripting, but I think I have my error at
name = name_box.text
because text is not defined and I am unsure they would like me to define it using the BeautifulSoup library. Any help maybe appreciated. The actual error will be below
RESTART: C:/Users/Parsons PC/AppData/Local/Programs/Python/Python36-32/projects/Scripts/S&P 500 website scraper/main.py
Traceback (most recent call last):
File "C:/Users/Parsons PC/AppData/Local/Programs/Python/Python36-32/projects/Scripts/S&P 500 website scraper/main.py", line 17, in <module>
name = name_box.text.strip() #strip is used to remove starting and trailing
AttributeError: 'NoneType' object has no attribute 'text'
The website https://www.bloomberg.com/quote/SPX:IND does not contain a <h1> tag with the class name companyName_99a4824b. That's why you are receiving the above error.
In the website. <h1> tag look like this,
<h1 class="companyName__99a4824b">S&P 500 Index</h1>
So to select it, you have to change the class name to companyName__99a4824b.
name_box = data.find('h1', attrs={'class': 'companyName__99a4824b'})
Finally Result:
# import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
#specify the url
html = 'https://www.bloomberg.com/quote/SPX:IND'
# query the website and return the html to thevariable 'page'
page = urlopen(html)
# parse the html using beautiful soup and store in variable 'soup'
data = BeautifulSoup(page, 'html.parser')
#take out the <div> of name and get its value
name_box = data.find('h1', attrs={'class': 'companyName__99a4824b'}) #edited companyName_99a4824b -> companyName__99a4824b
name = name_box.text.strip() #strip is used to remove starting and trailing
print (name)
# get the index price
price_box = data.find('div', attrs={'class':'priceText__1853e8a5'}) #edited priceText_1853e8a5 -> priceText__1853e8a5
price = price_box.text
print (price)
It would be better if you can also handle this exception, for future class name changes.

Python None check seems to fail using BeautifulSoup

I have looked at similar posts, which come close to my case, but my result nonetheless seems unexpected.
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup(<html page of interest>)
if (soup.find_all("td", attrs= {"class": "FilterElement"}, text= re.compile("HERE IS TEXT I AM LOOKING FOR")) is None):
print('There was no entry')
else:
print(soup.find("td", attrs= {"class": "FilterElement"}, text= re.compile("HERE IS THE TEXT I AM LOOKING FOR")))
I obviously filtered out the actual HTML page, as well as the text in the regular expression. The rest is exactly as written. I get the following error:
Traceback (most recent call last):
File "/Users/appa/src/workspace/web_forms/WebForms/src/root/queryForms.py", line 51, in <module>
LoopThroughDays(form, id, trailer)
File "/Users/appa/src/workspace/web_forms/WebForms/src/root/queryForms.py", line 33, in LoopThroughDays
if (soup.find_all("td", attrs= {"class": "FilterElement"}, text= re.compile("HERE IS THE TEXT I AM LOOKING FOR")) is None):
TypeError: 'NoneType' object is not callable
I understand that the text will sometimes be missing. But I thought that the way I have set up the if statement was precisely able to capture when it is missing, and therefore a NoneType.
Thanks in advance for any help!
It looks like it's just a typo. It should be soup.findAll not soup.find_all. I tried running it, and it works with the correction. So the full program should be:
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup(<html page of interest>)
if (soup.findAll("td", attrs= {"class": "FilterElement"}, text= re.compile("HERE IS TEXT I AM LOOKING FOR")) is None):
print('There was no entry')
else:
print(soup.find("td", attrs= {"class": "FilterElement"}, text= re.compile("HERE IS THE TEXT I AM LOOKING FOR")))<html page of interest>

Categories

Resources