Why is BeautifulSoup(...).find(...) returning None? - python

I have some problem with code (I use bs4):
elif 'temperature' in query:
speak("where?")
miejsce=takecommand().lower()
search = (f"Temperature in {miejsce}")
url = (f'https://www.google.com/search?q={search}')
r = requests.get(url)
data = BeautifulSoup(r.text , "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"In {search} there is {temp}")
and the error is:
temp = data.find("div", class_="BNeawe").text
AttributeError: 'NoneType' object has no attribute 'text'
Could you help me please

data.find("div", class_="BNeawe") didnt return anything, so i believe google changed how it displays weather since you last ran this code successfully.
If you search for yourself 'Weather in {place}' then right click the weather widget and choose Inspect Element (browser dependent), you can look for yourself at where the data is in the page, and see which class the data is under.
It appears it was previously under the BNeawe class.

elif "temperature" in query or "temperatures" in query:
search = "Temperature in New York"
url = f"https://www.google.com/search?q={search}:"
r = requests.get(url)
data = BeautifulSoup(r.text, "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"Currently, the temperature in your region is {temp}")
Try this one, you were experiencing your proble in line 5 which is '(r.text, "html.parser")'
try to avoid these comma space mistakes in the code...

Best practice would be to use directly api google / weather - If you wanna scrape,try to avoid selecting your elements by classes, cause they are often that dynamic.
Instead focus on id if possible or use HTML structure:
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=temperature"
response = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0', 'Accept-Language':'en-US,en;q=0.5'}, cookies={'CONSENT':'YES+'})
soup = BeautifulSoup(response.text)
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break

Related

AttributeError: 'NoneType' object has no attribute 'find_all' Python Web Scraping w/ Beautiful Soup

I have two problems. First of all, I get the error that is listed in the title "AttributeError: 'NoneType' object has no attribute 'find_all'" whenever I activate this line of code. Secondly, I want to access one more statistic on this specific website as well. So, firstly, my code is below. This is meant to gather names from a website, trim off the excess, then take those names, insert them into a URL, and take two statistics. The first statistic that I am taking is on line 22, which is the source of the error. And the second statistic is in HTML and is also going to be listed after my code.
import requests
from bs4 import BeautifulSoup
import re
res = requests.get('https://plancke.io/hypixel/guild/name/GBP')
soup = BeautifulSoup(res.text, 'lxml')
memberList = []
skillAverageList = []
for i in soup.select('.playerInfo'):
memberList.append(i.text)
memberList = [e[37:-38] for e in memberList]
members = [re.sub("[A-Z][^A-Z]+$", "", member.split(" ")[1]) for member in memberList]
print(members)
for i in range(len(memberList) + 1):
player = memberList[i]
skyLeaMoe = requests.get('https://sky.lea.moe/stats/' + str(player))
skillAverageList.append(soup.find("div", {"id":"additional_stats_container"}).find_all("div",class_="additional-stat")[-2].get_text(strip=True))
pprint(skillAverageList)
Below is the second statistic that I would like to scrape from this website as well (in HTML). This specific statistic is attributed to this specific site, but the code above will hopefully be able to cycle through the entire list (https://sky.lea.moe/stats/Igris/Apple).
<span class="stat-name">Total Slayer XP: </span> == $0
<span class ="stat-value">457,530</span>
I am sorry if this is a lot, I have almost no knowledge of HTML and any attempt for me to learn it has been a struggle. Thanks in advance to anyone this reaches.
It seems that this site doesn't have a div with the id of "additional_stats_container", and therefore soup.find("div", {"id":"additional_stats_container"}) returns None.
Upon inspecting the HTML of this URL with a browser, I couldn't find such a div.
This script will print all names and their Total Slayer XP:
import requests
from bs4 import BeautifulSoup
url = 'https://plancke.io/hypixel/guild/name/GBP'
stats_url = 'https://sky.lea.moe/stats/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for a in soup.select('td a[href*="/hypixel/player/stats/"]'):
s = BeautifulSoup(requests.get(stats_url + a['href'].split('/')[-1]).content, 'html.parser')
total_slayer_xp = s.select_one('span:contains("Total Slayer XP") + span')
print('{:<30} {}'.format(a.text, total_slayer_xp.text if total_slayer_xp else '-'))
Prints:
[MVP+] Igris 457,530
[VIP] Kutta 207,665
[VIP] mistercint 56,455
[MVP+] zouce 1,710,540
viellythedivelon 30
[MVP+] Louis7864 141,670
[VIP] Broadside1138 292,240
[VIP+] Babaloops 40
[VIP+] SparkleDuck9 321,290
[VIP] TooLongOfAUserNa 423,700
...etc.

How to pull a specific html attribute into a variable

so the title is probably really badly worded but I was not sure how else to word it. So I asked help to use beautifulsoup4 to scrape data and someone was kind enough to help me out.
import requests
from bs4 import BeautifulSoup
import re
#NJII
params = {
'action': 'vc_get_vc_grid_data',
'tag': 'vc_basic_grid',
'data[page_id]': 26,
'data[shortcode_id]': '1524685605316-ae64dc93-e23d-3',
'_vcnonce': 'b9fb62cf69' #Need to update this somehow
}
dateList = []
urlList = []
url = 'http://njii.com/wp-admin/admin-ajax.php'
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'html.parser')
for div in soup.find_all('div', class_='vc_gitem-animated-block'):
if re.search('2018', div.find('a')['href']):
urlList.append(div.find('a')['href'])
dateList.append(div.find('a')['href'])
#print(urlList)
count = 0;
while(count < len(dateList)):
dateList[count] = re.search('[0-9]{4}/[0-9]{2}/[0-9]{2}', dateList[count])
dateList[count] = dateList[count].group()
count = count + 1
print(dateList[1])
So this works almost perfectly for what I needed but then a problem occurred. The website I need to scrape data for my project updatse the _vcnonce variable daily. So my question really comes down to is that is it possible to get that specific html string into a variable. So everytime I run the code it will update automatically. Kind of like this
variable = w.e vcnonce attribute is
'_vcnonce': variable
or something like that. This is for a project where I need to get information and I was able to use selenium and beautifulsoup for other websites. But this one is just giving me problems no matter what. So I try to use selenium also but it would not work and I am just not sure if I need the same parameters even with selenium. Sorry for this long question. Not sure what would be the best approach to this.
You need to first obtain the value from the events page. This can then be used to make further requests. It is contained as an attribute inside a div element:
import requests
from bs4 import BeautifulSoup
import re
# First obtain the current nonce from the events page
r = requests.get("http://njii.com/events/")
soup = BeautifulSoup(r.content, 'html.parser')
vcnonce = soup.find('div', attrs={'data-vc-public-nonce':True})['data-vc-public-nonce']
#NJII
params = {
'action': 'vc_get_vc_grid_data',
'tag': 'vc_basic_grid',
'data[page_id]': 26,
'data[shortcode_id]': '1524685605316-ae64dc93-e23d-3',
'_vcnonce': vcnonce,
}
dateList = []
urlList = []
url = 'http://njii.com/wp-admin/admin-ajax.php'
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'html.parser')
for div in soup.find_all('div', class_='vc_gitem-animated-block'):
if re.search('2018', div.find('a')['href']):
urlList.append(div.find('a')['href'])
dateList.append(div.find('a')['href'])
dates = [re.search('[0-9]{4}/[0-9]{2}/[0-9]{2}', date).group() for date in dateList]
print(dates)
This would give you an output as:
['2018/11/01', '2018/10/22', '2018/10/09', '2018/10/09', '2018/10/03', '2018/09/27', '2018/09/21', '2018/09/13', '2018/09/12', '2018/08/24', '2018/08/20', '2018/08/02', '2018/07/27', '2018/07/11', '2018/07/06', '2018/06/21', '2018/06/08', '2018/05/24', '2018/05/17', '2018/05/14', '2018/05/04', '2018/04/20', '2018/03/28', '2018/03/26', '2018/03/23', '2018/03/22', '2018/03/15', '2018/03/15', '2018/02/27', '2018/02/19', '2018/01/18']

Using For Loop with BeautifulSoup to Select Text on a Different URLs

I have been scratching my head for nearly 4 days trying to find the best way to loop through a table of URLs on one website, request the URL and scrape text from 2 different areas of the second site.
I have tried to rewrite this script multiple times, using several different solutions to achieve my desired results, however, I have not been able to fully accomplish this.
Currently, I am able to select the first link of the table on page one, to go to the new page and select the data I need but I cant get the code to continue to loop through every link on the first page.
import requests
from bs4 import BeautifulSoup
journal_site = "https://journals.sagepub.com"
site_link 'http://journals.sagepub.com/action/showPublications?
pageSize=100&startPage='
# each page contains 100 results I need to scrape from
page_1 = '0'
page_2 = '1'
page_3 = '3'
page_4 = '4'
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
for table_row in soup.select('div.results'):
journal_name = table_row.findAll('tr', class_='False')
journal_link = table_row.find('a')['href']
journal_page = journal_site + journal_link
r = requests.get(journal_page)
soup = BeautifulSoup(r.text, 'html.parser')
for journal_header, journal_description in zip(soup.select('main'),
soup.select('div.journalCarouselTextText')):
try:
title = journal_header.h1.text.strip()
description = journal_description.p.text.strip()
print(title,':', description)
except AttributeError:
continue
What is the best way to find the title and the description for every journal_name? Thanks in advance for the help!
Most of your code works for me, just needed to modify the middle section of the code, leaving the parts before and after the same:
# all code same up to here
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find("div", { "class" : "results" })
table = results.find('table')
for row in table.find_all('a', href=True):
journal_link = row['href']
journal_page = journal_site + journal_link
# from here same as your code
I stopped after it got the fourth response(title/description) of 100 results from the first page. I'm pretty sure it will get all the expected results, only needs to loop through the 4 subsequent pages.
Hope this helps.

Web Scraping with Python (city) as parameter

def findWeather(city):
import urllib
connection = urllib.urlopen("http://www.canoe.ca/Weather/World.html")
rate = connection.read()
connection.close()
currentLoc = rate.find(city)
curr = rate.find("currentDegree")
temploc = rate.find("</span>", curr)
tempstart = rate.rfind(">", 0, temploc)
print "current temp:", rate[tempstart+1:temploc]
The link is provided above. The issue I have is everytime I run the program and use, say "Brussels" in Belgium, as the parameter, i.e findWeather("Brussels"), it will always print 24c as the temperature whereas (as I am writing this) it should be 19c. This is the case for many other cities provided by the site. Help on this code would be appreciated.
Thanks!
This one should work:
import requests
from bs4 import BeautifulSoup
url = 'http://www.canoe.ca/Weather/World.html'
response = requests.get(url)
# Get the text of the contents
html_content = response.text
# Convert the html content into a beautiful soup object
soup = BeautifulSoup(html_content, 'lxml')
cities = soup.find_all("span", class_="titleText")
cels = soup.find_all("span", class_="currentDegree")
for x,y in zip(cities,cels):
print (x.text,y.text)

Joining URL throwing exception

I have two variables, one containing the absolute URL, and another with the relative path to another section. First I tried just a simple concatenation.
absolute_url = www.example.com
relative_url = /downloads/images
url = absolute_url + relative_url
When I print the url variable, I have a well formed URL. But when I try to use requests or urllib2 to retrieve the data, about half the time it throws an exception: 'NoneType' object has no attribute 'getitem'
Then I researched and thought that maybe I should use urllib.urlparse.urljoin() to do this, but I still get the error.
But what is intriguing to me is that sometimes it works and sometimes don't. Any ideas of what is going on here?
EDIT
Here is the actual code:
url = "http://www.hdwallpapers.in"
html = requests.get(url)
soup = BeautifulSoup(html.text)
categories = ("Nature", "Animals & Birds", "Beach", "Bikes", "Cars","Dreamy & Fantasy", "Others", "Travel & World")
random_category = random.randint(0, len(categories)) - 1
selected_category = categories[random_category]
selected_category_url = soup.find('a', text=selected_category)
category_page_url_join = urlparse.urljoin(url, selected_category_url['href'])
category_page_html = requests.get(category_page_url_join)
You have a list of categories:
categories = ("Nature", "Animals & Birds", "Beach", "Bikes", "Cars","Dreamy & Fantasy", "Others", "Travel & World")
You're then picking one at random and searching for it:
random_category = random.randint(0, len(categories)) - 1
selected_category = categories[random_category]
selected_category_url = soup.find('a', text=selected_category)
This would be more easily written and just as readable as:
selected_category_url = soup.find('a', text=random.choice(categories))
Now your problem is no doubt coming from:
category_page_url_join = urlparse.urljoin(url, selected_category_url['href'])
This means that your selected_category_url ended up None because your soup.find didn't actually find anything. So in effect you're trying to run None['href'] (which of course fails...)
Note that requests won't do any HTML entity escaping, but BeautifulSoup will try where it can, so, eg:
from bs4 import BeautifulSoup
soup1 = BeautifulSoup('smith & jones')
soup2 = BeautifulSoup('smith & jones')
soup1, soup2
(<html><body><p>smith & jones</p></body></html>,
<html><body><p>smith & jones</p></body></html>)
So, since you say "about half of the time" then it's because you've got 3 choices you're searching for that won't match.... try replacing the & in your categories with & instead.

Categories

Resources