Joining URL throwing exception - python

I have two variables, one containing the absolute URL, and another with the relative path to another section. First I tried just a simple concatenation.
absolute_url = www.example.com
relative_url = /downloads/images
url = absolute_url + relative_url
When I print the url variable, I have a well formed URL. But when I try to use requests or urllib2 to retrieve the data, about half the time it throws an exception: 'NoneType' object has no attribute 'getitem'
Then I researched and thought that maybe I should use urllib.urlparse.urljoin() to do this, but I still get the error.
But what is intriguing to me is that sometimes it works and sometimes don't. Any ideas of what is going on here?
EDIT
Here is the actual code:
url = "http://www.hdwallpapers.in"
html = requests.get(url)
soup = BeautifulSoup(html.text)
categories = ("Nature", "Animals & Birds", "Beach", "Bikes", "Cars","Dreamy & Fantasy", "Others", "Travel & World")
random_category = random.randint(0, len(categories)) - 1
selected_category = categories[random_category]
selected_category_url = soup.find('a', text=selected_category)
category_page_url_join = urlparse.urljoin(url, selected_category_url['href'])
category_page_html = requests.get(category_page_url_join)

You have a list of categories:
categories = ("Nature", "Animals & Birds", "Beach", "Bikes", "Cars","Dreamy & Fantasy", "Others", "Travel & World")
You're then picking one at random and searching for it:
random_category = random.randint(0, len(categories)) - 1
selected_category = categories[random_category]
selected_category_url = soup.find('a', text=selected_category)
This would be more easily written and just as readable as:
selected_category_url = soup.find('a', text=random.choice(categories))
Now your problem is no doubt coming from:
category_page_url_join = urlparse.urljoin(url, selected_category_url['href'])
This means that your selected_category_url ended up None because your soup.find didn't actually find anything. So in effect you're trying to run None['href'] (which of course fails...)
Note that requests won't do any HTML entity escaping, but BeautifulSoup will try where it can, so, eg:
from bs4 import BeautifulSoup
soup1 = BeautifulSoup('smith & jones')
soup2 = BeautifulSoup('smith & jones')
soup1, soup2
(<html><body><p>smith & jones</p></body></html>,
<html><body><p>smith & jones</p></body></html>)
So, since you say "about half of the time" then it's because you've got 3 choices you're searching for that won't match.... try replacing the & in your categories with & instead.

Related

Comparing results with Beautiful Soup in Python

I've got the following code that filters a particular search on an auction site.
I can display the titles of each value & also the len of all returned values:
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.findAll("div", attrs={"class":"tm-marketplace-search-card__title"})
print(len(listings))
for listing in listings:
print(listing.text)
This prints out the following:
#print(len(listings))
3
#for listing in listings:
# print(listing.text)
PRS. Ten Top Custom 24, faded Denim, Piezo.
PRS SE CUSTOM 22
PRS Tremonti SE *With Seymour Duncan Pickups*
I know what I want to do next, but don't know how to code it. Basically I want to only display new results. I was thinking storing the len of the listings (3 at the moment) as a variable & then comparing that with another GET request (2nd variable) that maybe runs first thing in the morning. Alternatively compare both text values instead of the len. If it doesn't match, then it shows the new listings. Is there a better or different way to do this? Any help appreciated thank you
With length-comparison, there is the issue of some results being removed between checks, so it might look like there are no new results even if there are; and text-comparison does not account for results with similar titles.
I can suggest 3 other methods. (The 3rd uses my preferred approach.)
Closing time
A comment suggested using the closing time, which can be found in the tag before the title; you can define a function to get the days until closing
from datetime import date
import dateutil.parser
def get_days_til_closing(lSoup):
cTxt = lSoup.previous_sibling.find('div', {'tmid':'closingtime'}).text
cTime = dateutil.parser.parse(cTxt.replace('Closes:', '').strip())
return (cTime.date() - date.today()).days
and then filter by the returned value
min_dtc = 3 # or as preferred
# your current code upto listings = soup.findAll....
new_listings = [l for l in listings if get_days_til_closing(l) > min_dtc]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing.text)
However, I don't know if sellers are allowed to set their own closing times or if they're set at a fixed offset; also, I don't see the closing time text when inspecting with the browser dev tools [even though I could extract it with the code above], and that makes me a bit unsure of whether it's always available.
JSON list of Listing IDs
Each result is in a "card" with a link to the relevant listing, and that link contains a number that I'm calling the "listing ID". You can save that in a list as a JSON file and keep checking against it every new scrape
from bs4 import BeautifulSoup
import requests
import json
lFilename = 'listing_ids.json' # or as preferred
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = json.load(open(lFilename, 'r'))
except Exception as e:
prev_listings = []
print(len(prev_listings), 'saved listings found')
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.select("div.o-card > a[href*='/listing/']")
new_listings = [
l for l in listings if
l.get('href').split('/listing/')[1].split('?')[0]
not in prev_listings
]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings:
print(listing.select_one('div.tm-marketplace-search-card__title').text)
with open(lFilename, 'w') as f:
json.dump(prev_listings + [
l.get('href').split('/listing/')[1].split('?')[0]
for l in new_listings
], f)
This should be fairly reliable as long as they don't tend to recycle the listing ids, this should be fairly reliable. (Even then, every once in a while, after checking the new listings for that day, you can just delete the JSON file and re-run the program once; it will also keep the file from getting too big...)
CSV Logging [including Listing IDs]
Instead of just saving the IDs, you can save pretty much all the details from each result
from bs4 import BeautifulSoup
import requests
from datetime import date
import pandas
lFilename = 'listings.csv' # or as preferred
max_days = 60 # or as preferred
date_today = date.today()
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = pandas.read_csv(lFilename).to_dict(orient='records')
prevIds = [str(l['listing_id']) for l in prev_listings]
except Exception as e:
prev_listings, prevIds = [], []
print(len(prev_listings), 'saved listings found')
def get_listing_details(lSoup, prevList, lDate=date_today):
selectorsRef = {
'title': 'div.tm-marketplace-search-card__title',
'location_time': 'div.tm-marketplace-search-card__location-and-time',
'footer': 'div.tm-marketplace-search-card__footer',
}
lId = lSoup.get('href').split('/listing/')[1].split('?')[0]
lDets = {'listing_id': lId}
for k, sel in selectorsRef.items():
s = lSoup.select_one(sel)
lDets[k] = None if s is None else s.text
lDets['listing_link'] = 'https://www.trademe.co.nz/a/' + lSoup.get('href')
lDets['new_listing'] = lId not in prevList
lDets['last_scraped'] = lDate.isoformat()
return lDets
soup = BeautifulSoup(url.text, "html.parser")
listings = [
get_listing_details(s, prevIds) for s in
soup.select("div.o-card > a[href*='/listing/']")
]
todaysIds = [l['listing_id'] for l in listings]
new_listings = [l for l in listings if l['new_listing']]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing['title'])
prev_listings = [
p for p in prev_listings if str(p['listing_id']) not in todaysIds
and (date_today - date.fromisoformat(p['last_scraped'])).days < max_days
]
pandas.DataFrame(prev_listings + listings).to_csv(lFilename, index=False)
You'll end up with a spreadsheet of scraping history/log that you can check anytime, and depending on what you set max_days to, the oldest data will be automatically cleared.
Fixed it with the following:
allGuitars = ["",]
latestGuitar = soup.select("#-title")[0].text.strip()
if latestGuitar in allGuitars[0]:
print("No change. The latest listing is still: " + allGuitars[0])
elif not latestGuitar in allGuitars[0]:
print("New listing detected! - " + latestGuitar)
allGuitars.clear()
allGuitars.insert(0, latestGuitar)

Why is BeautifulSoup(...).find(...) returning None?

I have some problem with code (I use bs4):
elif 'temperature' in query:
speak("where?")
miejsce=takecommand().lower()
search = (f"Temperature in {miejsce}")
url = (f'https://www.google.com/search?q={search}')
r = requests.get(url)
data = BeautifulSoup(r.text , "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"In {search} there is {temp}")
and the error is:
temp = data.find("div", class_="BNeawe").text
AttributeError: 'NoneType' object has no attribute 'text'
Could you help me please
data.find("div", class_="BNeawe") didnt return anything, so i believe google changed how it displays weather since you last ran this code successfully.
If you search for yourself 'Weather in {place}' then right click the weather widget and choose Inspect Element (browser dependent), you can look for yourself at where the data is in the page, and see which class the data is under.
It appears it was previously under the BNeawe class.
elif "temperature" in query or "temperatures" in query:
search = "Temperature in New York"
url = f"https://www.google.com/search?q={search}:"
r = requests.get(url)
data = BeautifulSoup(r.text, "html.parser")
temp = data.find("div", class_="BNeawe").text
speak(f"Currently, the temperature in your region is {temp}")
Try this one, you were experiencing your proble in line 5 which is '(r.text, "html.parser")'
try to avoid these comma space mistakes in the code...
Best practice would be to use directly api google / weather - If you wanna scrape,try to avoid selecting your elements by classes, cause they are often that dynamic.
Instead focus on id if possible or use HTML structure:
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break
Example
from bs4 import BeautifulSoup
import requests
url = "https://www.google.com/search?q=temperature"
response = requests.get(url, headers = {'User-Agent': 'Mozilla/5.0', 'Accept-Language':'en-US,en;q=0.5'}, cookies={'CONSENT':'YES+'})
soup = BeautifulSoup(response.text)
for p in list(soup.select_one('span:-soup-contains("weather.com")').parents):
if '°' in p.text:
print(p.next.get_text(strip=True))
break

AttributeError: 'NoneType' object has no attribute 'find_all' Python Web Scraping w/ Beautiful Soup

I have two problems. First of all, I get the error that is listed in the title "AttributeError: 'NoneType' object has no attribute 'find_all'" whenever I activate this line of code. Secondly, I want to access one more statistic on this specific website as well. So, firstly, my code is below. This is meant to gather names from a website, trim off the excess, then take those names, insert them into a URL, and take two statistics. The first statistic that I am taking is on line 22, which is the source of the error. And the second statistic is in HTML and is also going to be listed after my code.
import requests
from bs4 import BeautifulSoup
import re
res = requests.get('https://plancke.io/hypixel/guild/name/GBP')
soup = BeautifulSoup(res.text, 'lxml')
memberList = []
skillAverageList = []
for i in soup.select('.playerInfo'):
memberList.append(i.text)
memberList = [e[37:-38] for e in memberList]
members = [re.sub("[A-Z][^A-Z]+$", "", member.split(" ")[1]) for member in memberList]
print(members)
for i in range(len(memberList) + 1):
player = memberList[i]
skyLeaMoe = requests.get('https://sky.lea.moe/stats/' + str(player))
skillAverageList.append(soup.find("div", {"id":"additional_stats_container"}).find_all("div",class_="additional-stat")[-2].get_text(strip=True))
pprint(skillAverageList)
Below is the second statistic that I would like to scrape from this website as well (in HTML). This specific statistic is attributed to this specific site, but the code above will hopefully be able to cycle through the entire list (https://sky.lea.moe/stats/Igris/Apple).
<span class="stat-name">Total Slayer XP: </span> == $0
<span class ="stat-value">457,530</span>
I am sorry if this is a lot, I have almost no knowledge of HTML and any attempt for me to learn it has been a struggle. Thanks in advance to anyone this reaches.
It seems that this site doesn't have a div with the id of "additional_stats_container", and therefore soup.find("div", {"id":"additional_stats_container"}) returns None.
Upon inspecting the HTML of this URL with a browser, I couldn't find such a div.
This script will print all names and their Total Slayer XP:
import requests
from bs4 import BeautifulSoup
url = 'https://plancke.io/hypixel/guild/name/GBP'
stats_url = 'https://sky.lea.moe/stats/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for a in soup.select('td a[href*="/hypixel/player/stats/"]'):
s = BeautifulSoup(requests.get(stats_url + a['href'].split('/')[-1]).content, 'html.parser')
total_slayer_xp = s.select_one('span:contains("Total Slayer XP") + span')
print('{:<30} {}'.format(a.text, total_slayer_xp.text if total_slayer_xp else '-'))
Prints:
[MVP+] Igris 457,530
[VIP] Kutta 207,665
[VIP] mistercint 56,455
[MVP+] zouce 1,710,540
viellythedivelon 30
[MVP+] Louis7864 141,670
[VIP] Broadside1138 292,240
[VIP+] Babaloops 40
[VIP+] SparkleDuck9 321,290
[VIP] TooLongOfAUserNa 423,700
...etc.

Beautifulsoup yields only result

I have searched for a number of similiar problems here, but still confused why my code yields only 1 result, instead of at least 15 on each page.
for pagenumber in range (0,2):
url = 'https://www.autowereld.nl/volkswagen/?mdl=volkswagen_golf|volkswagen_golf-alltrack|volkswagen_golf-cabriolet|volkswagen_golf-plus|volkswagen_golf-sportsvan|volkswagen_golf-variant&p='
txt = requests.get(url + str(pagenumber))
soup = BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('article', class_="item")
for car in soup_table.findAll('a'):
link = car.get('href')
sub_url = 'https://www.autowereld.nl' + link
print(sub_url)
You are using soup.find to find something with tag "article" and class "item". From the documentation, soup.find only finds one instance of a tag, and you are looking for multiple instances of a tag. Something like
for pagenumber in range (0,2):
url = 'https://www.autowereld.nl/volkswagen/?mdl=volkswagen_golf|volkswagen_golf-alltrack|volkswagen_golf-cabriolet|volkswagen_golf-plus|volkswagen_golf-sportsvan|volkswagen_golf-variant&p='
txt = requests.get(url + str(pagenumber))
soup = BeautifulSoup(txt.text, 'html.parser')
soup_articles = soup.findAll('article', class_="item")
for article in soup_articles
for car in article.findAll('a'):
link = car.get('href')
sub_url = 'https://www.autowereld.nl' + link
print(sub_url)
might work. I also recommend using find_all instead of findAll if you're using bs4 since the mixed case versions of these methods are deprecated, but that's up to you.

Is there a function available with beautifulsoup that will delete all the whitespaces

I am pretty new to Python.
I am trying to scrape the website = https://nl.soccerway.com/.
For this scraping i use beautifulsoup.
The only problem is when I scrape the team names, the team names get
extracted with whitespace surrounding them on the left and right.
How can I delete this? I know many people asked this question before, but
I cannot get it to work.
2nd Question:
How can I extract an HREF title out of a TD?
See provided HTML Code.
The club name is Perugia.
search google
search stackoverflow
Perugia
import requests
from bs4 import BeautifulSoup
def main():
url = 'https://nl.soccerway.com/'
get_detail_data(get_page(url))
def get_page(url):
response = requests.get(url)
if not response.ok:
print('response code is:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'lxml')
return soup
def get_detail_data(soup):
minutes = ""
score = ""
TeamA = ""
TeamB = ""
table_data = soup.find('table',class_='table-container')
try:
for tr in table_data.find_all('td', class_='minute visible'):
minutes = (tr.text)
print(minutes)
except:
pass
try:
for tr in soup.find_all('td', class_='team team-a'):
TeamA = tr.text
print(TeamA)
except:
pass
if __name__ == '__main__':
main()
you can use get_text(strip=True) method from beautifoulsoup
tr.get_text(strip=True)
Use the strip() method to remove trailing and leading whitespace. So in your case, it would be:
TeamA = tr.text.strip()
To get the href attribute, use the pattern tag['attribute']. In your case, it would be:
href = tr.a['href']

Categories

Resources