Using find_all in bs4 - python

When I parse for more than 1 class I get an error on line 12 (when I add all to find)
Error: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element
import requests
from bs4 import BeautifulSoup
heroes_page_list=[]
url = f'https://dota2.fandom.com/wiki/Dota_2_Wiki'
q = requests.get(url)
result = q.content
soup = BeautifulSoup(result, 'lxml')
heroes = soup.find_all('div', class_= 'heroentry').find('a')
for hero in heroes:
hero_url = heroes.get('href')
heroes_page_list.append("https://dota2.fandom.com" + hero_url)
# print(heroes_page_list)
with open ('heroes_page_list.txt', "w") as file:
for line in heroes_page_list:
file.write(f'{line}\n')

You are searching a tag inside a list of div tags you need to do like this,
heroes = soup.find_all('div', class_= 'heroentry')
a_tags = [hero.find('a') for hero in heroes]
for a_tag in a_tags:
hero_url = a_tag.get('href')
heroes_page_list.append("https://dota2.fandom.com" + hero_url)
heroes_page_list look like this,
['https://dota2.fandom.com/wiki/Abaddon',
'https://dota2.fandom.com/wiki/Alchemist',
'https://dota2.fandom.com/wiki/Axe',
'https://dota2.fandom.com/wiki/Beastmaster',
'https://dota2.fandom.com/wiki/Brewmaster',
'https://dota2.fandom.com/wiki/Bristleback',
'https://dota2.fandom.com/wiki/Centaur_Warrunner',
....

The error is stating everything you need to do.
find() method is only usable on a single element. find_all() returns a list of elements. You are trying to apply find() to a list of elements.
If you want to apply find('a') you should to something similar to this:
heroes = soup.find_all('div', class_= 'heroentry')
for hero in heroes:
hero_a_tag = hero.find('a')
hero_url = hero_a_tag .get('href')
heroes_page_list.append("https://dota2.fandom.com" + hero_url)
You basically have to apply the find() method on every element presents in the list generated by the find_all() method

Related

How to fetch/scrape all elements from a html "class" which is inside "span"?

I am trying to scrape data from a website where i am collecting data from all elements under "class" which is inside "span" using this piece of code. But i am ending up in fetching only one element instead of all.
expand_hits = soup.findAll("a", {"class": "sold-property-listing"})
apartments = []
for hit_property in expand_hits:
#element = soup.findAll("div", {"class": "sold-property-listing__location"})
place_name = expand_hits[1].find("div", {"class": "sold-property-listing__location"}).findAll("span", {"class": "item-link"})[1].getText()
print(place_name)
apartments.append(final_str)
Expected result for print(place_name)
Stockholm
Malmö
Copenhagen
...
..
.
The result which is am getting for print(place_name)
Malmö
Malmö
Malmö
...
..
.
When i try to fetch the contents from expand_hits[1] i get only one element. If i don't specify the index scraper is throwing an error regarding the usage find(), find_all() and findAll(). As far as i understood i think i have to call the content of the elements iteratively.
Any help is much appreciated.
Thanks in Advance!
Use the loop variable rather than indexing to same collection with same index (expand_hits[1]) and append place_name not final_str
expand_hits = soup.findAll("a", {"class": "sold-property-listing"})
apartments = []
for hit_property in expand_hits:
place_name = hit_property.find("div", {"class": "sold-property-listing__location"}).find("span", {"class": "item-link"}).getText()
print(place_name)
apartments.append(place_name)
You only then need Find and no indexing
Add User-Agent header to ensure results. Also, I note that I have to pick a parent node because at least one result will not be captured by using that class item-link e.g. Övägen 6C. I use replace to get rid of the hidden text present due to now selecting for parent node.
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.hemnet.se/salda/bostader?location_ids%5B%5D=474035"
page = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(page.content,'html.parser')
for result in soup.select('.sold-results__normal-hit'):
print(re.sub(r'\s{2,}',' ', result.select_one('.sold-property-listing__location h2 + div').text).replace(result.select_one('.hide-element').text.strip(), ''))
If you only want where in Malmo e.g. Limhamns Sjöstad, you need to check how many child span tags there are for each listing.
for result in soup.select('.sold-results__normal-hit'):
nodes = result.select('.sold-property-listing__location h2 + div span')
if len(nodes)==2:
place = nodes[1].text.strip()
else:
place = 'not specified'
print(place)

Short & Easy - soup.find_all Not Returning Multiple Tag Elements

I need to scrape all 'a' tags with "result-title" class, and all 'span' tags with either class 'results-price' and 'results-hood'. Then, write the output to a .csv file across multiple columns. The current code does not print anything to the csv file. This may be bad syntax but I really can't see what I am missing. Thanks.
f = csv.writer(open(r"C:\Users\Sean\Desktop\Portfolio\Python - Web Scraper\RE Competitor Analysis.csv", "wb"))
def scrape_links(start_url):
for i in range(0, 2500, 120):
source = urllib.request.urlopen(start_url.format(i)).read()
soup = BeautifulSoup(source, 'lxml')
for a in soup.find_all("a", "span", {"class" : ["result-title hdrlnk", "result-price", "result-hood"]}):
f.writerow([a['href']], span['results-title hdrlnk'].getText(), span['results-price'].getText(), span['results-hood'].getText() )
if i < 2500:
sleep(randint(30,120))
print(i)
scrape_links('my_url')
If you want to find multiple tags with one call to find_all, you should pass them in a list. For example:
soup.find_all(["a", "span"])
Without access to the page you are scraping, it's too hard to give you a complete solution, but I recommend extracting one variable at a time and printing it to help you debug. For example:
a = soup.find('a', class_ = 'result-title')
a_link = a['href']
a_text = a.text
spans = soup.find_all('span', class_ = ['results-price', 'result-hood'])
row = [a_link, a_text] + [s.text for s in spans]
print(row) # verify we are getting the results we expect
f.writerow(row)

Parsing a div with a "class" attribute

Using the BeautifulSoup module in Python, I'm trying to parse this webpage below.
<div class="span-body"><div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div></div>
I'm trying to get the script below to return 2016-05-08T1231Z, which is found in the second div with the timestamp updated class.
with open("index.html", 'rb') as source_file:
soup = BeautifulSoup(source_file.read()) # Read the source file and get BeautifulSoup to work with it.
div_1 = soup.find("div", {"class": "span-body"}).contents[0] # Parse the first div.
div_2 = div_1("div", {"class": "timestamp updated"}) # Parse the second div.
print div_2
div_1 returns what I wanted to return (the second div), but div_2 isn't, instead it's only giving me an empty list in return.
How can I fix this problem?
A couple of options, all of which you should just drop contents[0]:
div_1 = soup.find("div", {"class": "span-body"}) # Parse the first div.
div_2 = div_1("div", {"class": "timestamp updated"})
This will return a list with one element in it:
[<div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div>]
Just use find():
div_1 = soup.find("div", {"class": "span-body"})
div_2 = div_1.find("div", {'class': 'timestamp updated'})
print(div_2)
Result:
<div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div>
If you don't need the intermediate div_1 why not just go straight to div_2?
div_2 = soup.find("div", {'class': 'timestamp updated'})
Edit from comment: To get the value of the title attribute you can index it like this:
div_2['title']
To find what you want from div_1 you need to use the find function again, also you can get rid of the contents[0] as find doesn't return a list.
soup = BeautifulSoup(source_file.read()) # Read the source file and get BeautifulSoup to work with it.
div_1 = soup.find("div", {"class": "span-body"}) # Parse the first div.
div_2 = div_1.find("div", {"class": "timestamp updated"}) # Parse the second div.
print div_2

How to count the number of lines of code retrieved using beautiful soup?

Is there any function in beautiful soup to count the number of lines retrieved? Or is there any other way this can be done?
from bs4 import BeautifulSoup
import string
content = open("webpage.html","r")
soup = BeautifulSoup(content)
divTag = soup.find_all("div", {"class":"classname"})
for tag in divTag:
ulTags = tag.find_all("ul", {"class":"classname"})
for tag in ulTags:
aTags = tag.find_all("a",{"class":"classname"})
for tag in aTags:
name = tag.find('img')['alt']
print(name)
If you meant to get the number of elements retrieved by find_all(), try using len() function :
......
redditAll = soup.find_all("a")
print(len(redditAll))
UPDATE :
You can change the logic to select specific elements in one go, using CSS selector. This way, getting number of elements retrieved is as easy as calling len() function on the return value :
imgTags = soup.select("div.classname ul.classname a.classname img")
#print number of <img> retreived :
print(len(imgTags))
for tag in imgTags:
name = tag['alt']
print(name)
Or you can keep the logic using multiple for loops, and manually keep track number of elements in a variable :
counter = 0
divTag = soup.find_all("div", {"class":"classname"})
for tag in divTag:
ulTags = tag.find_all("ul", {"class":"classname"})
for tag in ulTags:
aTags = tag.find_all("a",{"class":"classname"})
for tag in aTags:
name = tag.find('img')['alt']
print(name)
#update counter:
counter += 1
print(counter)

How to append findall results to list?

I am trying to parse a website for all links that have the attribute nofollow.
I want to print that list, one link by one.
However I failed to append the results of findall() to my list box(my attempt is in brackets).
What did I do wrong?
import sys
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen(sys.argv[1]).read()
soup = BeautifulSoup(page)
soup.prettify()
box = []
for anchor in soup.findAll('a', href=True, attrs = {'rel' : 'nofollow'}):
# box.extend(anchor['href'])
print anchor['href']
# print box
You are looping over soup.findAll so each anchor is not itself a list; use .append() for individual elements:
box.append(anchor['href'])
You could also use a list comprehension to grab all href attributes:
box = [a['href'] for a in soup.findAll('a', href=True, attrs = {'rel' : 'nofollow'})]

Categories

Resources