Parsing a div with a "class" attribute - python

Using the BeautifulSoup module in Python, I'm trying to parse this webpage below.
<div class="span-body"><div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div></div>
I'm trying to get the script below to return 2016-05-08T1231Z, which is found in the second div with the timestamp updated class.
with open("index.html", 'rb') as source_file:
soup = BeautifulSoup(source_file.read()) # Read the source file and get BeautifulSoup to work with it.
div_1 = soup.find("div", {"class": "span-body"}).contents[0] # Parse the first div.
div_2 = div_1("div", {"class": "timestamp updated"}) # Parse the second div.
print div_2
div_1 returns what I wanted to return (the second div), but div_2 isn't, instead it's only giving me an empty list in return.
How can I fix this problem?

A couple of options, all of which you should just drop contents[0]:
div_1 = soup.find("div", {"class": "span-body"}) # Parse the first div.
div_2 = div_1("div", {"class": "timestamp updated"})
This will return a list with one element in it:
[<div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div>]
Just use find():
div_1 = soup.find("div", {"class": "span-body"})
div_2 = div_1.find("div", {'class': 'timestamp updated'})
print(div_2)
Result:
<div class="timestamp updated" title="2016-05-08T1231Z">May 8, 12:31 PM EDT</div>
If you don't need the intermediate div_1 why not just go straight to div_2?
div_2 = soup.find("div", {'class': 'timestamp updated'})
Edit from comment: To get the value of the title attribute you can index it like this:
div_2['title']

To find what you want from div_1 you need to use the find function again, also you can get rid of the contents[0] as find doesn't return a list.
soup = BeautifulSoup(source_file.read()) # Read the source file and get BeautifulSoup to work with it.
div_1 = soup.find("div", {"class": "span-body"}) # Parse the first div.
div_2 = div_1.find("div", {"class": "timestamp updated"}) # Parse the second div.
print div_2

Related

How to extract specific part of html using Beautifulsoup?

I am trying to extract the what's within the 'title' tag from the following html, but so far I didn't manage to.
<div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00">
This is my code:
from bs4 import BeautifulSoup
with open("messages.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
results = soup.find_all('div', attrs={'class':'pull_right date details'})
print(results)
And the output is a list with all <div for the html file.
To access the value inside title. Simply call ['title'].
If you use find_all, then this will return a list. Therefore you will need an index (e.g [0]['title'])
For example:
from bs4 import BeautifulSoup
fp = '<html><div class="pull_right date details" title="22.12.2022 01:49:03 UTC-03:00"></html>'
soup = BeautifulSoup(fp, 'html.parser')
results = soup.find_all('div', attrs={'class':'pull_right date details'})
print(results[0]['title'])
Or:
results = soup.find('div', attrs={'class':'pull_right date details'})
print(results['title'])
Output:
22.12.2022 01:49:03 UTC-03:00
22.12.2022 01:49:03 UTC-03:00

Using find_all in bs4

When I parse for more than 1 class I get an error on line 12 (when I add all to find)
Error: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element
import requests
from bs4 import BeautifulSoup
heroes_page_list=[]
url = f'https://dota2.fandom.com/wiki/Dota_2_Wiki'
q = requests.get(url)
result = q.content
soup = BeautifulSoup(result, 'lxml')
heroes = soup.find_all('div', class_= 'heroentry').find('a')
for hero in heroes:
hero_url = heroes.get('href')
heroes_page_list.append("https://dota2.fandom.com" + hero_url)
# print(heroes_page_list)
with open ('heroes_page_list.txt', "w") as file:
for line in heroes_page_list:
file.write(f'{line}\n')
You are searching a tag inside a list of div tags you need to do like this,
heroes = soup.find_all('div', class_= 'heroentry')
a_tags = [hero.find('a') for hero in heroes]
for a_tag in a_tags:
hero_url = a_tag.get('href')
heroes_page_list.append("https://dota2.fandom.com" + hero_url)
heroes_page_list look like this,
['https://dota2.fandom.com/wiki/Abaddon',
'https://dota2.fandom.com/wiki/Alchemist',
'https://dota2.fandom.com/wiki/Axe',
'https://dota2.fandom.com/wiki/Beastmaster',
'https://dota2.fandom.com/wiki/Brewmaster',
'https://dota2.fandom.com/wiki/Bristleback',
'https://dota2.fandom.com/wiki/Centaur_Warrunner',
....
The error is stating everything you need to do.
find() method is only usable on a single element. find_all() returns a list of elements. You are trying to apply find() to a list of elements.
If you want to apply find('a') you should to something similar to this:
heroes = soup.find_all('div', class_= 'heroentry')
for hero in heroes:
hero_a_tag = hero.find('a')
hero_url = hero_a_tag .get('href')
heroes_page_list.append("https://dota2.fandom.com" + hero_url)
You basically have to apply the find() method on every element presents in the list generated by the find_all() method

HTML parts locating

I am trying to extract each row individually to eventually create a dataframe to export them into a csv. I can't locate the individual parts of the html.
I can find and save the entire content (although I can only seem to save this on a loop so the pages appear hundreds of times), but I can't find any html parts nested beneath this. My code is as follows, trying to find the first row:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
content = soup.find('div', {'class': 'view-content'})
for infos in content:
try:
data = infos.find('div', {'class': 'type type_18'}).text
except:
print("None found")
df = pd.DataFrame(data)
df.columns = df.columns.str.lower().str.replace(': ','')
df[['type','rrr']] = df['rrr'].str.split("|",expand=True)
df.to_csv (r'savehere.csv', index = False, header = True)
This code just prints "None found" because, I assume, it hasn't found anything else to print. I don't know if I am not finding the right html part or what.
Any help would be much appreciated.
What happens?
Main issue here is that content = soup.find('div', {'class': 'view-content'}) is no ResultSet and contains only a single element. Thats why your second loop only iterates once.
Also Caused by this behavior you will swap from beautifoulsoup method find() to python string method find() and these two are operating in a different way - Without try/except you will see the what is going on, it try to find a string:
for x in soup.find('div', {'class': 'view-content'}):
print(x.find('div'))
Output
...
-1
<div class="views-field views-field-title-1"> <span class="views-label views-label-title-1">RRR: </span> <span class="field-content"><div class="type type_18">Eleemosynary grant</div>2256</span> </div>
...
How to fix?
Select your elements more specific in this case the views-row:
sections = soup.find_all('div', {'class': 'views-row'})
While you iterate each section you could select expected value:
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
print(section.select_one('div[class*="type_"]').text)
Example
Is scraping all the information and creates DataFrame
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = []
website = #link here#
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
sections = soup.find_all('div', {'class': 'views-row'})
for section in sections:
d = {}
for row in section.select('div.views-field'):
d[row.span.text] = row.select_one('span:nth-of-type(2)').get_text('|',strip=True)
data.append(d)
df = pd.DataFrame(data)
### replacing : in header and set all to lower case
df.columns = df.columns.str.lower().str.replace(': ','')
...
I think that You wanted to make pagination using for loop and range method and to grab RRR value.I've done the next pages meaning pagination in long url.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = #insert url#
data=[]
for page in range(1,7):
req=requests.get(url.format(page=page))
soup = BeautifulSoup(req.content,'lxml')
for r in soup.select('[class="views-field views-field-title-1"] span:nth-child(2)'):
rr=list(r.stripped_strings)[-1]
#print(rr)
data.append(rr)
df = pd.DataFrame(data,columns=['RRR'])
print(df)
#df.to_csv('data.csv',index=False)
Output:
List

How to fetch/scrape all elements from a html "class" which is inside "span"?

I am trying to scrape data from a website where i am collecting data from all elements under "class" which is inside "span" using this piece of code. But i am ending up in fetching only one element instead of all.
expand_hits = soup.findAll("a", {"class": "sold-property-listing"})
apartments = []
for hit_property in expand_hits:
#element = soup.findAll("div", {"class": "sold-property-listing__location"})
place_name = expand_hits[1].find("div", {"class": "sold-property-listing__location"}).findAll("span", {"class": "item-link"})[1].getText()
print(place_name)
apartments.append(final_str)
Expected result for print(place_name)
Stockholm
Malmö
Copenhagen
...
..
.
The result which is am getting for print(place_name)
Malmö
Malmö
Malmö
...
..
.
When i try to fetch the contents from expand_hits[1] i get only one element. If i don't specify the index scraper is throwing an error regarding the usage find(), find_all() and findAll(). As far as i understood i think i have to call the content of the elements iteratively.
Any help is much appreciated.
Thanks in Advance!
Use the loop variable rather than indexing to same collection with same index (expand_hits[1]) and append place_name not final_str
expand_hits = soup.findAll("a", {"class": "sold-property-listing"})
apartments = []
for hit_property in expand_hits:
place_name = hit_property.find("div", {"class": "sold-property-listing__location"}).find("span", {"class": "item-link"}).getText()
print(place_name)
apartments.append(place_name)
You only then need Find and no indexing
Add User-Agent header to ensure results. Also, I note that I have to pick a parent node because at least one result will not be captured by using that class item-link e.g. Övägen 6C. I use replace to get rid of the hidden text present due to now selecting for parent node.
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.hemnet.se/salda/bostader?location_ids%5B%5D=474035"
page = requests.get(url, headers = {'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(page.content,'html.parser')
for result in soup.select('.sold-results__normal-hit'):
print(re.sub(r'\s{2,}',' ', result.select_one('.sold-property-listing__location h2 + div').text).replace(result.select_one('.hide-element').text.strip(), ''))
If you only want where in Malmo e.g. Limhamns Sjöstad, you need to check how many child span tags there are for each listing.
for result in soup.select('.sold-results__normal-hit'):
nodes = result.select('.sold-property-listing__location h2 + div span')
if len(nodes)==2:
place = nodes[1].text.strip()
else:
place = 'not specified'
print(place)

Short & Easy - soup.find_all Not Returning Multiple Tag Elements

I need to scrape all 'a' tags with "result-title" class, and all 'span' tags with either class 'results-price' and 'results-hood'. Then, write the output to a .csv file across multiple columns. The current code does not print anything to the csv file. This may be bad syntax but I really can't see what I am missing. Thanks.
f = csv.writer(open(r"C:\Users\Sean\Desktop\Portfolio\Python - Web Scraper\RE Competitor Analysis.csv", "wb"))
def scrape_links(start_url):
for i in range(0, 2500, 120):
source = urllib.request.urlopen(start_url.format(i)).read()
soup = BeautifulSoup(source, 'lxml')
for a in soup.find_all("a", "span", {"class" : ["result-title hdrlnk", "result-price", "result-hood"]}):
f.writerow([a['href']], span['results-title hdrlnk'].getText(), span['results-price'].getText(), span['results-hood'].getText() )
if i < 2500:
sleep(randint(30,120))
print(i)
scrape_links('my_url')
If you want to find multiple tags with one call to find_all, you should pass them in a list. For example:
soup.find_all(["a", "span"])
Without access to the page you are scraping, it's too hard to give you a complete solution, but I recommend extracting one variable at a time and printing it to help you debug. For example:
a = soup.find('a', class_ = 'result-title')
a_link = a['href']
a_text = a.text
spans = soup.find_all('span', class_ = ['results-price', 'result-hood'])
row = [a_link, a_text] + [s.text for s in spans]
print(row) # verify we are getting the results we expect
f.writerow(row)

Categories

Resources