How to extract text from tag?

How to extract text from tag? - python

It is giving me output with html tag but i dont need html tag.
Getting the text is throwing AttributeError:
'NoneType' object has no attribute 'get_text'
import requests
from bs4 import BeautifulSoup
url = requests.get("https://in.indeed.com/jobs?q=python%20developer&l=")
soup = BeautifulSoup(url.content,"html.parser")
parsed_file = soup.find(id = "resultsBody")
items = parsed_file.find_all(class_="slider_container")
for item in items:
job_title = item.find(title='Python Developer').get_text()
print(job_title)

.get_text() only works if there is a result with your selection for a title. To fix the process first check if result is not None:
for item in items:
job_title = item.find(title='Python Developer').get_text() if item.find(title='Python Developer') else 'no result'
print(job_title)
Hint
Your selection could be more focused, so your are able to loop more efficient over the cards and also scrape additional info:
soup.select('#mosaic-provider-jobcards > a')
Example
import requests
from bs4 import BeautifulSoup
url = requests.get("https://in.indeed.com/jobs?q=python%20developer&l=")
soup = BeautifulSoup(url.content,"html.parser")
data = []
for item in soup.select('#mosaic-provider-jobcards > a'):
if item.find(title='Python Developer'):
data.append({
'title':item.h2.get_text(),
'company':item.a.get_text(),
'...':'...'
})
data

Since you only want to print out the jobs whose title is Python Developer, you need to first check if a job with such a title exists - That is .find() should not return None.
Just put this check inside your for-loop.
job_title = item.find(title='Python Developer')
# If job_title is not None, print the text
if job_title:
print(job_title.get_text())

Related

Beautiful Soup only extracting one tag when can see all the others in the html code

Trying to understand how web scraping works:
import requests
from bs4 import BeautifulSoup as soup
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
result = requests.get(url)
doc = soup(result.text, "lxml")
items = doc.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'})
for item in items:
caption = item.find('div', {'class': 'caption'})
price = item.find('h4', {'class': 'pull-right price'})
print(price.string)
However, when I run this all that returns is the final price from the website ($1799.00). Why does it skip all the other h4 tags and just return the last one?
Any help would be much appreciated!
If you need any more information please let me know

What happens?
You call the print() after you finally iterated over your results, thats why you only get the last one.
How to fix?
Put the print() into your loop
for item in items:
caption = item.find('div', {'class': 'caption'})
price = item.find('h4', {'class': 'pull-right price'})
print(price.string)
Output
$295.99
$299.00
$299.00
$306.99
$321.94
$356.49
$364.46
$372.70
$379.94
$379.95
$391.48
$393.88
$399.00
$399.99
$404.23
$408.98
$409.63
$410.46
$410.66
$416.99
$433.30
$436.29
$436.29
$439.73
$454.62
$454.73
$457.38
$465.95
$468.56
$469.10
$484.23
$485.90
$487.80
$488.64
$488.78
$494.71
$497.17
$498.23
$520.99
$564.98
$577.99
$581.99
$609.99
$679.00
$679.00
$729.00
$739.99
$745.99
$799.00
$809.00
$899.00
$999.00
$1033.99
$1096.02
$1098.42
$1099.00
$1099.00
$1101.83
$1102.66
$1110.14
$1112.91
$1114.55
$1123.87
$1123.87
$1124.20
$1133.82
$1133.91
$1139.54
$1140.62
$1143.40
$1144.20
$1144.40
$1149.00
$1149.00
$1149.73
$1154.04
$1170.10
$1178.19
$1178.99
$1179.00
$1187.88
$1187.98
$1199.00
$1199.00
$1199.73
$1203.41
$1212.16
$1221.58
$1223.99
$1235.49
$1238.37
$1239.20
$1244.99
$1259.00
$1260.13
$1271.06
$1273.11
$1281.99
$1294.74
$1299.00
$1310.39
$1311.99
$1326.83
$1333.00
$1337.28
$1338.37
$1341.22
$1347.78
$1349.23
$1362.24
$1366.32
$1381.13
$1399.00
$1399.00
$1769.00
$1769.00
$1799.00
Example
Instead of just printing the results while iterating, store them structured in a list of dicts and print or save it after the for loop
import requests
from bs4 import BeautifulSoup as soup
url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"
result = requests.get(url)
doc = soup(result.text, "lxml")
items = doc.find_all('div', {'class': 'col-sm-4 col-lg-4 col-md-4'})
data = []
for item in items:
data.append({
'caption' : item.a['title'],
'price' : item.find('h4', {'class': 'pull-right price'}).string
})
print(data)

Webscraping Issue w/ BeautifulSoup

I am new to Python web scraping, and I am scraping productreview.com for review. The following code pulls all the data I need for a single review:
#Scrape TrustPilot for User Reviews (Rating, Comments)
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import json
import requests
import datetime as dt
final_list=[]
url = 'https://www.productreview.com.au/listings/world-nomads'
r = requests.get(url)
soup = bs(r.text, 'lxml')
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
try:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
name = name.find('span').text
location = soup.find('h4').find('small').text
policy = soup.find('div', class_ ='px-4_1Cw pt-4_9Zz pb-2_1Ex card-body_2iI').find('span').text
title = soup.find('h3').find('span').text
content = soup.find('p', class_ = 'mb-0_2CX').text
rating = soup.find('div', class_ = 'mb-4_2RH align-items-center_3Oi flex-wrap_ATH d-flex_oSG')
rating = rating.find('div')['title']
final_list.append([name, location, policy, rating, title, content])
except AttributeError:
pass
reviews = pd.DataFrame(final_list, columns = ['Name', 'Location', 'Policy', 'Rating', 'Title', 'Content'])
print(reviews)
But when I edit
for div in soup.find('div', class_ = 'loadingOverlay_24D'):
to
for div in soup.findAll('div', class_ = 'loadingOverlay_24D'):
I don't get all reviews, I just get the same entry looped over and over.
Any help would be much appreciated.
Thanks!

Issue 1: Repeated data inside the loop
You loop has the following form:
for div in soup.find('div' , ...):
name = soup.find('h4', ... )
policy = soup.find('div', ... )
...
Notice that you are calling find inside the loop for the soup object. This means that each time you try to find the value for name, it will search the whole document from the beginning and return the first match, in every iteration.
This is why you are getting the same data over and over.
To fix this, you need to call find inside the current review div that you are currently at. That is:
for div in soup.find('div' , ...):
name = div.find('h4', ... )
policy = div.find('div', ... )
...
Issue 2: Missing data and error handling
In your code, any errors inside the loop are ignored. However, there are many errors that are actually happening while parsing and extracting the values. For example:
location = div.find('h4').find('small').text
Not all reviews have location information. Hence, the code will extract h4, then try to find small, but won't find any, returning None. Then you are calling .text on that None object, causing an exception. Hence, this review will not be added to the result data frame.
To fix this, you need to add more error checking. For example:
locationDiv = div.find('h4').find('small')
if locationDiv:
location = locationDiv.text
else:
location = ''
Issue 3: Identifying and extracting data
The page you're trying to parse has broken HTML, and uses CSS classes that seem random or at least inconsistent. You need to find the correct and unique identifiers for the data that you are extracting such that they strictly match all the entries.
For example, you are extracting the review-container div using CSS class loadingOverlay_24D. This is incorrect. This CSS class seems to be for a "loading" placeholder div or something similar. Actual reviews are enclosed in div blocks that look like this:
<div itemscope="" itemType="http://schema.org/Review" itemProp="review">
....
</div>
Notice that the uniquely identifying property is the itemProp attribute. You can extract those div blocks using:
soup.find('div', {'itemprop': 'review'}):
Similarly, you have to find the correct identifying properties of the other data you want to extract to ensure you get all your data fully and correctly.
One more thing, when a tag has more than one CSS class, usually only one of them is the identifying property you want to use. For example, for names, you have this:
name = soup.find('h4', class_ = 'my-0_27D align-items-baseline_kxl flex-row_3gP d-inline-flex_1j8 text-muted_2v5')
but in reality, you don't need all these classes. The first class, in this case, is sufficient to identify the name h4 blocks
name = soup.find('h4', class_ = 'my-0_27D')
Example:
Here's an example to extract the author names from review page:
for div in soup.find_all('div', {'itemprop': 'review'}):
name = div.find('h4', class_ = 'my-0_27D')
if (name):
name = name.find('span').text
else:
name = '-'
print(name)
Output:
Aidan
Bruno M.
Ba. I.
Luca Evangelista
Upset
Julian L.
Alison Peck
...

The page servs broken html code and html.parser is better at dealing with it.
Change soup = bs(r.text, 'lxml') to soup = bs(r.text, 'html.parser')

I see the text, but cannot .text return it SOUP

Running:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.yellowpages.com/search? search_terms=bestbuy+10956&geo_location_terms=10956').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all("div", {"class": "result"}):
info_primary = article.find("div", {"class": "info-section info-
primary"}).text
print(info_primary)
`
Yields some noisy (number) characters when yellowpages has a rating for the store. The ratings are stored in "a" tags if they exist, else there is no "a" tag and it goes straight to "p" tags. I wanted to just grab the text from the "p" tags.
Running:
info_primary = article.find("div", {"class": "info-section info-primary"}).p.text
Gives:
AttributeError: 'NoneType' object has no attribute 'text'
Running:
info_primary = article.find("div", {"class": "info-section info-primary"}).p
Runs and I can see the text nested, but cannot return it.
Upon looking further, the phone number for the store, which I want, is outside of the "p" tag. Maybe correctly accessing the "span" tags via different class descriptions would help?
Ideas? Thanks!
I am new to Python as a forewarning.

Two things: One, you also have to actually find the <p> tag as well in order to get its text.
Two, if there's no p tag and you try to get its text, an AttributeError will be raised: you just have to ignore that and go to the next one that may have a p (you could also check first to see if .find('p') is not None; same effect)
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.yellowpages.com/search?search_terms=bestbuy+10956&geo_location_terms=10956').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all("div", {"class": "result"}):
try:
info_primary = article.find("div", {"class": "info-section info-primary"}).find('p').text
except AttributeError:
continue # If there's no <p> (raises AttributeError) just continue to next loop iteration
print(info_primary)
The reason you could see the p tag but not its text is that the text isn't inside the p tag, but inside the span tags.
You could do
try:
info_primary = article.find("div", {"class": "info-section info-primary"}).p.span.text
except AttributeError:
continue # If there's no <p> (raises AttributeError) just continue to next loop iteration
But that only yields the first span's text. Instead, to get all the span's text you could also do:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.yellowpages.com/search?search_terms=bestbuy+10956&geo_location_terms=10956').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all("div", {"class": "result"}):
try:
span_data = article.find("div", {"class": "info-section info-primary"}).p.find_all('span')
info_primary = ''
for span in span_data:
info_primary += ' ' + span.text
except AttributeError:
continue # If there's no <p> (raises AttributeError) just continue to next loop iteration
print(info_primary)

Best way to loop this situation?

I have a list of divs, and I'm trying to get certain info in each of them. The div classes are the same so I'm not sure how I would go about this.
I have tried for loops but have been getting various errors
Code to get list of divs:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://sneakernews.com/release-dates/'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, "lxml")
soup1 = soup.find("div", {'class': 'popular-releases-block'})
soup1 = str(soup1.find("div", {'class': 'row'}))
soup1 = soup1.split('</div>')
print(soup1)
Code I want to loop for each item in the soup1 list:
linkinfo = soup1.find('a')['href']
date = str(soup1.find('span'))
name = soup1.find('a')
non_decimal = re.compile(r'[^\d.]+')
date = non_decimal.sub('', date)
name = str(name)
name = re.sub('</a>', '', name)
link, name = name.split('>')
link = re.sub('<a href="', '', link)
link = re.sub('"', '', link)
name = name.split(' ')
name = str(name[-1])
date = str(date)
link = str(link)
print(link)
print(name)
print(date)

Based on the URL you posted above, I imagine you are interested in something like this:
import requests
from bs4 import BeautifulSoup
url = requests.get('https://sneakernews.com/release-dates/').text
soup = BeautifulSoup(url, 'html.parser')
tags = soup.find_all('div', {'class': 'col lg-2 sm-3 popular-releases-box'})
for tag in tags:
link = tag.find('a').get('href')
print(link)
print(tag.text)
#Anything else you want to do
If you are using the BeautifulSoup library, then you do not need regex to try to parse through HTML tags. Instead, use the handy methods that accompany BeautifulSoup. If you would like to apply a regex to the text output from the tags you locate via BeautifulSoup to accomplish a more specific task, then that would be reasonable.

My understanding is that you want to loop your code for each item within a list.
An example of this:
my_list = ["John", "Fred", "Tom"]
for name in my_list:
print(name)
This will loop for each name that is in my_list and print out each item (reffered to here as name in the list). You could do something similar with your code:
for item in soup1:
# perform some action

BeautifulSoup - Python - Find the key from HTML

I have been practicing with bs4 and Python and now I have been stucked.
My plan is to do a If - Else state where I wanted to do similar like
If(I find a value inside this html)
Do This method
Else:
Do something else
and I have scraped up a html I found randomly which looks like -
<div class="Talkinghand" data-backing="ShowingHide" data-key="123456" data-theme="$MemeTheme" style=""></div>
and what I have done so far is that:
s = requests.Session()
Url = 'www.myhtml.com' #Just took a random page which I don't feel to insert
r = s.get(Url)
soup = soup(r, "lxml")
findKey = soup.find(('div', {'class': 'Talkinghand'})['data-key'])
print(findKey)
but no luck. Gives me error and
TypeError: object of type 'Response' has no len()
Once I find or print out the key. I wanted to do a if else statement where it also says:
If(there is a value inside that data-key)
...

To display the data-key attribute from inside the <div> tag, you can do the following:
from bs4 import BeautifulSoup
html = '<div class="Talkinghand" data-backing="ShowingHide" data-key="123456" data-theme="$MemeTheme" style=""></div>'
soup = BeautifulSoup(html, "html.parser")
print soup.div['data-key']
This would print:
123456
You would need to pass r.content to your soup call.
Your script had an extra ( and ), so the following would also work:
findKey = soup.find('div', {'class': 'Talkinghand'})['data-key']
print findKey

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract text from tag? - python

Related

Beautiful Soup only extracting one tag when can see all the others in the html code

Webscraping Issue w/ BeautifulSoup

I see the text, but cannot .text return it SOUP

Best way to loop this situation?

BeautifulSoup - Python - Find the key from HTML

Categories

Resources