I have a web page that is set up like this:
//a bunch of container divs....
<a class="food cat2 isotope-item" href="#" style="position: absolute; left: 45px; top: 0px;">
<div class="background"></div>
<div class="image">
<img src="/assets/score-images/cereal2.png" alt="">
</div>
<div class="score">1148</div>
<div class="name">Cereal with Banana</div>
</a>
<a class="food cat1 isotope-item" href="#" style="position: absolute; left: 215px; top: 0px;">
<div class="background"></div>
<div class="image">
<img src="/assets/score-images/burrito-all.png" alt="">
</div>
<div class="score">2257</div>
<div class="name">Beef & Cheese Burrito</div>
</a>
//hundreds more a tags....
</div>
I'm running this code to extra the name and score of each "a" attribute.
page = requests.get('http://www.eatlowcarbon.org/food-scores')
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
print('HEllO')
foodDict = {}
aTag = soup.findAll('a')
for tag in aTag:
print('HELLO 2')
name = tag.find("div", {"class": "name"}).text
score = tag.find("div", {"class": "score"}).text
foodDict[name] = score
print('hello')
Both print statements are successfully executed, and so the second one tells me that I've entered the for loop at least. However, I get the error,
File "scrapeRecipe.py", line 40, in <module>
name = tag.find("div", {"class": "name"}).text
AttributeError: 'NoneType' object has no attribute 'text'
From this post, I'm assuming that my code doesn't find any div with a class type equal to "name", or "score" for that matter. I'm completely new to python. Does anyone have any advice?
The problem is not with your tag.find('div', ...), but rather your soup.findAll('a'). You are pulling every a tag, even those without child tags you are trying to pull data from
By the looks of what you are needing, you need to add a class to your findAll as well
aTag = soup.findAll('a', {'class': 'food'})
Related
first time using beautifulsoup.
Trying to scrape a value from a website with the following structure:
<div class="overview">
<i class="fa fa-instagram"></i>
<div class="overflow-h">
<small>Value #1 here</small>
<small>131,390,555</small>
<div class="progress progress-u progress-xxs">
<div style="width: 13%" aria-valuemax="100" aria-valuemin="0" aria-valuenow="92" role="progressbar" class="progress-bar progress-bar-u">
</div>
</div>
</div>
</div>
<div class="overview">
<i class="fa fa-facebook"></i>
<div class="overflow-h">
<small>Value #2 here</small>
<small>555</small>
<div class="progress progress-u progress-xxs">
<div style="width: 13%" aria-valuemax="100" aria-valuemin="0" aria-valuenow="92" role="progressbar" class="progress-bar progress-bar-u">
</div>
</div>
</div>
</div>
I want the second <small>131,390,555</small> in the first <div class="overview"></div>
This is the code I am trying to use:
# Get the hashtag popularity and add it to a dictionary
for hashtag in hashtags:
popularity = []
url = ('http://url.com/hashtag/'+hashtag)
r = requests.get(url, headers=headers)
if (r.status_code == 200):
soup = BeautifulSoup(r.content, 'html5lib')
overview = soup.findAll('div', attrs={"class":"overview"})
print overview
for small in overview:
popularity.append(int(small.findAll('small')[1].text.replace(',','')))
if popularity:
raw[hashtag] = popularity[0]
#print popularity[0]
print raw
time.sleep(2)
else:
continue
The code works as long as the second <small>-value is populated in both div-overviews. I really only need the second small-value from the first overview-div.
I have tried to get it like this:
overview = soup.findAll('div', attrs={"class":"overview"})[0]
But I only get this error:
self.__class__.__name__, attr))
AttributeError: 'NavigableString' object has no attribute 'findAll'
Also is there somehow to not "break" the script if the is no small-value at all? (Have the script just replace the empty value with an zero, and continue)
you can use index but I suggest to use CSS selector and nth-child()
soup = BeautifulSoup(html, 'html.parser')
# only get first result
small = soup.select_one('.overview small:nth-child(2)')
print(small.text.replace(',',''))
# all results
secondSmall = soup.select('.overview small:nth-child(2)')
for small in secondSmall:
popularity.append(int(small.text.replace(',','')))
print(popularity)
If you just want the 2nd small tag in the 1st div only, this will work:
soup = BeautifulSoup(r.content, 'html.parser')
overview = soup.findAll('div', class_ = 'overview')
small_tag_2 = overview[0].findAll('small')[1]
print(small_tag_2)
If you want the 2nd small tag in every overview div, iterate using the loop:
soup = BeautifulSoup(r.content, 'html.parser')
overview = soup.findAll('div', class_ = 'overview')
for div in overview:
small_tag_2 = div.findAll('small')[1]
print(small_tag_2)
Note: I used html.parser instead of html5lib. If you know how to work with html5lib, then it's your choice.
Here's the HTML code:
<div class="sizeBlock">
<div class="size">
<a class="selectSize" id="44526" data-size-original="36.5">36.5</a>
</div>
<div class="size inactive active">
<a class="selectSize" id="44524" data-size-original="40">40</a>
</div>
<div class="size ">
<a class="selectSize" id="44525" data-size-original="40.5">40.5</a>
</div>
</div>
I want to get the values of the id tag and the data-size-original.
Here's my code:
for sizeBlock in soup.find_all('a', class_="selectSize"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')
The problem is that it gets the values of other ids that have the same class "selectSize".
I think this is what you want. You won't have ids and size from data in div class='size inactive active'
for sizeBlock in soup.select('div.size a.selectSize'):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')
Already answered here How to Beautiful Soup (bs4) match just one, and only one, css class
Use soup.select. Here's a simple test:
from bs4 import BeautifulSoup
html_doc = """<div class="size">
<a class="selectSize otherclass" id="44526" data-ean="0193394075362" " data-tprice="" data-sku="1171177-36.5" data-size-original="36.5">5</a>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
#for sizeBlock in soup.find_all('a', class_= "selectSize"): # this would include the anchor
for sizeBlock in soup.select("a[class='selectSize']"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-original')
print aid, size
I am trying to extract multiple factors from each of the repeated tags in a HTML file.
....
<div class="title">
<a target="_blank" id="jl_fe575975c912af9e" href="https://www.indeed.com/company/Nestvestor/jobs/Data-Science-Intern-fe575975c912af9e?fccid=8eed076a625928e7&vjs=3" onmousedown="return rclk(this,jobmap[0],0);" onclick=" setRefineByCookie(['radius']); return rclk(this,jobmap[0],true,0);" rel="noopener nofollow" title="Data Science Intern" class="jobtitle turnstileLink " data-tn-element="jobTitle">
Data Science Intern</a>
</div>
<div class="sjcl">
<div>
<span class="company">
Nestvestor</span>
</div>
<div class="jobsearch-SerpJobCard unifiedRow row result clickcard" id="p_9cfaca3374641aa0" data-jk="9cfaca3374641aa0" data-tn-component="organicJob">
<div class="title">
<a target="_blank" id="jl_9cfaca3374641aa0" href="https://www.indeed.com/rc/clk?jk=9cfaca3374641aa0&fccid=1779658d5b4ae2b0&vjs=3" onmousedown="return rclk(this,jobmap[1],0);" onclick=" setRefineByCookie(['radius']); return rclk(this,jobmap[1],true,0);" rel="noopener nofollow" title="Product Manager" class="jobtitle turnstileLink " data-tn-element="jobTitle">
Product Manager</a>
</div>
<div class="sjcl">
<div>
<span class="company">
<a data-tn-element="companyName" class="turnstileLink" target="_blank" href="https://www.indeed.com/cmp/Sojern" onmousedown="this.href = appendParamsOnce(this.href, 'from=SERP&campaignid=serp-linkcompanyname&fromjk=9cfaca3374641aa0&jcid=1779658d5b4ae2b0')" rel="noopener">
Sojern</a></span>
...
soup = BeautifulSoup(open(input("Enter a file to read: ")), "html.parser")
title = soup.find_all('div', class_='title')
for span in title:
print(span.text)
company = soup.find_all('span', class_='company')
for span in company:
print(span.text)
So far I have figured out how to get the following result:
Job_Title1
Job_Title2
Job_Title3
And in a different code result:
Company_name1
Company_Name2
Company_Name3
How do I get the results to look look like this with one run of code:
Job_Title1,Company_Name1,
Job_Title2,Company_Name2,
Job_Title3,Company_Name3,
Welcome to Stack Overflow,Just use this:
company = soup.find_all('span', class_='company')
title = soup.find_all('div', class_='title')
for t,c in zip(title, company):
print ("Job_Title :%s Company_Name :%s" %(t.text,c.text))
From what you have it looks like you need to nest your loops. Without the website, it is hard to tell but I would try something like this.
company = soup.find_all('span', class_='company')
title = soup.find_all('div', class_='title')
for span in title:
for x in company:
print(x.text,span.text)
I've tried replacing each string but I can't get it to work. I can get all the data between <span>...</span> but I can't if is closed, how could I do it? I've tried replacing the text afterwards, but I am not able to do it. I am quite new to python.
I have also tried using for x in soup.find_all('/span', class_ = "textLarge textWhite") but that won't display anything.
Relevant html:
<div style="width:100%; display:inline-block; position:relative; text-
align:center; border-top:thin solid #fff; background-image:linear-
gradient(#333,#000);">
<div style="width:100%; max-width:1400px; display:inline-block;
position:relative; text-align:left; padding:20px 15px 20px 15px;">
<a href="/manpower-fit-for-military-service.asp" title="Manpower
Fit for Military Service ranked by country">
<div class="smGraphContainer"><img class="noBorder"
src="/imgs/graph.gif" alt="Small graph icon"></div>
</a>
<span class="textLarge textWhite"><span
class="textBold">FIT-FOR-SERVICE:</span> 18,740,382</span>
</div>
<div class="blockSheen"></div>
</div>
Relevant python code:
for y in soup.find_all('span', class_ = "textBold"):
print(y.text) #this gets FIT-FOR-SERVICE:
for x in soup.find_all('span', class_ = "textLarge textWhite"):
print(x.text) #this gets FIT-FOR-SERVICE: 18,740,382 but i only want the number
Expected result: "18,740,382"
I believe you have two options here:
1 - Use regex on the parent span tag to extract only digits.
2 - Use decompose() function to remove the child span tag from the tree, and extract the text afterwards, like this:
from bs4 import BeautifulSoup
h = """<div style="width:100%; display:inline-block; position:relative; text-
align:center; border-top:thin solid #fff; background-image:linear-
gradient(#333,#000);">
<div style="width:100%; max-width:1400px; display:inline-block;
position:relative; text-align:left; padding:20px 15px 20px 15px;">
<a href="/manpower-fit-for-military-service.asp" title="Manpower
Fit for Military Service ranked by country">
<div class="smGraphContainer"><img class="noBorder"
src="/imgs/graph.gif" alt="Small graph icon"></div>
</a>
<span class="textLarge textWhite"><span
class="textBold">FIT-FOR-SERVICE:</span> 18,740,382</span>
</div>
<div class="blockSheen"></div>
</div>"""
soup = BeautifulSoup(h, "lxml")
soup.find('span', class_ = "textLarge textWhite").span.decompose()
res = soup.find('span', class_ = "textLarge textWhite").text.strip()
print(res)
#18,740,382
Here is how you could do it:
soup.find('span', {'class':'textLarge textWhite'}).find('span').extract()
output = soup.find('span', {'class':'textLarge textWhite'}).text.strip()
output:
18,740,382
Instead of grabbing the text using x.text you can use x.find_all(text=True, recursive=False) which will give you all the top-level text (in a list of strings) for a node without going into the children. Here's an example using your data:
for x in soup.find_all('span', class_ = "textLarge textWhite"):
res = x.find_all(text=True, recursive=False)
# join and strip the strings then print
print(" ".join(map(str.strip, res)))
#outputs: '18,740,382'
I am getting an error when i want to get coupon code using beautifulsoap
This is a part of page:
<ul class="nc-nav__promo-modal--global-links"><div class="nc-nav__promo-modal--global-divider"></div>
<li><div><span style="font-weight: 500; top: 0">35% off our favorite wear-now styles.* Online only. Use code <b style="font-weight: 700; top: 0">HISUMMER.</b></span></div><button type="button" class="nc-nav__promo-modal--global-details-button" aria-describedby="dialogDetailsBtn-0">Details</button></li>
<li><div><span style="font-weight: 500; top: 0">35% off our favorite wear-now styles.* Online only. Use code <b style="font-weight: 700; top: 0">MAY20.</b></span></div><button type="button" class="nc-nav__promo-modal--global-details-button" aria-describedby="dialogDetailsBtn-0">Details</button></li>
</ul>
This is my Code:
def parse(self, response):
self.mongo.GetAllDocuments()
soup = BeautifulSoup(response.text, 'html.parser')
url,off,coupon,itemtype = "","","",""
containersC=soup.select(".nc-nav__promo-modal--global-links > li")
for itemC in containersC:
coupon = itemC.a.div.span.b.text
I am getting the following error:
AttributeError: 'NoneType' object has no attribute 'b'
Your code is assuming that the structure of the HTML is the same for all instances. If the b (or any other element) is missing, you will get that error. One approach would be to first test for the presence of a b tag before attempting to print it, for example:
from bs4 import BeautifulSoup
html = """ <ul class="nc-nav__promo-modal--global-links"><div class="nc-nav__promo-modal--global-divider"></div>
<li><div><span style="font-weight: 500; top: 0">35% off our favorite wear-now styles.* Online only. Use code <b style="font-weight: 700; top: 0">HISUMMER.</b></span></div><button type="button" class="nc-nav__promo-modal--global-details-button" aria-describedby="dialogDetailsBtn-0">Details</button></li>
<li><div><span style="font-weight: 500; top: 0">35% off our favorite wear-now styles.* Online only. Use code <b style="font-weight: 700; top: 0">MAY20.</b></span></div><button type="button" class="nc-nav__promo-modal--global-details-button" aria-describedby="dialogDetailsBtn-0">Details</button></li>
</ul>"""
soup = BeautifulSoup(html, 'html.parser')
for li_tag in soup.select(".nc-nav__promo-modal--global-links > li"):
b_tag = li_tag.find('b')
if b_tag:
print(b_tag.text)
For your HTML, this gives:
HISUMMER.
MAY20.