I'm experimenting with using BeautifulSoup and Requests for the first time, and am trying to learn by scraping some information from a news site. The aim of the project is to just be able to read news highlights from terminal, so I need to effectively scrape and parse article titles and article body text.
I am still at the stage of getting the titles, but I simply am not storing any data when I try to use the find_all() function. Below is my code:
from bs4 import BeautifulSoup
from time import strftime
import requests
date = strftime("%Y/%m/%d")
url = "http://www.thedailybeast.com/cheat-sheets/" + date + "/cheat-sheet.html"
result = requests.get(url)
c = result.content
soup = BeautifulSoup(c, "lxml")
titles = soup.find_all('h1 class="title multiline"')
print titles
Any thoughts? If anyone also has any advice / tips to improve what I currently have or the approach I'm taking, I'm always looking to get better so please do tell!
Cheers
You are putting everything here in quotes:
titles = soup.find_all('h1 class="title multiline"')
which makes BeautifulSoup search for h1 class="title multiline" elements.
Instead, use:
titles = soup.find_all("h1", class_="title multiline")
Or, with a CSS selector:
titles = soup.select("h1.title.multiline")
Actually, because of the dynamic nature of the page, to get all of the titles, you have to approach it differently:
import json
results = json.loads(soup.find('div', {'data-pageraillist': True})['data-pageraillist'])
for result in results:
print result["title"]
Prints:
Hillary Email ‘Born Classified’
North Korean Internet Goes Down
Kid-Porn Cops Go to Gene Simmons’s Home
Baylor Player Convicted of Rape After Coverup
U.S. Calls In Aussie Wildfire Experts
Markets’ 2015 Gains Wiped Out
Black Lives Matters Unveils Platform
Sheriff Won’t Push Jenner Crash Charge
Tear Gas Used on Migrants Near Macedonia
Franzen Considered Adopting Iraqi Orphan
You're very close, but find_all only searches the tags, it's not like a generic search function.
Hence if you want to filter by tag and attribute like class, then do this:
soup.find_all('h1', {'class' : 'multiline'})
Related
I am trying to use BeautifulSoup to scrape basically any website just to learn, and when trying, I never end up getting all instances of each parameter set. Attached is my code, please let me know what I'm doing wrong:
import requests
url = "https://www.newegg.com/core-i7-8th-gen-intel-core-i7-8700k/p/N82E16819117827?Item=N82E16819117827"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
price = doc.find_all(string="$")
print(price)
#### WHY DOES BEAUTIFULSOUP NOT RETURN ALL INSTANCES!?!?!?```
as per the url provided in the question, I could see the price with $ symbol is available in the price-current class name.
So I have used a find_all() to get all the prices.
Use the below code:
import requests
from bs4 import BeautifulSoup
url = "https://www.newegg.com/core-i7-8th-gen-intel-core-i7-8700k/p/N82E16819117827?Item=N82E16819117827"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
price = doc.find_all(attrs={'class': "price-current"})
for p in price:
print(p.text)
output:
$399.99
$400.33
$403.09
$412.00
I'm not sure what you mean by "all instances of each parameter set." One reason that your code block may not be working, however, is that you forgot to import the BeautifulSoup library.
from bs4 import BeautifulSoup
Also, it's not the best practice to scrape live sites. I would highly recommend the website toscrape.com. It's a really great resource for newbies. I still use it to this day to hone my scraping skills and expand them.
Lastly, BeautifulSoup works best when you have a decent grasp of HTML and CSS, especially the selector syntax. If you don't know those two, you will struggle a little bit no matter how much Python you know. The BeautifulSoup documentation can give you some insight on how to navigate the HTML and CSS if you are not well versed in those.
I am trying to scrape text from paragraphs with different id names. The text looks as follows:
<p id="comFull1" class="comment" style="display:none"><strong>Comment:
</strong><br>I realized how much Abilify has been helping me when I recently
tried to taper off of it. I am on the bipolar spectrum, with mainly
depression and some OCD symptoms. My obsessive, intrusive thoughts came
racing back when I decreased the medication. I also got much more tired and
had insomnia with the decrease. am not happy with side effects of 15 lb
weight gain, increased cholesterol and a flat effect on my emotions. I am
actually wondering if an increase from the 7 mg would help even more...for
now I'm living with the side effects.<br><a
onclick="toggle('comTrunc1'); toggle('comFull1');return false;"
href="#">Hide Full Comment</a></p>
<p id="comFull2" class="comment" style="display:none"><strong>Comment:
</strong><br>It's worked Very well for me. I'm sleeping I'm
eating I'm going Out in the public. Overall I'm very
satisfied.However I haven't heard anybody mention this but my feet are
very puffy and swollen is this a side effect does anyone know?<br><a
onclick="toggle('comTrunc2'); toggle('comFull2');return false;"
href="#">Hide Full Comment</a></p>
......
I am able to scrap text only from a particular id but not with all id at a time. Can anyone help on this issue to scrap text from all ids. The code looks like this
>>> from urllib.request import Request, urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
>>> req = Request(url,headers={'User-Agent': 'Mozilla/5.0'})
>>> webpage = urlopen(req).read()
>>> soup = BeautifulSoup(webpage, "html.parser")
>>> required2 = soup.find("p", {"id": "comFull1"}).text
>>> required2
"Comment:I realized how much Abilify has been helping me when I recently
tried to taper off of it. I am on the bipolar spectrum, with mainly
depression and some OCD symptoms. My obsessive, intrusive thoughts came
racing back when I decreased the medication. I also got much more tired and
had insomnia with the decrease. am not happy with side effects of 15 lb
weight gain, increased cholesterol and a flat effect on my emotions. I am
actually wondering if an increase from the 7 mg would help even more...for
now I'm living with the side effects.Hide Full Comment"
Try this. If all the ID numbers containing paragraphs are suffixed 1,2,3 e.t.c to it, as in comFull1,comFull2,comFull3 then the below selector should handle it.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
for item in soup.select("[id^='comFull']"):
print(item.text)
The issue you are having as understood by me is to scrape the text of all paragraphs in a webpage or <\p> tags.
The function you are looking for is -
soup.find_all('p')
A more comprehensive example is shown in the following docs -
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
If you want to use xpath you can use
response.xpath("//p[contains(#id,'comFull')]/text()").extract()
But since you are using beautiful soup you can pass a function or regular expression to find_all method as mentioned here.
Matching id's in BeautifulSoup
soup.find_all('p', id=re.compile('^comFull-'))
from lxml import html
import requests
import time
#Gets prices
page = requests.get('https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=hi')
tree = html.fromstring(page.content)
price = tree.xpath('//h2[#data-attribute="Hi Guess the Food - What’s the Food Brand in the Picture"]/text()')
print(price)
This only returns []
When looking into page.content, it shows the amazon anti bot stuff. How can I bypass this?
One general advice when you're trying to scrap something from some website. Take a look first to the returned content, in this case page.content before trying anything. You're assuming wrongly amazon is allowing you nicely to fetch their data, when they don't.
I think urllib2 is better, and xpath could be:
price = c.xpath('//div[#class="s-item-container"]//h2')[0]
print price.text
After all, long string may contains strange characters.
all! I'm working on my first web scraper ever, which grabs author names, URLs, and paper names from PMC, when given a "CitedBy" page like this
My program works fine for getting the author names and the URL's, however I can only get some of the paper titles, which I suspect is due to subscripts and superscripts.
Here's what I've got so far:
import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2593677/citedby/?page=0'
req = requests.get(url)
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml") #soup object
titles_list = []
for items in soup.findAll('div', {'class': 'title'}):
title = items.string
if title is None:
title = ("UHOH") #Problems with some titles
#print(title)
titles_list.append(title)
When I run this part of my code, my scraper gives me these results:
Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids
UHOH
Comprehensive Comparative Genomic and Transcriptomic Analyses of the Legume Genes Controlling the Nodulation Process
UHOH
Dosage Sensitivity of RPL9 and Concerted Evolution of Ribosomal Protein Genes in Plants
And so on for the whole page...
Some papers on this page that I get "UHOH" for are:
Comparative cell-specific transcriptomics reveals differentiation of C4 photosynthesis pathways in switchgrass and other C4 lineages
The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny
Cross-Family Translational Genomics of Abiotic Stress-Responsive Genes between Arabidopsis and Medicago truncatula
The first two I've listed here I believe are problematic because of "C4" and "F1" are actually "C subscript 4" and "F subscript 1". For the third one, "Medicago truncatula" is in an "em" HTML tag, so I suspect that this is why my scraper cannot scrape it.
The only alternative solution I've thought of is making my "soup.findAll" more specific, but that didn't end up helping me. I tried:
for items in soup.findAll('div', {'class': 'title'}):
title = items.string
if title is None:
for other in soup.findAll('a', {'class': 'view'}):
title = other.string
But sadly, this didn't work... So I'm not exactly sure how to approach this. Does anybody know how to handle special cases like these? Thank you so much!
Thanks to #LukasGraf, I have the answer!
Since I'm using the BeautifulSoup, I can use node.get_text(). It works different from the plain, ".string" because it also returns all the text beneath a tag, which was the case for the subscripts and "em" HTML marked text.
Trying to parse a weather page and select the weekly forecasted highs.
Normally I would search with tags = soup.find_all("span", id="hi") but this tag doesn't use an id it uses a class.
Full code:
import mechanize
from bs4 import BeautifulSoup
my_browser = mechanize.Browser()
html_page = my_browser.open("http://www.wunderground.com/weather-forecast/45056")
html_text = html_page.get_data()
my_soup = BeautifulSoup(html_text)
tags = my_soup.find_all("span", class_="hi")
temp = tags[0].string
print temp
When I run this, nothing prints
The piece of HTML is buried inside a bunch of other tags, however the specific tag for today's high is as follows:
<span class="hi">63</span>
Just use class_ as the parameter name. See the docs.
The problem arises because class is a Python keyword, so you can't use it directly.
As an alternative to scraping the web page, you could always check out Weather Underground's API. It's free for developers (limited number of calls per day, etc.), but if you're going to be doing a number of lookups, this might be easier in the end.