find specific text in beautifulsoup - python

I have a specific piece of text i'm trying to get using BeautifulSoup and Python, however I am not sure how to get it using sou.find().
I am trying to obtain "#1 in Beauty" only from the following.
<ul>
<li>...<li>
<li>...<li>
<li id="salesRank">
<b>Amazon Best Sellers Rank:</b>
"#1 in Beauty ("
See top 100
")
Can anyone help me with this?

You need to use the find_all method of soup. Try below
import urllib, urllib2
from bs4 import BeautifulSoup, Comment
url='your url here'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, "html.parser")
print soup.find_all('#1 in Beauty')

Related

Scraping facebook likes, comments and shares with Beautiful Soup

I want to scrape number of likes, comments and shares with Beautiful soup and Python.
I have wrote a code, but it returns me the empty list, I do not know why:
this is the code:
from bs4 import BeautifulSoup
import requests
website = "https://www.facebook.com/nike"
soup = requests.get(website).text
my_html = BeautifulSoup(soup, 'lxml')
list_of_likes = my_html.find_all('span', class_='_81hb')
print(list_of_likes)
for i in list_of_likes:
print(i)
The same is with comments and likes. What should I do?
Facebook uses client side rendering...that means in the HTML document that you get and you have it stored in soup variable is just javascript code that actually renders the content only when you display it in browser.
Probably, you can try use the Selenium.

Python HTML scraping cannot find attribute I know exists?

I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?
I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.

Scraping site returns different href for a link

In python, I'm using the requests module and BS4 to search the web with duckduckgo.com. I went to http://duckduckgo.com/html/?q='hello' manually and got the first results title as <a class="result__a" href="http://example.com"> using the Developer Tools. Now I used the following code to get the href with Python:
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup4(html, 'html.parser')
result = soup.find('a', class_='result__a')['href']
However, the href looks like gibberish and is completely different from the one i saw manually. ny idea why this is happening?
There are multiple DOM elements with the classname 'result__a'. So, don't expect the first link you see be the first you get.
The 'gibberish' you mentioned is an encoded URL. You'll need to decode and parse it to get the parameters(params) of the URL.
For example:
"/l/?kh=-1&uddg=https%3A%2F%2Fwww.example.com"
The above href contains two params, namely kh and uddg.
uddg is the actual link you need I suppose.
Below code will get all the URL of that particular class, unquoted.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs, unquote
html = requests.get('http://duckduckgo.com/html/?q=hello').content
soup = BeautifulSoup(html, 'html.parser')
for anchor in soup.find_all('a', attrs={'class':'result__a'}):
link = anchor.get('href')
url_obj = urlparse(link)
parsed_url = parse_qs(url_obj.query).get('uddg', '')
if parsed_url:
print(unquote(parsed_url[0]))

Unable to scrape certain values of a website using regex

I've been trying to scrape the information inside of a particular set of p tags on a website and running into a lot of trouble.
My code looks like:
import urllib
import re
def scrape():
url = "https://www.theWebsite.com"
statusText = re.compile('<div id="holdsThePtagsIwant">(.+?)</div>')
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
status = re.findall(statusText,htmltext)
print("Status: " + str(status))
scrape()
Which unfortunately returns only: "Status: []"
However, that being said I have no idea what I am doing wrong because when I was testing on the same website I could use the code
statusText = re.compile('(.+?)')
instead and I would get what I was trying to, "Status: ['About', 'About']"
Does anyone know what I can do to get the information within the div tags? Or more specifically the single set of p tags the div tags contain? I've tried plugging in just about any values I could think of and have gotten nowhere. After Google, YouTube, and SO searching I'm running out of ideas now.
I use BeautifulSoup for extracting information between html tags. Suppose you want to extract a division like this : <div class='article_body' itemprop='articleBody'>...</div>
then you can use beautifulsoup and extract this division by:
soup = BeautifulSoup(<htmltext>) # creating bs object
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
also see the official documentation of bs4
as an example i have edited your code for extracting a division form an article of bloomberg
you can make your own changes
import urllib
import re
from bs4 import BeautifulSoup
def scrape():
url = 'http://www.bloomberg.com/news/2014-02-20/chinese-group-considers-south-africa-platinum-bids-amid-strikes.html'
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext)
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
print ans
scrape()
You can BeautifulSoup from here
P.S. : I use scrapy and BeautifulSoup for web scraping and I am satisfied with it

Beautifulsoup url loading error

So I am trying to get the content of this page using beautiful soup. I want to create a dictionary of all the css color names and this seemed like a quick and easy way to access this. So naturally I did the quick basic:
from bs4 import BeautifulSoup as bs
url = 'http://www.w3schools.com/cssref/css_colornames.asp'
soup = bs(url)
for some reason I am only getting the url in a p tag inside the body and that's it:
>>> print soup.prettify()
<html>
<body>
<p>
http://www.w3schools.com/cssref/css_colornames.asp
</p>
</body>
</html>
why wont BeautifulSoup give me access to the information I need?
Beautifulsoup does not load a URL for you.
You need to pass in the full HTML page, which means you need to load it from the URL first. Here is a sample using the urllib2.urlopen function to achieve that:
from urllib2 import urlopen
from bs4 import BeautifulSoup as bs
source = urlopen(url).read()
soup = bs(source)
Now you can extract the colours just fine:
css_table = soup.find('table', class_='reference')
for row in css_table.find_all('tr'):
cells = row.find_all('td')
if cells:
print cells[0].a.text, cells[1].a.text

Categories

Resources