How to pull quotes and authors from a website in Python? - python

I have written the following code to pull the quotes from the webpage:
#importing python libraries
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_colwidth', 500)
import time
import requests
import random
from lxml import html
#collect first page of quotes
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
#create a BeautifulSoup object
soup=BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())
#find all quotes on the page
soup.find_all('ol')
#pull just the quotes and not the superfluous data
Quote=soup.find(id='post-')
Quote_list=Quote.find_all('ol')
quote_list
At this point, I now want to just show the text in a list and not see < li > or < ol > tags
I've tried using the .get_text() attribute but I get an error saying
ResultSet object has no attribute 'get_text'
How can I get only the text to return?
This is only for the first page of quotes - there is a second page which I am going to need to pull the quotes from. I will also need to present the data in a table with a column for the quotes and a column for the author from both pages.
Help is greatly appreciated... I'm still new to learning python and I've been working to this point for 8 hours on this code and feel so stuck/discouraged.

The 'html.parser' seems to be having bit of a problem even with what I think is now the correct code. But after switching to using 'lxml', which this was not using, it now seems to be working:
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
quotes.extend([li.get_text()
for li in ordered_list.find_all('li')
for ordered_list in ordered_lists
])
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote
Prints:
22
“By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It’s Columbus meet Columbo―starry-eyed explorers and skeptical detectives.”
--------------------------------------------------------------------------------
“Once you have a certain amount of math/stats and hacking skills, it is much better to acquire a grounding in one or more subjects than in adding yet another programming language to your hacking skills, or yet another machine learning algorithm to your math/stats portfolio…. Clients will rather work with some data scientist A who understands their specific field than with another data scientist B who first needs to learn the basics―even if B is better in math/stats/hacking.”
Alternate Coding
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
for ordered_list in ordered_lists:
for li in ordered_list.find_all('li'):
quotes.append(li.get_text())
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote

The find_all() method can take a list of elements to search for. It basically takes a function which determines what elements should be returned. For printing the result from the tags, you need get_text(). But it only works on single entity so you have to loop over the entire find_all() to get each element and then apply get_text() to extract text from each element.
Use this code to get all your quotes :- (This is updated and working)
from bs4 import BeautifulSoup as bs
import requests
from lxml import html
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.text, 'html.parser')
# Q return all of your quotes with HTML tagging but you can loop over it with `get_text()`
Q = soup.find('div',{"id":"post-"}).find_all('ol')
quotes=[]
for every_quote in Q:
quotes.append(every_quote.get_text())
print(quotes[0]) # Get the first quote
Use quotes[0], quotes[1], ... to get 1st, 2nd, and so on quotes!

Related

Scraping specific 'dd' tags with BeautifulSoup and Python

Im learning beautifulsoup and I came a cross one problem. Thats scraping dd tags in html. Check out the picture below, I want to get the parameters that are in the red color zone. The problem is I do not know how to access them. I have tried this:
kvadratura = float(nek_html.find('span', class_='d-inline-block mt-auto').text.split(' ')[0])
jedinica_mere = nek_html.find('span', class_='d-inline-block mt-auto').text.split(' ')[1].strip()
...
But the problem is that sometimes different pages have different parameters, or different order of parameters so I cant access with index. Check out the links:
https://www.nekretnine.rs/stambeni-objekti/stanovi/centar-zmaj-jovina-salonac-id1003/NkmUEzjEFo0/
https://www.nekretnine.rs/stambeni-objekti/stanovi/prodajemo-stan-milica-od-macve-mirijevo-46m2-nov/NkNruPymNHy/
How can I sure that I will always scrape the parameter that I want?
Each parameter goes into the list afterwards so If some parameter does now exist, it should add '' to the list
In such cases, this is something you might wanna do instead of using index as the latter may lead you to the wrong dd. When you go for the following approach, all you need to do is replace the text within :contains('') to get their dd, as in Transakcija,Vrsta stana and so on..
import requests
from bs4 import BeautifulSoup
url = "https://www.nekretnine.rs/stambeni-objekti/stanovi/zemun-krajiska-41m-bela-fasadna-cila-odlican/NkiRX4sq4Cy/"
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
Kategorija = soup.select_one(".base-inf .dl-horozontal:has(:contains('Kategorija:')) > dd")
Kategorija = Kategorija.get_text(strip=True) if Kategorija else ""
print(Kategorija)

Python Web Scraping Problems

I am using Python to scrape AAPL's stock price from Yahoo finance. But the program always returns []. I would appreciate if someone could point out why the program is not working. Here is my code:
import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price
The original source is like this:
<span id="yfs_l84_aapl" class>112.31</span>
Here I just want the price 112.31. I copy and paste the code and find 'class' changes to 'class=""'. I also tried code
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
But it does not work either.
Well, the good news is that you are getting the data. You were nearly there. I would recommend that you work our your regex problems in a tool that helps, e.g. regex101.
Anyway, here is your working regex:
regex='<span id="yfs_l84_aapl">(\d*\.\d\d)'
You are collecting only digits, so don't do the general catch, be specific where you can. This is multiple digits, with a decimal literal, with two more digits.
When I went to the yahoo site you provided, I saw a span tag without class attribute.
<span id="yfs_l84_aapl">112.31</span>
Not sure what you are trying to do with "class." Without that I get 112.31
import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price
I am using BeautifulSoup to get the text from span tag
import urllib
from BeautifulSoup import BeautifulSoup
response =urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
html = response.read()
soup = BeautifulSoup(html)
# find all the spans have id = 'yfs_l84_aapl'
target = soup.findAll('span',{'id':"yfs_l84_aapl"})
# target is a list
print(target[0].string)

Python: Parsing a class prints nothing?

Trying to parse a weather page and select the weekly forecasted highs.
Normally I would search with tags = soup.find_all("span", id="hi") but this tag doesn't use an id it uses a class.
Full code:
import mechanize
from bs4 import BeautifulSoup
my_browser = mechanize.Browser()
html_page = my_browser.open("http://www.wunderground.com/weather-forecast/45056")
html_text = html_page.get_data()
my_soup = BeautifulSoup(html_text)
tags = my_soup.find_all("span", class_="hi")
temp = tags[0].string
print temp
When I run this, nothing prints
The piece of HTML is buried inside a bunch of other tags, however the specific tag for today's high is as follows:
<span class="hi">63</span>
Just use class_ as the parameter name. See the docs.
The problem arises because class is a Python keyword, so you can't use it directly.
As an alternative to scraping the web page, you could always check out Weather Underground's API. It's free for developers (limited number of calls per day, etc.), but if you're going to be doing a number of lookups, this might be easier in the end.

Python 3 Beautiful Soup Data type incompatibility issue

Hello there stack community!
I'm having an issue that I can't seem to resolve since it looks like most of the help out there is for Python 2.7.
I want to pull a table from a webpage and then just get the linktext and not the whole anchor.
Here is the code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = 'http://www.craftcount.com/category.php?cat=5'
html = urlopen(url).read()
soup = BeautifulSoup(html)
alltables = soup.findAll("table")
## This bit captures the input from the previous sequence
results=[]
for link in alltables:
rows = link.findAll('a')
## Find just the names
top100 = re.findall(r">(.*?)<\/a>",rows)
print(top100)
When I run it, I get: "TypeError: expected string or buffer"
Up to the second to the last line, it does everything correctly (when I swap out 'print(top100)' for 'print(rows)').
As an example of the response I get:
michellechangjewelry
And I just need to get:
michellechangjewelry
According to pythex.org, my (ir)regular expression should work, so I wanted to see if anyone out there knew how to do that. As an additional issue, it looks like most people like to go the other way, that is, from having the full text and only wanting the URL part.
Finally, I'm using BeautifulSoup out of "convenience", but I'm not beholden to it if you can suggest a better package to narrow down the parsing to the linktext.
Many thanks in advance!!
BeautifulSoup results are not strings; they are Tag objects, mostly.
Look for the text of the <a> tags, use the .string attribute:
for table in alltables:
link = table.find('a')
top100 = link.string
print(top100)
This finds the first <a> link in a table. To find all text of all links:
for table in alltables:
links = table.find_all('a')
top100 = [link.string for link in links]
print(top100)

Web Scraping data using python?

I just started learning web scraping using Python. However, I've already ran into some problems.
My goal is to web scrape the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)
The problem: I'm unable to extract all of the species names.
This is what I have so far:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(html_doc)
spans = soup.find_all(
From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+")) to capture the texts inside the tag...
Any input will be highly appreciated!
You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:
scientific_names = [it.text for it in soup.table.find_all('i')]
Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.
You should read up on what BS actually does, it seems like you're underestimating its utility.
What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:
import urllib2
from BeautifulSoup import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(page)
scientific_names = [it.text for it in soup.table.findAll('i')]
print scientific_names
Looking at the web page, I'm not sure exactly about what information you want to extract. However, note that you can easily get the text in a tag using the text attribute:
>>> from bs4 import BeautifulSoup
>>> html = '<a>some text</a>'
>>> soup = BeautifulSoup(html)
>>> [tag.text for tag in soup.find_all('a')]
[u'some text']
Thanks everyone...I was able to solve the problem I was having with this code:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)
scientific_names = [it.text for it in soup.table.find_all('i')]
for item in scientific_names:
print item
If you want a long term solution, try scrapy. It is quite simple and does a lot of work for you. It is very customizable and extensible. You will extract all the URLs you need using xpath, which is more pleasant and reliable. Still scrapy allows you to use re, if you need.

Categories

Resources