Python Web Scraping Problems

Python Web Scraping Problems - python

I am using Python to scrape AAPL's stock price from Yahoo finance. But the program always returns []. I would appreciate if someone could point out why the program is not working. Here is my code:
import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price
The original source is like this:
<span id="yfs_l84_aapl" class>112.31</span>
Here I just want the price 112.31. I copy and paste the code and find 'class' changes to 'class=""'. I also tried code
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
But it does not work either.

Well, the good news is that you are getting the data. You were nearly there. I would recommend that you work our your regex problems in a tool that helps, e.g. regex101.
Anyway, here is your working regex:
regex='<span id="yfs_l84_aapl">(\d*\.\d\d)'
You are collecting only digits, so don't do the general catch, be specific where you can. This is multiple digits, with a decimal literal, with two more digits.

When I went to the yahoo site you provided, I saw a span tag without class attribute.
<span id="yfs_l84_aapl">112.31</span>
Not sure what you are trying to do with "class." Without that I get 112.31
import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

I am using BeautifulSoup to get the text from span tag
import urllib
from BeautifulSoup import BeautifulSoup
response =urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
html = response.read()
soup = BeautifulSoup(html)
# find all the spans have id = 'yfs_l84_aapl'
target = soup.findAll('span',{'id':"yfs_l84_aapl"})
# target is a list
print(target[0].string)

Related

Select css tags with randomized letters at the end

I am currently learning web scraping with python. I'm reading Web scraping with Python by Ryan Mitchell.
I am stuck at Crawling Sites Through Search. For example, reuters search given in the book works perfectly but when I try to find it by myself, as I will do in the future, I get this link.
Whilst in the second link it is working for a human, I cannot figure out how to scrape it due to weird class names like this class="media-story-card__body__3tRWy"
The first link gives me simple names, like this class="search-result-content" that I can scrape.
I've encountered the same problem on other sites too. How would I go about scraping it or finding a link with normal names in the future?
Here's my code example:
from bs4 import BeautifulSoup
import requests
from rich.pretty import pprint
text = "hello"
url = f"https://www.reuters.com/site-search/?query={text}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
results = soup.select("div.media-story-card__body__3tRWy")
for result in results:
pprint(result)
pprint("###############")

You might resort to a prefix attribute value selector, like
div[class^="media-story-card__body__"]
This assumes that the class is the only one ( or at least notationally the first ). However, the idea can be extended to checking for a substring.

Trying to use BeautifulSoup to learn python

I am trying to use BeautifulSoup to scrape basically any website just to learn, and when trying, I never end up getting all instances of each parameter set. Attached is my code, please let me know what I'm doing wrong:
import requests
url = "https://www.newegg.com/core-i7-8th-gen-intel-core-i7-8700k/p/N82E16819117827?Item=N82E16819117827"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
price = doc.find_all(string="$")
print(price)
#### WHY DOES BEAUTIFULSOUP NOT RETURN ALL INSTANCES!?!?!?```

as per the url provided in the question, I could see the price with $ symbol is available in the price-current class name.
So I have used a find_all() to get all the prices.
Use the below code:
import requests
from bs4 import BeautifulSoup
url = "https://www.newegg.com/core-i7-8th-gen-intel-core-i7-8700k/p/N82E16819117827?Item=N82E16819117827"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
price = doc.find_all(attrs={'class': "price-current"})
for p in price:
print(p.text)
output:
$399.99
$400.33
$403.09
$412.00

I'm not sure what you mean by "all instances of each parameter set." One reason that your code block may not be working, however, is that you forgot to import the BeautifulSoup library.
from bs4 import BeautifulSoup
Also, it's not the best practice to scrape live sites. I would highly recommend the website toscrape.com. It's a really great resource for newbies. I still use it to this day to hone my scraping skills and expand them.
Lastly, BeautifulSoup works best when you have a decent grasp of HTML and CSS, especially the selector syntax. If you don't know those two, you will struggle a little bit no matter how much Python you know. The BeautifulSoup documentation can give you some insight on how to navigate the HTML and CSS if you are not well versed in those.

How to pull quotes and authors from a website in Python?

I have written the following code to pull the quotes from the webpage:
#importing python libraries
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_colwidth', 500)
import time
import requests
import random
from lxml import html
#collect first page of quotes
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
#create a BeautifulSoup object
soup=BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())
#find all quotes on the page
soup.find_all('ol')
#pull just the quotes and not the superfluous data
Quote=soup.find(id='post-')
Quote_list=Quote.find_all('ol')
quote_list
At this point, I now want to just show the text in a list and not see < li > or < ol > tags
I've tried using the .get_text() attribute but I get an error saying
ResultSet object has no attribute 'get_text'
How can I get only the text to return?
This is only for the first page of quotes - there is a second page which I am going to need to pull the quotes from. I will also need to present the data in a table with a column for the quotes and a column for the author from both pages.
Help is greatly appreciated... I'm still new to learning python and I've been working to this point for 8 hours on this code and feel so stuck/discouraged.

The 'html.parser' seems to be having bit of a problem even with what I think is now the correct code. But after switching to using 'lxml', which this was not using, it now seems to be working:
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
quotes.extend([li.get_text()
for li in ordered_list.find_all('li')
for ordered_list in ordered_lists
])
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote
Prints:
22
“By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It’s Columbus meet Columbo―starry-eyed explorers and skeptical detectives.”
--------------------------------------------------------------------------------
“Once you have a certain amount of math/stats and hacking skills, it is much better to acquire a grounding in one or more subjects than in adding yet another programming language to your hacking skills, or yet another machine learning algorithm to your math/stats portfolio…. Clients will rather work with some data scientist A who understands their specific field than with another data scientist B who first needs to learn the basics―even if B is better in math/stats/hacking.”
Alternate Coding
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
for ordered_list in ordered_lists:
for li in ordered_list.find_all('li'):
quotes.append(li.get_text())
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote

The find_all() method can take a list of elements to search for. It basically takes a function which determines what elements should be returned. For printing the result from the tags, you need get_text(). But it only works on single entity so you have to loop over the entire find_all() to get each element and then apply get_text() to extract text from each element.
Use this code to get all your quotes :- (This is updated and working)
from bs4 import BeautifulSoup as bs
import requests
from lxml import html
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.text, 'html.parser')
# Q return all of your quotes with HTML tagging but you can loop over it with `get_text()`
Q = soup.find('div',{"id":"post-"}).find_all('ol')
quotes=[]
for every_quote in Q:
quotes.append(every_quote.get_text())
print(quotes[0]) # Get the first quote
Use quotes[0], quotes[1], ... to get 1st, 2nd, and so on quotes!

Edit content in BeautifulSoup ResultSet

My goal in the end is to add up the number within the BeautifulSoup ResultSet down here:
[<span class="u">1,677</span>, <span class="u">114</span>, <span class="u">15</span>]
<class 'BeautifulSoup.ResultSet'>
Therefore, end up with:
sum = 1806
But it seems like the usual techniques for iterating through a list do not work here.
In the end I know that I have to extract the numbers, delete the commas, and then add them up. But I am kinda stuck, especially with extracting the numbers.
Would really appreciate some help. Thanks

Seems like the usual iterating techniques are working for me. Here is my code:
from bs4 import BeautifulSoup
# or `from BeautifulSoup import BeautifulSoup` if you are using BeautifulSoup 3
text = "<html><head><title>Test</title></head><body><span>1</span><span>2</span></body></html>"
soup = BeautifulSoup(text)
spans = soup.findAll('span')
total = sum(int(span.string) for span in spans)
print(total)
# 3
What have you tried? Do you have any error message we might be able to work with?

Web Scraping data using python?

I just started learning web scraping using Python. However, I've already ran into some problems.
My goal is to web scrape the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)
The problem: I'm unable to extract all of the species names.
This is what I have so far:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(html_doc)
spans = soup.find_all(
From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+")) to capture the texts inside the tag...
Any input will be highly appreciated!

You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:
scientific_names = [it.text for it in soup.table.find_all('i')]
Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.
You should read up on what BS actually does, it seems like you're underestimating its utility.

What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:
import urllib2
from BeautifulSoup import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(page)
scientific_names = [it.text for it in soup.table.findAll('i')]
print scientific_names

Looking at the web page, I'm not sure exactly about what information you want to extract. However, note that you can easily get the text in a tag using the text attribute:
>>> from bs4 import BeautifulSoup
>>> html = '<a>some text</a>'
>>> soup = BeautifulSoup(html)
>>> [tag.text for tag in soup.find_all('a')]
[u'some text']

Thanks everyone...I was able to solve the problem I was having with this code:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)
scientific_names = [it.text for it in soup.table.find_all('i')]
for item in scientific_names:
print item

If you want a long term solution, try scrapy. It is quite simple and does a lot of work for you. It is very customizable and extensible. You will extract all the URLs you need using xpath, which is more pleasant and reliable. Still scrapy allows you to use re, if you need.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Web Scraping Problems - python

Related

Select css tags with randomized letters at the end

Trying to use BeautifulSoup to learn python

How to pull quotes and authors from a website in Python?

Edit content in BeautifulSoup ResultSet

Web Scraping data using python?

Categories

Resources