This simple scanner depicted below is designed to find the tag which displays a stock's percent change for the day on yahoo finance. When I examine the source code of the webpage I can easily identify that there is only one span tag which has a class equal to what I have written below. The tags class either reads $dataGreen if the price has gone up, or $dataRed if it has gone down.
I am using iterators in many other places on this webpage, all are formatted exactly the same way, and are functional. But for some reason, no amount of tweaking here will give me a result. It is as though this tag cannot be detected.
I haven't a clue why this tag can be found by ctrl+f but not .find_all()
Any guidance you can give me would be most appreciated. Here's my code.
import bs4 as bs
from urllib.request import urlopen
import urllib.request, urllib.error
url = str('https://finance.yahoo.com/quote/ABEO?p=ABEO')
source = urllib.request.urlopen(url, timeout=30).read()
soup = bs.BeautifulSoup(source,'lxml')
for row in soup.find('span',{"class":"Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($dataGreen)"}):
print (1)
for row in soup.find('span',{"class":"Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($dataRed)"}):
print (2)
Edit:
I've saved the source to a .txt and poured through it for the tag, though I couldnt detect it with ctrl+feither. When I compare what I found in the .txt to what I had pulled from the webpage, it differs. My problem seems to be solved, but I would love for someone to explain why that worked.
Trsdu(0.3s) Fw(500) Fz(14px) C($dataRed)
Related
I am currently learning web scraping with python. I'm reading Web scraping with Python by Ryan Mitchell.
I am stuck at Crawling Sites Through Search. For example, reuters search given in the book works perfectly but when I try to find it by myself, as I will do in the future, I get this link.
Whilst in the second link it is working for a human, I cannot figure out how to scrape it due to weird class names like this class="media-story-card__body__3tRWy"
The first link gives me simple names, like this class="search-result-content" that I can scrape.
I've encountered the same problem on other sites too. How would I go about scraping it or finding a link with normal names in the future?
Here's my code example:
from bs4 import BeautifulSoup
import requests
from rich.pretty import pprint
text = "hello"
url = f"https://www.reuters.com/site-search/?query={text}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
results = soup.select("div.media-story-card__body__3tRWy")
for result in results:
pprint(result)
pprint("###############")
You might resort to a prefix attribute value selector, like
div[class^="media-story-card__body__"]
This assumes that the class is the only one ( or at least notationally the first ). However, the idea can be extended to checking for a substring.
I am downloading the verb conjugations to aid my learning. However one thing I can't seem to get from this web page is the english translation near the top of the page.
The code I have is below. When I print results_eng it prints the section I want but there is no english translation, what am I missing?
import requests
from bs4 import BeautifulSoup
URL = 'https://conjugator.reverso.net/conjugation-portuguese-verb-ser.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results_eng = soup.find(id='list-translations')
eng = results_eng.find_all('p', class_='context_term')
In a normal website, you should be able to find the text in a paragraph witht the function get_text(), but in this case this is a search, wich means it's probably pulling the data from a database and its not in the paragraph itself. At least that's what I can come up with, since I tried to use that function and I got an empty string in return. Can't you try another website and see what happens?
p.d: I'm a beginner, sorry if I'm guessing wrong
Hope you're all well. I wrote a basic webscrape of an HTML site earlier today, along similar lines. I was following a tutorial, as you'll be able to see by my code I'm a bit of a green-horn to coding in Python. Hoping for a bit of guidance regarding scraping this site.
As you can see by the commented out code,
#print(results.prettify())
I am able to successfully able to print out the entire contents of the webpage. What I'd like to do however, is whittle down the contents of what I am printing out, so that I am just printing out the relevant content. There is a lot of content on the page that I don't want, and I'd like massage it out. Does anyone have any thoughts on why the for loop at the bottom of the code is not sequentially grabbing up the paragraphs in the xlmins unit of HTML and printing it out? Please see the below code for more.
import requests
from bs4 import BeautifulSoup
URL = "http://www.gutenberg.org/files/7142/7142-h/7142-h.htm"
page = requests.get(URL)
#we're going to create an object in Beautiful soup that will scrape it.
soup = BeautifulSoup(page.content, 'html.parser')
#this line of code takes
results = soup.find(xmlns='http://www.w3.org/1999/xhtml')
#print(results.prettify())
job_elems = results.find_all('p', xlmins="http://www.w3.org/1999/xhtml")
for job in job_elems:
paragraph = job.find("p", xlmins='http://www.w3.org/1999/xhtml')
print(paragraph.text.strip)
No <p> tag contains the attribute xlmins='http://www.w3.org/1999/xhtml', only the top HTML tag does. Remove that part and you'll get all the paragraphs.
job_elems = results.find_all('p')
for job in job_elems:
print(job.text.strip())
I am using Python to scrape AAPL's stock price from Yahoo finance. But the program always returns []. I would appreciate if someone could point out why the program is not working. Here is my code:
import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price
The original source is like this:
<span id="yfs_l84_aapl" class>112.31</span>
Here I just want the price 112.31. I copy and paste the code and find 'class' changes to 'class=""'. I also tried code
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
But it does not work either.
Well, the good news is that you are getting the data. You were nearly there. I would recommend that you work our your regex problems in a tool that helps, e.g. regex101.
Anyway, here is your working regex:
regex='<span id="yfs_l84_aapl">(\d*\.\d\d)'
You are collecting only digits, so don't do the general catch, be specific where you can. This is multiple digits, with a decimal literal, with two more digits.
When I went to the yahoo site you provided, I saw a span tag without class attribute.
<span id="yfs_l84_aapl">112.31</span>
Not sure what you are trying to do with "class." Without that I get 112.31
import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price
I am using BeautifulSoup to get the text from span tag
import urllib
from BeautifulSoup import BeautifulSoup
response =urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
html = response.read()
soup = BeautifulSoup(html)
# find all the spans have id = 'yfs_l84_aapl'
target = soup.findAll('span',{'id':"yfs_l84_aapl"})
# target is a list
print(target[0].string)
I'm writing in Python to try and get exchange rates from the website:
xe.com/currency/converter (I can't post another link, sorry - I'm at limit)
I want to be able to get rates from this file, for example, for the conversion between GBP and USD:
Therefore, I would search the url: "http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD" , then get the value printed "1.56371 USD" (the rates at the time I was writing this message), and assign that value as an int to a variable, like rate_usd.
At the moment, I was thinking about using the BeautifulSoup module and urllib.request module, and request the url ("http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD") and search through it using BeautifulSoup. At the moment, I'm at this stage in the coding:
import urllib.request
import bs4 from BeautifulSoup
def rates_fetcher(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
# code to search through soup and fetch the converted value
# e.g. 1.56371
# How would I extract this value?
# I have inspected the page element and found the value I want to be in the class:
# <td width="47%" align="left" class="rightCol">1.56371
# I'm thinking about searching through the class: class="rightCol"
# and extracting the value that way, but how?
url1 = "http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD"
rates_fetcher(url1)
Any help would be much appreciated, and thank you whoever took the time to read this.
p.s. Sorry in advance if I have made any typos, I'm kinda' in a hurry :s
It sounds like you've got the right idea.
def rates_fetcher(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
return [item.text for item in soup.find_all(class_='rightCol')]
That should do it... This will return a list of the text inside any tag with the class 'rightCol'.
If you haven't read through the Beautiful Soup documentation, you really oughtta. It's straightforward and very useful.
Try pyquery. It's a lot better than Soup.
PS: For urllib, try Requests: Http for humans
PS2: Actually I use Node and jQuery/jQuery-like for html scrapping at last.