Python: Reading a webpage and extracting text from that page - python

I'm writing in Python to try and get exchange rates from the website:
xe.com/currency/converter (I can't post another link, sorry - I'm at limit)
I want to be able to get rates from this file, for example, for the conversion between GBP and USD:
Therefore, I would search the url: "http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD" , then get the value printed "1.56371 USD" (the rates at the time I was writing this message), and assign that value as an int to a variable, like rate_usd.
At the moment, I was thinking about using the BeautifulSoup module and urllib.request module, and request the url ("http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD") and search through it using BeautifulSoup. At the moment, I'm at this stage in the coding:
import urllib.request
import bs4 from BeautifulSoup
def rates_fetcher(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
# code to search through soup and fetch the converted value
# e.g. 1.56371
# How would I extract this value?
# I have inspected the page element and found the value I want to be in the class:
# <td width="47%" align="left" class="rightCol">1.56371
# I'm thinking about searching through the class: class="rightCol"
# and extracting the value that way, but how?
url1 = "http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD"
rates_fetcher(url1)
Any help would be much appreciated, and thank you whoever took the time to read this.
p.s. Sorry in advance if I have made any typos, I'm kinda' in a hurry :s

It sounds like you've got the right idea.
def rates_fetcher(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
return [item.text for item in soup.find_all(class_='rightCol')]
That should do it... This will return a list of the text inside any tag with the class 'rightCol'.
If you haven't read through the Beautiful Soup documentation, you really oughtta. It's straightforward and very useful.

Try pyquery. It's a lot better than Soup.
PS: For urllib, try Requests: Http for humans
PS2: Actually I use Node and jQuery/jQuery-like for html scrapping at last.

Related

Trying to use BeautifulSoup to learn python

I am trying to use BeautifulSoup to scrape basically any website just to learn, and when trying, I never end up getting all instances of each parameter set. Attached is my code, please let me know what I'm doing wrong:
import requests
url = "https://www.newegg.com/core-i7-8th-gen-intel-core-i7-8700k/p/N82E16819117827?Item=N82E16819117827"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
price = doc.find_all(string="$")
print(price)
#### WHY DOES BEAUTIFULSOUP NOT RETURN ALL INSTANCES!?!?!?```
as per the url provided in the question, I could see the price with $ symbol is available in the price-current class name.
So I have used a find_all() to get all the prices.
Use the below code:
import requests
from bs4 import BeautifulSoup
url = "https://www.newegg.com/core-i7-8th-gen-intel-core-i7-8700k/p/N82E16819117827?Item=N82E16819117827"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
price = doc.find_all(attrs={'class': "price-current"})
for p in price:
print(p.text)
output:
$399.99
$400.33
$403.09
$412.00
I'm not sure what you mean by "all instances of each parameter set." One reason that your code block may not be working, however, is that you forgot to import the BeautifulSoup library.
from bs4 import BeautifulSoup
Also, it's not the best practice to scrape live sites. I would highly recommend the website toscrape.com. It's a really great resource for newbies. I still use it to this day to hone my scraping skills and expand them.
Lastly, BeautifulSoup works best when you have a decent grasp of HTML and CSS, especially the selector syntax. If you don't know those two, you will struggle a little bit no matter how much Python you know. The BeautifulSoup documentation can give you some insight on how to navigate the HTML and CSS if you are not well versed in those.

scraping data from web page but missing content

I am downloading the verb conjugations to aid my learning. However one thing I can't seem to get from this web page is the english translation near the top of the page.
The code I have is below. When I print results_eng it prints the section I want but there is no english translation, what am I missing?
import requests
from bs4 import BeautifulSoup
URL = 'https://conjugator.reverso.net/conjugation-portuguese-verb-ser.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results_eng = soup.find(id='list-translations')
eng = results_eng.find_all('p', class_='context_term')
In a normal website, you should be able to find the text in a paragraph witht the function get_text(), but in this case this is a search, wich means it's probably pulling the data from a database and its not in the paragraph itself. At least that's what I can come up with, since I tried to use that function and I got an empty string in return. Can't you try another website and see what happens?
p.d: I'm a beginner, sorry if I'm guessing wrong

Cannot get Beautifulsoup to recognize tag

This simple scanner depicted below is designed to find the tag which displays a stock's percent change for the day on yahoo finance. When I examine the source code of the webpage I can easily identify that there is only one span tag which has a class equal to what I have written below. The tags class either reads $dataGreen if the price has gone up, or $dataRed if it has gone down.
I am using iterators in many other places on this webpage, all are formatted exactly the same way, and are functional. But for some reason, no amount of tweaking here will give me a result. It is as though this tag cannot be detected.
I haven't a clue why this tag can be found by ctrl+f but not .find_all()
Any guidance you can give me would be most appreciated. Here's my code.
import bs4 as bs
from urllib.request import urlopen
import urllib.request, urllib.error
url = str('https://finance.yahoo.com/quote/ABEO?p=ABEO')
source = urllib.request.urlopen(url, timeout=30).read()
soup = bs.BeautifulSoup(source,'lxml')
for row in soup.find('span',{"class":"Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($dataGreen)"}):
print (1)
for row in soup.find('span',{"class":"Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($dataRed)"}):
print (2)
Edit:
I've saved the source to a .txt and poured through it for the tag, though I couldnt detect it with ctrl+feither. When I compare what I found in the .txt to what I had pulled from the webpage, it differs. My problem seems to be solved, but I would love for someone to explain why that worked.
Trsdu(0.3s) Fw(500) Fz(14px) C($dataRed)

Python 3 Beautiful Soup Data type incompatibility issue

Hello there stack community!
I'm having an issue that I can't seem to resolve since it looks like most of the help out there is for Python 2.7.
I want to pull a table from a webpage and then just get the linktext and not the whole anchor.
Here is the code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = 'http://www.craftcount.com/category.php?cat=5'
html = urlopen(url).read()
soup = BeautifulSoup(html)
alltables = soup.findAll("table")
## This bit captures the input from the previous sequence
results=[]
for link in alltables:
rows = link.findAll('a')
## Find just the names
top100 = re.findall(r">(.*?)<\/a>",rows)
print(top100)
When I run it, I get: "TypeError: expected string or buffer"
Up to the second to the last line, it does everything correctly (when I swap out 'print(top100)' for 'print(rows)').
As an example of the response I get:
michellechangjewelry
And I just need to get:
michellechangjewelry
According to pythex.org, my (ir)regular expression should work, so I wanted to see if anyone out there knew how to do that. As an additional issue, it looks like most people like to go the other way, that is, from having the full text and only wanting the URL part.
Finally, I'm using BeautifulSoup out of "convenience", but I'm not beholden to it if you can suggest a better package to narrow down the parsing to the linktext.
Many thanks in advance!!
BeautifulSoup results are not strings; they are Tag objects, mostly.
Look for the text of the <a> tags, use the .string attribute:
for table in alltables:
link = table.find('a')
top100 = link.string
print(top100)
This finds the first <a> link in a table. To find all text of all links:
for table in alltables:
links = table.find_all('a')
top100 = [link.string for link in links]
print(top100)

Trouble parsing HTML using BeautifulSoup

I'm trying to use BeautifulSoup to parse some HTML in Python. Specifically, I'm trying to create two arrays of soup objects: one for the dates of postings on a website, and one for the postings themselves. However, when I use findAll on the div class that matches the postings, only the initial tag is returned, not the text inside the tag. On the other hand, my code works just fine for the dates. What is going on??
# store all texts of posts
texts = soup.findAll("div", {"class":"quote"})
# store all dates of posts
dates = soup.findAll("div", {"class":"datetab"})
The first line above returns only
<div class="quote">
which is not what I want. The second line returns
<div class="datetab">Feb<span>2</span></div>
which IS what I want (pre-refining).
I have no idea what I'm doing wrong. Here is the website I'm trying to parse. This is for homework, and I'm really really desperate.
Which version of BeautifulSoup are you using? Version 3.1.0 performs significantly worse with real-world HTML (read: invalid HTML) than 3.0.8. This code works with 3.0.8:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://harvardfml.com/")
soup = BeautifulSoup(page)
for incident in soup.findAll('span', { "class" : "quote" }):
print incident.contents
That site is powered by Tumblr. Tumblr has an API.
There's a python port of Tumblr that you can use to read documents.
from tumblr import Api
api = Api('harvardfml.com')
freq = {}
posts = api.read()
for post in posts:
#do something here
for your bogus findAll, without the actual source code of your program it is hard to see what is wrong.

Categories

Resources