Beautiful Soup can't find this html - python

Python3 - Beautiful Soup 4
I'm trying to parse the weather graph out of the website:
https://www.wunderground.com/forecast/us/ny/new-york-city
But when I grab the weather graph html but beautiful soup seems to grab all around it.
I am new to Beautiful Soup. I think it is not able to grab this because either it is not able to parse the tag thing they have going on or because the javascript that populates the graph hasn't loaded or is not parsable by BS (at least the way I'm using it).
As far as my code goes, it's extremely basic
import requests, bs4
url = 'https://www.wunderground.com/forecast/us/ny/new-york-city'
requrl = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
requrl.raise_for_status()
bs = bs4.BeautifulSoup(requrl.text, features="html.parser")
a = str(bs)
x = 'weather-graph'
print(a[a.find('x'):])
#Also tried a.find('weather-graph') which returns -1
I have verified that each piece of the code works in other scenarios. The last line should find that string and print out everything after that.
I tried making x many different pieces of the html in and around the graph but got nothing of substance.

There is an API you can use. Same as the page does. Don't know if key expires. You may need to do some ordering on output but you can do that by datetime field
import requests
r = requests.get('https://api.weather.com/v1/geocode/40.765/-73.981/forecast/hourly/240hour.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e').json()
for i in r['forecasts']:
print(i)
If unsure I will happily update to show you how to build dataframe and order.

Related

Select css tags with randomized letters at the end

I am currently learning web scraping with python. I'm reading Web scraping with Python by Ryan Mitchell.
I am stuck at Crawling Sites Through Search. For example, reuters search given in the book works perfectly but when I try to find it by myself, as I will do in the future, I get this link.
Whilst in the second link it is working for a human, I cannot figure out how to scrape it due to weird class names like this class="media-story-card__body__3tRWy"
The first link gives me simple names, like this class="search-result-content" that I can scrape.
I've encountered the same problem on other sites too. How would I go about scraping it or finding a link with normal names in the future?
Here's my code example:
from bs4 import BeautifulSoup
import requests
from rich.pretty import pprint
text = "hello"
url = f"https://www.reuters.com/site-search/?query={text}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
results = soup.select("div.media-story-card__body__3tRWy")
for result in results:
pprint(result)
pprint("###############")
You might resort to a prefix attribute value selector, like
div[class^="media-story-card__body__"]
This assumes that the class is the only one ( or at least notationally the first ). However, the idea can be extended to checking for a substring.

scraping data from web page but missing content

I am downloading the verb conjugations to aid my learning. However one thing I can't seem to get from this web page is the english translation near the top of the page.
The code I have is below. When I print results_eng it prints the section I want but there is no english translation, what am I missing?
import requests
from bs4 import BeautifulSoup
URL = 'https://conjugator.reverso.net/conjugation-portuguese-verb-ser.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results_eng = soup.find(id='list-translations')
eng = results_eng.find_all('p', class_='context_term')
In a normal website, you should be able to find the text in a paragraph witht the function get_text(), but in this case this is a search, wich means it's probably pulling the data from a database and its not in the paragraph itself. At least that's what I can come up with, since I tried to use that function and I got an empty string in return. Can't you try another website and see what happens?
p.d: I'm a beginner, sorry if I'm guessing wrong

Python: Reading a webpage and extracting text from that page

I'm writing in Python to try and get exchange rates from the website:
xe.com/currency/converter (I can't post another link, sorry - I'm at limit)
I want to be able to get rates from this file, for example, for the conversion between GBP and USD:
Therefore, I would search the url: "http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD" , then get the value printed "1.56371 USD" (the rates at the time I was writing this message), and assign that value as an int to a variable, like rate_usd.
At the moment, I was thinking about using the BeautifulSoup module and urllib.request module, and request the url ("http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD") and search through it using BeautifulSoup. At the moment, I'm at this stage in the coding:
import urllib.request
import bs4 from BeautifulSoup
def rates_fetcher(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
# code to search through soup and fetch the converted value
# e.g. 1.56371
# How would I extract this value?
# I have inspected the page element and found the value I want to be in the class:
# <td width="47%" align="left" class="rightCol">1.56371
# I'm thinking about searching through the class: class="rightCol"
# and extracting the value that way, but how?
url1 = "http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD"
rates_fetcher(url1)
Any help would be much appreciated, and thank you whoever took the time to read this.
p.s. Sorry in advance if I have made any typos, I'm kinda' in a hurry :s
It sounds like you've got the right idea.
def rates_fetcher(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
return [item.text for item in soup.find_all(class_='rightCol')]
That should do it... This will return a list of the text inside any tag with the class 'rightCol'.
If you haven't read through the Beautiful Soup documentation, you really oughtta. It's straightforward and very useful.
Try pyquery. It's a lot better than Soup.
PS: For urllib, try Requests: Http for humans
PS2: Actually I use Node and jQuery/jQuery-like for html scrapping at last.

scraping a constantly changing integer from a website

I am trying to extract numeric data from a website. I tried using a simple web scraper to retrieve the data:
from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
url = "http://www.oanda.com/currency/live-exchange-rates/"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
data1 = soup.find(id='EUR_USD-b-int')
print data1
This kind of approach normally would give the line of data from the website including the contents of the element I am trying to extract. However it gives everything but the contents which is the part I need. I have tried .contents and it returns []. I've also tried .child and it returns 'none'. Does anyone know another method that could work. I have looked through the beautiful soup documentation but I can't seem to find a solution?
The value on this page is updated using Javascript by making a request to
GET http://www.oanda.com/lfr/rates_lrrr?tstamp=1392757175089&lrrr_inverts=1
Referer: http://www.oanda.com/currency/live-exchange-rates/
(Be aware that I was blocked 4 times just looking at this, they are extremely block-happy. This is because they sell this data commercially as a subscription service.)
The request is made and the response parsed in http://www.oanda.com/jslib/wl/lrrr/liverates.js. The response is "encrypted" with RC4 (http://en.wikipedia.org/wiki/RC4)
The RC4 decrypt method is coming from http://www.oanda.com/wandacache/rc4-ea63ca8c97e3cbcd75f72603d4e99df48eb46f66.js. It looks like this file is refreshed often so you'll need to grab the latest link from the homepage and extract the var key=<value> to fully decrypt the value.

Trouble parsing HTML using BeautifulSoup

I'm trying to use BeautifulSoup to parse some HTML in Python. Specifically, I'm trying to create two arrays of soup objects: one for the dates of postings on a website, and one for the postings themselves. However, when I use findAll on the div class that matches the postings, only the initial tag is returned, not the text inside the tag. On the other hand, my code works just fine for the dates. What is going on??
# store all texts of posts
texts = soup.findAll("div", {"class":"quote"})
# store all dates of posts
dates = soup.findAll("div", {"class":"datetab"})
The first line above returns only
<div class="quote">
which is not what I want. The second line returns
<div class="datetab">Feb<span>2</span></div>
which IS what I want (pre-refining).
I have no idea what I'm doing wrong. Here is the website I'm trying to parse. This is for homework, and I'm really really desperate.
Which version of BeautifulSoup are you using? Version 3.1.0 performs significantly worse with real-world HTML (read: invalid HTML) than 3.0.8. This code works with 3.0.8:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://harvardfml.com/")
soup = BeautifulSoup(page)
for incident in soup.findAll('span', { "class" : "quote" }):
print incident.contents
That site is powered by Tumblr. Tumblr has an API.
There's a python port of Tumblr that you can use to read documents.
from tumblr import Api
api = Api('harvardfml.com')
freq = {}
posts = api.read()
for post in posts:
#do something here
for your bogus findAll, without the actual source code of your program it is hard to see what is wrong.

Categories

Resources