Beautiful Soup in Python: extracting different data with the same classes - python

I would like to extract via Beautiful Soup different data from the same web page, but apparently all the data are with the same html info.
The web page is https://www.ine.es
What I am trying to get is: 47.329.981, -0,5 and -22,1.
I don't know if it is possible having the same class (the only different are the images).
Many thanks.

Please Check this out
import requests
from bs4 import BeautifulSoup
url="https://www.ine.es"
results=requests.get(url)
soup=BeautifulSoup(results.text,"html.parser")
details=soup.findAll("span",attrs={"class":"dato"})
for i in details:
print(i.text)
Output: 47.329.981 -0,5 -22,1 15,33 55,54

Related

Scraping tables from a webpage into python

I am learning Spanish & to help me learn the different verbs and their conjugations I am making some flash cards to use on my phone.
I am trying to scrape the data from a web page here is example page for one verb. On the page there are a few tables, I am interested in the first five (Present, Future, Imperfect, Preterite & Conditional) near the top.
I have heard the BeautifulSoup is good for these types of projects. However when I use the prettify method I can't find the tables in the text anywhere? I think I'm missing something, how can I get these tables in python?
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://www.linguasorb.com/spanish/verbs/conjugation/tener.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
txt = soup.prettify()
You're loading the wrong url. Remove the ".html" from the URL variable and you will be able to find the tables (they're actually lists) in the output: soup.find_all('div', class_='vPos')

Python html missing when scraping website

I tried to scrape a website using code like
import requests
requests.get("myurl.com").content
But some important elements from the website were missing. How can I get the whole website content with Python 3, just like I would use the inspector in Firefox or other browsers?
Why don't you try Scrapy, Selenium or even Splash? They are powerful scraping libraries.
You can use Beautiful Soup, a python library for scraping, for this purpose. Simply import it at the top:
from bs4 import BeautifulSoup
Then, add these lines to your code
data = requests.get("myurl.com").text
soup = BeautifulSoup(data, 'html.parser')

No output after using requests and BeautifulSoup in PyCharm

I was trying to get some headlines from the newyorktimes website. I have 2 questions,
question 1:
This is my code, but I gives me no output, does anyone know what I'd have to change?
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
a = soup.find_all(class_="balancedHeadline")
for story_heading in a:
print(story_heading)
My second question:
As the HTML is not the same for all headlines (there's a different class for the big headlines and the smaller ones for example), how would I take all those different classes in my code and give me all of the headlines as output?
Thanks in advance!
BeautifulSoup is a robust parsing library.
But, unlike your browser, it does not evaluate javascript.
Elements with balancedHeadline class you were looking for are
not present in the download HTML document.
They get added in later when assets have downloaded
and javascript functions have run.
You won't be able to find such a class using your current technique.
The answer to your second question is in the docs.
A regex or a function would work, but you might find that
passing in a list is simpler for your application.

BeautifulSoup4: Missing Parsed Table Data

I'm trying to extract the Earnings Per Share data through BeautifulSoup 4 from this page.
When I parse the data, the table information is missing using the default, lxml and HTML 5 parsers. I believe this has something to do with Javascript and I have been trying to implement PyV8 to transform the script into readable HTML for BS4. The problem is I don't know where to go from here.
Do you know if this is in fact my issue? I have been reading many posts and it's been a very big headache for me today. Below is a quick example. The financeWrap includes the table information, but beautifulSoup shows that it is empty.
import requests
from bs4 import BeautifulSoup
url = "http://financials.morningstar.com/ratios/r.html?t=AAPL&region=usa&culture=en-US"
response = requests.get(url)
soup_key_ratios = bs(response.content, 'html5lib')
financial_tables = soup_key_ratios.find("div", {"id":"financeWrap"})
print financial_tables
# Output: <div id="financeWrap">
# </div>
The issue is that you're trying to get data that is coming in through Ajax on the website. If you go to the link you provided, and looked at the source via the browser, you'll see that there should be no content with the data.
However, if you use a console manager, such as Firebug, you will see that there are Ajax requests made to the following URL, which is something you can parse via beautifulsoup (perhaps - I haven't tried it or looked at the structure of the data).
Keep in mind that this is quite possibly against the website's ToS.

Parsing data stored in URLs via BeautifulSoup?

I'm attempting to access the URLs of the different fish family from this website: http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon
I'd like to be able to run a script that opens the links of the given website to then be able to parse the information stored within the pages. I'm fairly new to web scraping, so any help will be appreciated. Thanks in advance!
This is what I have so far:
import urllib2
import re
from bs4 import BeautifulSoup
import time
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)
page = urllib2.urlopen('http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon').read()
soup = BeautifulSoup(page)
soup.prettify()
for fish in soup.findAll('a', href=True):
print fish['href']
Scrapy is the perfect tool for this. It is a python web scraping framework.
http://doc.scrapy.org/en/latest/intro/tutorial.html
You can pass in your url with you term, and create rules for crawling.
FOr example using a regex you would add a rule to scrape all links with the path /Summary and then extract the information using XPath or Beautiful soup.
Additionally you can set up a rule to automatically handle the pagination, ie in your example url it could automatically follow the Next link.
Basically a lot of what you are trying to do comes packaged for free in scrapy. I would def take a look into it.
If you're just writing a one-off script to grab all the data from this site, you can do something like:
fish_url_base = "http://www.fishbase.org/ComNames/%s"
fish_urls = [fish_url_base%a['href'] for a in soup.find_all('a')]
This gives you a list of links to traverse, which you can pass to urllib2.urlopen and BeautifulSoup:
for url in fish_urls:
fish_soup = BeautifulSoup(urllib2.urlopen(url).read())
# Do something with your fish_soup
(Note 1: I haven't tested this code; you might need to adjust the base URL to fit the href attributes so you get to the right site.)
(Note 2: I see you're using bs4 but calling findAll on the soup. findAll was right for BS3, but it is changed to find_all in bs4.)
(Note 3: If you're doing this for practical rather than learning purposes / fun, there are more efficient ways of scraping, such as scrapy also mentioned here.)

Categories

Resources