Find class from web in python 3.6 not working - python

Please let me know how to find the class from HTML Code from the URL: https://knoema.com/MIG_UNEMP_GENDER_2015
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://knoema.com/MIG_UNEMP_GENDER_2015')
soup = bs(r.text, 'lxml')
soup.findAll('div', {'class': 'metadata-block validate'})
Result giving blank.

Related

BeautifulSoup find() function returning None because I am not getting correct html info

I am trying to webscrape an Amazon web product page and print out the productTitle. Code below:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import smtplib
def extract():
headers = {my_info}
url = "link"
# this link is to a webpage on Amazon
req = requests.get(url, headers=headers)
soup1 = BeautifulSoup(req.content, "html.parser") # get soup1 from website
soup2 = BeautifulSoup(soup1.prettify(), 'html.parser') # prettify it in soup 2
print(soup2) # when i inspect element on website, it shows that there is a html tag <span> with
id=productTitle
# title = soup2.find('span', id='productTitle').get_text() # find() is returning None
print(soup2.prettify())
I was expecting the html content that i inspected element from directly on the website to be the same as in soup2, but for some reason it's not, which is causing find() to return None. How come the html is not the same? Any help would be appreciated.

How to find all classes under table tag in python using web scraping library beautiful soup

import requests
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")
from bs4 import BeautifulSoup
soup.table["class"]
Add this and you will find the class of table in that page.
soup = BeautifulSoup(req.content, 'html.parser')
soup.table["class"]
Result:
['infobox', 'vcard']

Python requests.get() not showing all HTML

I'm looking to scrape some information from Easy Allies reviews for a personal project using:
Python3
requests
BS4 (BeautifulSoup)
I would like to scrape the names of the last games they have reviewed which is easy to find within the browser inspect tool, but doesn't exist within the source code of the page which is what is returned with this Python code:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.easyallies.com/#!/reviews")
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
How do I access this data?
Notice that when you open that url, it calls an endpoint https://www.easyallies.com/api/review/get that will fetch the reviews.
Take this code as an example, and parse the JSON result as you wish.
import requests
from bs4 import BeautifulSoup
data = { 'method': 'review', 'action': 'get', 'data[start]': 0, 'data[limit]': 10 }
reviews = requests.post("https://www.easyallies.com/api/review/get", data=data)
print (reviews.text)
from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
url = 'https://www.easyallies.com/#!/reviews'
sada = browser.get(url)
time.sleep(3)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
for item in soup.findAll('div', attrs={'class': 'name'}):
print(item.text)

Python3 : BeautifulSoup4 not returning expected value

I'm currently trying to scrap some data over a website using BS4 under python 3.6.4 but the value returned is not what I am expecting:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.content
soup = BeautifulSoup(page, "html5lib")
price = soup.find("div", {"class" : "fieldPrice sizeC"}).text
print(price)
I should get "39 900 €" but the code return "47 880 â¬".
NB: Even without JS, the data should be "39 900 €".
Thanks for your help !
The encoding declaration is wrong on this page so BeautifulSoup gets told to use the wrong encoding. You can force it to use the correct encoding like this:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.content
soup = BeautifulSoup(page.decode('utf-8','ignore'), "html5lib")
price = soup.find("div", {"class": "fieldPrice sizeC"}).text
print(price)
Outputs:
49 070 €
Instead of page.content use page.text
Ex:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.text
soup = BeautifulSoup(page, "html.parser")
price = soup.find("div", {"class" : "fieldPrice sizeC"}).text
print(price)
.text automatically decode content from the server

BeautifulSoup does not work for some web sites

I have this sript:
import urrlib2
from bs4 import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs
For this web site, it prints empty list? What can be problem? I am running on Ubuntu 12.04
Actually there are quite couple of bugs in BeautifulSoup which might raise some unknown errors. I had a similar issue when working on apache using lxml parser
So, just try to use other couple of parsers mentioned in the documentation
soup = BeautifulSoup(page, "html.parser")
This should work!
It looks like you have a few mistakes in your code urrlib2 should be urllib2, I've fixed the code for you and this works using BeautifulSoup 3
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.shoptop.ru/"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
divs = soup.findAll('a')
print divs

Categories

Resources