scrape text with BeautifulSoup not working

scrape text with BeautifulSoup not working - python

I am trying to scrape the text from the minutes published in the webpage of the central bank of Brazil:
https://www.bcb.gov.br/publicacoes/atascopom
Have tried to use BeautifulSoup as per the code below, but only get empty results
url = "https://www.bcb.gov.br/publicacoes/atascopom"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, 'lxml')
print(soup.find_all('p', class_='paragrafo'))
After doing a lot of research, it seems the problem has to do with the JavaScript, but I do not know how to fix it (new to Python). The same code works fine when scraping similar text from other central bank webpages.
Anyone has any idea on how to fix it?

The problem is that the website uses JavaScript to render the content. Requests library doesn't have JavaScript support. You can try using selenium or requests_html for this purpose. More details can be found in this question.

On the other hand, that site provides RSS-feeds here https://www.bcb.gov.br/acessoinformacao/rss and there is the Atas do Copom feed.
Following is possible method to start with
import requests
from bs4 import BeautifulSoup
url = "https://www.bcb.gov.br/api/feed/sitebcb/sitefeeds/atascopom"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, 'xml')
print(soup.find_all('content'))

The data is loaded from external URL in Json format. You can use next example how to load it:
import requests
import textwrap
from bs4 import BeautifulSoup
url = "https://www.bcb.gov.br/api/servico/sitebcb/atascopom/principal?filtro="
data = requests.get(url).json()
soup = BeautifulSoup(data["conteudo"][0]["OutrasInformacoes"], "html.parser")
print("\n".join(textwrap.wrap(soup.get_text(strip=True, separator=" "))))
print()
print("PDF:", data["conteudo"][0]["Url"])
Prints:
A) Atualização da conjuntura econômica e do cenário básico do Copom 1
1. No cenário externo, estímulos fiscais e monetários em alguns países
desenvolvidos promovem uma recuperação robusta da atividade econômica.
Devido à presença de ociosidade, a comunicação dos principais bancos
centrais sugere que os estímulos monetários terão longa duração.
Contudo, a incerteza segue elevada e uma nova rodada de
questionamentos dos mercados a respeito dos riscos inflacionários
nessas economias pode tornar o ambiente desafiador para países
emergentes. 2. Em relação à atividade econômica brasileira, apesar da
intensidade da segunda onda da pandemia, os indicadores recentes
continuam mostrando evolução mais positiva do que o esperado,
implicando revisões relevantes nas projeções de crescimento. Os riscos
para a recuperação econômica reduziram-se significativamente. 3. As
...
a evolução recente e as perspectivas para a economia brasileira e para
a economia internacional, no contexto do regime de política monetária,
cujo objetivo é atingir as metas fixadas pelo Conselho Monetário
Nacional para a inflação.
PDF: /content/copom/atascopom/Copom239-not20210616239.pdf

Related

python lib html_requests html.render() don't get all elements

import requests_html
url = 'https://www.crous-bordeaux.fr/restaurant/resto-u-pierre-bidart/'
s = requests_html.HTMLSession()
r = s.get(url)
r.html.render()
print(r)
J'aimerais récupérer le menu de mon restaurant universitaire, mais j'arrive pas à récupérer l'éléments content-repas. The page that is retrieved by my script is incomplete.
I would like to pick up the menu from my university restaurant, but I can't get the content-repas items. The page that is retrieved by my script is incomplete.

You can access that menu with Requests. here is an example:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.crous-bordeaux.fr/restaurant/resto-u-pierre-bidart/'
r = requests.get(url)
soup = bs(r.text, 'html.parser')
menu = soup.select_one('div[id="menu-repas"]').text
print(menu)
Result in terminal:
Menu du mardi 25 octobre 2022
Petit déjeunerPas de service
DéjeunerVENTE A EMPORTER OU SUR PLACE MIDIENTREECordon bleuPoisson à la bordelaiseBoulette soja tomateBlé pilaf sauce tomateHaricots vertsDESSERTVENTE A EMPORTER SOIRmenu non communiqué
DînerPas de service
Menu du mercredi 26 octobre 2022
Petit déjeunerPas de service
DéjeunerVENTE A EMPORTER OU SUR PLACE MIDIENTREEDos de colin d'Alaska sauce basquaiseEmincé de porcPurée de potironHaricots platsDESSERTVENTE A EMPORTER SOIRmenu non communiqué
DînerPas de service
Menu du jeudi 27 octobre 2022
Petit déjeunerPas de service
DéjeunerVENTE A EMPORTER OU SUR PLACE MIDIENTREEPoisson meunièreSteak hachéFritesBrocolis ail/persilDESSERTVENTE A EMPORTER SOIRmenu non communiqué
DînerPas de service
For Requests' documentation, please visit https://requests.readthedocs.io/en/latest/

Web Scraping news articles in some cases returns empty body

I just wanted to scrape a few articles from El Pais website archive. From each article I take: title, hashtags and article body. The HTML structure of each article is the same and script is successful with all the titles and hashtags, however for some of the articles it does not scrape the body at all. Below I add my code, links to fully working articles and also a few links to the ones returning empty bodies. Do you know how to fix it?
The empty body articles do not happen regularly, so sometimes there can be 3 empty articles in a row, then 5 successful articles, 1 empty, 3 successful.
Working articles
article1
https://elpais.com/diario/1990/01/17/economia/632530813_850215.html
article2
https://elpais.com/diario/1990/01/07/internacional/631666806_850215.html
article3
https://elpais.com/diario/1990/01/05/deportes/631494011_850215.html
Articles without the body
article4
https://elpais.com/diario/1990/01/23/madrid/633097458_850215.html
article5
https://elpais.com/diario/1990/01/30/economia/633654016_850215.html
article6
https://elpais.com/diario/1990/01/03/espana/631321213_850215.html
from bs4 import BeautifulSoup
import requests
#place for the url of the article to be scraped
URL = some_url_of_article_above
#print(URL)
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
bodydiv = soup.find("div", id="ctn_article_body")
artbody = bodydiv.find_all("p", class_="")
tagdiv = soup.find("div", id="mod_archivado")
hashtags= tagdiv.find_all("li", class_="w_i | capitalize flex align_items_center")
titlediv = soup.find("div", id="article_header")
title = titlediv.find("h1")
#print title of the article
print(title.text)
#print body of the article
arttext = ""
for par in artbody:
arttext += str(par.text)
print(arttext)
#hastags
tagstring = ""
for hashtag in hashtags:
tagstring += hashtag.text
tagstring += ","
print(tagstring)
Thank you in advance for your help!

The problem is that inside that <div class="a_b article_body | color_gray_dark" id="ctn_article_body"> element there's a broken or incomplete <b> tag. Take a look at this code snippet from the html page:
<div id="ctn_article_body" class="a_b article_body | color_gray_dark"><p class=""></b>La Asociación Ecologista de Defensa dela Naturaleza (AEDENAT) ha denunciado que las obras de la carretera que cruza la Universidad Complutense desde la carretera de La Coruña hasta Cuatro Caminos se están realizando "sin permisos de ningún tipo" y suponen "la destrucción de zonas de pinar en las cercanías del edificio de Filosofia B".</p>
Just after the first <p></p> tags, there an </b> without its pair <b> tag. That's the reason because "html.parser" it is failing.
Using this text,
from bs4 import BeautifulSoup
text = """<div id="ctn_article_body" class="a_b article_body | color_gray_dark"><p class=""></b>La Asociación Ecologista de Defensa de la Naturaleza (AEDENAT) ha denunciado que las obras de la carretera que cruza la Universidad Complutense desde la carretera de La Coruña hasta Cuatro Caminos se están realizando "sin permisos de ningún tipo" y suponen "la destrucción de zonas de pinar en las cercanías del edificio de Filosofia B".</p><div id="elpais_gpt-INTEXT" style="width: 0px; height: 0px; display: none;"></div><p class="">Por su parte, José Luis Garro, tercer teniente de alcalde, ha declarado a EL PAÍS: "Tenemos una autorización provisional del rector de la Universidad Complutense. Toda esa zona, además, está pendiente de un plan especial de reforma interior (PERI). Ésta es sólo una solución provisional".</p><p class="">Según Garro, el trazado de la carretera "ha tenido que dar varias vueltas para no tocar las masas arbóreas", aunque reconoce que se ha hecho "en algunos casos", si bien causando "un daño mínimo".</p><p class="footnote">* Este artículo apareció en la edición impresa del lunes, 22 de enero de 1990.</p></div>"""
soup = BeautifulSoup(text, "html.parser")
print(soup.find("div"))
Output:
<div class="a_b article_body | color_gray_dark" id="ctn_article_body"><p class=""></p></div>
How to solve this? Well I made another try with a different parser, in this case I made use of "lxml" instead of "html.parser", and it works.
It selected the div, so just changing this line should work
soup = BeautifulSoup(text, "lxml")
Of course you will need to have this parser installed.
EDIT:
As #moreni123 commented below, this solution seems to be correct for certain cases but not for all. Given that, I will add another option that could also work.
It seems that it would be better to use Selenium to fetch webpage, given that some content is been generated with JavaScript and requests cannot do that, it's not its purpose.
I'm going to use Selenium with a headless chrome driver,
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
# article to fetch
url = "https://elpais.com/diario/1990/01/14/madrid/632319855_850215.html"
driver_options = Options()
driver_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="path/to/chrome/driver", options=driver_options)
# this is the source code with the js executed
driver.get(url)
page = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
# now, as before we use BeautifulSoup to parse it. Selenium is a
# powerful, tool you could use Selenium for this also
soup = BeautifulSoup(page, "html.parser")
print(soup.select("#ctn_article_body"))
#quiting driver
if driver is not None:
driver.quit()
Make sure that the path to the chrome driver is correct, in this line
driver = webdriver.Chrome(executable_path="path/to/chrome/driver", options=driver_options)
Here is a link to the Selenium doc, and to the ChromeDriver. In case you need to download it.
This solution should work. At least in this article that you passed me, it works.

Getting unexpected character from scraping with Beautiful Soup

I'm trying to extract the text inside this html structure.
I have the following Beautiful Soup code:
import requests
from bs4 import BeautifulSoup
from time import sleep
def url_get_text(url):
text=""
page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")
text = [p for p in soup.find('div', attrs={'class': 'intro'}).getText()]
sleep(0.75)
return text
texts=[]
url = https://editorialorsai.com/el_viejo_folletin_y_las_nuevas_tecnologias/
texts.append([url_get_text(url)])
print("Text:" + str(texts))
print("Text length:" + str(len(texts)))
This is the output I get:
Text:[[['\xa0']]]
Text length:1
I don't understand why I'm getting a non-breaking space character instead of the text in the structure.

The data is loaded dynamically via Ajax call. You can use requests module to simulate this call.
For example:
import json
import requests
def url_get_text(url):
page = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(page, indent=4))
return page['entradilla']
url = 'https://editorialorsai.com/el_viejo_folletin_y_las_nuevas_tecnologias/'
api_url = 'https://hernancasciari.com/apis/articulus/v1/blogs/blog_orsai/articulos/{}'
t = url_get_text(api_url.format(url.split('/')[-2]))
print(t)
print()
print("Text length: {}".format(len(t)))
Prints:
Ayer di por finalizada la primera etapa de un experimento de ficción llamado Más respeto, que soy tu madre, en el que usé el recurso de la bitácora (una herramienta de publicación cronológica de contenidos en internet) para contar una historia costumbrista desde la subjetiva de un ama de casa argentina de clase media. La repercusión del proyecto fue tan asombrosa que me gustaría compartir algunos detalles con el lector.
Text length: 423

With Python, how to scrape the descriptive text of a link from a Google search?

In python3 I have this script to scrape the first screen of a Google search:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoAlertPresentException
from selenium.webdriver.support.select import Select
nome = '"ALDEANNO CAMPOS"'
nome = nome.replace(' ', '+')
cargo = 'DEPUTADO FEDERAL'
busca = f'https://www.google.com.br/search?q={nome}+{cargo}+ditadura'
profile = webdriver.FirefoxProfile()
browser = webdriver.Firefox(profile)
browser.get(busca)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
browser.close()
page = soup.find_all("div", {"class": "rc"})
for link in page:
href = link.find("a")['href']
texto = link.find("a").text
print(href)
print(texto)
print("---------------")
The program shows or captures the href link and a descriptive text of the link, that is, the name of the page. But I also want to extract the phrase that is below the Google search link
For example, on this page (https://www.google.com/search?client=ubuntu&channel=fs&ei=DrSNW8r3E4urwgS977WYDA&q=ALDEANNO+CAMPOS+deputado+federal+ditadura&oq=ALDEANNO+CAMPOS+deputado+federal+ditadura&gs_l=psy-ab.12...0.0.0.1933260.0.0.0.0.0.0.0.0..0.0....0...1c..64.psy-ab..0.0.0....0.U9iFnwXwzpk) the texts:
"Aug 24, 2018 - Perfil completo do candidato ao cargo de Deputado Federal Aldeanno Campos que concorre pelo PRP nas Eleições 2018 no Pará."
"Relacionamos a seguir os senadores e deputados federais brasileiros cassados conforme as .... Epílogo de Campos · Costa Rego · Recife, PE, PTB-PE (1962) ..."
"Francisco Luís da Silva Campos (Dores do Indaiá, 18 de novembro de 1891 — Belo Horizonte, ... Em 1921 Francisco Campos foi eleito deputado federal pelo PRM, estreando na ... Armadas, dos preparativos que levariam à ditadura do Estado Novo, instalada por um golpe de estado decretado em novembro de 1937."
And so on
Please, does anyone know how I can capture this final text that lies below the link?
Example of how it appears with the name "CORONEL FERES" - print (link) - (could not display html code)
PSL Itapema - Posts | Facebookhttps://www.facebook.com/PSLitapema17/posts/1638801189535968General Mourão apoia o pré-cadidato a Deputado Federal Coronel Feres. Confira: 37 Views .... Há uma ditadura silenciosa que não podemos permitir. Bom dia!

You just need to add that inside your loop, see below code.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoAlertPresentException
from selenium.webdriver.support.select import Select
nome = '"ALDEANNO CAMPOS"'
nome = nome.replace(' ', '+')
cargo = 'DEPUTADO FEDERAL'
busca = f'https://www.google.com.br/search?q={nome}+{cargo}+ditadura'
profile = webdriver.FirefoxProfile()
browser = webdriver.Firefox(profile)
browser.get(busca)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
browser.close()
page = soup.find_all("div", {"class": "rc"})
for link in page:
href = link.find("a")['href']
texto = link.find("a").text
body = link.find('span', attrs={'class': 'st'}).text
print(href)
print(texto)
print(body)
print("---------------")

You're looking for this:
for result in soup.select('.tF2Cxc'):
snippet = result.select_one('#rso .lyLwlc').text
Have a look at SelectorGadget Chrome extension to grab CSS by clicking on the desired element in your browser. CSS selectors reference.
Also, you don't actually need to use selenium since it slows down scraping time a lot.
Make sure you're using user-agent if not using selenium otherwise Google will block request eventually since the default user-agent in requests library is python-requests and Google understands that it's a bot and not a "real" user visit and will block a request.
Code and example using requests and beautifulsoup in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "ALDEANNO CAMPOS DEPUTADO FEDERAL ditadura", # query
"gl": "br", # country to search from
"hl": "pt" # language
}
html = requests.get("https://www.google.com.br/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
snippet = result.select_one('#rso .lyLwlc').text
print(f"{title}\n{link}\n{snippet}\n")
-------
'''
Deputado Federal Jefferson Campos - Portal da Câmara dos ...
https://www.camara.leg.br/deputados/74273
Deputados podem apresentar emendas ao Orçamento da União, ou seja: sugerir ao Poder Executivo gastos específicos, para atender demandas da comunidade que ...
...
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case that you only need to iterate over strucutured JSON string and get the data you want fast, rather than making everything from scrach and maintain the parser over time or figuring out why certain things don't work as they should.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"google_domain": "google.com.br",
"q": "ALDEANNO CAMPOS deputado federal ditadura",
"hl": "pt",
"gl": "br",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")
-----
'''
Title: Girão critica prisão de Silveira e vê 'ditadura da toga' - Senado ...
Summary: En pronunciamento nesta terça-feira (23), o senador Eduardo Girão (Podemos-CE) criticou a decisão da Câmara dos Deputados que confirmou por ...
Link: https://www12.senado.leg.br/noticias/materias/2021/02/23/girao-critica-prisao-de-silveira-e-ve-ditadura-da-toga
...
'''
Disclaimer, I work for SerpApi.

Product description is not returning while scraping

I tried to scrape product description from below url. But it is not returning that
https://www.mambo.com.br/arroz-integral-camil-1kg/p
My code below not return description text:
myurl = "https://www.mambo.com.br/arroz-integral-camil-1kg/p"
agent = {'User-Agent': 'Magic Browser'}
req1 = requests.get(myurl, headers=agent)
soup2 = BeautifulSoup(req1.content, "html.parser")
for desc in soup2.findAll('div', {"class": "accordion__body ProductDescription"}):
print(desc.text)
Please fix and help in code the issue.

The data is loaded dynamically through Ajax - the page itself doesn't contain any data.
You need to extract SKU (product number) from the main page and then call API located at https://www.mambo.com.br/api/ for JSON data (you can see all requests that the page is doing in Firefox/Chrome network inspector):
from bs4 import BeautifulSoup
import requests
import json
product_url = "https://www.mambo.com.br/api/catalog_system/pub/products/search/?fq=productId:{}"
url = "https://www.mambo.com.br/arroz-integral-camil-1kg/p"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
sku = soup.select_one('meta[itemprop="sku"]')['content']
data_json = json.loads(requests.get(product_url.format(sku)).text)
for p in data_json:
print(p['description'])
# print(json.dumps(data_json, indent=4)) # this will print all data about the product
Output:
O arroz integral faz parte da linha de produtos naturais da Camil. É saudável, prático e gostoso. Melhor que saborear um prato delicioso é fazer isso com saúde!
EDIT:
Alternatively you can get description from <meta itemprop="description">, but I'm not sure if the description is complete in this tag:
url = "https://www.mambo.com.br/arroz-integral-camil-1kg/p"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('meta[itemprop="description"]')['content'])
Prints:
O arroz integral faz parte da linha de produtos naturais da Camil. É saudável, prático e gostoso. Melhor que saborear um prato delicioso é fazer isso com saúde!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrape text with BeautifulSoup not working - python

The problem is that the website uses JavaScript to render the content. Requests library doesn't have JavaScript support. You can try using selenium or requests_html for this purpose. More details can be found in this question.

Related

python lib html_requests html.render() don't get all elements

Web Scraping news articles in some cases returns empty body

Getting unexpected character from scraping with Beautiful Soup

With Python, how to scrape the descriptive text of a link from a Google search?

Product description is not returning while scraping

Categories

Resources