Getting unexpected character from scraping with Beautiful Soup

Getting unexpected character from scraping with Beautiful Soup - python

I'm trying to extract the text inside this html structure.
I have the following Beautiful Soup code:
import requests
from bs4 import BeautifulSoup
from time import sleep
def url_get_text(url):
text=""
page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")
text = [p for p in soup.find('div', attrs={'class': 'intro'}).getText()]
sleep(0.75)
return text
texts=[]
url = https://editorialorsai.com/el_viejo_folletin_y_las_nuevas_tecnologias/
texts.append([url_get_text(url)])
print("Text:" + str(texts))
print("Text length:" + str(len(texts)))
This is the output I get:
Text:[[['\xa0']]]
Text length:1
I don't understand why I'm getting a non-breaking space character instead of the text in the structure.

The data is loaded dynamically via Ajax call. You can use requests module to simulate this call.
For example:
import json
import requests
def url_get_text(url):
page = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(page, indent=4))
return page['entradilla']
url = 'https://editorialorsai.com/el_viejo_folletin_y_las_nuevas_tecnologias/'
api_url = 'https://hernancasciari.com/apis/articulus/v1/blogs/blog_orsai/articulos/{}'
t = url_get_text(api_url.format(url.split('/')[-2]))
print(t)
print()
print("Text length: {}".format(len(t)))
Prints:
Ayer di por finalizada la primera etapa de un experimento de ficción llamado Más respeto, que soy tu madre, en el que usé el recurso de la bitácora (una herramienta de publicación cronológica de contenidos en internet) para contar una historia costumbrista desde la subjetiva de un ama de casa argentina de clase media. La repercusión del proyecto fue tan asombrosa que me gustaría compartir algunos detalles con el lector.
Text length: 423

Related

How can I get phone numbers from a Google Maps page?

I want to scrape the phone number from these Google Maps results.
We may have them here.
<div jstcache="258" class="Io6YTe fontBodyMedium" jsan="7.Io6YTe,7.fontBodyMedium">05 61 93 70 00</div>
Then I tried with this:
import requests
from bs4 import BeautifulSoup
nom_entreprise = "AIRBUS"
adresse = "2 RPT EMILE DEWOITINE, BLAGNAC"
url = f"https://www.google.com/maps/search/{nom_entreprise},{adresse}"
print(url)
# Faire une requête GET sur la page de résultats de recherche Google
page = requests.get(URL)
# Envoi de la requête et récupération du contenu HTML
response = requests.get(url)
content = response.content
# Analyse du contenu HTML avec BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
# Recherche des éléments contenant les numéros de téléphone
phone_numbers = soup.find_all('div', {'class': 'Io6YTe fontBodyMedium'})
print(phone_numbers)
# Affichage des numéros de téléphone trouvés
for phone_number in phone_numbers:
print(phone_number.text)
But I can't bring back anything as as I get back an unrelated page in Dutch. Indeed, here is what I have in soup.
I tried Driftr95 answer:
import requests
from bs4 import BeautifulSoup
nom_entreprise = "AIRBUS"
adresse = "2 RPT EMILE DEWOITINE, BLAGNAC"
url = f"https://www.google.com/maps/search/{nom_entreprise},{adresse}"
print(url)
# Faire une requête GET sur la page de résultats de recherche Google
page = requests.get(url)
# Envoi de la requête et récupération du contenu HTML
response = requests.get(url)
content = response.content
# Analyse du contenu HTML avec BeautifulSoup
# soup = BeautifulSoup(content, "html.parser")
soup = linkToSoup_selenium(url)
# Recherche des éléments contenant les numéros de téléphone
phone_numbers = soup.find_all('div', {'class': 'Io6YTe fontBodyMedium'})
print(phone_numbers)
# Affichage des numéros de téléphone trouvés
for phone_number in phone_numbers:
print(phone_number.text)
but unfortunately I got:
https://www.google.com/maps/search/AIRBUS,2 RPT EMILE DEWOITINE, BLAGNAC
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-9-0da133fe0595> in <module>
20
21 # Recherche des éléments contenant les numéros de téléphone
---> 22 phone_numbers = soup.find_all('div', {'class': 'Io6YTe fontBodyMedium'})
23 print(phone_numbers)
24 # Affichage des numéros de téléphone trouvés
AttributeError: 'NoneType' object has no attribute 'find_all'
Indeed, there is nothing in soup, wheras the URL actually lead to the webpage I'm looking for.

python lib html_requests html.render() don't get all elements

import requests_html
url = 'https://www.crous-bordeaux.fr/restaurant/resto-u-pierre-bidart/'
s = requests_html.HTMLSession()
r = s.get(url)
r.html.render()
print(r)
J'aimerais récupérer le menu de mon restaurant universitaire, mais j'arrive pas à récupérer l'éléments content-repas. The page that is retrieved by my script is incomplete.
I would like to pick up the menu from my university restaurant, but I can't get the content-repas items. The page that is retrieved by my script is incomplete.

You can access that menu with Requests. here is an example:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.crous-bordeaux.fr/restaurant/resto-u-pierre-bidart/'
r = requests.get(url)
soup = bs(r.text, 'html.parser')
menu = soup.select_one('div[id="menu-repas"]').text
print(menu)
Result in terminal:
Menu du mardi 25 octobre 2022
Petit déjeunerPas de service
DéjeunerVENTE A EMPORTER OU SUR PLACE MIDIENTREECordon bleuPoisson à la bordelaiseBoulette soja tomateBlé pilaf sauce tomateHaricots vertsDESSERTVENTE A EMPORTER SOIRmenu non communiqué
DînerPas de service
Menu du mercredi 26 octobre 2022
Petit déjeunerPas de service
DéjeunerVENTE A EMPORTER OU SUR PLACE MIDIENTREEDos de colin d'Alaska sauce basquaiseEmincé de porcPurée de potironHaricots platsDESSERTVENTE A EMPORTER SOIRmenu non communiqué
DînerPas de service
Menu du jeudi 27 octobre 2022
Petit déjeunerPas de service
DéjeunerVENTE A EMPORTER OU SUR PLACE MIDIENTREEPoisson meunièreSteak hachéFritesBrocolis ail/persilDESSERTVENTE A EMPORTER SOIRmenu non communiqué
DînerPas de service
For Requests' documentation, please visit https://requests.readthedocs.io/en/latest/

Python scraping of Wikipedia category page

I have this Wikipedia category page: https://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Pi%C3%A8ce_de_th%C3%A9%C3%A2tre_du_XVIIIe_si%C3%A8cle
I'd like to open the page of each play listed (e.g. https://fr.wikipedia.org/wiki/L%27Oiseau_vert) and print the first sentence of it (e.g. L'Oiseau vert (L'augellino belverde) est une comédie de Carlo Gozzi (auteur italien de pièces de théâtre) parue en 1765). A dataframe with the play title in the 1st column and the first sentence in the 2nd one would also be good.
I tried to get all the page links through BeautifulSoup and print the first sentences with wikipedia.summary() but results are not satisfactory, since the wikipedia module often redirects to the wrong articles. Part of the problem may be caused by the French special characters within the play titles (é, â, etc.)
Is there a better method to access the individual articles directly from the category page?
This question seems related but hasn't helped me further.

There is a better method to access the individual articles directly from the category page: Wikipedia API!
You can try this:
import requests
url = "https://fr.wikipedia.org/w/api.php"
params = {
"action": "query",
"cmtitle": "Catégorie:Pièce de théâtre du XVIIIe siècle",
"cmlimit": "50",
"list": "categorymembers",
"format": "json"
}
req = requests.get(url=url, params=params)
pages = req.json()['query']['categorymembers']
# here just iterate over category individual pages
for page in pages:
# eg. page = {'pageid': 622757, 'ns': 0, 'title': 'Les Acteurs de bonne foi'}
_url = 'https://fr.wikipedia.org/w/api.php'
_params = {
'format': 'json',
'action': 'query',
'prop': 'extracts',
'exintro': True,
'explaintext': True,
'redirects': 1,
'pageids': page['pageid'],
}
req = requests.get(_url, _params)
summary = req.json()['query']['pages'][str(page['pageid'])]['extract']
In case of 'Les Acteurs de bonne foi' the summary returns:
'Les Acteurs de bonne foi est une comédie en un acte et en prose de
Marivaux, jouée pour la première fois chez Quinault cadette le 30
octobre 1748.\nMarivaux fit jouer les Acteurs de bonne foi au
Théâtre-Français en 1755, mais la pièce ne réussit pas. Elle fut
publiée pour la première fois dans le Conservateur de novembre 1757.
L’intérêt de la pièce repose principalement sur un jeu qu’entretient
Marivaux avec son lecteur grâce à la mise en abyme. En effet, le texte
mêle au sein d’une même page : entretien des acteurs sur leurs vies
respectives, dialogues sur les possibilités de jeu et de mise en scène
ainsi que répliques d’un texte qui est alors joué. Dans cette pièce,
qui est la dernière que l’auteur ait fait jouer sur un grand théâtre,
où la scène de comédie est rapidement détournée et donne lieu à une
confusion entre la situation réelle et la scène jouée, la mise en
abyme révèle l’importance de l’illusion théâtrale.'

Here's an example how you can achieve that using beautifulsoup only:
import requests
from bs4 import BeautifulSoup
def get_categories(data):
print("Getting categories...")
categories = {}
soup = BeautifulSoup(data, "lxml")
group_divs = soup.find_all("div", {"class": "mw-category-group"})
for div in group_divs:
links = div.find_all("a")
for link in links:
title = link.get("title")
href = link.get("href")
categories[title] = "https://fr.wikipedia.org" + href
print(f"Found Categories: {len(categories)}")
return categories
def get_first_paragraph(data):
soup = BeautifulSoup(data, "lxml")
parser_output = soup.find("div", {"class": "mw-parser-output"})
first_paragraph = parser_output.find("p", {"class": None}, recursive=False)
return first_paragraph.text
def process_categories(categories):
result = {}
for title, link in categories.items():
print(f"Processing Piece: {title}, on link: {link}")
data = requests.get(link).content
first_paragraph = get_first_paragraph(data)
result[title] = first_paragraph.strip()
return result
def clean_categories(categories):
return {k: v for k, v in categories.items() if "Catégorie" not in k}
def main():
categories_url = "https://fr.wikipedia.org/wiki/Cat%C3%A9gorie:Pi%C3%A8ce_de_th%C3%A9%C3%A2tre_du_XVIIIe_si%C3%A8cle"
data_categories = requests.get(categories_url).content
categories = get_categories(data_categories)
categories = clean_categories(categories)
result = process_categories(categories)
print(result) # create dataframe etc...
if __name__ == "__main__":
main()
The code is self-explanatory:
First we find divs that have category-group class, extract all a elements and get title and href
Then for each category i.e. piece we parse HTML and get the first p in the div with mw-parser-output class (That should be the first sentence).
Note: I added clean_categories since category-group picked up unwanted things that had Catégorie in them.
Example of output for first couple of pieces:
Getting categories...
Found Categories: 198
Processing Piece: Les Acteurs de bonne foi, on link: https://fr.wikipedia.org/wiki/Les_Acteurs_de_bonne_foi
Processing Piece: Adamire ou la Statue de l'honneur, on link: https://fr.wikipedia.org/wiki/Adamire_ou_la_Statue_de_l%27honneur
Processing Piece: Agamemnon (Lemercier), on link: https://fr.wikipedia.org/wiki/Agamemnon_(Lemercier)
Processing Piece: Agathocle (Voltaire), on link: https://fr.wikipedia.org/wiki/Agathocle_(Voltaire)
And Result:
{"Adamire ou la Statue de l'honneur": "Adamire ou la Statue de l'honneur est "
'la traduction par Thomas-Simon '
'Gueullette de la pièce de théâtre '
"italienne l'Adamira overo la Statua "
"dell'Honore de Giacinto Andrea "
'Cicognini représentée pour la première '
'fois en France le 12 décembre 1717 à '
"Paris, à l'Hôtel de Bourgogne.",
'Agamemnon (Lemercier)': 'Agamemnon est une tragédie en cinq actes considérée '
"comme le chef-d'œuvre dramatique de Népomucène "
'Lemercier. Elle fut représentée au Théâtre de la '
'République le 5 floréal an V (24 avril 1797) et '
'valut à son auteur une célébrité immédiate.',
'Agathocle (Voltaire)': 'Agathocle est une tragédie écrite par Voltaire, '
'représentée pour la première fois le 31 mai 1779, '
'sur la scène de la Comédie-Française.',
'Les Acteurs de bonne foi': 'Les Acteurs de bonne foi est une comédie en un '
'acte et en prose de Marivaux, jouée pour la '
'première fois chez Quinault cadette le 30 '
'octobre 1748.'}

scrape text with BeautifulSoup not working

I am trying to scrape the text from the minutes published in the webpage of the central bank of Brazil:
https://www.bcb.gov.br/publicacoes/atascopom
Have tried to use BeautifulSoup as per the code below, but only get empty results
url = "https://www.bcb.gov.br/publicacoes/atascopom"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, 'lxml')
print(soup.find_all('p', class_='paragrafo'))
After doing a lot of research, it seems the problem has to do with the JavaScript, but I do not know how to fix it (new to Python). The same code works fine when scraping similar text from other central bank webpages.
Anyone has any idea on how to fix it?

The problem is that the website uses JavaScript to render the content. Requests library doesn't have JavaScript support. You can try using selenium or requests_html for this purpose. More details can be found in this question.

On the other hand, that site provides RSS-feeds here https://www.bcb.gov.br/acessoinformacao/rss and there is the Atas do Copom feed.
Following is possible method to start with
import requests
from bs4 import BeautifulSoup
url = "https://www.bcb.gov.br/api/feed/sitebcb/sitefeeds/atascopom"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, 'xml')
print(soup.find_all('content'))

The data is loaded from external URL in Json format. You can use next example how to load it:
import requests
import textwrap
from bs4 import BeautifulSoup
url = "https://www.bcb.gov.br/api/servico/sitebcb/atascopom/principal?filtro="
data = requests.get(url).json()
soup = BeautifulSoup(data["conteudo"][0]["OutrasInformacoes"], "html.parser")
print("\n".join(textwrap.wrap(soup.get_text(strip=True, separator=" "))))
print()
print("PDF:", data["conteudo"][0]["Url"])
Prints:
A) Atualização da conjuntura econômica e do cenário básico do Copom 1
1. No cenário externo, estímulos fiscais e monetários em alguns países
desenvolvidos promovem uma recuperação robusta da atividade econômica.
Devido à presença de ociosidade, a comunicação dos principais bancos
centrais sugere que os estímulos monetários terão longa duração.
Contudo, a incerteza segue elevada e uma nova rodada de
questionamentos dos mercados a respeito dos riscos inflacionários
nessas economias pode tornar o ambiente desafiador para países
emergentes. 2. Em relação à atividade econômica brasileira, apesar da
intensidade da segunda onda da pandemia, os indicadores recentes
continuam mostrando evolução mais positiva do que o esperado,
implicando revisões relevantes nas projeções de crescimento. Os riscos
para a recuperação econômica reduziram-se significativamente. 3. As
...
a evolução recente e as perspectivas para a economia brasileira e para
a economia internacional, no contexto do regime de política monetária,
cujo objetivo é atingir as metas fixadas pelo Conselho Monetário
Nacional para a inflação.
PDF: /content/copom/atascopom/Copom239-not20210616239.pdf

Web Scraping news articles in some cases returns empty body

I just wanted to scrape a few articles from El Pais website archive. From each article I take: title, hashtags and article body. The HTML structure of each article is the same and script is successful with all the titles and hashtags, however for some of the articles it does not scrape the body at all. Below I add my code, links to fully working articles and also a few links to the ones returning empty bodies. Do you know how to fix it?
The empty body articles do not happen regularly, so sometimes there can be 3 empty articles in a row, then 5 successful articles, 1 empty, 3 successful.
Working articles
article1
https://elpais.com/diario/1990/01/17/economia/632530813_850215.html
article2
https://elpais.com/diario/1990/01/07/internacional/631666806_850215.html
article3
https://elpais.com/diario/1990/01/05/deportes/631494011_850215.html
Articles without the body
article4
https://elpais.com/diario/1990/01/23/madrid/633097458_850215.html
article5
https://elpais.com/diario/1990/01/30/economia/633654016_850215.html
article6
https://elpais.com/diario/1990/01/03/espana/631321213_850215.html
from bs4 import BeautifulSoup
import requests
#place for the url of the article to be scraped
URL = some_url_of_article_above
#print(URL)
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
bodydiv = soup.find("div", id="ctn_article_body")
artbody = bodydiv.find_all("p", class_="")
tagdiv = soup.find("div", id="mod_archivado")
hashtags= tagdiv.find_all("li", class_="w_i | capitalize flex align_items_center")
titlediv = soup.find("div", id="article_header")
title = titlediv.find("h1")
#print title of the article
print(title.text)
#print body of the article
arttext = ""
for par in artbody:
arttext += str(par.text)
print(arttext)
#hastags
tagstring = ""
for hashtag in hashtags:
tagstring += hashtag.text
tagstring += ","
print(tagstring)
Thank you in advance for your help!

The problem is that inside that <div class="a_b article_body | color_gray_dark" id="ctn_article_body"> element there's a broken or incomplete <b> tag. Take a look at this code snippet from the html page:
<div id="ctn_article_body" class="a_b article_body | color_gray_dark"><p class=""></b>La Asociación Ecologista de Defensa dela Naturaleza (AEDENAT) ha denunciado que las obras de la carretera que cruza la Universidad Complutense desde la carretera de La Coruña hasta Cuatro Caminos se están realizando "sin permisos de ningún tipo" y suponen "la destrucción de zonas de pinar en las cercanías del edificio de Filosofia B".</p>
Just after the first <p></p> tags, there an </b> without its pair <b> tag. That's the reason because "html.parser" it is failing.
Using this text,
from bs4 import BeautifulSoup
text = """<div id="ctn_article_body" class="a_b article_body | color_gray_dark"><p class=""></b>La Asociación Ecologista de Defensa de la Naturaleza (AEDENAT) ha denunciado que las obras de la carretera que cruza la Universidad Complutense desde la carretera de La Coruña hasta Cuatro Caminos se están realizando "sin permisos de ningún tipo" y suponen "la destrucción de zonas de pinar en las cercanías del edificio de Filosofia B".</p><div id="elpais_gpt-INTEXT" style="width: 0px; height: 0px; display: none;"></div><p class="">Por su parte, José Luis Garro, tercer teniente de alcalde, ha declarado a EL PAÍS: "Tenemos una autorización provisional del rector de la Universidad Complutense. Toda esa zona, además, está pendiente de un plan especial de reforma interior (PERI). Ésta es sólo una solución provisional".</p><p class="">Según Garro, el trazado de la carretera "ha tenido que dar varias vueltas para no tocar las masas arbóreas", aunque reconoce que se ha hecho "en algunos casos", si bien causando "un daño mínimo".</p><p class="footnote">* Este artículo apareció en la edición impresa del lunes, 22 de enero de 1990.</p></div>"""
soup = BeautifulSoup(text, "html.parser")
print(soup.find("div"))
Output:
<div class="a_b article_body | color_gray_dark" id="ctn_article_body"><p class=""></p></div>
How to solve this? Well I made another try with a different parser, in this case I made use of "lxml" instead of "html.parser", and it works.
It selected the div, so just changing this line should work
soup = BeautifulSoup(text, "lxml")
Of course you will need to have this parser installed.
EDIT:
As #moreni123 commented below, this solution seems to be correct for certain cases but not for all. Given that, I will add another option that could also work.
It seems that it would be better to use Selenium to fetch webpage, given that some content is been generated with JavaScript and requests cannot do that, it's not its purpose.
I'm going to use Selenium with a headless chrome driver,
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
# article to fetch
url = "https://elpais.com/diario/1990/01/14/madrid/632319855_850215.html"
driver_options = Options()
driver_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="path/to/chrome/driver", options=driver_options)
# this is the source code with the js executed
driver.get(url)
page = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
# now, as before we use BeautifulSoup to parse it. Selenium is a
# powerful, tool you could use Selenium for this also
soup = BeautifulSoup(page, "html.parser")
print(soup.select("#ctn_article_body"))
#quiting driver
if driver is not None:
driver.quit()
Make sure that the path to the chrome driver is correct, in this line
driver = webdriver.Chrome(executable_path="path/to/chrome/driver", options=driver_options)
Here is a link to the Selenium doc, and to the ChromeDriver. In case you need to download it.
This solution should work. At least in this article that you passed me, it works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting unexpected character from scraping with Beautiful Soup - python

Related

How can I get phone numbers from a Google Maps page?

python lib html_requests html.render() don't get all elements

Python scraping of Wikipedia category page

scrape text with BeautifulSoup not working

Web Scraping news articles in some cases returns empty body

Categories

Resources