I'm trying to extract data from a website with BeautifulSoup.
I'm actually stuck with this :
"Trad. de l'anglais par < a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien < /a>"
I want to get the names of translaters but the tag uses their id.
my code is
translater = soup.find_all("a", href="/searchinternet/advanced?all_authors_id=")
I tried with a str.startswith but it doesn't work.
Can someone help me plz?
Providing your HTML is correct, static (doesn't get loaded with javascript after initial page load), this is one way to select that/those links:
from bs4 import BeautifulSoup as bs
html = '''<p>Trad. de l'anglais par Camille Fabien </p>'''
soup = bs(html, 'html.parser')
a = soup.select('a[href^="/searchinternet/advanced?all_authors_id="]')
print(a[0])
print(a[0].get_text(strip=True))
print(a[0].get('href'))
Result in terminal:
Camille Fabien
Camille Fabien
/searchinternet/advanced?all_authors_id=35534&SearchAction=1
EDIT: Who doesn't like a challenge?... Based on further comments made by OP, here is a way of obtaining titles, authors, translators and illustrator from that page - considering there can be one, or more translators/one or more illustrators:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
url = 'https://www.gallimard.fr/searchinternet/advanced/(editor_brand_id)/1/(fserie)/FOLIO-JUNIOR+LIVRE+HEROS%3A%3AFolio+Junior+-+Un+Livre+dont+Vous+%C3%AAtes+le+H%C3%A9ros+%40+DEFIS+FANTASTIQ%3A%3AS%C3%A9rie+D%C3%A9fis+Fantastiques/(limit)/3?date%5Bfrom%5D=1980-01-01&date%5Bto%5D=1995-01-01&SearchAction=OK'
big_list = []
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
items = soup.select('div[class="results bg_white"] > table div[class="item"]')
print()
for i in items:
title = i.select_one('div[class="title"] h3')
author = i.select_one('div[class="author"] a')
history = i.select_one('p[class="collective_work_entries"]')
translators = [[y.get_text() for y in x.find_previous_siblings('a')] for x in history.contents if "Illustrations" in x]
illustrators = [[y.get_text() for y in x.find_next_siblings('a')] for x in history.contents if "Illustrations" in x]
big_list.append((title.text.strip(), author.text.strip(), ', '.join([x for y in translators for x in y]), ', '.join([x for y in illustrators for x in y])))
df = pd.DataFrame(big_list, columns = ['Title', 'Author', 'Translator(s)', 'Illustrator(s)'])
print(df)
Result in terminal:
Title
Author
Translator(s)
Illustrator(s)
0
Le Sépulcre des Ombres
Jonathan Green
Noël Chassériau
Alan Langford
1
La Légende de Zagor
Ian Livingstone
Pascale Houssin
Martin McKenna
2
Les Mages de Solani
Keith Martin
Noël Chassériau
Russ Nicholson
3
Le Siège de Sardath
Keith P. Phillips
Yannick Surcouf
Pete Knifton
4
Retour à la Montagne de Feu
Ian Livingstone
Yannick Surcouf
Martin McKenna
5
Les Mondes de l'Aleph
Peter Darvill-Evans
Yannick Surcouf
Tony Hough
6
Les Mercenaires du Levant
Paul Mason
Mona de Pracontal
Terry Oakes
7
L'Arpenteur de la Lune
Stephen Hand
Pierre de Laubier
Martin McKenna, Terry Oakes
8
La Tour de la Destruction
Keith Martin
Mona de Pracontal
Pete Knifton
9
La Légende des Guerriers Fantômes
Stephen Hand
Alexis Galmot
Martin McKenna
10
Le Repaire des Morts-Vivants
Dave Morris
Nicolas Grenier
David Gallagher
11
L'Ancienne Prophétie
Paul Mason
Mona de Pracontal
Terry Oakes
12
La Vengeance des Démons
Jim Bambra
Mona de Pracontal
Martin McKenna
13
Le Sceptre Noir
Keith Martin
Camille Fabien
David Gallagher
14
La Nuit des Mutants
Peter Darvill-Evans
Anne Collas
Alan Langford
15
L'Élu des Six Clans
Luke Sharp
Noël Chassériau
Martin Mac Kenna, Martin McKenna
16
Le Volcan de Zamarra
Luke Sharp
Olivier Meyer
David Gallagher
17
Les Sombres Cohortes
Ian Livingstone
Noël Chassériau
Nik William
18
Le Vampire du Château Noir
Keith Martin
Mona de Pracontal
Martin McKenna
19
Le Voleur d'Âmes
Keith Martin
Mona de Pracontal
Russ Nicholson
20
Le Justicier de l'Univers
Martin Allen
Mona de Pracontal
Tim Sell
21
Les Esclaves de l'Eternité
Paul Mason
Sylvie Bonnet
Bob Harvey
22
La Créature venue du Chaos
Steve Jackson
Noël Chassériau
Alan Langford
23
Les Rôdeurs de la Nuit
Graeme Davis
Nicolas Grenier
John Sibbick
24
L'Empire des Hommes-Lézards
Marc Gascoigne
Jean Lacroix
David Gallagher
25
Les Gouffres de la Cruauté
Luke Sharp
Sylvie Bonnet
Russ Nicholson
26
Les Spectres de l'Angoisse
Robin Waterfield
Mona de Pracontal
Ian Miller
27
Le Chasseur des Étoiles
Luke Sharp
Arnaud Dupin de Beyssat
Cary Mayes, Gary Mayes
28
Les Sceaux de la Destruction
Robin Waterfield
Sylvie Bonnet
Russ Nicholson
29
La Crypte du Sorcier
Ian Livingstone
Noël Chassériau
John Sibbick
30
La Forteresse du Cauchemar
Peter Darvill-Evans
Mona de Pracontal
Dave Carson
31
La Grande Menace des Robots
Steve Jackson
Danielle Plociennik
Gary Mayes
32
L'Épée du Samouraï
Mark Smith
Pascale Jusforgues
Alan Langford
33
L'Épreuve des Champions
Ian Livingstone
Alain Vaulont, Pascale Jusforgues
Brian Williams
34
Défis Sanglants sur l'Océan
Andrew Chapman
Jean Walter
Bob Harvey
35
Les Démons des Profondeurs
Steve Jackson
Noël Chassériau
Bob Harvey
36
Rendez-vous avec la M.O.R.T.
Steve Jackson
Arnaud Dupin de Beyssat
Declan Considine
37
La Planète Rebelle
Robin Waterfield
C. Degolf
Gary Mayes
38
Les Trafiquants de Kelter
Andrew Chapman
Anne Blanchet
Nik Spender
39
Le Combattant de l'Autoroute
Ian Livingstone
Alain Vaulont, Pascale Jusforgues
Kevin Bulmer
40
Le Mercenaire de l'Espace
Andrew Chapman
Jean Walthers
Geoffroy Senior
41
Le Temple de la Terreur
Ian Livingstone
Denise May
Bill Houston
42
Le Manoir de l'Enfer
Steve Jackson
43
Le Marais aux Scorpions
Steve Jackson
Camille Fabien
Duncan Smith
44
Le Talisman de la Mort
Steve Jackson
Camille Fabien
Bob Harvey
45
La Sorcière des Neiges
Ian Livingstone
Michel Zénon
Edward Crosby, Gary Ward
46
La Citadelle du Chaos
Steve Jackson
Marie-Raymond Farré
Russ Nicholson
47
La Galaxie Tragique
Steve Jackson
Camille Fabien
Peter Jones
48
La Forêt de la Malédiction
Ian Livingstone
Camille Fabien
Malcolm Barter
49
La Cité des Voleurs
Ian Livingstone
Henri Robillot
Iain McCaig
50
Le Labyrinthe de la Mort
Ian Livingstone
Patricia Marais
Iain McCaig
51
L'Île du Roi Lézard
Ian Livingstone
Fabienne Vimereu
Alan Langford
52
Le Sorcier de la Montagne de Feu
Steve Jackson
Camille Fabien
Russ Nicholson
Bear in mind this method fails for Le Manoir de l'Enfer, because word 'Illustrations' is not found in text. It's down to the OP to find a solution for that one.
BeautifulSoup documentation can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Also, Pandas docs can be found here: https://pandas.pydata.org/pandas-docs/stable/index.html
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("./test.html", "r"),'html.parser') #returns a list
names = []
for elem in soup:
names.append(elem.text)
I have the following texts in a df column:
La Palma
La Palma Nueva
La Palma, Nueva Concepcion
El Estor
El Estor Nuevo
Nuevo Leon
San Jose
La Paz Colombia
Mexico Distrito Federal
El Estor, Nuevo Lugar
What I need is to add a comma at the end of each row but the condition that it is only two words. I found a partial solution:
df['Column3'] = df['Column3'].apply(lambda x: str(x)+',')
(solution found in stackoverflow)
Given:
words
0 La Palma
1 La Palma Nueva
2 La Palma, Nueva Concepcion
3 El Estor
4 El Estor Nuevo
5 Nuevo Leon
6 San Jose
7 La Paz Colombia
8 Mexico Distrito Federal
9 El Estor, Nuevo Lugar
Doing:
df.words = df.words.apply(lambda x: x+',' if len(x.split(' ')) == 2 else x)
print(df)
Outputs:
words
0 La Palma,
1 La Palma Nueva
2 La Palma, Nueva Concepcion
3 El Estor,
4 El Estor Nuevo
5 Nuevo Leon,
6 San Jose,
7 La Paz Colombia
8 Mexico Distrito Federal
9 El Estor, Nuevo Lugar
I want to know if there is a possibility to modify these dates (they are in PT-BR), I want to put them in numerical form in EN or PT-BR.
Início Fim
0 15 de março de 1985 14 de fevereiro de 1986
1 14 de fevereiro de 1986 23 de novembro de 1987
2 23 de novembro de 1987 15 de janeiro de 1989
3 16 de janeiro de 1989 14 de março de 1990
4 15 de março de 1990 23 de janeiro de 1992
We can setlocale LC_TIME to pt_PT then to_datetime will work as expected with a format string:
import locale
import pandas as pd
locale.setlocale(locale.LC_TIME, 'pt_PT')
df = pd.DataFrame({
'Início': ['15 de março de 1985', '14 de fevereiro de 1986',
'23 de novembro de 1987', '16 de janeiro de 1989',
'15 de março de 1990'],
'Fim': ['14 de fevereiro de 1986', '23 de novembro de 1987',
'15 de janeiro de 1989', '14 de março de 1990',
'23 de janeiro de 1992']
})
cols = ['Início', 'Fim']
df[cols] = df[cols].apply(pd.to_datetime, format='%d de %B de %Y')
df:
Início Fim
0 1985-03-15 1986-02-14
1 1986-02-14 1987-11-23
2 1987-11-23 1989-01-15
3 1989-01-16 1990-03-14
4 1990-03-15 1992-01-23
Split the string inside each cell by " de ".
Replace the 2nd element with the corresponding number (I suggest using a dictionary for this).
Join the list into a string. I suggest using str.join, but string concatenation or formatted strings work too.
Let's use an example.
date = "23 de novembro de 1987"
dates = date.split(" de ") # ['23', 'novembro', '1987']
dates[1] = "11" # ['23', '11', '1987']
numerical_date = '/'.join(dates) # "23/11/1987"
I'm not great at beautifulsoup. A couple questions in one:
I'm just trying to put these three columns in a pandas dataframe.
*Here's the code that will get you the soup data from the url (there there will be nulls):
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import requests
import re
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
with requests.Session() as s:
url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html?page=1'
r = s.get(url, headers=req_headers)
soup = BeautifulSoup(r.content, 'lxml')
soup
Here's the html I'm tyring to parse (there's a bunch of these divs on each page):
<div class="col-md-12 data">
<div class="col-md-6">
<a href="/business-directory/company-profiles.S-A_FLUXO_-_COMERCIO_E_ASSESSORIA_INTERNACION_AL.02f1cc56465eb3286f769daad5262d91.html">
S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL</a>
</div>
<div class="col-md-4">
<div class="show-mobile">Country:</div>
Recife,
Pernambuco,
<br>
Brazil</div>
<div class="col-md-2 last">
<div class="show-mobile">Sales Revenue ($M):</div>
250.620749M</div>
</div>
Here's what I have so far:
#sales rev
sales_revenue = soup.find_all("div", {"class": "col-md-2 last"})
#location
country = soup.find_all("div", {"class": "col-md-4"})
#thought something like this would work for country but it doesn't"
classToIgnore = ["col-sm-4", "col-xs-4"]
classes = "col-md-4"
for a in soup:
a = soup.find_all("div", class_= lambda c: classes in c and classToIgnore not in c)
#company name
for div in soup.find_all('div',class_="col-md-6"):
x = div.find_all("a", href=re.compile("business-directory"))
print(x)
Result should look like
revenue location company
$250620749 Recife, Pernambuco, Brazil S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL
Issues
-the sales revenue kind of works - not great. grabs a lot of other information.
-the location doesn't work well
-the company name is hard to grab bc it's the text after the href. I can get the hrefs but not sure how to grab the text after the url
Any ideas?
To save the table found on the page, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html?page=1'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
# remove unnecessary information:
for t in soup.select('.show-mobile'):
t.extract()
all_data = []
for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
soup.select('#companyResults .col-md-4')[1:],
soup.select('#companyResults .col-md-2')[1:]):
all_data.append({
'Name': c1.get_text(strip=True),
'Location': ' '.join(c2.get_text(strip=True).split()),
'Revenue': c3.get_text(strip=True)
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
Prints:
Name Location Revenue
0 S/A FLUXO - COMERCIO E ASSESSORIA INTERNACION AL Recife, Pernambuco,Brazil 250.620749M
1 POINT SHOES EIRELI Franca, Sao Paulo,Brazil
2 Cooperativa Triticola Caçapavana Ltda Caçapava Do Sul, Rio Grande Do Sul,Brazil 142.786551M
3 CRT2 REPRESENTACOES EMPRESARIAIS LTDA Curitiba, Parana,Brazil
4 Mercantil Palmeirense Ltda Sao Paulo, Sao Paulo,Brazil
5 GVD IMPORTACAO E EXPORTACAO LTDA Campo Bom, Rio Grande Do Sul,Brazil
6 COOPERATIVA TRITICOLA DE GETULIO VARGAS LTDA Estacao, Rio Grande Do Sul,Brazil 75.176735M
7 Golden Distribuidora Ltda Vitoria, Espirito Santo,Brazil
8 JTF COMERCIO E REPRESENTACOES LTDA Colider, Mato Grosso,Brazil
9 MARINHO VESTUARIO LTDA Eusebio, Ceara,Brazil
10 COTIA FOODS COMERCIO E REPRESENTACAO LTDA - EM RECUPERACAO JUDICIAL Cotia, Sao Paulo,Brazil
11 FOKUS LOGISTICA LTDA Aparecida De Goiania, Goias,Brazil
12 R. SHIBUYA TENDENCIA MARKETING E REPRESENTACOES LTDA Rio De Janeiro, Rio De Janeiro,Brazil
13 TIM COMERCIO DE EMBALAGENS LTDA Belo Horizonte, Minas Gerais,Brazil
14 PEDRAFORT LTDA Sete Lagoas, Minas Gerais,Brazil
15 FARMA-RAPIDA MEDICAMENTOS E MATERIAIS ESPECIAIS SA Natal, Rio Grande Do Norte,Brazil 48.913861M
16 PORTOFINO REPRESENTACOES LTDA Botuvera, Santa Catarina,Brazil
17 NOROEST REPRESENTACOES COMERCIAIS LTDA Jaru, Rondonia,Brazil
18 LOGIMED DISTRIBUIDORA SOCIEDADE EMPRESARIA LIMITADA Sao Paulo, Sao Paulo,Brazil
19 Filon Confecções - EIRELI São Paulo, Sao Paulo,Brazil
20 LEMES & LIMA COMERCIO E LOGISTICA LTDA Goiania, Goias,Brazil
21 CERAMICA JACARANDA LTDA Ribeirao Das Neves, Minas Gerais,Brazil
22 NORDICAL REPRESENTANTE DE ALIMENTOS LTDA Jaboatao Dos Guararapes, Pernambuco,Brazil
23 QUESALON REPRESENTACAO DE PRODUTOS FARMACEUTICOS LTDA Alhandra, Paraiba,Brazil
24 ATACK REPRESENTACAO COMERCIAL LTDA Vitoria, Espirito Santo,Brazil
25 LESTE BRASILEIRA IMPORTADORA E EXPORTADORA LTDA Cariacica, Espirito Santo,Brazil
26 JUCELITO BORDIGNON & CIA LTDA Sao Sepe, Rio Grande Do Sul,Brazil
27 CASAS DA LAVOURA REPRESENTACOES COMERCIAIS LTDA Goiania, Goias,Brazil
28 UNISOAP COSMETICOS LTDA Praia Grande, Sao Paulo,Brazil
29 MOTIVA MAQUINAS LTDA Salvador, Bahia,Brazil
30 BC COSMETICOS LTDA Sao Paulo, Sao Paulo,Brazil
31 ORGANIZACOES ALMEIDA SOARES LTDA Belo Horizonte, Minas Gerais,Brazil
32 Refinitiv Brasil Servicos Economicos Ltda Sao Paulo, Sao Paulo,Brazil
33 JBC REPRESENTACOES LTDA Conchal, Sao Paulo,Brazil
34 P & P RIO DISTRIBUIDORA DE ALIMENTOS LTDA Rio De Janeiro, Rio De Janeiro,Brazil
35 FORMATTO TELHAS E TELHADOS REPRESENTACAO COMERCIAL EIRELI Jaguaruna, Santa Catarina,Brazil
36 MACLENY - DISTRIBUIDORA DE PRODUTOS DE BELEZA LTDA Sao Paulo, Sao Paulo,Brazil
37 ELG REPRESENTACAO E COMERCIO LTDA Jaragua Do Sul, Santa Catarina,Brazil
38 ELFA PRODUTOS FARMACEUTICOS E HOSPITALARES LTDA Cabedelo, Paraiba,Brazil
39 COMERCIO E EXPORTACAO DE CEREAIS MUNARETTO LTDA Bom Sucesso Do Sul, Parana,Brazil
40 RGE DISTRIBUIDORA DE BEBIDAS LTDA Montes Claros, Minas Gerais,Brazil
41 A.S. REPRESENTACAO DE EMBALAGENS LTDA Sao Paulo, Sao Paulo,Brazil
42 ON LINE TRADING SA Novo Hamburgo, Rio Grande Do Sul,Brazil 21.79624M
43 AMX COMERCIO E SERVICOS DE AUTOMOTORES LTDA Itaborai, Rio De Janeiro,Brazil
44 SOL EMBALAGENS PLASTICAS EIRELI Camacari, Bahia,Brazil
45 MJB COMERCIO DE EQUIPAMENTOS ELETRONICOS E GESTAO DE PESSOAL LTDA Cuiaba, Mato Grosso,Brazil
46 EBANOS REPRESENTACOES LTDA Estancia Velha, Rio Grande Do Sul,Brazil
47 TENXE SERVICOS DE REPRESENTACAO COMERCIAL E TELEATENDIMENTO LTDA Curitiba, Parana,Brazil
48 BRASILVEST REPRESENTACOES LTDA Gaspar, Santa Catarina,Brazil
49 EURO MED INDUSTRIA E COMERCIO LTDA Timbauba, Pernambuco,Brazil
And saves data.csv (screenshot from LibreOffice):
EDIT: To scrape multiple pages, use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
params = {'page': 1}
url = 'https://www.dnb.com/business-directory/company-information.finance-insurance-sector.br.html'
all_data = []
for params['page'] in range(1, 3): # <-- increase number of pages here
print('Page {}...'.format(params['page']))
soup = BeautifulSoup(requests.get(url, headers=headers, params=params).content, 'html.parser')
# remove unnecessary information:
for t in soup.select('.show-mobile'):
t.extract()
for c1, c2, c3 in zip(soup.select('#companyResults .col-md-6')[1:],
soup.select('#companyResults .col-md-4')[1:],
soup.select('#companyResults .col-md-2')[1:]):
all_data.append({
'Name': c1.get_text(strip=True),
'Location': ' '.join(c2.get_text(strip=True).split()),
'Revenue': c3.get_text(strip=True)
})
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
In python3 and pandas I have a list of dictionaries in this format:
a = [{'texto27/2': 'SENADO: PLS 00143/2016, de autoria de Telmário Mota, fala sobre maternidade e sofreu alterações em sua tramitação. Tramitação: Comissão de Assuntos Sociais. Situação: PRONTA PARA A PAUTA NA COMISSÃO. http://legis.senado.leg.br/sdleg-getter/documento?dm=2914881'}, {'texto27/3': 'SENADO: PEC 00176/2019, de autoria de Randolfe Rodrigues, fala sobre maternidade e sofreu alterações em sua tramitação. Tramitação: Comissão de Constituição, Justiça e Cidadania. Situação: PRONTA PARA A PAUTA NA COMISSÃO. http://legis.senado.leg.br/sdleg-getter/documento?dm=8027142'}, {'texto6/4': 'SENADO: PL 05643/2019, de autoria de Câmara dos Deputados, fala sobre violência sexual e sofreu alterações em sua tramitação. Tramitação: Comissão de Direitos Humanos e Legislação Participativa. Situação: MATÉRIA COM A RELATORIA. http://legis.senado.leg.br/sdleg-getter/documento?dm=8015569'}]
I tried to transform it into a dataframe with these commands:
import pandas as pd
df_lista_sentencas = pd.DataFrame(a)
df_lista_sentencas.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
texto27/2 1 non-null object
texto27/3 1 non-null object
texto6/4 1 non-null object
dtypes: object(3)
memory usage: 100.0+ bytes
But the generated dataframe has blank lines:
df_lista_sentencas.reset_index()
index texto27/2 texto27/3 texto6/4
0 0 SENADO: PLS 00143/2016, de autoria de Telmário... NaN NaN
1 1 NaN SENADO: PEC 00176/2019, de autoria de Randolfe... NaN
2 2 NaN NaN SENADO: PL 05643/2019, de autoria de Câmara do...
I would like to generate something like this:
texto27/2 texto27/3 texto6/4
SENADO: PLS 00143/2016, de autoria de Telmário... SENADO: PEC 00176/2019, de autoria de Randolfe.. SENADO: PL 05643/2019, de autoria de Câmara do...
Please, does anyone know how I can create a dataframe without blank lines?
May be using bfill:
df = df_lista_sentencas.bfill().iloc[[0]]
print(df)