Scraping issue with id_tag - python

I'm trying to extract data from a website with BeautifulSoup.
I'm actually stuck with this :
"Trad. de l'anglais par < a href="/searchinternet/advanced?all_authors_id=35534&SearchAction=1">Camille Fabien < /a>"
I want to get the names of translaters but the tag uses their id.
my code is
translater = soup.find_all("a", href="/searchinternet/advanced?all_authors_id=")
I tried with a str.startswith but it doesn't work.
Can someone help me plz?

Providing your HTML is correct, static (doesn't get loaded with javascript after initial page load), this is one way to select that/those links:
from bs4 import BeautifulSoup as bs
html = '''<p>Trad. de l'anglais par Camille Fabien </p>'''
soup = bs(html, 'html.parser')
a = soup.select('a[href^="/searchinternet/advanced?all_authors_id="]')
print(a[0])
print(a[0].get_text(strip=True))
print(a[0].get('href'))
Result in terminal:
Camille Fabien
Camille Fabien
/searchinternet/advanced?all_authors_id=35534&SearchAction=1
EDIT: Who doesn't like a challenge?... Based on further comments made by OP, here is a way of obtaining titles, authors, translators and illustrator from that page - considering there can be one, or more translators/one or more illustrators:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
url = 'https://www.gallimard.fr/searchinternet/advanced/(editor_brand_id)/1/(fserie)/FOLIO-JUNIOR+LIVRE+HEROS%3A%3AFolio+Junior+-+Un+Livre+dont+Vous+%C3%AAtes+le+H%C3%A9ros+%40+DEFIS+FANTASTIQ%3A%3AS%C3%A9rie+D%C3%A9fis+Fantastiques/(limit)/3?date%5Bfrom%5D=1980-01-01&date%5Bto%5D=1995-01-01&SearchAction=OK'
big_list = []
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
items = soup.select('div[class="results bg_white"] > table div[class="item"]')
print()
for i in items:
title = i.select_one('div[class="title"] h3')
author = i.select_one('div[class="author"] a')
history = i.select_one('p[class="collective_work_entries"]')
translators = [[y.get_text() for y in x.find_previous_siblings('a')] for x in history.contents if "Illustrations" in x]
illustrators = [[y.get_text() for y in x.find_next_siblings('a')] for x in history.contents if "Illustrations" in x]
big_list.append((title.text.strip(), author.text.strip(), ', '.join([x for y in translators for x in y]), ', '.join([x for y in illustrators for x in y])))
df = pd.DataFrame(big_list, columns = ['Title', 'Author', 'Translator(s)', 'Illustrator(s)'])
print(df)
Result in terminal:
Title
Author
Translator(s)
Illustrator(s)
0
Le Sépulcre des Ombres
Jonathan Green
Noël Chassériau
Alan Langford
1
La Légende de Zagor
Ian Livingstone
Pascale Houssin
Martin McKenna
2
Les Mages de Solani
Keith Martin
Noël Chassériau
Russ Nicholson
3
Le Siège de Sardath
Keith P. Phillips
Yannick Surcouf
Pete Knifton
4
Retour à la Montagne de Feu
Ian Livingstone
Yannick Surcouf
Martin McKenna
5
Les Mondes de l'Aleph
Peter Darvill-Evans
Yannick Surcouf
Tony Hough
6
Les Mercenaires du Levant
Paul Mason
Mona de Pracontal
Terry Oakes
7
L'Arpenteur de la Lune
Stephen Hand
Pierre de Laubier
Martin McKenna, Terry Oakes
8
La Tour de la Destruction
Keith Martin
Mona de Pracontal
Pete Knifton
9
La Légende des Guerriers Fantômes
Stephen Hand
Alexis Galmot
Martin McKenna
10
Le Repaire des Morts-Vivants
Dave Morris
Nicolas Grenier
David Gallagher
11
L'Ancienne Prophétie
Paul Mason
Mona de Pracontal
Terry Oakes
12
La Vengeance des Démons
Jim Bambra
Mona de Pracontal
Martin McKenna
13
Le Sceptre Noir
Keith Martin
Camille Fabien
David Gallagher
14
La Nuit des Mutants
Peter Darvill-Evans
Anne Collas
Alan Langford
15
L'Élu des Six Clans
Luke Sharp
Noël Chassériau
Martin Mac Kenna, Martin McKenna
16
Le Volcan de Zamarra
Luke Sharp
Olivier Meyer
David Gallagher
17
Les Sombres Cohortes
Ian Livingstone
Noël Chassériau
Nik William
18
Le Vampire du Château Noir
Keith Martin
Mona de Pracontal
Martin McKenna
19
Le Voleur d'Âmes
Keith Martin
Mona de Pracontal
Russ Nicholson
20
Le Justicier de l'Univers
Martin Allen
Mona de Pracontal
Tim Sell
21
Les Esclaves de l'Eternité
Paul Mason
Sylvie Bonnet
Bob Harvey
22
La Créature venue du Chaos
Steve Jackson
Noël Chassériau
Alan Langford
23
Les Rôdeurs de la Nuit
Graeme Davis
Nicolas Grenier
John Sibbick
24
L'Empire des Hommes-Lézards
Marc Gascoigne
Jean Lacroix
David Gallagher
25
Les Gouffres de la Cruauté
Luke Sharp
Sylvie Bonnet
Russ Nicholson
26
Les Spectres de l'Angoisse
Robin Waterfield
Mona de Pracontal
Ian Miller
27
Le Chasseur des Étoiles
Luke Sharp
Arnaud Dupin de Beyssat
Cary Mayes, Gary Mayes
28
Les Sceaux de la Destruction
Robin Waterfield
Sylvie Bonnet
Russ Nicholson
29
La Crypte du Sorcier
Ian Livingstone
Noël Chassériau
John Sibbick
30
La Forteresse du Cauchemar
Peter Darvill-Evans
Mona de Pracontal
Dave Carson
31
La Grande Menace des Robots
Steve Jackson
Danielle Plociennik
Gary Mayes
32
L'Épée du Samouraï
Mark Smith
Pascale Jusforgues
Alan Langford
33
L'Épreuve des Champions
Ian Livingstone
Alain Vaulont, Pascale Jusforgues
Brian Williams
34
Défis Sanglants sur l'Océan
Andrew Chapman
Jean Walter
Bob Harvey
35
Les Démons des Profondeurs
Steve Jackson
Noël Chassériau
Bob Harvey
36
Rendez-vous avec la M.O.R.T.
Steve Jackson
Arnaud Dupin de Beyssat
Declan Considine
37
La Planète Rebelle
Robin Waterfield
C. Degolf
Gary Mayes
38
Les Trafiquants de Kelter
Andrew Chapman
Anne Blanchet
Nik Spender
39
Le Combattant de l'Autoroute
Ian Livingstone
Alain Vaulont, Pascale Jusforgues
Kevin Bulmer
40
Le Mercenaire de l'Espace
Andrew Chapman
Jean Walthers
Geoffroy Senior
41
Le Temple de la Terreur
Ian Livingstone
Denise May
Bill Houston
42
Le Manoir de l'Enfer
Steve Jackson
43
Le Marais aux Scorpions
Steve Jackson
Camille Fabien
Duncan Smith
44
Le Talisman de la Mort
Steve Jackson
Camille Fabien
Bob Harvey
45
La Sorcière des Neiges
Ian Livingstone
Michel Zénon
Edward Crosby, Gary Ward
46
La Citadelle du Chaos
Steve Jackson
Marie-Raymond Farré
Russ Nicholson
47
La Galaxie Tragique
Steve Jackson
Camille Fabien
Peter Jones
48
La Forêt de la Malédiction
Ian Livingstone
Camille Fabien
Malcolm Barter
49
La Cité des Voleurs
Ian Livingstone
Henri Robillot
Iain McCaig
50
Le Labyrinthe de la Mort
Ian Livingstone
Patricia Marais
Iain McCaig
51
L'Île du Roi Lézard
Ian Livingstone
Fabienne Vimereu
Alan Langford
52
Le Sorcier de la Montagne de Feu
Steve Jackson
Camille Fabien
Russ Nicholson
Bear in mind this method fails for Le Manoir de l'Enfer, because word 'Illustrations' is not found in text. It's down to the OP to find a solution for that one.
BeautifulSoup documentation can be found at https://beautiful-soup-4.readthedocs.io/en/latest/index.html
Also, Pandas docs can be found here: https://pandas.pydata.org/pandas-docs/stable/index.html

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("./test.html", "r"),'html.parser') #returns a list
names = []
for elem in soup:
names.append(elem.text)

Related

Add a comma after two words in pandas

I have the following texts in a df column:
La Palma
La Palma Nueva
La Palma, Nueva Concepcion
El Estor
El Estor Nuevo
Nuevo Leon
San Jose
La Paz Colombia
Mexico Distrito Federal
El Estor, Nuevo Lugar
What I need is to add a comma at the end of each row but the condition that it is only two words. I found a partial solution:
df['Column3'] = df['Column3'].apply(lambda x: str(x)+',')
(solution found in stackoverflow)
Given:
words
0 La Palma
1 La Palma Nueva
2 La Palma, Nueva Concepcion
3 El Estor
4 El Estor Nuevo
5 Nuevo Leon
6 San Jose
7 La Paz Colombia
8 Mexico Distrito Federal
9 El Estor, Nuevo Lugar
Doing:
df.words = df.words.apply(lambda x: x+',' if len(x.split(' ')) == 2 else x)
print(df)
Outputs:
words
0 La Palma,
1 La Palma Nueva
2 La Palma, Nueva Concepcion
3 El Estor,
4 El Estor Nuevo
5 Nuevo Leon,
6 San Jose,
7 La Paz Colombia
8 Mexico Distrito Federal
9 El Estor, Nuevo Lugar

Check if elements from a Dataframe column are in another Pandas dataframe row and append to a new column

I have a DataFrame like this:
Casa
Name
Solo Deportes
Paleta De Padel Adidas Metalbone CTRL
Solo Deportes
Zapatillas Running Under Armour Charged Stamin...
Solo Deportes
Rompeviento Con Capucha Reebok Woven Azu
Solo Deportes
Remera Michael Jordan Chicago Bulls
and Df2:
Palabra
Marca
Acqualine
Acqualine
Addnice
Addnice
Adnnice
Addnice
Under Armour
Under Armour
Jordan
Nike
Adidas
Adidas
Reebok
Reebok
How can I check each row of df['Name'], see if the row contains a value of Df2['Palabra'], and in that case get the value of Df2['Marca'] and put it in the new column?. The result should be something like this:
Casa
Name
Marca
Solo Deportes
Paleta De Padel Adidas Metalbone CTRL
Adidas
Solo Deportes
Zapatillas Running Under Armour Charged Stamin...
Under Armour
Solo Deportes
Rompeviento Con Capucha Reebok Woven Azu
Reebok
Solo Deportes
Remera Michael Jordan Chicago Bulls
Nike
Data:
df:
{'Casa': ['Solo Deportes', 'Solo Deportes', 'Solo Deportes', 'Solo Deportes'],
'Name': ['Paleta De Padel Adidas Metalbone CTRL',
'Zapatillas Running Under Armour Charged Stamin...',
'Rompeviento Con Capucha Reebok Woven Azu',
'Remera Michael Jordan Chicago Bulls']}
Df2:
{'Palabra': ['Acqualine', 'Addnice', 'Adnnice', 'Under Armour', 'Jordan', 'Adidas', 'Reebok'],
'Marca': ['Acqualine', 'Addnice', 'Addnice', 'Under Armour', 'Nike', 'Adidas', 'Reebok']}
A simple solution is to use iterrows in a generator expression and next to iterate over Df2 and check if a matching item appears:
df1['Marca'] = df1['Name'].apply(lambda name: next((r['Marca'] for _, r in Df2.iterrows() if r['Palabra'] in name), float('nan')))
Output:
Casa Name Marca
0 Solo Deportes Paleta De Padel Adidas Metalbone CTRL Adidas
1 Solo Deportes Zapatillas Running Under Armour Charged Stamin... Under Armour
2 Solo Deportes Rompeviento Con Capucha Reebok Woven Azu Reebok
3 Solo Deportes Remera Michael Jordan Chicago Bulls Nike

Data formatting in Python

I want to know if there is a possibility to modify these dates (they are in PT-BR), I want to put them in numerical form in EN or PT-BR.
Início Fim
0 15 de março de 1985 14 de fevereiro de 1986
1 14 de fevereiro de 1986 23 de novembro de 1987
2 23 de novembro de 1987 15 de janeiro de 1989
3 16 de janeiro de 1989 14 de março de 1990
4 15 de março de 1990 23 de janeiro de 1992
We can setlocale LC_TIME to pt_PT then to_datetime will work as expected with a format string:
import locale
import pandas as pd
locale.setlocale(locale.LC_TIME, 'pt_PT')
df = pd.DataFrame({
'Início': ['15 de março de 1985', '14 de fevereiro de 1986',
'23 de novembro de 1987', '16 de janeiro de 1989',
'15 de março de 1990'],
'Fim': ['14 de fevereiro de 1986', '23 de novembro de 1987',
'15 de janeiro de 1989', '14 de março de 1990',
'23 de janeiro de 1992']
})
cols = ['Início', 'Fim']
df[cols] = df[cols].apply(pd.to_datetime, format='%d de %B de %Y')
df:
Início Fim
0 1985-03-15 1986-02-14
1 1986-02-14 1987-11-23
2 1987-11-23 1989-01-15
3 1989-01-16 1990-03-14
4 1990-03-15 1992-01-23
Split the string inside each cell by " de ".
Replace the 2nd element with the corresponding number (I suggest using a dictionary for this).
Join the list into a string. I suggest using str.join, but string concatenation or formatted strings work too.
Let's use an example.
date = "23 de novembro de 1987"
dates = date.split(" de ") # ['23', 'novembro', '1987']
dates[1] = "11" # ['23', '11', '1987']
numerical_date = '/'.join(dates) # "23/11/1987"

How to change platform when scraping a website (Futbin) in Python?

I am currently looking at obtaining price data from Futbin specifically from this page of player data. I have used bs4 successfully for this with the following specific code:
spans = soup.find_all("span", class_="ps4_color font-weight-bold")
This collects all PS4 prices from the players page but I would like to also obtain Xbox and PC prices. To do this on the site you have to manually select it from the icons in the top right but from what I can tell this links to the same url but with updated price data. How can I scrape this data in a similar way to above as I'm sure there must be an easier way than using Selenium or similar packages.
Any help would be greatly appreciated!
To change page for another platform, set cookie= parameter in your request:
import requests
from bs4 import BeautifulSoup
url = 'https://www.futbin.com/20/players?page=1'
platforms = ['ps4', 'xone', 'pc']
for platform in platforms:
print()
print('Platform: {}'.format(platform))
print('-' * 80)
soup = BeautifulSoup( requests.get(url, cookies={'platform': platform}).content, 'html.parser' )
for s in soup.select('span.font-weight-bold'):
print('{:<40} {}'.format(s.find_previous('a', class_="player_name_players_table").text, s.text))
Prints:
Platform: ps4
--------------------------------------------------------------------------------
Lionel Messi 2.5M
Virgil van Dijk 1.82M
Cristiano Ronaldo 3.2M
Diego Maradona 4.5M
Pelé 6.65M
Kevin De Bruyne 1.95M
Virgil van Dijk 1.75M
Lionel Messi 2.18M
Robert Lewandowski 805K
Cristiano Ronaldo 3.08M
Pelé 3.35M
Kylian Mbappé 2.62M
Kevin De Bruyne 1.21M
Sadio Mané 783K
Kylian Mbappé 2.66M
Neymar Jr 3.83M
Diego Maradona 2.19M
Sadio Mané 625K
Alisson 148K
N'Golo Kanté 1.51M
Robert Lewandowski 269K
Ronaldo 0
Zinedine Zidane 7.15M
Lionel Messi 4.6M
Lionel Messi 1.4M
Alisson 143K
Mohamed Salah 459K
Raphaël Varane 847K
Karim Benzema 310K
Luis Suárez 407K
Platform: xone
--------------------------------------------------------------------------------
Lionel Messi 2.15M
Virgil van Dijk 1.65M
Cristiano Ronaldo 2.53M
Diego Maradona 4.07M
Pelé 0
Kevin De Bruyne 1.73M
Virgil van Dijk 1.6M
Lionel Messi 1.9M
Robert Lewandowski 719K
Cristiano Ronaldo 2.51M
Pelé 3.15M
Kylian Mbappé 2.27M
Kevin De Bruyne 1.02M
Sadio Mané 695K
Kylian Mbappé 2.24M
Neymar Jr 3.27M
Diego Maradona 1.61M
Sadio Mané 585K
Alisson 153K
N'Golo Kanté 1.3M
Robert Lewandowski 247K
Ronaldo 0
Zinedine Zidane 6.78M
Lionel Messi 4.26M
Lionel Messi 1.24M
Alisson 130K
Mohamed Salah 470K
Raphaël Varane 725K
Karim Benzema 272K
Luis Suárez 351K
Platform: pc
--------------------------------------------------------------------------------
Lionel Messi 3.56M
Virgil van Dijk 2.5M
Cristiano Ronaldo 3.75M
Diego Maradona 4.3M
Pelé 0
Kevin De Bruyne 2.52M
Virgil van Dijk 2.4M
Lionel Messi 2.86M
Robert Lewandowski 1.16M
Cristiano Ronaldo 3.75M
Pelé 5.75M
Kylian Mbappé 3.35M
Kevin De Bruyne 1.4M
Sadio Mané 925K
Kylian Mbappé 3.3M
Neymar Jr 4.85M
Diego Maradona 1.98M
Sadio Mané 730K
Alisson 179K
N'Golo Kanté 1.9M
Robert Lewandowski 400K
Ronaldo 0
Zinedine Zidane 0
Lionel Messi 4.77M
Lionel Messi 2.3M
Alisson 160K
Mohamed Salah 520K
Raphaël Varane 940K
Karim Benzema 370K
Luis Suárez 679K

Removing entire row if there are repetitive values in specific columns

I Have read a CSV file (that have name and addresses of customers) and assign the data into DataFrame table.
Description of the csv file (or the DataFrame table)
DataFrame contains several rows and 7 columns
Database example
Client_id Client_Name Address1 Address3 Post_Code City_Name Full_Address
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000004 D 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000005 E 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
So far I have written this code to generate the aformentioned table :
The code is
import pandas as pd
import glob
Excel_file = 'Address.xlsx'
Address_Info = pd.read_excel(Excel_file)
# rename the columns name
Address_Info.columns = ['Client_ID', 'Client_Name','Address_ID','Street_Name','Post_Code','City_Name','Country']
# extract specfic columns into a new dataframe
Bin_Address= Address_Info[['Client_Name','Address_ID','Street_Name','Post_Code','City_Name','Country']].copy()
# Clean existing whitespace from the ends of the strings
Bin_Address= Bin_Address.apply(lambda x: x.str.strip(), axis=1) # ← added
# Adding a new column called (Full_Address) that concatenate address columns into one
# for example Karlaplan 13,115 20,STOCKHOLM,Stockholms län, Sweden
Bin_Address['Full_Address'] = Bin_Address[Bin_Address.columns[1:]].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)
Bin_Address['Full_Address']=Bin_Address[['Full_Address']].copy()
Bin_Address['latitude'] = 'None'
Bin_Address['longitude'] = 'None'
# Remove repetitive addresses
#Temp = list( dict.fromkeys(Bin_Address.Full_Address) )
# Remove repetitive values ( I do beleive the modification should be here)
Temp = list( dict.fromkeys(Address_Info.Client_ID) )
I am looking to remove the entire row if there are repetitive values in the Client id, Client name , and Full_Address columns, so far code doesnt show any error but at the same time, I havnt got the expected out ( i do beleive the modification would be in the last line of the attached code)
The expected output is
Client_id Client_Name Address1 Address3 Post_Code City_Name Full_Address
C0000001 A 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000002 B 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
C0000003 C 11000051 9 RUE DU BRILL L-3898 FOETZ 9 RUE DU BRILL,L-3898 ,FOETZ
C0000004 D 10000009 37 RUE DE LA GARE L-7535 MERSCH 37 RUE DE LA GARE,L-7535, MERSCH
C0000005 E 10001998 RUE EDWARD STEICHEN L-1855 LUXEMBOURG RUE EDWARD STEICHEN,L-1855,LUXEMBOURG
You can use built in method called dorp_duplicates() from pandas. Also there is lot of options out of the box you can apply.
<your_dataframe>.drop_duplicates(subset=["Client_id", "Client_name", "Full_Address"])
You also have options for when it is duplicate, if you want to keep the first value or last value.
<your_dataframe>.drop_duplicates(subset=["Client_id", "Client_name", "Full_Address"], keep="first") # "first" or "last"
By default it will always keep the first value.
Try:
df = df.drop_duplicates(['Client id', 'Client name', 'Full_Address'])

Categories

Resources