How to extract text from 'a' element with BeautifulSoup?

How to extract text from 'a' element with BeautifulSoup? - python

I'm trying to get the text from a 'a' html element I got with beautifulsoup.
I am able to print the whole thing and what I want to find is right there:
-1
Tensei Shitara Slime Datta Ken Manga
-1
But when I want to be more specific and get the text from that it gives me this error:
File "C:\python\manga\manga.py", line 15, in <module>
print(title.text)
AttributeError: 'int' object has no attribute 'text'
Here is the code I'm running:
import requests
from bs4 import BeautifulSoup
URL = 'https://mangapark.net/manga/tensei-shitara-slime-datta-ken-fuse'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('section', class_='manga')
manga_title = soup.find('div', class_='pb-1 mb-2 line-b-f hd')
for m_title in manga_title:
title = m_title.find('a')
print(title.text)
I've searched for my problem but I couldn't find something that helps.

Beautiful soup returns -1 as a value when it doesn't find something in a search
This isn't a very common way in python to show that no values exist but it is a common one for other languages.
import requests
from bs4 import BeautifulSoup
URL = 'https://mangapark.net/manga/tensei-shitara-slime-datta-ken-fuse'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('section', class_='manga')
manga_title = soup.find('div', class_='pb-1 mb-2 line-b-f hd')
for m_title in manga_title.children:
title = m_title.find('a')
# Beautiful soup returns -1 as a value when it doesn't find something in a search
# This isn't a very pythonic way to show non existent values but it is a common one
if title != -1:
print(title.text)
Output
Tensei Shitara Slime Datta Ken Manga

Related

Webscrape Beautifulsoup on website (get multiple hrefs)

I want to web scrape this webpage (carbuzz.com). I want to get the links (href) of all the car brands from "Acura" to "Volvo" (link to picture).
Currently, I only get the first entry (Acura). How do I get the remaining ones? As I just started scraping and coding would highly appreciate your input!
Code:
from bs4 import BeautifulSoup
import requests
import time
#Inputs/URLs to scrape:
URL2 = ('https://carbuzz.com/cars')
(response := requests.get(URL2)).raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
overview = soup.find()
car_brand = overview.find(class_='bg-make-preview')['href']
car_brand_url ='https://carbuzz.com'+car_brand
print(car_brand_url)
Output:
[Finished in 1.2s]

You can use find_all to get the tag with class name bg-make-preview.
soup = BeautifulSoup(response.text, 'lxml')
for elem in soup.find_all(class_='bg-make-preview'):
car_brand_url ='https://carbuzz.com' + elem['href']
print(car_brand_url)
This gives us the expected output
https://carbuzz.com/cars/acura
https://carbuzz.com/cars/alfa-romeo
https://carbuzz.com/cars/aston-martin
https://carbuzz.com/cars/audi
https://carbuzz.com/cars/bentley
https://carbuzz.com/cars/bmw
https://carbuzz.com/cars/bollinger
https://carbuzz.com/cars/bugatti
https://carbuzz.com/cars/buick
https://carbuzz.com/cars/cadillac
https://carbuzz.com/cars/caterham
https://carbuzz.com/cars/chevrolet
https://carbuzz.com/cars/chrysler
https://carbuzz.com/cars/dodge
https://carbuzz.com/cars/ferrari
https://carbuzz.com/cars/fiat
https://carbuzz.com/cars/fisker
https://carbuzz.com/cars/ford
https://carbuzz.com/cars/genesis
https://carbuzz.com/cars/gmc
https://carbuzz.com/cars/hennessey
https://carbuzz.com/cars/honda
https://carbuzz.com/cars/hyundai
https://carbuzz.com/cars/infiniti
https://carbuzz.com/cars/jaguar
https://carbuzz.com/cars/jeep
https://carbuzz.com/cars/karma
https://carbuzz.com/cars/kia
https://carbuzz.com/cars/koenigsegg
https://carbuzz.com/cars/lamborghini
https://carbuzz.com/cars/land-rover
https://carbuzz.com/cars/lexus
https://carbuzz.com/cars/lincoln
https://carbuzz.com/cars/lordstown
https://carbuzz.com/cars/lotus
https://carbuzz.com/cars/lucid
https://carbuzz.com/cars/maserati
https://carbuzz.com/cars/mazda
https://carbuzz.com/cars/mclaren
https://carbuzz.com/cars/mercedes-benz
https://carbuzz.com/cars/mini
https://carbuzz.com/cars/mitsubishi
https://carbuzz.com/cars/nissan
https://carbuzz.com/cars/pagani
https://carbuzz.com/cars/polestar
https://carbuzz.com/cars/porsche
https://carbuzz.com/cars/ram
https://carbuzz.com/cars/rimac
https://carbuzz.com/cars/rivian
https://carbuzz.com/cars/rolls-royce
https://carbuzz.com/cars/spyker
https://carbuzz.com/cars/subaru
https://carbuzz.com/cars/tesla
https://carbuzz.com/cars/toyota
https://carbuzz.com/cars/volkswagen
https://carbuzz.com/cars/volvo
https://carbuzz.com/cars/hummer
https://carbuzz.com/cars/maybach
https://carbuzz.com/cars/mercury
https://carbuzz.com/cars/pontiac
https://carbuzz.com/cars/saab
https://carbuzz.com/cars/saturn
https://carbuzz.com/cars/scion
https://carbuzz.com/cars/smart
https://carbuzz.com/cars/suzuki

How can I extract href links from a within a table th using BeautifulSoup

I am trying to create a list of all football teams/links from any one of a number of tables within the base URL: https://fbref.com/en/comps/10/stats/Championship-Stats
I would then use the link from the href to scrape each individual team's data. The href is embedded within the th tag as per below
th scope="row" class="left " data-stat="squad">Barnsley</th
a href="/en/squads/293cb36b/Barnsley-Stats">Barnsley</a
The following code gives me a list of the 'a' tags
page = "https://fbref.com/en/comps/10/Championship-Stats"
pageTree = requests.get(page)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Teams = pageSoup.find_all("th", {"class": "left"})
Output(for each class of 'left'):
th class="left" data-stat="squad" scope="row">
a href="/en/squads/293cb36b/Barnsley-Stats">Barnsley,
I have tried the guidance from a previous Stack question (Extract links after th in beautifulsoup)
However, the following code based on that thread produces errors
AttributeError: 'NoneType' object has no attribute 'find_parent'
def import_TeamList():
BASE_URL = "https://fbref.com/en/comps/10/Championship-Stats"
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'lxml')
team_list = []
team_tr = soup.find('a', {'data-stat': 'squad'}).find_parent('tr')
for tr in reels_tr.find_next_siblings('tr'):
if tr.find('a').text != 'squad':
break
midi_list.append(BASE_URL + tr.find('a')['href'])
return TeamList

Here is a version using CSS selectors, which I find simpler than most other methods.
import requests
from bs4 import BeautifulSoup
url = 'https://fbref.com/en/comps/10/stats/Championship-Stats'
data = requests.get(url).text
soup = BeautifulSoup(data)
links = BeautifulSoup(data).select('th a')
urls = [link['href'] for link in links]
print(urls)

Is this what you're looking for?
import requests
from bs4 import BeautifulSoup as BS
from lxml import etree
with requests.Session() as session:
r = session.get('https://fbref.com/en/comps/10/stats/Championship-Stats')
r.raise_for_status()
dom = etree.HTML(str(BS(r.text, 'lxml')))
for a in dom.xpath('//th[#class="left"]/a'):
print(a.attrib['href'])

Scraping Issue. Why no data is being scraped?

import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
words = soup.find_all('p', class_="entry-content")
print(len(words))
for word in words:
print(word.text)
// Nothing is being displayed on my console and the length variable returns 0 which means nothing is being scraped.

If you see html there is div in which all p tags are present so you can take that div tag with associate class and then take p tag from it so you will get your output
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
main_div = soup.find('div', attrs={"class":"entry-content"})
words=main_div.find_all("p")
for word in words:
print(word.text)
Output:
testing
The slang used in Dominican Republic.
C – ce
*Caballo – person similar to a tigre but a little more decent
*Cabron – a large male goat,also means displeased
*Cacaito – candy
....

There were no p tags with entry-content? Remove that and it should work
import requests
from bs4 import BeautifulSoup
URL = 'https://www.colonialzone-dr.com/c-dominicanismos-dictionary'
page = requests.get(URL)
print("testing")
soup = BeautifulSoup(page.content, 'html.parser')
words = soup.find_all('p')
print(len(words))
for word in words:
print(word.text)
But if you want content of p inside the div with class "entry-content" then
entryContent=soup.find('div',attrs={"class":"entry-content"})
words=entryContent.find_all("p")

you are passing additional parameter to words as entry-content, but there is no need to pass additional parameter.
words = soup.find_all('p', class_="entry-content")
try instead of that,
words = soup.find_all('p')
then it will get all the content with p and it gives you the length.
print(len(words))
i hope it will help to you..

How to get a li with no class underneath a div tag

I am attempting to pull line and over under data for games from ESPN. To do this I need to pull a list item underneath a div tag.I can successfully get the over/under data because it's clear to me what the tag is, but the list item for the line doesn't seem to have a clear tag. Essentially I would be wanting to pull out "Line: IOWA -3.5" from this specific URL.
from bs4 import BeautifulSoup
page = requests.get('https://www.espn.com/college- football/game/_/gameId/401012863')
soup = BeautifulSoup(page.text, 'html.parser')
#Get over/under
game_ou = soup.find('li',class_='ou')
game_ou2 = game_ou.contents[0]
game_ou3=game_ou2.strip()
#Get Line
game_line = soup.find('div',class_='odds-details')
print(game_line)

Add in the parent class (with descendant combinator and child li type selector) then you can retrieve both li in a list and index in or just use select_one to retrieve the first
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.espn.com/college-football/game/_/gameId/401012863')
soup = bs(r.content, 'lxml')
lis = [i.text.strip() for i in soup.select('.odds-details li')]
print(lis[0])
print(lis[1])
print(soup.select_one('.odds-details li').text)

Use find('li') after find the div element.
from bs4 import BeautifulSoup
page = requests.get('https://www.espn.com/college-football/game/_/gameId/401012863')
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.find("div",class_="odds-details").find('li').text)
print(soup.find("div",class_="odds-details").find('li',class_='ou').text.strip())
Output:
Line: IOWA -3.5
Over/Under: 47

BS "find_all" method does not match all target

When I use find_all method on this page beautiful soup doesn't find all targets.
This code:
len(mySoup.find_all('div', {'class': 'lo-liste row'}))
Returns 1, yet there are 4.
This is the soup url.

When i looked in source code of given link i found there are only 1 div with class name "lo-liste row" other three div is having class name as follows "lo-liste row not-first-ligne" so that's why you got only 1 as output.
try following code
len(soup.findAll('div', {'class': ['lo-liste row','not-first-ligne']}))
enter code here
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.ubaldi.com/offres/jeu-de-3-disques-feutres-electrolux--ma-92ca565jeud-4b9yz--727536.php")
soup = BeautifulSoup(page.content, 'html.parser')
print(len(soup.findAll('div', {'class': ['lo-liste row','not-first-ligne']})))

The find_all DOES correctly match all targets.
The first product has class=lo-liste row
The next 3 products have class=lo-liste row not-first-ligne
import requests
url = 'https://www.ubaldi.com/offres/jeu-de-3-disques-feutres-electrolux--ma-92ca565jeud-4b9yz--727536.php'
response = requests.get(url)
mySoup = BeautifulSoup(response.text, 'html.parser')
for product in mySoup.find_all('div', {'class': 'lo-liste row'}):
print (product.find('a').find_next('span').text.strip())
for product in mySoup.find_all('div', {'class': 'lo-liste row not-first-ligne'}):
print (product.find('a').find_next('span').text.strip())
# or to combine those 2 for loops into 1
#for product in mySoup.findAll('div', {'class': ['lo-liste row','not-first-ligne']}):
#print (product.find('a').find_next('span').text.strip())
Output:
SOS Accessoire
Stortle
Groupe-Dragon
Asdiscount

Use select instead. It will match on all 4 for that class.
items = soup.select('.lo-liste.row')

Use Regular Expression re to find the element.
from bs4 import BeautifulSoup
import requests
import re
url = 'https://www.ubaldi.com/offres/jeu-de-3-disques-feutres-electrolux--ma-92ca565jeud-4b9yz--727536.php'
html= requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
print(len(soup.find_all('div', class_=re.compile('lo-liste row'))))
Output:
4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract text from 'a' element with BeautifulSoup? - python

Related

Webscrape Beautifulsoup on website (get multiple hrefs)

How can I extract href links from a within a table th using BeautifulSoup

Scraping Issue. Why no data is being scraped?

How to get a li with no class underneath a div tag

BS "find_all" method does not match all target

Categories

Resources