Problem Scraping Element & Child Text with lxml & etree

Problem Scraping Element & Child Text with lxml & etree - python

I am trying to scrape lists from Wikipedia pages (like this one for example: https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Sk%C3%A1lholt) in a particular format. I am encountering issues getting 'li' and 'a href' to match up.
For example, from the above page, the ninth bullet has text:
1238–1268: Sigvarður Þéttmarsson (Norweger)
with HTML:
<li>1238–1268: Sigvarður Þéttmarsson (Norweger)</li>
I want to pull it together as a dictionary:
'1238–1268: Sigvarður Þéttmarsson (Norweger)': '/wiki/Sigvar%C3%B0ur_%C3%9E%C3%A9ttmarsson'
[Entire text of both parts of 'li' and 'a' child]: [href of 'a' child]
I know I can use lxml/etree to do this, but I'm not entirely sure how. Some recombination of the below?
from lxml import etree
tree = etree.HTML(html)
bishops = tree.cssselect('li').text for bishop
text = [li.text for li in bishops]
links = tree.cssselect('li a')
hrefs = [bishop.get('href') for bishop in links]

Update: I have figured this out using BeautifulSoup as follows:
from bs4 import BeautifulSoup
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
bishops_with_links = {}
bishops = soup.select('li')
for bishop in bishops:
if bishop.findChildren('a'):
bishops_with_links[bishop.text] = 'https://de.wikipedia.org' + bishop.a.get('href')
else:
bishops_with_links[bishop.text] = ''
return bishops_with_links

Related

find all a href from table

I'm trying to scrape rotten tomatoes with bs4
My aim is to find all a hrefs from the table but i cannot do it can you help me?
https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/
my code is
from urllib import request
from bs4 import BeautifulSoup as BS
import re
import pandas as pd
url = 'https://www.rottentomatoes.com/top/bestofrt'
html = request.urlopen(url)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'articleLink unstyled'})[7:]
links = ['https://www.rottentomatoes.com' + tag['href'] for tag in tags]
########################################### links ############################################################################
webpages = []
for link in reversed(links):
print(link)
html = request.urlopen(link)
bs = BS(html.read(), 'html.parser')
tags = bs.find_all('a', {'class':'unstyled articleLink'})[43:]
links = ['https://www.rottentomatoes.com' + tag['href'] for tag in tags]
webpages.extend(links)
print(webpages)
I put a limit to 43 in order to avoid useless links except for movies but it is a short term solution and does not help
I need to find an exact solution on how to scrape from table without scrape irrelevant information
thanks

Just grab the main table and then extract all the <a> tags.
For example:
import requests
from bs4 import BeautifulSoup
rotten_tomatoes_url = 'https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/'
action_and_adventure = [
f"https://www.rottentomatoes.com{link.get('href')}"
for link in
BeautifulSoup(
requests.get(rotten_tomatoes_url).text,
"lxml",
)
.find("table", class_="table")
.find_all("a")
]
print(len(action_and_adventure))
print("\n".join(action_and_adventure[:10]))
Output (all 100 links to movies):
100
https://www.rottentomatoes.com/m/black_panther_2018
https://www.rottentomatoes.com/m/avengers_endgame
https://www.rottentomatoes.com/m/mission_impossible_fallout
https://www.rottentomatoes.com/m/mad_max_fury_road
https://www.rottentomatoes.com/m/spider_man_into_the_spider_verse
https://www.rottentomatoes.com/m/wonder_woman_2017
https://www.rottentomatoes.com/m/logan_2017
https://www.rottentomatoes.com/m/coco_2017
https://www.rottentomatoes.com/m/dunkirk_2017
https://www.rottentomatoes.com/m/star_wars_the_last_jedi

try this:
tags = bs.find_all(name='a', {'class':'unstyled articleLink'})[43:]

How to extract text from 'a' element with BeautifulSoup?

I'm trying to get the text from a 'a' html element I got with beautifulsoup.
I am able to print the whole thing and what I want to find is right there:
-1
Tensei Shitara Slime Datta Ken Manga
-1
But when I want to be more specific and get the text from that it gives me this error:
File "C:\python\manga\manga.py", line 15, in <module>
print(title.text)
AttributeError: 'int' object has no attribute 'text'
Here is the code I'm running:
import requests
from bs4 import BeautifulSoup
URL = 'https://mangapark.net/manga/tensei-shitara-slime-datta-ken-fuse'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('section', class_='manga')
manga_title = soup.find('div', class_='pb-1 mb-2 line-b-f hd')
for m_title in manga_title:
title = m_title.find('a')
print(title.text)
I've searched for my problem but I couldn't find something that helps.

Beautiful soup returns -1 as a value when it doesn't find something in a search
This isn't a very common way in python to show that no values exist but it is a common one for other languages.
import requests
from bs4 import BeautifulSoup
URL = 'https://mangapark.net/manga/tensei-shitara-slime-datta-ken-fuse'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('section', class_='manga')
manga_title = soup.find('div', class_='pb-1 mb-2 line-b-f hd')
for m_title in manga_title.children:
title = m_title.find('a')
# Beautiful soup returns -1 as a value when it doesn't find something in a search
# This isn't a very pythonic way to show non existent values but it is a common one
if title != -1:
print(title.text)
Output
Tensei Shitara Slime Datta Ken Manga

BeautifulSoup find() takes no keyword arguments error

from bs4 import BeautifulSoup
from selenium import webdriver
import time
import sys
query_txt = input("크롤링할 내용 입력 :")
path = "C:\Temp\chromedriver_240\chromedriver.exe"
driver = webdriver.Chrome(path)
driver.get("https://www.naver.com")
time.sleep(2)
driver.find_element_by_id("query").send_keys(query_txt)
driver.find_element_by_id("search_btn").click()
driver.find_element_by_link_text("블로그 더보기").click()
full_html = driver.page_source
soup = BeautifulSoup(full_html, 'html.parser')
content_list = soup.find('ul', id='elThumbnailResultArea')
print(content_list)
content = content_list.find('a','sh_blog_title _sp_each_url _sp_each_title' ).get_text()
print(content)
for i in content_list:
con = i.find('a', class_='sh_blog_title _sp_each_url _sp_each_title').get_text()
print(con)
print('\n')
i typed this code with watching online learning but in loop it always error.
con = i.find('a', class_='sh_blog_title _sp_each_url _sp_each_title').get_text()
this line show error 'find() takes no keyword arguments'

The problem is, you have to use .find_all() to get all <a> tags. .find() only returns one tag (if there's any):
import requests
from bs4 import BeautifulSoup
url = 'https://search.naver.com/search.naver?query=tree&where=post&sm=tab_nmr&nso='
full_html = requests.get(url).content
soup = BeautifulSoup(full_html, 'html.parser')
content_list = soup.find_all('a', class_='sh_blog_title _sp_each_url _sp_each_title' )
for i in content_list:
print(i.text)
print('\n')
Prints:
[2017/공학설계 입문] Romantic Tree
장충동/Banyan Tree Club & Spa/Club Members Restaurant
2020-06-27 Joshua Tree National Park Camping(조슈아트리...
[결혼준비/D-102] 웨딩밴드 '누니주얼리 - like a tree'
Book Club - Magic Tree House # 1 : Dinosaur Before Dark...
비밀 정원, 조슈아 트리 국립공원(Joshua Tree National Park)
그뤼너씨 TEA TREE 티트리 라인 3종리뷰
Number of Nodes in the Sub-Tree With the Same Label
태국의 100년 넘은 Giant tree
[부산 기장 카페] 오션뷰 뷰맛집카페 : 씨앤트리 sea&tree

use .find('a', attrs={"class": "<Class name>"}) instead. Reference: Beatifulsoup docs

These two links will definitely help you.
Understand the Find() function in Beautiful Soup
Find on beautiful soup in loop returns TypeError

BeautifulSoup Extract Text from a Paragraph and Split Text by <br/>

I am very new to BeauitfulSoup.
How would I be able to extract the text in a paragraph from an html source code, split the text whenever there is a <br/>, and store it into an array such that each element in the array is a chunk from the paragraph text (that was split by a <br/>)?
For example, for the following paragraph:
<p>
<strong>Pancakes</strong>
<br/>
A <strong>delicious</strong> type of food
<br/>
</p>
I would like it to be stored into the following array:
['Pancakes', 'A delicious type of food']
What I have tried is:
import bs4 as bs
soup = bs.BeautifulSoup("<p>Pancakes<br/> A delicious type of food<br/></p>")
p = soup.findAll('p')
p[0] = p[0].getText()
print(p)
but this outputs an array with only one element:
['Pancakes A delicious type of food']
What is a way to code it so that I can get an array that contains the paragraph text split by any <br/> in the paragraph?

try this
from bs4 import BeautifulSoup, NavigableString
html = '<p>Pancakes<br/> A delicious type of food<br/></p>'
soup = BeautifulSoup(html, 'html.parser')
p = soup.findAll('p')
result = [str(child).strip() for child in p[0].children
if isinstance(child, NavigableString)]
Update for deep recursive
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
p = soup.find('p').find_all(text=True, recursive=True)
Update again for text split only by <br>
from bs4 import BeautifulSoup, NavigableString, Tag
html = "<p><strong>Pancakes</strong><br/> A <strong>delicious</strong> type of food<br/></p>"
soup = BeautifulSoup(html, 'html.parser')
text = ''
for child in soup.find_all('p')[0]:
if isinstance(child, NavigableString):
text += str(child).strip()
elif isinstance(child, Tag):
if child.name != 'br':
text += child.text.strip()
else:
text += '\n'
result = text.strip().split('\n')
print(result)

I stumbled across this whilst having a similar issue. This was my solution...
A simple way is to replace the line
p[0] = p[0].getText()
with
p[0].getText('#').split('#')
Result is:
['Pancakes', ' A delicious type of food']
Obv choose a character/characters that won't appear in the text

How to get a li with no class underneath a div tag

I am attempting to pull line and over under data for games from ESPN. To do this I need to pull a list item underneath a div tag.I can successfully get the over/under data because it's clear to me what the tag is, but the list item for the line doesn't seem to have a clear tag. Essentially I would be wanting to pull out "Line: IOWA -3.5" from this specific URL.
from bs4 import BeautifulSoup
page = requests.get('https://www.espn.com/college- football/game/_/gameId/401012863')
soup = BeautifulSoup(page.text, 'html.parser')
#Get over/under
game_ou = soup.find('li',class_='ou')
game_ou2 = game_ou.contents[0]
game_ou3=game_ou2.strip()
#Get Line
game_line = soup.find('div',class_='odds-details')
print(game_line)

Add in the parent class (with descendant combinator and child li type selector) then you can retrieve both li in a list and index in or just use select_one to retrieve the first
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.espn.com/college-football/game/_/gameId/401012863')
soup = bs(r.content, 'lxml')
lis = [i.text.strip() for i in soup.select('.odds-details li')]
print(lis[0])
print(lis[1])
print(soup.select_one('.odds-details li').text)

Use find('li') after find the div element.
from bs4 import BeautifulSoup
page = requests.get('https://www.espn.com/college-football/game/_/gameId/401012863')
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.find("div",class_="odds-details").find('li').text)
print(soup.find("div",class_="odds-details").find('li',class_='ou').text.strip())
Output:
Line: IOWA -3.5
Over/Under: 47

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem Scraping Element & Child Text with lxml & etree - python

Related

find all a href from table

How to extract text from 'a' element with BeautifulSoup?

BeautifulSoup find() takes no keyword arguments error

BeautifulSoup Extract Text from a Paragraph and Split Text by <br/>

How to get a li with no class underneath a div tag

Categories

Resources