Issue with Beautifulsoup .find(text=true)

Issue with Beautifulsoup .find(text=true) - python

for row in soup.find_all('tr'):
cells = row.find_all('td')
if len(cells)==10: #Only extract table body not heading
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find('div').get('title'))
D.append(cells[3].find('a', href=True).get_text())
E.append(cells[4].find('a', href=True).get_text())
if cells[5].find(text=True) is None or cells[5].find('a', href=True) is None:
F.append(cells[5].find(text=True))
else:
Output = '-'.join([item.get_text() for item in cells[5].find_all('a')])
F.append(Output)
if cells[6].find(text=True) is None or cells[6].find('a', href=True) is None:
G.append(cells[6].find(text=True))
else:
G.append(cells[6].find('a', href=True).get_text())
if cells[7].find(text=True) is None or cells[7].find('a', href=True) is None:
H.append(cells[7].find(text=True))
else:
H.append(cells[7].find('a', href=True).get_text())
I.append(cells[8].find('span').get_text())
J.append(cells[9].find(Title=True))
The problem is that at cells 5,6 and 7 the desired output is sometimes inside a ahref tag and sometimes inside a td tag. The code works but the List F f.e. looks something like this:
0
T-001
1
TD-U1B
2 BMA-D2-USA
3 BMU-D3-USA
4
Position 2 and 3 are correct. These are the outputs from:
else:
Output = '-'.join([item.get_text() for item in cells[5].find_all('a')])
F.append(Output)
Position 0 and 1 are incorrect. These are the outputs from:
F.append(cells[5].find(text=True))

Related

BeautifulSoup trying to get text from wrapped divs but empty or "none" is being returned

Here is a picture (sorry) of the HTML that I am trying to parse:
I am using this line:
home_stats = soup.select_one('div', class_='statText:nth-child(1)').text
Thinking that I'd get the 1st child of the class statText and the outcome would be 53%.
But it's not. I get "Loading..." and none of the data that I was trying to use and display.
The full code I have so far:
soup = BeautifulSoup(source, 'lxml')
home_team = soup.find('div', class_='tname-home').a.text
away_team = soup.find('div', class_='tname-away').a.text
home_score = soup.select_one('.current-result .scoreboard:nth-child(1)').text
away_score = soup.select_one('.current-result .scoreboard:nth-child(2)').text
print("The home team is " + home_team, "and they scored " + home_score)
print()
print("The away team is " + away_team, "and they scored " + away_score)
home_stats = soup.select_one('div', class_='statText:nth-child(1)').text
print(home_stats)
Which currently does print the hone and away team and the number of goals they scored. But I can't seem to get any of the statistical content from this site.
My output plan is to have:
[home_team] had 53% ball possession and [away_team] had 47% ball possession
However, I would like to remove the "%" symbols from the parse (but that's not essential). My plan is to use these numbers for more stats later on, so the % symbol gets in the way.
Apologies for the noob question - this is the absolute beginning of my Pythonic journey. I have scoured the internet and StackOverflow and just can not find this situation - I also possibly don't know exactly what I am looking for either.
Thanks kindly for your help! May your answer be the one I pick as "correct" ;)

Assuming that this is the website that u r tryna scrape, here is the complete code to scrape all the stats:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.scoreboard.com/en/match/SO3Fg7NR/#match-statistics;0')
pg = driver.page_source #Gets the source code of the page
driver.close()
soup = BeautifulSoup(pg,'html.parser') #Creates a soup object
statrows = soup.find_all('div',class_ = "statTextGroup") #Finds all the div tags with class statTextGroup -- these div tags contain the stats
#Scrapes the team names
teams = soup.find_all('a',class_ = "participant-imglink")
teamslst = []
for x in teams:
team = x.text.strip()
if team != "":
teamslst.append(team)
stats_dict = {}
count = 0
for x in statrows:
txt = x.text
final_txt = ""
stat = ""
alphabet = False
percentage = False
#Extracts the numbers from the text
for c in txt:
if c in '0123456789':
final_txt+=c
else:
if alphabet == False:
final_txt+= "-"
alphabet = True
if c != "%":
stat += c
else:
percentage = True
values = final_txt.split('-')
#Appends the values to the dictionary
for x in values:
if stat in stats_dict.keys():
if percentage == True:
stats_dict[stat].append(x + "%")
else:
stats_dict[stat].append(int(x))
else:
if percentage == True:
stats_dict[stat] = [x + "%"]
else:
stats_dict[stat] = [int(x)]
count += 1
if count == 15:
break
index = [teamslst[0],teamslst[1]]
#Creates a pandas DataFrame out of the dictionary
df = pd.DataFrame(stats_dict,index = index).T
print(df)
Output:
Burnley Southampton
Ball Possession 53% 47%
Goal Attempts 10 5
Shots on Goal 2 1
Shots off Goal 4 2
Blocked Shots 4 2
Free Kicks 11 10
Corner Kicks 8 2
Offsides 2 1
Goalkeeper Saves 0 2
Fouls 8 10
Yellow Cards 1 0
Total Passes 522 480
Tackles 15 12
Attacks 142 105
Dangerous Attacks 44 29
Hope that this helps!
P.S: I actually wrote this code for a different question, but I didn't post it as an answer was already posted! But I didn't know that it would come in handy now! Anyways, I hope that my answer does what u need.

How can my program return a none for values not available, e.g some movies don't have metascore

Program suppose to return values for all 50 movies for its title, Metascore, genre, gross and if not available return aa none to ensure all elements in the respective list are 50 but currently give out 43 elements.
url = requests.get(f'https://www.imdb.com/search/title/?title_type=feature&year=2017-01-01,2017-12-31&start=51&ref_=adv_nxt')
soup = BeautifulSoup(url.text, 'html.parser')
for t, m, g, r, c, i in zip(soup.select('div.lister-list >div.lister-item>div.lister-item-content>h3.lister-item-header>a'),
soup.select('div.lister-list >div.lister-item>div.lister-item-content>div.ratings-bar>div.ratings-metascore>span'),
soup.select('div.lister-list >div.lister-item>div.lister-item-content>p.text-muted>.genre'),
soup.select('div.lister-list >div.lister-item>div.lister-item-content>p.text-muted>.runtime'),
soup.select('div.lister-list >div.lister-item>div.lister-item-content>p.text-muted>.certificate'),
soup.select('div.lister-list >div.lister-item>div.lister-item-content>div.ratings-bar>div>strong')):
title.append(t.text)
metascore.append(m.getText())
genre.append(g.text.strip())
run_time.append(r.text)
m_certificate.append(c.text)
imdb_rating.append(i.text)
For loops return None value to values not present
for v in soup.select('div.lister-item-content >p.sort-num_votes-visible'):
votes.append(v.find('span', attrs = {'name':'nv'}).text)
vote = v.find_all('span', attrs={'name': 'nv'})
try:
gross.append(vote[1].text)
except IndexError:
gross.append(None)

Some movies don't have metascore and some of them don't have certificate either. You either go for try-except blocks or conditional statements to get rid of that error. I used the latter within the following example. Give it a shot:
import requests
from bs4 import BeautifulSoup
link = 'https://www.imdb.com/search/title/?title_type=feature&year=2017-01-01,2017-12-31&start=51&ref_=adv_nxt'
res = requests.get(link)
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select(".lister-item"):
name = item.select_one('h3.lister-item-header > a').get_text(strip=True)
score = item.select_one('span.metascore').get_text(strip=True) if item.select_one('span.metascore') else None
genre = item.select_one('span.genre').get_text(strip=True) if item.select_one('span.genre') else None
runtime = item.select_one('span.runtime').get_text(strip=True) if item.select_one('span.runtime') else None
certificate = item.select_one('span.certificate').get_text(strip=True) if item.select_one('span.certificate') else None
rating = item.select_one('.rating-star + strong').get_text(strip=True) if item.select_one('.rating-star + strong') else None
print(name,score,genre,runtime,certificate,rating)

How to limit number of rows that fill a dataframe in a for loop

I have written the following function that scrapes multiple pages from a website. I only want to get the first 20 or so pages. How can I limit the number of rows that I fill in my dataframe:
def scrape_page(poi,page_name):
base_url="https://www.fake_website.org/"
report_url=(base_url+poi)
page=urlopen(report_url)
experiences=BeautifulSoup(page,"html.parser")
empty_list=[]
for link in experiences.findAll('a', attrs={'href': re.compile(page_name+".shtml$")}):
url=urljoin(base_url, link.get("href"))
subpage=urlopen(url)
expages=BeautifulSoup(subpage, "html.parser")
for report in expages.findAll('a', attrs={'href': re.compile("^/experiences/exp")}):
url=urljoin(base_url, report.get("href"))
reporturlopen=urlopen(url)
reporturl=BeautifulSoup(reporturlopen, "html.parser")
book_title= reporturl.findAll("div",attrs={'class':'title'})
for i in book_title:
title=i.get_text()
book_genre= reporturl.findAll("div",attrs={'class':'genre'})
for i in book_genre:
genre=i.get_text()
book_author= reporturl.findAll("div",attrs={'class':'author'})
for i in book_author:
author=i.get_text()
author = re.sub("by", "",author)
empty_list.append({'title':title,'genre':genre,'author':author})
setattr(sys.modules[__name__], '{}_df'.format(poi+"_"+page_name), empty_list)

You can for example add a while loop:
i = 0
while i < 20:
< insert your code >
i += 1

Issues on dividing html part through 'tr' tag using Selenium Python

I tried to collect the data from this page (http://www.bobaedream.co.kr/mycar/popup/mycarChart_4.php?zone=C&cno=639137&tbl=cyber) using Selenium Python 3.6. What I tried to do is to divide the section into two and collect the data from each part.
The part is like below:
Those items in the two parts are made of 39 'tr' tags. I select 0 to 14th 'tr' tags for the first part and 15th to the end 'tr'tags for the second part. But the first part already called up to the last 'tr' tag. I don't understand why it happened.
Below is my code:
from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urlparse
from urllib.parse import quote
from selenium import webdriver
import re
import time
popup_inspection = "http://www.bobaedream.co.kr/mycar/popup/mycarChart_4.php?zone=C&cno=639137&tbl=cyber"
driver = webdriver.PhantomJS()
driver.set_window_size(500, 300)
driver.get(popup_inspection)
soup_inspection = BeautifulSoup(driver.page_source, "html.parser")
count = 0 # for loop count
count_insp = 0 # 누유 및 오작동
count_in = 0 # 골격
count_out = 0 # 외관
insp_tables = soup_inspection.find_all('table', class_=True)
for insp_table in insp_tables[4].find_all('tr'):
labels = insp_table.find_all('td', class_="center")
for label in labels[:15]:
if label.find("input", type="checkbox", checked=True):
count_out += 1
print (label.text)
else:
print(label.text)
print("외관 이상 수: ", count_out)
for label in labels[16:]:
if label.find("input", type="checkbox", checked=True):
count_in += 1
print (label.text)
else:
print(label.text)
print("골격 이상 수: ", count_in)
The result I would like to have is like below:
<Upper Part>
1 후드 0 0
2 프론트 휀더(좌) 0 0
......
8 트렁크 리드 1 0
Total : 1 0
<Lower Part>
1 프론트 패널
2 크로스 멤버
....
22 리어 패널 1 0
23 트렁크 플로어 0 0
Total : 1 0
Please help me to work this out.
Thanks.

Generate a table of contents from HTML with Python

I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags.
My plan so far was to:
Extract a list of headers using beautifulsoup
Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- There might be a method for replacing inside beautifulsoup?
Output a nested list of links to the headers in a predefined spot.
It sounds easy when I say it like that, but it's proving to be a bit of a pain in the rear.
Is there something out there that does all this for me in one go so I don't waste the next couple of hours reinventing the wheel?
A example:
<p>This is an introduction</p>
<h2>This is a sub-header</h2>
<p>...</p>
<h3>This is a sub-sub-header</h3>
<p>...</p>
<h2>This is a sub-header</h2>
<p>...</p>

Some quickly hacked ugly piece of code:
soup = BeautifulSoup(html)
toc = []
header_id = 1
current_list = toc
previous_tag = None
for header in soup.findAll(['h2', 'h3']):
header['id'] = header_id
if previous_tag == 'h2' and header.name == 'h3':
current_list = []
elif previous_tag == 'h3' and header.name == 'h2':
toc.append(current_list)
current_list = toc
current_list.append((header_id, header.string))
header_id += 1
previous_tag = header.name
if current_list != toc:
toc.append(current_list)
def list_to_html(lst):
result = ["<ul>"]
for item in lst:
if isinstance(item, list):
result.append(list_to_html(item))
else:
result.append('<li>%s</li>' % item)
result.append("</ul>")
return "\n".join(result)
# Table of contents
print list_to_html(toc)
# Modified HTML
print soup

Use lxml.html.
It can deal with invalid html just fine.
It is very fast.
It allows you to easily create the missing elements and move elements around between the trees.

I have come with an extended version of the solution proposed by Łukasz's.
def list_to_html(lst):
result = ["<ul>"]
for item in lst:
if isinstance(item, list):
result.append(list_to_html(item))
else:
result.append('<li>{}</li>'.format(item[0], item[1]))
result.append("</ul>")
return "\n".join(result)
soup = BeautifulSoup(article, 'html5lib')
toc = []
h2_prev = 0
h3_prev = 0
h4_prev = 0
h5_prev = 0
for header in soup.findAll(['h2', 'h3', 'h4', 'h5', 'h6']):
data = [(slugify(header.string), header.string)]
if header.name == "h2":
toc.append(data)
h3_prev = 0
h4_prev = 0
h5_prev = 0
h2_prev = len(toc) - 1
elif header.name == "h3":
toc[int(h2_prev)].append(data)
h3_prev = len(toc[int(h2_prev)]) - 1
elif header.name == "h4":
toc[int(h2_prev)][int(h3_prev)].append(data)
h4_prev = len(toc[int(h2_prev)][int(h3_prev)]) - 1
elif header.name == "h5":
toc[int(h2_prev)][int(h3_prev)][int(h4_prev)].append(data)
h5_prev = len(toc[int(h2_prev)][int(h3_prev)][int(h4_prev)]) - 1
elif header.name == "h6":
toc[int(h2_prev)][int(h3_prev)][int(h4_prev)][int(h5_prev)].append(data)
toc_html = list_to_html(toc)

How do I generate a table of contents for HTML text in Python?
But I think you are on the right track and reinventing the wheel will be fun.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue with Beautifulsoup .find(text=true) - python

Related

BeautifulSoup trying to get text from wrapped divs but empty or "none" is being returned

How can my program return a none for values not available, e.g some movies don't have metascore

How to limit number of rows that fill a dataframe in a for loop

Issues on dividing html part through 'tr' tag using Selenium Python

Generate a table of contents from HTML with Python

Categories

Resources