Conditional arguments to extract data from HTML - python

I have some HTML that I'm trying to extract specific information for, however it has repeating elements and I have an idea on how to account for this. I'm trying to implement conditional arguments that go as follows:
Extract the player names from the first href tag
search for the next tag named flaggenrahmen and extract the data in alt
If flaggenrahmen repeats again, skip.
Repeat steps.
what I have tried:
player_dict = defaultdict(list)
soup = BeautifulSoup(html)
player_id = soup.select('*[href]')
nation = soup.select('.flaggenrahmen')
for l,k in zip(player_id, nation):
player_dict[l.get_text(strip=True)].append(k['alt'])
However, I cannot get the 'skip' when flaggenrahmen repeats again, and therefore I get more than one country per player.
Produced output:
defaultdict(list,
{'': ['England', 'Spain', 'Portugal'],
'Trent Alexander-Arnold': ['Morocco'],
'Achraf Hakimi': ['England']})
Expected output:
{'Trent Alexander-Arnold':['England'],
'Achraf Hakimi':['Morocco'],
'João Cancelo':['Portugal'],
'Reece James':['England']
}
Here's the html data:
html='''<tbody>
<tr class="odd">
<td class="zentriert">1</td><td class=""><table class="inline-table"><tr><td rowspan="2"><img alt="Trent Alexander-Arnold" class="bilderrahmen-fixed" src="https://img.a.transfermarkt.technology/portrait/small/314353-1559826986.jpg?lm=1" title="Trent Alexander-Arnold"/></td><td class="hauptlink"><a class="spielprofil_tooltip" href="/trent-alexander-arnold/profil/spieler/314353" id="314353" title="Trent Alexander-Arnold">Trent Alexander-Arnold</a></td></tr><tr><td>Right-Back</td></tr></table></td><td class="zentriert">23</td><td class="zentriert"><img alt="England" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569" title="England"/></td><td class="zentriert"><a class="vereinprofil_tooltip" href="/fc-liverpool/startseite/verein/31" id="31"><img alt="Liverpool FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/31.png?lm=1456567819" title=" "/></a></td><td class="rechts hauptlink"><b>£67.50m</b><span class="icons_sprite red-arrow-ten" title="£90.00m"> </span></td></tr>
<tr class="even">
<td class="zentriert">2</td><td class=""><table class="inline-table"><tr><td rowspan="2"><img alt="Achraf Hakimi" class="bilderrahmen-fixed" src="https://img.a.transfermarkt.technology/portrait/small/398073-1633679363.jpg?lm=1" title="Achraf Hakimi"/></td><td class="hauptlink"><a class="spielprofil_tooltip" href="/achraf-hakimi/profil/spieler/398073" id="398073" title="Achraf Hakimi">Achraf Hakimi</a></td></tr><tr><td>Right-Back</td></tr></table></td><td class="zentriert">22</td><td class="zentriert"><img alt="Morocco" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/107.png?lm=1520611569" title="Morocco"/><br/><img alt="Spain" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/157.png?lm=1520611569" title="Spain"/></td><td class="zentriert"><a class="vereinprofil_tooltip" href="/fc-paris-saint-germain/startseite/verein/583" id="583"><img alt="Paris Saint-Germain" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/583.png?lm=1522312728" title=" "/></a></td><td class="rechts hauptlink"><b>£63.00m</b><span class="icons_sprite green-arrow-ten" title="£54.00m"> </span></td></tr>
<tr class="odd">
<td class="zentriert">3</td><td class=""><table class="inline-table"><tr><td rowspan="2"><img alt="João Cancelo" class="bilderrahmen-fixed" src="https://img.a.transfermarkt.technology/portrait/small/182712-1615221629.jpg?lm=1" title="João Cancelo"/></td><td class="hauptlink"><a class="spielprofil_tooltip" href="/joao-cancelo/profil/spieler/182712" id="182712" title="João Cancelo">João Cancelo</a></td></tr><tr><td>Right-Back</td></tr></table></td><td class="zentriert">27</td><td class="zentriert"><img alt="Portugal" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/136.png?lm=1520611569" title="Portugal"/></td><td class="zentriert"><a class="vereinprofil_tooltip" href="/manchester-city/startseite/verein/281" id="281"><img alt="Manchester City" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/281.png?lm=1467356331" title=" "/></a></td><td class="rechts hauptlink"><b>£49.50m</b><span class="icons_sprite green-arrow-ten" title="£45.00m"> </span></td></tr>
<tr class="even">
<td class="zentriert">4</td><td class=""><table class="inline-table"><tr><td rowspan="2"><img alt="Reece James" class="bilderrahmen-fixed" src="https://img.a.transfermarkt.technology/portrait/small/472423-1569484519.png?lm=1" title="Reece James"/></td><td class="hauptlink"><a class="spielprofil_tooltip" href="/reece-james/profil/spieler/472423" id="472423" title="Reece James">Reece James</a></td></tr><tr><td>Right-Back</td></tr></table></td><td class="zentriert">21</td><td class="zentriert"><img alt="England" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/189.png?lm=1520611569" title="England"/></td><td class="zentriert"><a class="vereinprofil_tooltip" href="/fc-chelsea/startseite/verein/631" id="631"><img alt="Chelsea FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/631.png?lm=1628160548" title=" "/></a></td><td class="rechts hauptlink"><b>£40.50m</b><span class="icons_sprite green-arrow-ten" title="£36.00m"> </span></td></tr>
<tr class="odd">
<tbody>'''.replace('< ', '<')

this should do
players={}
soup = BeautifulSoup(html, 'lxml')
for el in soup.tbody.children:
if el.name!='tr':
continue
name=el.select_one('.spielprofil_tooltip')
country=el.select_one('.flaggenrahmen')
if name and country:
players[name.text]=[country['title']]
print(players)
>>> {'Trent Alexander-Arnold': ['England'], 'Achraf Hakimi': ['Morocco'], 'João Cancelo': ['Portugal'], 'Reece James': ['England']}

Related

Getting content of multiple tags with beautifulsoup in python

I have an HTML temp like below
<tr>
<td width="45">
<p style="text-align: center;"><strong>STT</strong></p>
</td>
<td width="204">
<p style="text-align: center;"><strong>Tên bệnh viện</strong></p>
</td>
<td width="364">
<p style="text-align: center;"><strong>Địa chỉ</strong></p>
</td>
</tr>,
<tr>
<td width="45"><strong> </strong>
<p><strong>1</strong></p>
</td>
<td width="204"><strong> </strong>
<h3><span id="list hospital"><strong> ABC HOSPITAL</strong></span></h3>
</td>
<td width="364">
<img alt="abc hospital" class="aligncenter size-full wp-image-5549" height="470" sizes="(max-width: 705px) 100vw, 705px" src="https://suckhoe2t.net/wp-content/uploads/2017/11/benh-vien-an-binh-suckhoe2t.jpg" srcset="https://suckhoe2t.net/wp-content/uploads/2017/11/benh-vien-an-binh-suckhoe2t.jpg 705w, https://suckhoe2t.net/wp-content/uploads/2017/11/benh-vien-an-binh-suckhoe2t-696x464.jpg 696w, https://suckhoe2t.net/wp-content/uploads/2017/11/benh-vien-an-binh-suckhoe2t-630x420.jpg 630w" width="705"/>
<p><iframe allowfullscreen="allowfullscreen" frameborder="0" height="450" src="https://www.google.com/maps/embed?pb=!1m18!221m12!1m3!1d3919.743463626014!2d106.66938211450157!3d10.75424379233658!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!3m3!1m2!1s0x31752efc4039dee3%3A0x9157c2008d49be79!2sAn+Binh+Hospital!5e0!3m2!1sen!2s!4v1557202875759!5m2!1sen!2s" style="border: 0;" width="600"></iframe></p>
<ul>
<li>address: 1345 Golden View , LA</li>
<li>phonenumber: 3923 4260</li>
<li>Email: abc#hospital.com</li>
</ul>
<ul>
<li>Website: xxxxxxxxx</li>
I would like to have an output like this:
ABC HOSPITAL
address: 1345 Golden View , LA
phonenumber: 3923 4260
Email: abc#hospital.com
Because it has many li tags, I don't know how to get exactly all the fields I wish. Could you please help assist on this?
My code like below:
res = '''html code above'''
soup = BeautifulSoup(res, 'html.parser')
data = soup.find_all('tr')
for temp in data:
each = temp.find('h3')
print(each)
Output I got:
None
<h3><span id="list hospital"><strong> ABC HOSPITAL</strong></span></h3>
This should work.
soup = BeautifulSoup(res, 'html.parser')
data = soup.find_all('tr')
accepted_li = ('address', 'phonenumber', 'email') # tuple of "li" informations you want to get
for tr in data:
hospital_span = tr.find('span', {'id': 'list hospital'}) # get span of the hospital name
if hospital_span is not None:
print(hospital_span.find('strong').text.strip())
for li in tr.find_all('li'): # iterate over every li
if li.text.lower().startswith(accepted_li): # check if li element starts with any value in tuple
print(li.text)

How to store a set of multiple p tag texts to a single variable in space delimit with BeautifulSoup in Python

How can I store texts from multiple HTML p tags in a single variable with space delimit with BeautifulSoup in the following example? I'm brand new to Python. Thank you!
from bs4 import BeautifulSoup
HTML = '''
</tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
'''
soup = BeautifulSoup(HTML, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
for value in values:
value = value.text
print(value)
In print statement itself you can put end="," as parameter to make answer in one line
from bs4 import BeautifulSoup
html= """<tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
"""
soup = BeautifulSoup(html, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
for value in values:
print(value.text,end=",")
Output:
+1.30%,-1.33%,+1.58%,+1.61%,
OR :
you can try to append data to list and print in one line
lst=[i.get_text(strip=True) for i in values]
print(*lst,sep=",")
Output:
+1.30%,-1.33%,+1.58%,+1.61%
To get in single variable
x=",".join(lst)
print(x)
Output:
+1.30%,-1.33%,+1.58%,+1.61%
You can do like this using string concatenation.
from bs4 import BeautifulSoup
HTML = '''
</tr>
<td class="up">
<p class="pie_chart_val">+1.30%</p>
</td>
<td class="down">
<p class="pie_chart_val">-1.33%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.58%</p>
</td>
<td class="up">
<p class="pie_chart_val">+1.61%</p>
</td>
</tr>
'''
soup = BeautifulSoup(HTML, 'lxml')
values = soup.find_all('p', class_="pie_chart_val")
ans = ''
for value in values:
ans += value.text.strip() + ' '
print(ans)
ans is a string that has space separated texts of <p> tags.
+1.30% -1.33% +1.58% +1.61%

Cannot access to cell text inside an html table (Selenium,python)

I have been trying for a few hours now to extract a text from a specific cell in the following table for vain:
<tbody class="table-body">
<tr class=" " data-blah="25293454534534513" data-currency="1">
<td class="action-cell no-sort">
</td>
<td class="col1 id">
<a class="alert-ico " data-tooltip=""></a>
<a class="isin-btn " data-tooltip="" id="isin" data-portfolioid="2423424" data-status="0">US3</a>
</td>
<td class="col2 name hide">4%</td>
<td class="col9 colNo.9" title="Bid: 101.23; Mid: 101.28; Ask: 101.33;
Liquidity Score: -*/5*; Merit: -/4;" data-bprice="101.28" data-uprice="101.28">101.28<span class="estim-star">*</span></td>
<td class="col10 price_change" nowrap="" data-sort="0.02"><span class="positive-change">0.02%</span><span class="change-sign positive-change">↑</span></td>
<td class="col11 yield yield-val" title="" data-sort="3.33" data-byield="3.33" data-uyield="3.34%">3.33%</td>
<td class="col12 purchase_price" data-bprice="101.28" data-uprice="101.28" data-sort="101.28"><input type="text" name="purchase_price" class="positive-num-only default" value="101.28"></td>
<td class="col13 margin_bond" data-bond="sec" data-sort="0"><input type="text" name="margin_bond" maxlength="3" class="positive-num-only default" value="0"></td>
</tr>
</tbody>
I'm trying to extract a text from column 'Price Change' (col 10) using lxml.html which allows me to extract data from big tables in a manner of seconds. I'm doing it like that:
import lxml.html
import pandas as pd
root = lxml.html.fromstring(self.driver.page_source)
data = []
for row in root.xpath('.//*[#id=\'main\']/div[5]/div[2]/table/tbody/tr'):
cells = row.xpath('.//td/text()')
So, I succeeded to extract the whole table like that and I know that the only exception is column 10 ('price change') and tried the following and it returned the empty string (""):
row.xpath('.//tr[1]/td[11][#data-sort]/text()')
row.xpath('.//[#id='main']/div[5]/div[2]/table/tbody/tr[1]/td[11]/span/text()')
row.xpath('.//*[#id='main']/div[5]/div[2]/table/tbody/tr[1]/td[11]/text()')
I don't want to extract the text using WebElement but only with lxml.html library
Thank you!
There are two problems
There are total 7 tds and not 11, the td you are intersted is 5 and not 11.
the td you are intersted in has two span and you are not providing which span you are interested in.
this code works perfectly fine.
html_code = """
<tbody class="table-body">
<tr class=" " data-blah="25293454534534513" data-currency="1">
<td class="action-cell no-sort">
</td>
<td class="col1 id">
<a class="alert-ico " data-tooltip=""></a>
<a class="isin-btn " data-tooltip="" id="isin" data-portfolioid="2423424" data-status="0">US3</a>
</td>
<td class="col2 name hide">4%</td>
<td class="col9 colNo.9" title="Bid: 101.23; Mid: 101.28; Ask: 101.33;
Liquidity Score: -*/5*; Merit: -/4;" data-bprice="101.28" data-uprice="101.28">101.28<span class="estim-star">*</span></td>
<td class="col10 price_change" nowrap="" data-sort="0.02">
<span class="positive-change">0.02%</span>
<span class="change-sign positive-change">↑</span></td>
<td class="col11 yield yield-val" title="" data-sort="3.33" data-byield="3.33" data-uyield="3.34%">3.33%</td>
<td class="col12 purchase_price" data-bprice="101.28" data-uprice="101.28" data-sort="101.28"><input type="text" name="purchase_price" class="positive-num-only default" value="101.28"></td>
<td class="col13 margin_bond" data-bond="sec" data-sort="0"><input type="text" name="margin_bond" maxlength="3" class="positive-num-only default" value="0"></td>
</tr>
</tbody>
"""
tree = html.fromstring(html_code)
print "purchase price is %s" % tree.xpath(".//td[contains(#class,'col10')]/span[1]/text()")[0]
print "purchase price is %s" % tree.xpath(".//td[5]/span[1]/text()")[0]

Python - Beautiful Soup - Remove Tags

I have extracted the below web based data as a list using Beautiful Soup. On the original website it's a table of numbers:
[<td class="right">113</td>, <td class="right">
89 </td>, <td class="right last">
<b>117</b> </td>, <td class="right">113</td>, <td class="right">
85 </td>, <td class="right last">
<b>114</b> </td>, <td class="right">100</td>, <td class="right">
56 </td>, <td class="right last">
<b>84</b> </td>]
What's the most efficient way to create a list of numbers from this data? Ideally I'd like to extract the tags using Beautiful Soup but I can't figure out how to do this from the documentation.
My original Soup code is:
print soup.find_all('td', 'right') #printing this produces the above data
numbers_data = [] #my attempt to extract tags
for e in soup.find_all('td', 'right'):
numbers_data.append(e.extract())
print numbers_data
Both return the same list.
numbers_data = [int(e.text) for e in soup.find_all('td', 'right')]
print numbers_data

How to find elements that match specific conditions in Beautiful Soup

I am learning and trying both Python (2.7) and Beautiful Soup (3.2.0). I already got some help here with my first problems (Beautiful Soup throws `IndexError`)
This is the Python code so far:
# Import the classes that are needed
import urllib2
from BeautifulSoup import BeautifulSoup
# URL to scrape and open it with the urllib2
url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football'
source = urllib2.urlopen(url)
# Turn the saced source into a BeautifulSoup object
soup = BeautifulSoup(source)
# From the source HTML page, search and store all <div class="date">...</div> and it's content
datesDiv = soup.findAll('div', { "class" : "date" })
# Loop through the tag and store only the needed information, being the actual date
dates = [tag.contents[0] for tag in datesDiv]
# From the source HTML page, search and store all <span class="time">...</span> and it's content
timesSpan = soup.findAll('span', { "class" : "time" })
# Loop through the tag and store only the needed information, being the actual times
times = [tag.contents[0] for tag in timesSpan]
# From the source HTML page, search and store all <td class="home">..</td> and it's content
hometeamsTd = soup.findAll('td', { "class" : "home" })
# Loop through the tag and store only the needed information, being the home team
# if tag.contents[1] != 'Dutch KNVB Beker' - Do a direct test if output is needed or not
hometeams = [tag.contents[1] for tag in hometeamsTd if tag.contents[1] != 'Dutch KNVB Beker']
# From the source HTML page, search and store all <td class="away">..</td> and it's content
# [1:] at the end meand slice the first one found
awayteamsTd = soup.findAll('td', { "class" : "away" })[1:]
# Loop through the tag and store only the needed information, being the away team
awayteams = [tag.contents[1] for tag in awayteamsTd]
# From the source HTML page, search and store all <a class="broadcast" href="...">..</a> and it's content
broadcastsA = soup.findAll('a', { "class" : "broadcast" })
# Loop through the tag and store only the needed information, being the the broadcast URL, where we can find the streams
broadcasts = [tag['href'] for tag in broadcastsA]
The problem I got is that the arrays are not equal to each other:
len(dates) #9, should be 6
len(times) #18, should be 12
len(hometeams) #6, is correct
len(awayteams) #6, is correct
len(broadcasts) #9, should be 6
Problem I have is that I do the following search for getting the dates array: soup.findAll('div', { "class" : "date" }). Which obviously gives me all the <div> elements with class="date". But the problem is, that I only need the date when there is also a <td> element with class="away".
See next part of the HTML that I am scraping:
<tr class="odd">
<td class="logo">
<img src="/gfx/disciplines/football.gif" alt="football"/>
</td>
<td>
Dutch Cup
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="comp-92"/>
</td>
<td>
<div class="date" rel="1380054900">Tuesday, September 24</div> <!-- This date is not needed, because within this <tr> there is no <td class="away"> -->
<span class="time" rel="1380054900">22:35</span> - <!-- This time is not needed, because within this <tr> there is no <td class="away"> -->
<span class="time" rel="1380058500">23:35</span> <!-- This time is not needed, because within this <tr> there is no <td class="away"> -->
</td>
<td class="home" colspan="3">
<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>Dutch KNVB Beker<img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758"/>
</td>
<td class="broadcast">
<a class="broadcast" href="/broadcast.php?matchid=221554&part=sports">Live</a> <!-- This href is not needed, because within this <tr> there is no <td class="away"> -->
</td>
</tr>
<tr class="even">
<td class="logo">
<img src="/gfx/disciplines/football.gif" alt="football"/>
</td>
<td>
Dutch Cup
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="comp-92"/>
</td>
<td>
<div class="date" rel="1380127500">Wednesday, September 25</div> <!-- This date we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
<span class="time" rel="1380127500">18:45</span> - <!-- This time we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
<span class="time" rel="1380134700">20:45</span> <!-- This date we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
</td>
<td class="home">
<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>PSV<img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3"/>
</td>
<td>vs.</td>
<td class="away">
<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428"/>Stormvogels Telstar<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>
</td>
<td class="broadcast">
<a class="broadcast" href="/broadcast.php?matchid=221555&part=sports">Live</a> <!-- This href we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> -->
</td>
</tr>
How about rethinking the way you scrape the data. You have a table with matches - then just iterate over the rows:
for tr in soup.findAll('tr', {'class': ['odd', 'even']}):
home_team = tr.find('td', {'class': 'home'}).text
if home_team == 'Dutch KNVB Beker':
continue
away_team = tr.find('td', {'class': 'away'}).text
date = ' - '.join([span.text for span in tr.findAll('span', {'class': 'time'})])
broadcast = tr.find('a', {'class': 'broadcast'})['href']
print home_team, away_team, date, broadcast
prints 5 rows:
RKC Waalwijk Heracles 20:45 - 22:45 /broadcast.php?matchid=221553&part=sports
PSV Stormvogels Telstar 18:45 - 20:45 /broadcast.php?matchid=221555&part=sports
Ajax FC Volendam 20:45 - 22:45 /broadcast.php?matchid=221556&part=sports
SC Heerenveen FC Twente 18:45 - 20:45 /broadcast.php?matchid=221558&part=sports
Feyenoord FC Dordrecht 20:45 - 22:45 /broadcast.php?matchid=221559&part=sports
Then, you can collect results into the list of dicts.

Categories

Resources