So, here's my code:
link = "https://nookipedia.com/w/api.php?action=query&list=categorymembers&cmtitle=Category:Insect&cmlimit=500&format=json"
async with aiohttp.get(link) as t:
result = await t.json()
foundCheck = False
for list in result["query"]["categorymembers"]:
print(list["title"])
if bug.lower() == list["title"].lower():
print(bug)
await self.bot.say("{} is a real bug".format(bug.title()))
bug2 = bug.replace(" ", "_")
url = "https://nookipedia.com/wiki/{}".format(bug2)
await self.bot.say(url)
async with aiohttp.get(url) as response:
soupObject = BeautifulSoup(await response.text(), "html.parser")
try:
info = soupObject.find(id="Infobox-bug").tr.td.get_text()
await self.bot.say("{}".format(info))
except:
await self.bot.say("Can't get the content from {}".format(url))
foundCheck = True
return
if not foundCheck:
await self.bot.say("That bug does not exist")
return
else:
await self.bot.say("Error")
and here's the html code i'm trying to get:
<table id="Infobox-bug" align="right" style="background: #adff2f; margin-left: 10px; margin-bottom: 10px; border-radius: 10px; -moz-border-radius: 10px; -webkit-border-radius: 10px; -khtml-border-radius: 10px; -icab-border-radius: 10px; -o-border-radius: 10px; border: 3px solid #9acd32; width: 25%">
<tr align="center">
<td colspan="2"> <big><big><b>Pill Bug</b></big></big>
</td></tr>
<tr align="center">
<td style="background: #caecc9; border-radius: 10px; -moz-border-radius: 10px; -webkit-border-radius: 10px; -khtml-border-radius: 10px; -icab-border-radius: 10px; -o-border-radius: 10px;" colspan="2"> <img alt="Pill Bug Picture.jpg" src="/w/images/b/bb/Pill_Bug_Picture.jpg" width="199" height="186" />
</td></tr>
<tr>
<th style="background: #86df2d; border-top-left-radius: 10px; -moz-border-radius-topleft: 10px; -webkit-border-top-left-radius: 10px; -khtml-border-top-left-radius: 10px; -icab-border-top-left-radius: 10px; -o-border-top-left-radius: 10px;" align="right"> Scientific name
</th>
<td style="background:#ffffff; border-top-right-radius: 10px; -moz-border-radius-topright: 10px; -webkit-border-top-right-radius: 10px; -khtml-border-top-right-radius: 10px; -icab-border-top-right-radius: 10px; -o-border-top-right-radius: 10px;" align="left"> <i>Armadillidium vulgare</i>
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Family
</th>
<td style="background:#ffffff" align="left"> <i>Armadillidiidae - Terrestrial Custaceans</i>
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Time of year
</th>
<td style="background:#ffffff" align="left"> All year
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Time of day
</th>
<td style="background:#ffffff" align="left"> All day
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Location
</th>
<td style="background:#ffffff" align="left"> Under rocks
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Size
</th>
<td style="background:#ffffff" align="left"> 2 mm
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Rarity
</th>
<td style="background:#ffffff" align="left"> Common
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Selling price
</th>
<td style="background:#ffffff" align="left"> 250 Bells
</td></tr>
<tr>
<th style="background: #86df2d; border-bottom-left-radius: 10px; -moz-border-radius-bottomleft: 10px; -webkit-border-bottom-left-radius: 10px; -khtml-border-bottom-left-radius: 10px; -icab-border-bottom-left-radius: 10px; -o-border-bottom-left-radius: 10px;" align="right"> Appearances
</th>
<td style="background:#ffffff; border-bottom-right-radius: 10px; -moz-border-radius-bottomright: 10px; -webkit-border-bottom-right-radius: 10px; -khtml-border-bottom-right-radius: 10px; -icab-border-bottom-right-radius: 10px; -o-border-bottom-right-radius: 10px;" align="left"> <i>Doubutsu no Mori</i>,<br /><i>Animal Crossing</i>,<br /><i>Animal Crossing: Wild World</i>,<br /><i>Animal Crossing: City Folk</i>,<br /><i>Animal Crossing: New Leaf</i>
</td></tr></table>
So, basically i got the "Pill Bug" (aka info) as it own string but i'm not sure how to get everything else after it (within the tr and td) without getting pill bug again? How would i do that so i can get each text as their own strings?
Thank you so much for the help.
BS has many methods to get tags and it parameters
soup.find(args)
soup.find_all(args)
soup.select(CSS_selection)
tag.get(param) or tag.get(param, default) or tag[param]
tag.text or tag.get_text()
tag.name
etc.
And find() / find_all() may use different arguments - so you have to read BS doc for more.
Example:
html = '''<table id="Infobox-bug" align="right" style="background: #adff2f; margin-left: 10px; margin-bottom: 10px; border-radius: 10px; -moz-border-radius: 10px; -webkit-border-radius: 10px; -khtml-border-radius: 10px; -icab-border-radius: 10px; -o-border-radius: 10px; border: 3px solid #9acd32; width: 25%">
<tr align="center">
<td colspan="2"> <big><big><b>Pill Bug</b></big></big>
</td></tr>
<tr align="center">
<td style="background: #caecc9; border-radius: 10px; -moz-border-radius: 10px; -webkit-border-radius: 10px; -khtml-border-radius: 10px; -icab-border-radius: 10px; -o-border-radius: 10px;" colspan="2"> <img alt="Pill Bug Picture.jpg" src="/w/images/b/bb/Pill_Bug_Picture.jpg" width="199" height="186" />
</td></tr>
<tr>
<th style="background: #86df2d; border-top-left-radius: 10px; -moz-border-radius-topleft: 10px; -webkit-border-top-left-radius: 10px; -khtml-border-top-left-radius: 10px; -icab-border-top-left-radius: 10px; -o-border-top-left-radius: 10px;" align="right"> Scientific name
</th>
<td style="background:#ffffff; border-top-right-radius: 10px; -moz-border-radius-topright: 10px; -webkit-border-top-right-radius: 10px; -khtml-border-top-right-radius: 10px; -icab-border-top-right-radius: 10px; -o-border-top-right-radius: 10px;" align="left"> <i>Armadillidium vulgare</i>
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Family
</th>
<td style="background:#ffffff" align="left"> <i>Armadillidiidae - Terrestrial Custaceans</i>
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Time of year
</th>
<td style="background:#ffffff" align="left"> All year
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Time of day
</th>
<td style="background:#ffffff" align="left"> All day
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Location
</th>
<td style="background:#ffffff" align="left"> Under rocks
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Size
</th>
<td style="background:#ffffff" align="left"> 2 mm
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Rarity
</th>
<td style="background:#ffffff" align="left"> Common
</td></tr>
<tr>
<th style="background: #86df2d" align="right"> Selling price
</th>
<td style="background:#ffffff" align="left"> 250 Bells
</td></tr>
<tr>
<th style="background: #86df2d; border-bottom-left-radius: 10px; -moz-border-radius-bottomleft: 10px; -webkit-border-bottom-left-radius: 10px; -khtml-border-bottom-left-radius: 10px; -icab-border-bottom-left-radius: 10px; -o-border-bottom-left-radius: 10px;" align="right"> Appearances
</th>
<td style="background:#ffffff; border-bottom-right-radius: 10px; -moz-border-radius-bottomright: 10px; -webkit-border-bottom-right-radius: 10px; -khtml-border-bottom-right-radius: 10px; -icab-border-bottom-right-radius: 10px; -o-border-bottom-right-radius: 10px;" align="left"> <i>Doubutsu no Mori</i>,<br /><i>Animal Crossing</i>,<br /><i>Animal Crossing: Wild World</i>,<br /><i>Animal Crossing: City Folk</i>,<br /><i>Animal Crossing: New Leaf</i>
</td></tr></table>'''
from bs4 import BeautifulSoup
#import requests
#r = requests.get('https://nookipedia.com/wiki/Pill_Bug')
#html = r.content
soup = BeautifulSoup(html, "html.parser")
tds = soup.find(id="Infobox-bug").find_all('td')
print('--- all td text ---')
for x in tds:
print('>', x.get_text().strip())
# or
print('>', x.text.strip())
print('--- one td text ---')
print(tds[0].text.strip())
print('--- one td a href ---')
print(tds[1].find('a').get('href'))
# or
print(tds[1].find('a')['href'])
print('--- all a href (using CSS selector) ---')
for a in soup.select('#Infobox-bug td a'):
print(a['href'])
print('--- all td and th ---')
for tt in soup.find(id='Infobox-bug').find_all({'td', 'th'}):
if tt.name == 'th':
print('[', tt.name, ']', tt.text.strip(), end=" --> ")
elif tt.name == 'td':
a = tt.find('a')
if a:
a = a['href']
else:
a = 'None'
print('[', tt.name, ']', tt.text.strip(), '(', a, ')')
Result:
--- all td text ---
> Pill Bug
> Pill Bug
>
>
> Armadillidium vulgare
> Armadillidium vulgare
> Armadillidiidae - Terrestrial Custaceans
> Armadillidiidae - Terrestrial Custaceans
> All year
> All year
> All day
> All day
> Under rocks
> Under rocks
> 2 mm
> 2 mm
> Common
> Common
> 250 Bells
> 250 Bells
> Doubutsu no Mori,Animal Crossing,Animal Crossing: Wild World,Animal Crossing: City Folk,Animal Crossing: New Leaf
> Doubutsu no Mori,Animal Crossing,Animal Crossing: Wild World,Animal Crossing: City Folk,Animal Crossing: New Leaf
--- one td text ---
Pill Bug
--- one td a href ---
/wiki/File:Pill_Bug_Picture.jpg
/wiki/File:Pill_Bug_Picture.jpg
--- all a href (using CSS selector) ---
/wiki/File:Pill_Bug_Picture.jpg
/wiki/Bells
/wiki/Doubutsu_no_Mori_(game)
/wiki/Animal_Crossing_(GCN)
/wiki/Animal_Crossing:_Wild_World
/wiki/Animal_Crossing:_City_Folk
/wiki/Animal_Crossing:_New_Leaf
--- all td and th ---
[ td ] Pill Bug ( None )
[ td ] ( /wiki/File:Pill_Bug_Picture.jpg )
[ th ] Scientific name --> [ td ] Armadillidium vulgare ( None )
[ th ] Family --> [ td ] Armadillidiidae - Terrestrial Custaceans ( None )
[ th ] Time of year --> [ td ] All year ( None )
[ th ] Time of day --> [ td ] All day ( None )
[ th ] Location --> [ td ] Under rocks ( None )
[ th ] Size --> [ td ] 2 mm ( None )
[ th ] Rarity --> [ td ] Common ( None )
[ th ] Selling price --> [ td ] 250 Bells ( /wiki/Bells )
[ th ] Appearances --> [ td ] Doubutsu no Mori,Animal Crossing,Animal Crossing: Wild World,Animal Crossing: City Folk,Animal Crossing: New Leaf ( /wiki/Doubutsu_no_Mori_(game) )
Related
I would like to get a table html code from a website with Beautifulsoup and I need to add attribute to the first td item. I have:
try:
description=hun.select('#description > div.tab-pane-body > div > div > div > table')[0]
description+="<style type=text/css>td:first-child { font-weight: bold; width: 5%; } td:nth-child(2) { width: 380px } td:nth-child(3) { font-weight: bold; }</style>"
except:
description=None
The selected description's code:
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td valign="top" width="704" style="">
<p><span>Short description </span></p>
</td>
</tr>
<tr>
<td valign="top" width="123" style="">
<p><span>Additional data</span></p>
</td>
</tr>
</tbody>
</table>
I would like to add a colspan attribute to the first <td> and keep changes in the description variable:
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td valign="top" width="704" style="" colspan="4">
<p><span>Short description </span></p>
</td>
</tr>
<tr>
<td valign="top" width="123" style="">
<p><span>Additional data</span></p>
</td>
</tr>
</tbody>
</table>
I tried:
hun=BeautifulSoup(f,'html.parser')
try:
description2=hun.select('#description > div.tab-pane-body > div > div > div > table')[0]
description2+="<style type=text/css>td:first-child { font-weight: bold; width: 5%; } td:nth-child(2) { width: 380px } td:nth-child(3) { font-weight: bold; }</style>"
soup = BeautifulSoup(description2, 'html.parser')
description = soup.td['colspan'] = 4
...but it is not working, the output is "4", instead of the table's html code with attribute added.
I found it, it must be like this:
hun=BeautifulSoup(f,'html.parser')
try:
description2=hun.select('#description > div.tab-pane-body > div > div > div > table')[0]
description2+="<style type=text/css>td:first-child { font-weight: bold; width: 5%; } td:nth-child(2) { width: 380px } td:nth-child(3) { font-weight: bold; }</style>"
soup = BeautifulSoup(description2, 'html.parser')
soup.td['colspan'] = 4
description = soup
Just select the first <td> and add attribute colspan:
from bs4 import BeautifulSoup
html_doc = '''\
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td valign="top" width="704" style="">
<p><span>Short description </span></p>
</td>
</tr>
<tr>
<td valign="top" width="123" style="">
<p><span>Additional data</span></p>
</td>
</tr>
</tbody>
</table>'''
soup = BeautifulSoup(html_doc, 'html.parser')
soup.td['colspan'] = 4
print(soup.prettify())
Prints:
<table border="0" cellpadding="0" cellspacing="0" width="704">
<tbody>
<tr>
<td colspan="4" style="" valign="top" width="704">
<p>
<span>
Short description
</span>
</p>
</td>
</tr>
<tr>
<td style="" valign="top" width="123">
<p>
<span>
Additional data
</span>
</p>
</td>
</tr>
</tbody>
</table>
I get transaction emails from my bank everytime I make a transaction. It comes in html. I want to be able to get certain information like confirmation_number, date, amount, etc. from the html content.
I tried to use regex extraction and also BeautifulSoup but the results are ugly and unwieldy. For example, the html code doesn't come with any useful attributes so it's not easy to do a find() with attributes filter. See snippet of html code below:
<table style="border: 1px solid black; border-collapse: collapse">
<tbody>
<tr>
<td colspan="2" style="border:1px solid black;padding:3px">
<center>
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
<b>
Transfer Money Details
</b>
</font>
</center>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Confirmation Number
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
1594379907846
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Transaction Date and Time
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Friday, Jul 10 2020; 07:18:54 PM (GMT +8)
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Transfer From
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
XXXX-XXX-247 (PESO SAVINGS)
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Transfer To
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
XXXX-XXX-545
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Amount
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
PHP 1,200.00
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Service Fee
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
PHP 0.00
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Total Amount
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
PHP 1,200.00
</font>
</td>
</tr>
<tr>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Notes
</font>
</td>
<td style="border: 1px solid black; padding: 3px">
<font color="#000000" face="arial" style="FONT-SIZE:10pt">
Mask filters
</font>
</td>
</tr>
</tbody>
</table>
I want to be able to have a dataframe or a dictionary that looks like this:
{
'Confirmation Number': '1594379907846',
'Transaction Date and Time': 'Friday, Jul 10 2020; 07:18:54 PM (GMT +8)',
'Transfer From': 'XXXX-XXX-247 (PESO SAVINGS)'
... and so on
}
The code I have:
def get_content(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
rows = soup.find_all('tr')
content_ls = []
trans_details = {}
for row in rows:
cells = row.findChildren('td')
for cell in cells:
content_ls.append(cell.getText())
trans_details['Confirmation Number'] = content_ls[2]
trans_details['Date_Time'] = content_ls[4]
trans_details['From'] = content_ls[6]
trans_details['To'] = content_ls[8]
trans_details['Amount'] = content_ls[10]
trans_details['Notes'] = content_ls[12]
return trans_details
produces this dictionary:
{'Amount': 'PHP 1,200.00',
'Confirmation Number': '1594379907846',
'Date_Time': 'Friday, Jul 10 2020; 07:18:54 PM (GMT +8)',
'From': 'XXXX-XXX-247 (PESO SAVINGS)',
'Notes': 'PHP 0.00',
'To': 'XXXX-XXX-545'}
Is there a more elegant and pythonic way of doing it?
Ultimately, I'd like to produce a DataFrame, with columns 'Confirmation Number', 'Transaction Date and Time', and so on.
Thanks
What you can do is to use lxml lib. It allows you to use xpath to find elements.
Here is a method to extract information with the HTML you had provided.
def parse(html):
root = etree.fromstring(html)
trs = root.xpath("//tr")
result = dict()
for tr in trs:
fonts = tr.xpath(".//font")
key = fonts[0].text.strip()
value = fonts[1].text.strip()
result[key] = value
return result
I have this piece of code i want to scrape from a table:
<tr id="vsViewer1_dgMainView_dgMainView_ctl02" class="GridItem odd">
<td class=" ">
<a class="hlPopup" id="lbdgMainView$ctl02" name="lbdgMainView$ctl02" onclick="wrjl_test(this,'lbdgMainView$ctl02','746402:O9oY58XKE+w=:746402:746402')" onmouseover="this.className='HLPopupOver'" onmouseout="this.className='HLPopup'"></a>
<span class="HLPopup" id="lbldgMainView$ctl02" name="lbldgMainView$ctl02" onclick="wrjl_test(this,'lbldgMainView$ctl02','746402:O9oY58XKE+w=:746402:746402')"> Info </span>
</td>
<td align="center" class=" ">746402</td>
<td align="center" class=" ">Wyndham Orlando Resort International Drive</td>
<td align="center" class=" ">Interiano, Ana</td>
<td align="center" class=" ">Yes</td>
<td align="center" class=" ">7.32</td>
<td align="left" class=" ">
<table width="250" class="TextTableSmall" border="0">
<tbody>
<tr>
<td align="center" style="background-color: rgb(128, 128, 128); text-align: center; font-size: 8pt;">Date</td>
<td align="center" style="background-color: rgb(128, 128, 128); text-align: center; font-size: 8pt;">In</td>
<td align="center" style="background-color: rgb(128, 128, 128); text-align: center; font-size: 8pt;">Out</td>
<td align="center" style="background-color: rgb(128, 128, 128); text-align: center; font-size: 8pt;">Hours</td>
<td style="background-color: rgb(128, 128, 128); text-align: center; font-size: 8pt;">Shift</td>
</tr>
<tr>
<td style="background-color: rgb(204, 204, 153); text-align: left; font-size: 8pt;">Thu 10/24/19</td>
<td align="right" style="background-color: rgb(204, 204, 153); text-align: left; font-size: 8pt;">8:00am</td>
<td align="right" style="background-color: rgb(204, 204, 153); text-align: left; font-size: 8pt;">1:20pm</td>
<td align="right" style="background-color: rgb(204, 204, 153); text-align: left; font-size: 8pt;">5.33</td>
<td align="center" style="background-color: rgb(204, 204, 153); text-align: left; font-size: 8pt;">1
<br>FL ORL Wyndham Resort I Drive 18128 - Housekeeping
<br>Room Attendant
</td>
</tr>
<tr>
<td style="background-color: rgb(204, 204, 153); text-align: left; font-size: 8pt;">Thu 10/24/19</td>
<td align="right" style="background-color: rgb(204, 204, 153); text-align: left; font-size: 8pt;">1:39pm</td>
<td align="right" style="background-color: rgb(204, 204, 153); text-align: left; font-size: 8pt;">3:38pm</td>
<td align="right" style="background-color: rgb(204, 204, 153); text-align: left; font-size: 8pt;">1.98</td>
<td align="center" style="background-color: rgb(204, 204, 153); text-align: left; font-size: 8pt;">1
<br>FL ORL Wyndham Resort I Drive 18128 - Housekeeping
<br>Room Attendant
</td>
</tr>
</tbody>
</table>
</td>
<td align="right" class=" ">12.25</td>
<td class=" ">9.0000</td>
<td align="center" class=" ">1</td>
<td align="center" class=" ">Housekeeper</td>
<td align="center" class=" ">HOUSEKEEPER</td>
<td align="center" class=" ">SE-FL-Orlando</td>
<td align="center" class=" ">Wyndham Hotel Group</td>
</tr>
i've done this:
from bs4 import BeautifulSoup
import requests
with open('vsShowViewTWO.html') as html_file:
soup = BeautifulSoup(html_file,'lxml')
tbody = soup.find('tbody',id='thetbody')
table_rows=tbody.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
and the results are:
[' Info ', '746402', 'Resort International', 'Interiano, Ana', 'Yes', '7.32', 'DateInOutHoursShiftThu 10/24/198:00am1:20pm5.331Resort I Drive 18128 - HousekeepingRoom AttendantThu 10/24/191:39pm3:38pm1.981Resort I Drive 18128 - HousekeepingRoom Attendant', 'Date', 'In', 'Out', 'Hours', 'Shift', 'Thu 10/24/19', '8:00am', '1:20pm', '5.33', '1Resort I Drive 18128 - HousekeepingRoom Attendant', 'Thu 10/24/19', '1:39pm', '3:38pm', '1.98', '1 Resort I Drive 18128 - HousekeepingRoom Attendant', '12.25', '9.0000', '1', 'Housekeeper', 'HOUSEKEEPER', 'SE', 'Hotel Group']
but i don't need the whole row just the name "Interiano, Ana" and the last "HOUSEKEEPER", i've been trying with indexing the rows var with no luck
the available days has a class .calendarCellOpen:
table.calendario .calendarCellOpen input {
}
Here it is the calendar css:
#calwrapper
{
min-height:230px;
margin-top:10px;
}
#calendar
{
float:left;
margin-left: 15px; /*Daniele 10-04-2014*/
}
span.calendario
{
display:block;
margin:0;
}
table.fasce
{
margin-left:20px;
}
table.fasce th
{
background-image: url( '../images/tab_body.png' );
background-repeat: repeat-x;
font-size:12px;
}
table.fasce tr
{
border-bottom: #f5f4e7 thin dotted;
}
table.calendario
{
border-top: 0px !important;
}
table.calendario, table.fasce
{
width: 300px;
background-color: White !important;
font-size: 15px;
border-right: #f5f4e7 1px solid !important;
border-left: #f5f4e7 1px solid !important;
border-bottom: #f5f4e7 1px solid !important;
}
table.calendario td, table.fasce td
{
text-align:center;
}
table.calendario .calTitolo
{
background-image: url( '../images/tab_body.png' );
background-repeat: repeat-x;
margin: 0px !important;
padding: 0px !important;
font-size:12px;
}
table.calendario .calTitolo td
{
padding:0px 5px 0px 5px;
width:14.3%;
}
table.calendario .calDayHeader /* RIGA */
{
background-color:#FCFBF7;
font-size:12px;
}
table.calendario .otherMonthDay
{
color: #C0C0C0;
}
table.calendario .cellaSelezionata /* CELLA */
{
background-color:#EDEBD5 !important;
border-collapse:collapse !important;
font-weight:bold;
}
table.calendario .calendarCellOpen input
{
color:#208020 !important; /*High availability (green)*/
font-weight:bold;
}
table.calendario .calendarCellRed
{
color:Red !important; /*noe availability*/
font-weight:bold;
}
table.calendario .calendarCellMed input
{
color:#F09643 !important; /*Disponibilità media*/
font-weight:bold;
}
.pulsanteCalendario
{
border: 0px;
background-color: Transparent;
cursor: pointer;
padding: 0px 0px 0px 0px;
margin: 0px;
height:20px;
width:100%;
overflow:visible;
text-align:center;
font-size:16px;
}
.pulsanteCalendario:hover
{
text-decoration:underline;
}
#legend
{
margin-bottom:8px;
width:100%;
}
#legend ul
{
list-style-type:none;
}
#legend ul li
{
display:inline;
margin-left:20px;
}
The thing is that i want to select (clicking on it with Selenium) the day available(doesn`t matter which day).Just any day which appears to be available(green).
Here is the calendar:
elementos = driver.find_elements_by_class_name("calendarCellOpen")
while True:
if elementos:
driver.find_element_by_class_name("calendarCellOpen").click()
driver.find_element_by_id("ctl00_ContentPlaceHolder1_acc_Calendario1_repFasce_ctl01_btnConferma").click() #confirm button
else:
driver.find_element_by_xpath("//input[#value='<']").click() #back
if elementos:
driver.find_element_by_class_name("calendarCellOpen").click()
driver.find_element_by_id("ctl00_ContentPlaceHolder1_acc_Calendario1_repFasce_ctl01_btnConferma").click()
driver.find_element_by_xpath("//input[#value='>']").click() #forward
if elementos:
driver.find_element_by_class_name("calendarCellOpen").click()
driver.find_element_by_id("ctl00_ContentPlaceHolder1_acc_Calendario1_repFasce_ctl01_btnConferma").click()
This some code i made
I made back and foward because is th only way to reload the calendar..
This is the HTML of the calendar:
<div id="calwrapper">
<div id="legend" style="padding-left:15px; margin-bottom:20px">
<table style="width:90%; border-collapse:collapse; border: 0px">
<tr style="line-height:15px">
<td style="background-color:Red; width:80px; margin-right:10px">
</td>
<td style="width: 383px; padding-left:5px">
Tutto occupato # all none available
</td>
<td style="background-color:#F09643; width:80px">
</td>
<td style="width: 450px; padding-left:5px">
Media disponibilità #half available
</td>
<td style="background-color:#058d08; width:80px">
</td>
<td style="width: 383px; padding-left:5px">
Posti disponibili #available
</td>
<td style="background-color:#000000; width:80px">
</td>
<td style="width: 383px; padding-left:5px">
Non disponibile # none available
</td>
</tr>
</table>
</div>
<div id="calendar">
<span id="ctl00_ContentPlaceHolder1_acc_Calendario1_myCalendario1"
class="calendario">
<table class="calendario" summary="Summary" cellspacing="0">
<caption>Calendario eventi</caption>
<tr class="calTitolo">
<th>
<input type="submit"
name="ctl00$ContentPlaceHolder1$acc_Calendario1$myCalendario1$ctl01"
value="<" title="Clicca qui per andare al mese precedente"
class="pulsanteCalendario" />
</th>
<th colspan="5">
<span>agosto, 2017</span>
</th>
<th>
<input type="submit"
name="ctl00$ContentPlaceHolder1$acc_Calendario1$myCalendario1$ctl03"
value=">" title="Clicca qui per andare al mese successivo"
class="pulsanteCalendario" />
</th>
</tr>
<tr>
<th class="calDayHeader" scope="col">lun</th>
<th class="calDayHeader"
scope="col">mar</th>
<th class="calDayHeader" scope="col">mer</th>
<th class="calDayHeader" scope="col">gio</th>
<th class="calDayHeader" scope="col">ven</th>
<th class="calDayHeader" scope="col">sab</th>
<th class="calDayHeader" scope="col">dom</th>
</tr>
<tr>
<td title="Giorno non disponibile" class="otherMonthDay">31</td>
<td title="Tutto occupato" class="calendarCellRed">1</td>
<td title="Giorno non disponibile" class="noSelectableDay">2</td>
<td title="Tutto occupato" class="calendarCellRed">3</td>
<td title="Tutto occupato" class="calendarCellRed">4</td>
<td title="Giorno non disponibile" class="noSelectableDay">5</td>
<td title="Giorno non disponibile" class="noSelectableDay">6</td>
</tr>
<tr>
<td title="Tutto occupato" class="calendarCellRed">7</td>
<td class="calendarCellOpen">
<input type="submit"
name="ctl00$ContentPlaceHolder1$acc_Calendario1$myCalendario1$ctl12"
value="8" title="8 agosto 2017, Posti disponibili"
class="pulsanteCalendario" />
</td>
<td class="calendarCellOpen">
<input type="submit"
name="ctl00$ContentPlaceHolder1$acc_Calendario1$myCalendario1$ctl12"
value="8" title="8 agosto 2017, Posti disponibili"
class="pulsanteCalendario" />
</td>
<td class="calendarCellOpen">
<input type="submit"
name="ctl00$ContentPlaceHolder1$acc_Calendario1$myCalendario1$ctl12"
value="8" title="8 agosto 2017, Posti disponibili"
class="pulsanteCalendario" />
</td>
<td class="calendarCellOpen">
<input type="submit"
name="ctl00$ContentPlaceHolder1$acc_Calendario1$myCalendario1$ctl12"
value="8" title="8 agosto 2017, Posti disponibili"
class="pulsanteCalendario" />
</td>
<td title="Giorno non disponibile" class="noSelectableDay">9</td>
<td title="Giorno non disponibile" class="noSelectableDay">10</td>
</tr><tr>
<td title="Giorno non disponibile" class="noSelectableDay">14</td>
<td title="Giorno non disponibile" class="noSelectableDay">15</td>
<td title="Giorno non disponibile" class="noSelectableDay">16</td>
<td title="Giorno non disponibile" class="noSelectableDay">17</td>
<td title="Giorno non disponibile" class="noSelectableDay">18</td>
<td title="Giorno non disponibile" class="noSelectableDay">19</td>
<td title="Giorno non disponibile" class="noSelectableDay">20</td>
</tr><tr>
<td title="Giorno non disponibile" class="noSelectableDay">21</td>
<td title="Giorno non disponibile" class="noSelectableDay">22</td>
<td title="Giorno non disponibile" class="noSelectableDay">23</td>
<td title="Giorno non disponibile" class="noSelectableDay">24</td>
<td title="Giorno non disponibile" class="noSelectableDay">25</td>
<td title="Giorno non disponibile" class="noSelectableDay">26</td>
<td title="Giorno non disponibile" class="noSelectableDay">27</td>
</tr><tr>
<td title="Giorno non disponibile" class="noSelectableDay">28</td>
<td title="Giorno non disponibile" class="noSelectableDay">29</td>
<td title="Giorno non disponibile" class="noSelectableDay">30</td>
<td title="Giorno non disponibile" class="noSelectableDay">31</td>
<td title="Giorno non disponibile" class="otherMonthDay">1</td>
<td title="Giorno non disponibile" class="otherMonthDay">2</td>
<td title="Giorno non disponibile" class="otherMonthDay">3</td>
</tr></table></span>
</div>
<div id="orari" >
<input type="hidden"
name="ctl00$ContentPlaceHolder1$acc_Calendario1$HiddenField1"
id="ctl00_ContentPlaceHolder1_acc_Calendario1_HiddenField1" />
</div>
</div>
This is what i gain to do, but im not quite sure that this is going to work:
while True:
for dates in elementos:
if dates.is_enabled():
dates.click()
driver.find_element_by_id("ctl00_ContentPlaceHolder1_acc_Calendario1_repFasce_ctl01_btnConferma").click()
#if elementos > 0:
#driver.find_element_by_class_name("calendarCellOpen").click()
#else:
driver.find_element_by_xpath("//input[#value='<']").click()
driver.find_element_by_xpath("//input[#value='>']").click()
I am trying to parse HTML with Python using BeautifulSoup, but I can't manage to get what I need.
This is a little module of a personal app I want to do, and it consists in a web login part with credentials, and once the script is logged in the web, I need to parse some information in order to manage it and process it.
The HTML code after getting logged is:
<div class="widget_title clearfix">
<h2>Account Balance</h2>
</div>
<div class="widget_body">
<div class="widget_content">
<table class="simple">
<tr>
<td>Daily Earnings</td>
<td style="text-align: right; width: 125px; color: #119911; font-weight: bold;">
150
</td>
</tr>
<tr>
<td>Weekly Earnings</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;">
500 </td>
</tr>
<tr>
<td>Monthly Earnings</td>
<td style="text-align: right; color: #119911; font-weight: bold;">
1500 </td>
</tr>
<tr>
<td>Total expended</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;">
430 </td>
</tr>
<tr>
<td>Account Balance</td>
<td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;">
840 </td>
</tr>
<tr>
<td></td>
<td style="padding: 5px;">
<center>
<form id="request_bill" method="POST" action="index.php?page=dashboard">
<input type="hidden" name="secret_token" value="" />
<input type="hidden" name="request_payout" value="1" />
<input type="submit" class="btn blue large" value="Request Payout" />
</form>
</center>
</td>
</tr>
</table>
</div>
</div>
</div>
As you can see, it's not a very well-formatted HTML, but I'd need to extract the elements and their values, I mean, for example: "Daily earnings" and "150" | "Weekly earnings" and "500"...
I think that the "id" attribute may help, but when I try to parse it, it crashes.
The Python code I'm working with is:
def parseo(archivohtml):
html = archivohtml
parsed_html = BeautifulSoup(html)
par = parsed_html.find('td', attrs={'id':'west1'}).string
print par
Where archivohtml is the saved html file after logging in the web
When I run the script, I only get errors.
I've also tried doing this:
def parseo(archivohtml):
soup = BeautifulSoup()
html = archivohtml
parsed_html = soup(html)
par = soup.parsed_html.find('td', attrs={'id':'west1'}).string
print par
But the result is still the same.
The tag with id="west1" is an <a> tag. You are looking for the <td> tag that comes after this <a> tag:
import BeautifulSoup as bs
content = '''<div class="widget_title clearfix">
<h2>Account Balance</h2>
</div>
<div class="widget_body">
<div class="widget_content">
<table class="simple">
<tr>
<td>Daily Earnings</td>
<td style="text-align: right; width: 125px; color: #119911; font-weight: bold;">
150
</td>
</tr>
<tr>
<td>Weekly Earnings</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #119911; font-weight: bold;">
500 </td>
</tr>
<tr>
<td>Monthly Earnings</td>
<td style="text-align: right; color: #119911; font-weight: bold;">
1500 </td>
</tr>
<tr>
<td>Total expended</td>
<td style="text-align: right; border-bottom: 1px solid #000; color: #880000; font-weight: bold;">
430 </td>
</tr>
<tr>
<td>Account Balance</td>
<td style="text-align: right; border-bottom: 3px double #000; color: #119911; font-weight: bold;">
840 </td>
</tr>
<tr>
<td></td>
<td style="padding: 5px;">
<center>
<form id="request_bill" method="POST" action="index.php?page=dashboard">
<input type="hidden" name="secret_token" value="" />
<input type="hidden" name="request_payout" value="1" />
<input type="submit" class="btn blue large" value="Request Payout" />
</form>
</center>
</td>
</tr>
</table>
</div>
</div>
</div>'''
def parseo(archivohtml):
html = archivohtml
parsed_html = bs.BeautifulSoup(html)
par = parsed_html.find('a', attrs={'id':'west1'}).findNext('td')
print par.string.strip()
parseo(content)
yields
150
I can't tell from your question if this will be applicable to you, but here's another method:
def parseo(archivohtml):
html = archivohtml
parsed_html = BeautifulSoup(html)
for line in parsed_html.stripped_strings:
print line.strip()
which yields:
Account Balance
Daily Earnings
150
Weekly Earnings
500
Monthly Earnings
1500
Total expended
430
Account Balance
840
And if you wanted the data in a list:
data = [line.strip() for line in parsed_html.stripped_strings]
[u'Account Balance', u'Daily Earnings', u'150', u'Weekly Earnings', u'500', u'Monthly Earnings', u'1500', u'Total expended', u'430', u'Account Balance', u'840']