I have a website containing film listings, I've put together a simplified HTML of the website. Please note that for the real world example the <ul> tags are not direct children of the class film_listing or showtime. They are found under several <div> or <ul> elements.
<li class="film_listing">
<h3 class="film_title">James Bond</h3>
<ul class="showtimes">
<li class="showtime">
<p class="start_time">15:00</p>
</li>
<li class="showtime">
<p class="start_time">19:00</p>
<ul class="attributes">
<li class="audio_desc">
</li>
<li class="open_cap">
</li>
</ul>
</li>
</ul>
</li>
I have created a Python script to scrape the website which currently lists all film titles with the first showtime and first attribute of each. However, I am trying to list all showtimes. The final aim is to only list film titles with open captions and the showtime of those open captions performances.
Here is the python script with a nested for loop that doesn't work and prints all showtimes for all films, rather than showtimes for a specific film. It is also not set up to only list captioned films yet. I suspect the logic may be wrong and would appreciate any advice. Thanks!
for i in soup.findAll('li', {'class':'film_listing'}):
film_title=i.find('h3', {'class':'film_title'}).text
print(film_title)
for j in soup.findAll('li', {'class':'showtime'}):
print(j['showtime.text'])
#For the time listings, find ones with Open Captioned
i=filmlisting.find('li', {'class':'open_cap'})
print(film_access)
edit: small correction to html script
There are many ways how you could extract the information. One way is to "search backwards". Search for <li> with class="open_cap" and the find previous start time and film title:
from bs4 import BeautifulSoup
txt = '''
<li class="film_listing">
<h3 class="film_title">James Bond</h3>
<ul class="showtimes">
<li class="showtime">
<p class="start_time">15:00</p>
</li>
<li class="showtime">
<p class="start_time">19:00</p>
<ul class="attributes">
<li class="audio_desc">
</li>
<li class="open_cap">
</li>
</ul>
</li>
</ul>
</li>'''
soup = BeautifulSoup(txt, 'html.parser')
for open_cap in soup.select('.open_cap'):
print('Name :', open_cap.find_previous(class_='film_title').text)
print('Start time :', open_cap.find_previous(class_='start_time').text)
print('-' * 80)
Prints:
Name : James Bond
Start time : 19:00
--------------------------------------------------------------------------------
Content of read.html
<li class="film_listing">
<h3 class="film_title">James Bond</h3>
<ul class="showtimes">
<li class="showtime">
<p class="start_time">15: 00</p>
</li>
<li class="showtime">
<p class="start_time">19:00</p>
<ul class="attributes">
<li class="audio_desc"></li>
<li class="open_cap"></li>
</ul>
</li>
</ul>
</li>
As you said <ul> tags are not direct children of the class film_listing or showtime then you can try find() to get first element with specified tag name or you can use find_all() to get list of elements with specified tag name.
You can try this
from bs4 import BeautifulSoup as bs
text = open("read.html", "r")
soup = bs(text.read(), 'html.parser')
for listing in soup.find_all("li", class_="film_listing"):
print("Film name: ", listing.find("h3", class_="film_title").text)
print("Start time: ", listing.find("p", class_="start_time").text)
Output:
Film name: James Bond
Start time: 15: 00
instead of find() you can use find_all() method which will return all the tags which that name <p> and class start_time
Related
I extracted the links I want with this:
link_soup = soup.find_all('ul', 'pagination')
but now I can't use link_soup[0].find_all('a')['href'] if I use link_soup[0].find('a')['href']
it only shows the first link which isn't what I want. How would I go about getting all links returned in a list?
Snippet Below:
<ul class="pagination">
<li><<</li>
<li><</li>
<li class="hidden-xs">1</li>
<li class="hidden-xs active">2</li>
<li class="hidden-xs">3</li>
<li class="hidden-xs">4</li>
<li class="hidden-xs">5</li>
<li> ></li>
<li> >></li>
</ul>
First you need to find parent tag using find and then all child using find_all.Hope this helps
from bs4 import BeautifulSoup
html="""<html><ul class="pagination">
<li><<</li>
<li><</li>
<li class="hidden-xs">1</li>
<li class="hidden-xs active">2</li>
<li class="hidden-xs">3</li>
<li class="hidden-xs">4</li>
<li class="hidden-xs">5</li>
<li> ></li>
<li> >></li>
</ul></html>"""
soup=BeautifulSoup(html,"html.parser")
ul=soup.find('ul')
for a in ul.find_all('a'):
print(a['href'])
Output :
link
link
link
link
link
link
link
link
link
I am learning to use beautiful soup to parse div containers from html. But for some reason, when i pass the class name of the div containers to my beautiful soup, nothing happens. I am getting no element when i try to parse the div. What could i be doing wrong. here is my html and the parse
<div class="upcoming-date ng-hide no-league" ng-show="nav.upcoming" ng-class="{'no-league': !search.checkShowTitle(nav.sport,nav.todayHighlights,nav.upcoming,nav.orderBy,"FOOTBALL - HIGHLIGHTS")}">
<span class="weekday">Monday</span>
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="true" show-time="false" class="date ng-isolate-scope"><span class="ng-binding">09/07/18</span></timecomponent>
<div class="clear"></div>
</div>
<div id="g1390856" class="match football FOOTBALL - HIGHLIGHTS" itemscope="" itemtype="https://schema.org/SportsEvent">
<div class="leaguename ng-hide" ng-show="search.checkShowTitle(nav.sport,nav.todayHighlights,nav.upcoming,nav.orderBy,"FOOTBALL - HIGHLIGHTS") && (1 || (nav.upcoming && 0))">
<span class="name">
<span class="flag-icon flag-icon-swe"></span>
Sweden - Allsvenskan
</span>
</div>
<ul class="meta">
<li class="date">
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="true" show-time="false" class="ng-isolate-scope"><span class="ng-binding">09/07/18</span></timecomponent>
</li>
<li class="time">
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="false" show-time="true" class="ng-isolate-scope"><span class="ng-binding">20:00</span></timecomponent>
</li>
<li class="game-id"><span class="gameid">GameID:</span> 2087</li>
</ul>
<ul class="teams">
<li>Hammarby</li>
<li>Ostersunds</li>
</ul>
<ul class="bet-selector">
<li class="pick01" id="b499795664">
<a data-id="499795664" ng-click="bets.pick($event, 499795664, 2087, 2.10)" class="betting-button pick-button " title="Hammarby">
<span class="team">Hammarby</span>
<span class="odd">2.10</span>
</a>
</li> <li class="pick0X" id="b499795666">
<a data-id="499795666" ng-click="bets.pick($event, 499795666, 2087, 3.56)" class="betting-button pick-button " title="Draw">
<span class="team">Draw</span>
<span class="odd">3.56</span>
</a>
</li> <li class="pick02" id="b499795668">
<a data-id="499795668" ng-click="bets.pick($event, 499795668, 2087, 3.40)" class="betting-button pick-button " title="Ostersunds">
<span class="team">Ostersunds</span>
<span class="odd">3.40</span>
</a>
</li> </ul>
<ul class="extra-picks">
<li>
<a class="betting-button " href="/games/1390856/markets?league=0&top=0&sid=2087&sportId=1">
<span class="short-desc">+13</span>
<span class="long-desc">View 13 more markets</span>
</a>
</li>
</ul>
<div class="game-stats">
<img class="img-responsive" src="/img/chart-icon.png?v2.2.25.2">
</div>
<div class="clear"></div>
</div>
.............................................................
parser.py
import requests
import urllib2
from bs4 import BeautifulSoup as soup
udata = urllib2.urlopen('https://www.sportpesa.co.ke/?sportId=1')
htmlsource = udata.read()
ssoup = soup(htmlsource,'html.parser')
page_div = ssoup.findAll("div",{"class":"match football FOOTBALL - HIGHLIGHTS"})
print page_div
"match football FOOTBALL - HIGHLIGHTS" is a dynamic class so you are just getting a blank list....
Here is my code in python3
from bs4 import BeautifulSoup as bs4
import requests
request = requests.get('https://www.sportpesa.co.ke/?sportId=1')
soup = bs4(request.text, 'lxml')
print(soup)
After printing soup you will find that this class is not present in your source code... I hope that it will help you
So - as suggested in the comment - the best (fastest) way to get data from this site is to make use of the same endpoint, that the javascript uses.
If you use Chrome, pop up the Inspector Tool, open the networks tab and load the page. You'll see, that se site gets the data from a url, that looks very similar to the one actually displayed in the url, namely
https://sportpesa.co.ke/sportgames?sportId=1
This endpoint gives you the data you need. To grab it using requests and getting the divs, you want, would be done like below:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://sportpesa.co.ke/sportgames?sportId=1")
soup = BeautifulSoup(r.text, "html.parser")
page_divs = soup.select('div.match.football.FOOTBALL.-.HIGHLIGHTS')
print(len(page_divs))
That prints 30 - which is the number of divs. Btw I'm using the bs4-method select here, which is the bs4-recommended way of doing things, when you - as you do here - have not just one but multiple classes ('match', 'football', 'FOOTBALL', '-' and 'HIGHLIGHTS').
So as the title states I have some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/name/acetone that I am parsing and want to extract some data like the Acetone under MeSH Heading from my similar post How to set up XPath query for HTML parsing?
<div id="names">
<h2>Names and Synonyms</h2>
<div class="ds">
<button class="toggle1Col" title="Toggle display between 1 column of wider results and multiple columns.">↔</button>
<h3>Name of Substance</h3>
<div class="yui3-g-r">
<div class="yui3-u-1-4">
<ul>
<li id="ds2">
<div>2-Propanone</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds3">
<div>Acetone</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds4">
<div>Acetone [NF]</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds5">
<div>Dimethyl ketone</div>
</li>
</ul>
</div>
</div>
<h3>MeSH Heading</h3>
<ul>
<li id="ds6">
<div>Acetone</div>
</li>
</ul>
</div>
</div>
Previously in other pages I would do mesh_name = tree.xpath('//*[text()="MeSH Heading"]/..//div')[1].text_content() to extract the data because other pages had similar structures, but now I see that is not the case as I didn't account for inconsistency. So, is there a way of after going to the node that I want and then obtaining it's subchild, allowing for consistency across different pages?
Would doing tree.xpath('//*[text()="MeSH Heading"]//preceding-sibling::text()[1]') work?
From what I understand, you need to get the list of items by a heading title.
How about making a reusable function that would work for every heading in the "Names and Synonyms" container:
from lxml.html import parse
tree = parse("http://chem.sis.nlm.nih.gov/chemidplus/name/acetone")
def get_contents_by_title(tree, title):
return tree.xpath("//h3[. = '%s']/following-sibling::*[1]//div/text()" % title)
print get_contents_by_title(tree, "Name of Substance")
print get_contents_by_title(tree, "MeSH Heading")
Prints:
['2-Propanone', 'Acetone', 'Acetone [NF]', 'Dimethyl ketone']
['Acetone']
I'm currently trying to extract the html elements which have a text on their own and wrap them with a special tag.
For example, my HTML looks like this:
<ul class="myBodyText">
<li class="fields">
This text still has children
<b>
Simple Text
</b>
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello <br/>
World
</li>
</ul>
</div>
</li>
</ul>
I'm trying to wrap tags only around the tags, so I can further parse them at a later time, so I tried to make it look like this:
<ul class="bodytextAttributes">
<li class="field">
[Editable]This text still has children[/Editable]
<b>
[Editable]Simple Text[/Editable]
</b>
<div class="sectionFields">
<ul class="section">
<li style="padding-left: 10px;">
[Editable]Hello [/Editable]<br/>
[Editable]World[/Editable]
</li>
</ul>
</div>
</li>
</ul>
My script so far, which iterates just fine, but the placement of the edit placeholders isn't working and I have currently no idea how I can check this:
def parseSection(node):
b = str(node)
changes = set()
tag_start, tag_end = extractTags(b)
# index 0 is the element itself
for cell in node.findChildren()[1:]:
if cell.findChildren():
cell = parseSection(cell)
else:
# safe to extract with regular expressions, only 1 standardized tag created by BeautifulSoup
subtag_start, subtag_end = extractTags(str(cell))
changes.add((str(cell), "[/EditableText]{0}[EditableText]{1}[/EditableText]{2}[EditableText]".format(subtag_start, str(cell.text), subtag_end)))
text = extractText(b)
for change in changes:
text = text.replace(change[0], change[1])
return bs("{0}[EditableText]{1}[/EditableText]{2}".format(tag_start, text, tag_end), "html.parser")
The script generates following:
<ul class="myBodyText">
[EditableText]
<li class="fields">
This text still has children
[/EditableText]
<b>
[EditableText]
Simple Text
[/EditableText]
</b>
[EditableText]
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello [/EditableText]
<br/>
[EditableText][/EditableText]
<br/>
[EditableText]
World
</li>
</ul>
</div>
</li>
[/EditableText]
</ul>
How I can check this and fix it? I'm grateful for every possible answer.
There is a built-in replace_with() method that fits the use case nicely:
soup = BeautifulSoup(data)
for node in soup.find_all(text=lambda x: x.strip()):
node.replace_with("[Editable]{}[/Editable]".format(node))
print soup.prettify()
all. I'm having trouble getting at links in nested HTML with Mechanize in Python. Here's my current code (I've tried everything; this is just the latest copy, which doesn't work correctly) (and pardon my variable names (thing, stuff)):
soup = BeautifulSoup(resultsPage)
if not soup.find(attrs={'class' : 'paging'}):
print "Only one producted listed!"
else:
stuff = soup.find('div', attrs={'class' : 'paging'}).ul.li
for thing in stuff:
print thing
Here's the HTML I'm looking at:
<div class="paging">
<ul>
<li><
</li>
<li class='on'>
1-10
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl01_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=2">11-20</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl02_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=3">21-30</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl03_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=4">31-40</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl04_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=5">41-50</a>
</li>
<li class=''>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_rptPageNavigators_ctl05_hlPage" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=6">51-60</a>
</li>
<li>
<a id="ctl00_SPWebPartManager1_g_83a79912_01d8_4726_8a95_2953baaad0ec_ctl01_ucProductInfoPageNavigatorGroupTop_lnkNext" href="http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=22&searchtext=jell-o&pageno=7">>></a>
</li>
</ul>
I need to determine whether or not there are <li> tags with hyperlinks in them; if there are I need to store them for clicking on later. This is the page that the code came from, in case you're curious: http://www.kraftrecipes.com/Products/ProductInfoSearchResults.aspx?CatalogType=1&BrandId=22&SearchText=Jell-O&PageNo=1 I'm working on something to scrape food websites for product info and I need to be able to navigate around the search results.
I have another quick side question. Is it bad to chain together tags and searches like this?
ingredients = soup.find(attrs={'class' : "TitleAndDescription"}).div.find(text=re.compile("Ingredients")).next
I'm just learning Python but this seems kind of kludge-y and I'd like to know what you guys think. Here's a sample of the HTML I'm scraping:
<table>
<tr>
<td>
<div id="contHeader" class="TitleAndDescription">
<h1>JELL-O - GELATIN DESSERT - RASPBERRY</h1>
<div class="textArea">
<strong>Ingredients:</strong> SUGAR, GELATIN, ADIPIC ACID (FOR TARTNESS), CONTAINS LESS THAN 2% OF ARTIFICIAL FLAVOR, DISODIUM PHOSPHATE AND SODIUM CITRATE (CONTROL ACIDITY), FUMARIC ACID (FOR TARTNESS), RED 40.<br/>
<strong>Size:</strong> 6 OZ<br/><strong>Upc:</strong> 4300020052<br/>
<br/>
<!--<br/>-->
<br/>
</div>
</div>
...
</td>
...
</tr>
...
</table>
Sorry for the wall of text. Let me know if you need any more information.
Thanks.
"HTMLParser" module of python could be one of the solution to the problem. Find more details at http://docs.python.org/library/htmlparser.html
If I understood correctly, what you want to get is the list of all li tags that contain any a tag (no matter how deep in the DOM tree). If that's correct, then you can do something like this:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(resultsPage)
list_items = [list_item for list_item in soup.findAll('li')
if list_item.findAll('a')]