how to extract span info from div with soup - python

I have a piece of HTML code below:
<div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div>
I want to extract "Portugal" from it, note the span class is a dynamic one, it is not always class="country-flag-small flag-113" but indeed changes per the value of country generated for this div block.
To get the player1 and 1357, I am using the following cumbersome code:
player1info = soup.findAll('div', attrs={'class':'user-tagline'})[0].text.split("\n")
player1 = player1info[1]
pscore1 = player1info[1].replace('(','').replace(')', '')
It would be appreciated if someone can share with your better solution here. Thank you in advance
UPDATE:
With the initial HTML div info extracted, now I would like to expand it to extract more for the entire row, here is the row:
<tr board-popover="" fen="r1bk2r1/1p2n3/pN6/1B1qQp2/P2Pp2p/1P6/2P2PPP/R3K1R1 b Q -" flip-board="1" highlight-squares="c4b6">
<td>
<a class="clickable-link td-user" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<span class="time-control">
<i class="icon-rapid">
</i>
</span>
<div class="user-tagline ">
<span class="username " data-avatar="https://betacssjs.chesscomfiles.com/bundles/web/images/noavatar_l.1c5172d5.gif" data-country="Portugal" data-enabled="true" data-flag="113" data-joined="Joined Jun 19, 2016" data-logged="Online 6 hrs ago" data-membership="basic" data-name="Atikinounette" data-popup="hover" data-title="" data-username="Atikinounette">
Atikinounette
</span>
<span class="user-rating">
(1357)
</span>
<span class="country-flag-small flag-113" tip="Portugal">
</span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="https://images.chesscomfiles.com/uploads/v1/user/28196414.83e31ff1.50x50o.3a6f77e4aa44.jpeg" data-country="Indonesia" data-enabled="true" data-flag="70" data-joined="Joined May 15, 2016" data-logged="Online Nov 7, 2017" data-membership="basic" data-name="belemnarmada" data-popup="hover" data-title="" data-username="belemnarmada">
belemnarmada
</span>
<span class="user-rating">
(1387)
</span>
<span class="country-flag-small flag-70" tip="Indonesia">
</span>
</div>
</a>
</td>
<td>
<a class="clickable-link text-middle" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<div class="pull-left">
<span class="game-result">
1
</span>
<span class="game-result">
0
</span>
</div>
<div class="result">
<i class="icon-square-minus loss" tip="Lost">
</i>
</div>
</a>
</td>
<td class="text-center">
<a class="clickable-link" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
30 min
</a>
</td>
<td class="text-right">
<a class="clickable-link text-middle moves" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
25
</a>
</td>
<td class="text-right miniboard">
<a class="clickable-link archive-date" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
Aug 9, 2017
</a>
</td>
<td class="text-center miniboard">
<input class="checkbox" game-checkbox="" game-id="2249663029" game-is-live="true" ng-model="model.gameIds[2249663029].checked" type="checkbox"/>
</td>
</tr>
Needed info are:
player's info (answer provided by #balderman already got that)
game-result (1, 0)
playing time (30 min in this row)
total moves (25)
playing date (Aug 9, 2017)
Thank you so much here.

How about the code below?
The idea that the user attributes are 3 spans under the div. So the code points to those spans and extract the data.
from bs4 import BeautifulSoup
html = '''<html><body> <div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div><body></html>'''
soup = BeautifulSoup(html, 'html.parser')
users = soup.findAll('div', attrs={'class': 'user-tagline'})
for user in users:
user_properties = user.findAll('span')
for idx, prop in enumerate(user):
if idx == 1:
print('user name: {}'.format(prop.text))
elif idx == 3:
print('user rating: {}'.format(prop.text))
elif idx == 5:
print('user country: {}'.format(prop.attrs['tip']))
Output
user name: player1
user rating: (1357)
user country: Portugal
user name: player2
user rating: (1387)
user country: Indonesia

This is a more readable solution:
div1 = soup.select("div.user-tagline")[0]
player1 = div1.select_one("span.user-rating").text
pscore1 = div1.select_one("span.country-flag-small").text
To extract data of all divs, just use a loop. And replace "0" with "i".

If you are interested only in the first div, you can go with this:
res = bsobj.find('div', {'class':'user-tagline'}).findAll('span')
print(res[0].text, res[1].text, res[2]['tip'])

Related

Adding multiple loop outputs to single dictionary

I'm learning how to use python and trying to use beautiful soup to do some web scraping. I want to pull the product name and product number from the saved page I'm referencing in my python code, but have provided a snippet of a section where this script is looking. They're located under a div with the class name and a span with the id product_id
Essentially, my python script does put in all the product names, but once it gets to the product_id loop, it overwrites the initial values from my first loop. Looking to see if anyone can point me in the right direction.
OUTPUT
After first loop
{'name': 'ADA Hi-Lo Power Plinth Table'}
{'name': 'Adjustable Headrest Couch - Chrome-Plated Steel Legs'}
{'name': 'Adjustable Headrest Couch - Chrome-Plated Steel Legs (X-Large)'}
After second loop
{'name': 'Weekender Folding Cot', 'product_ID': '55984'}
{'name': 'Weekender Folding Cot', 'product_ID': '31350'}
{'name': 'Weekender Folding Cot', 'product_ID': '31351'}
<div class="revealOnScroll product-item" data-addcart-callback="addcart_callback" data-ajaxcart="1" data-animation="fadeInUp" data-catalogid="1496" data-categoryid="5127" data-timeout="500">
<div class="img">
<a href="ADA-Hi-Lo-Power-Plinth-Table_p_1496.html">
<img alt="ADA Hi-Lo Power Plinth Table" class="img-responsive" src="assets/images/thumbnails/55984_thumbnail.jpg"/>
</a>
<button class="quickview" data-toggle="modal">
Quick View
</button>
</div>
<div class="name">
<a href="ADA-Hi-Lo-Power-Plinth-Table_p_1496.html">
ADA Hi-Lo Power Plinth Table
</a>
</div>
<div class="product-id">
Item Number:
<strong>
<span id="product_id">
55984
</span>
</strong>
</div>
<div class="status">
</div>
<div class="reviews">
</div>
<div class="price">
<span class="regular-price">
$2,849.00
</span>
</div>
<div class="action">
<a class="add-to-cart btn btn-default" href="add_cart.asp?quick=1&item_id=1496&cat_id=5127">
<span class="buyitlink-text">
Select Options
</span>
<span class="ajaxcart-loader icon-spin2 animate-spin">
</span>
<span class="ajaxcart-added icon-ok">
</span>
</a>
</div>
</div>
<div class="revealOnScroll product-item" data-addcart-callback="addcart_callback" data-ajaxcart="1" data-animation="fadeInUp" data-catalogid="2878" data-categoryid="5127" data-timeout="500">
<div class="img">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs_p_2878.html">
<img alt="Adjustable Headrest Couch - Chrome-Plated Steel Legs" class="img-responsive" src="assets/images/thumbnails/31350_thumbnail.jpg"/>
</a>
<button class="quickview" data-toggle="modal">
Quick View
</button>
</div>
<div class="name">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs_p_2878.html">
Adjustable Headrest Couch - Chrome-Plated Steel Legs
</a>
</div>
<div class="product-id">
Item Number:
<strong>
<span id="product_id">
31350
</span>
</strong>
</div>
<div class="status">
</div>
<div class="reviews">
</div>
<div class="price">
<span class="regular-price">
$729.00
</span>
</div>
<div class="action">
<a class="add-to-cart btn btn-default" href="add_cart.asp?quick=1&item_id=2878&cat_id=5127">
<span class="buyitlink-text">
Select Options
</span>
<span class="ajaxcart-loader icon-spin2 animate-spin">
</span>
<span class="ajaxcart-added icon-ok">
</span>
</a>
</div>
</div>
<div class="revealOnScroll product-item" data-addcart-callback="addcart_callback" data-ajaxcart="1" data-animation="fadeInUp" data-catalogid="2879" data-categoryid="5127" data-timeout="500">
<div class="img">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs-X-Large_p_2879.html">
<img alt="Adjustable Headrest Couch - Chrome-Plated Steel Legs (X-Large)" class="img-responsive" src="assets/images/thumbnails/31350_thumbnail.jpg"/>
</a>
<button class="quickview" data-toggle="modal">
Quick View
</button>
</div>
<div class="name">
<a href="Adjustable-Headrest-Couch--Chrome-Plated-Steel-Legs-X-Large_p_2879.html">
Adjustable Headrest Couch - Chrome-Plated Steel Legs (X-Large)
</a>
</div>
<div class="product-id">
Item Number:
<strong>
<span id="product_id">
31351
</span>
</strong>
</div>
<div class="status">
</div>
<div class="reviews">
</div>
<div class="price">
<span class="regular-price">
$769.00
</span>
</div>
<div class="action">
<a class="add-to-cart btn btn-default" href="add_cart.asp?quick=1&item_id=2879&cat_id=5127">
<span class="buyitlink-text">
Select Options
</span>
<span class="ajaxcart-loader icon-spin2 animate-spin">
</span>
<span class="ajaxcart-added icon-ok">
</span>
</a>
</div>
</div>
BEGINNING OF PYTHON SCRIPT
import requests
from bs4 import BeautifulSoup
with open('recoveryCouches','r') as html_file:
content= html_file.read()
soup = BeautifulSoup(content,'lxml')
allProductDivs = soup.find('div', class_='product-items product-items-4')
#get names of products on page
nameDiv = soup.find_all('div',class_='name')
prodID = soup.find_all('span', id='product_id')
records=[]
d=dict()
for name in nameDiv:
d['name'] = name.find('a').text
records.append(d)
print(d)
for productId in prodID:
d['product_ID'] = productId.text
records.append(d)
print(d)
Try this:
nameDiv = soup.find_all('div',class_='name')
prodID = soup.find_all('span', id='product_id')
records=[]
for i in range(len(nameDiv)):
records.append({
"name": nameDiv[i].find('a').text.strip(),
"product_ID": prodID[i].text.strip()
})
to write data to csv file:
import csv
with open("file.csv", 'w') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=records[0].keys())
writer.writeheader()
for record in records:
writer.writerow(record)
If I understand the question correctly, you're trying to get all the names and productIds and store them. The problem you're running into is, in the dictionary, your values are getting overwritten.
One solution to that problem would be to initialize your python dictionary values as lists, like so:
d = {
'name': [],
'product_ID': []
}
Then in each of the loops, you can append the new value to that array. What you currently have will overwrite the previous value.
for name in nameDiv:
d['name'].append(name.find('a').text)
for productId in prodID:
d['product_ID'].append(productId.text)
This will result in a list of all names and product_IDs stored in that dictionary.
If you want to put these lists together in a format like this:
[(name0, productId0), (name1, productId1), ...]
Then you can make use of zip, which will basically combine your lists as long as they are equal length. For example:
zipped_results = list(zip(d['name'], d['product_ID']))

how do I extract with beautiful soup a nested span class value?

I am struggling trying to figure out what is the element I need to tell Beautiful Soup to scrape the tag 'amount' value, which in this code sample is "1,56".
I am pasting below a code excerpt of the webpage I want to scrape:
<td class="line-content">
<span class="html-tag">
<div
<span class="html-attribute-name">
class
</span>
='
<span class="html-attribute-value">
the-price
</span>
'
<span class="html-attribute-name">
style
</span>
='
<span class="html-attribute-value">
margin-top:20px;
</span>
'>
</span>
</td>
</tr>
<tr>
<td class="line-number" value="447">
</td>
<td class="line-content">
<span class="html-tag">
<span
<span class="html-attribute-name">
class
</span>
='
<span class="html-attribute-value">
currency
</span>
'>
</span>
€
<span class="html-tag">
</span>
</span>
<span class="html-tag">
<span
<span class="html-attribute-name">
class
</span>
='
<span class="html-attribute-value">
amount
</span>
'>
</span>
1,56
<span class="html-tag">
</span>
</span>
</td>
</tr>
would you kindly enlighten me?
I am really grateful for any help.
You can target the amount for example like this (data is your HTML string):
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
span_with_amount = soup.find(lambda tag: tag.name == 'span' and tag.get_text(strip=True) == 'amount')
value = span_with_amount.parent.find_next_sibling(text=True)
print(value.strip())
Prints:
1,56
First we will find <span> with the text "amount" and then we will find the text that is next to the parent of this <span>.

Python BeautifulSoup extracting the text right after a particular tag

I'm trying to extract information from a webpage using beautifulsoup and python. I want to extract the information right below a particular tag. To know if its the right tag I would like to do a comparison of its text and then extract the text in the next immediate tag.
Say for example, if the following is a part of an HTML page-source,
<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>
I want to check if the <p class="title"> has text value as Procurement type then I want to print out Services Similarly, if the <p class="title"> has text value as Reference then I want to print out ANAJSKJD23423-Commission and if <p class="title"> has value as Countries then print out all the countries i.e. Belgium,France,Luxembourg.
I know I can extract all the texts with <p class="data strong"> and append them to a list and later fetch all values using indexing. But the thing is, the order of the occurrence of these <p class="title> is not fixed....at some places countries could be mentioned before procurement-type. I, therefore, want to perform a check on the text values and then extract the next immediate tag's text value. I'm still new to BeautifulSoup so any help is appreciated. Thanks
You can do it many ways.Here you go.
from bs4 import BeautifulSoup
htmldata='''<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>'''
soup=BeautifulSoup(htmldata,'html.parser')
items=soup.find_all('p', class_='title')
for item in items:
if ('Procurement type' in item.text) or ('Reference' in item.text):
print(item.findNext('p').text)
You can also use :contains pseudo class with bs4 4.7.1. Although I have passed as a list you can separate out each condition
from bs4 import BeautifulSoup as bs
import re
html = 'yourHTML'
soup = bs(html, 'lxml')
items=[re.sub(r'\n\s+','', item.text.strip()) for item in soup.select('p.title:contains("Procurement type") + p, p.title:contains(Reference) + p, p.title:contains(Countries) + p')]
print(items)
Output:
You can add the argument to check for specific text when you use .find() or .find_all() then use .next_sibling or findNext() to grab the next tags with the content
Ie:
soup.find('p', {'class':'title'}, text = 'Procurement type')
Given:
html = '''<div class="row">
::before
<div class="four columns">
<p class="title">Procurement type</p>
<p class="data strong">Services</p>
</div>
<div class="four columns">
<p class="title">Reference</p>
<p class="data strong">ANAJSKJD23423-Commission</p>
</div>
<div class="four columns">
<p class="title">Funding Agency</p>
<p class="data strong">Health Commission</p>
</div>
::after
</div>
<div class="row">
::before
::after
</div>
<hr>
<div class="row">
::before
<div class="twelve columns">
<p class="title">Countries</p>
<p class="data strong">
<span class>Belgium</span>
", "
<span class>France</span>
", "
<span class>Luxembourg</span>
</p>
<p></p>
</div>
::after
</div>'''
you could do something like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
alpha = soup.find('p', {'class':'title'}, text = 'Procurement type')
for sibling in alpha.next_siblings:
try:
print (sibling.text)
except:
continue
Output:
Services
or
ref = soup.find('p', {'class':'title'}, text = 'Reference')
for sibling in ref.next_siblings:
try:
print (sibling.text)
except:
continue
Output:
ANAJSKJD23423-Commission
or
countries = soup.find('p', {'class':'title'}, text = 'Countries')
names = countries.findNext('p', {'class':'data strong'}).text.replace('", "','').strip().split('\n')
names = [name.strip() for name in names if not name.isspace()]
for country in names:
print (country)
Output:
Belgium
France
Luxembourg

For loop with Beautiful Soup throwing error

I have created this for loop to find td items that start with 'td_threadtitle':
for item in posts:
hello = item.find("td", {"id": lambda L: L and L.startswith('td_threadtitle')})
print(hello)
But I get this error:
hello = item.find("td", {"id": lambda L: L and L.startswith('td_threadtitle')})
TypeError: slice indices must be integers or None or have an __index__ method
When I change the variable hello to this:
hello = item.find("td") , it works perfectly fine. Why does it throw that error when I try to specify the id?
EDIT:
This is how I created posts:
tableWithPosts = soup.find("body").find("div", attrs = {"align": "center"}).find("div", {"class" : "page"}).find("div", attrs = {"style" : "padding:0px 0px 0px 0px"}).find("center").find("form").find("table", {"id": "threadslist"})
posts = tableWithPosts.find("tbody", {"id": "threadbits_forum_75"}
Here is a portion of posts:
</a>
)
</span>
</div>
<div class="smallfont">
<span onclick="window.open('member.php?s=625e629b088a68126ca2d867c056b363&u=206824', '_self')" style="cursor:pointer">
thelavenhagen
</span>
</div>
</td>
<td class="alt2" title="Replies: 11, Views: 1,471">
<div class="smallfont" style="text-align:right; white-space:nowrap">
Thu, May-25-2017
<span class="time">
05:06:46 AM
</span>
<br/>
by
<a href="member.php?s=625e629b088a68126ca2d867c056b363&find=lastposter&t=581132" rel="nofollow">
westopher
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&p=1067660274#post1067660274">
<img alt="Go to last post" border="0" class="inlineimg" src="images/buttons/lastpost.gif"/>
</a>
</div>
</td>
<td align="center" class="alt1">
<a href="misc.php?do=whoposted&t=581132" onclick="who(581132); return false;">
11
</a>
</td>
<td align="center" class="alt2">
1,471
</td>
</tr>
<tr>
<td class="alt1" id="td_threadstatusicon_558556">
<img alt="" border="" id="thread_statusicon_558556" src="images/statusicon/thread_hot.gif"/>
</td>
<td class="alt2">
<img alt="" border="0" src="images/icons/icon1.gif"/>
</td>
<td class="alt1" id="td_threadtitle_558556" title="1996 E36 M3 Lux Dakar Yellow, 87,800 miles, special order without sunroof. Second owner, owned...">
<div>
<span style="float:right">
<a href="#" onclick="attachments(558556); return false">
<img alt="4 Attachment(s)" border="0" class="inlineimg" src="images/misc/paperclip.gif"/>
</a>
</span>
<span style="color: blue">
<b>
<u>
FS:
</u>
</b>
</span>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556" id="thread_title_558556">
1996 E36 M3 - Dakar Lux Slicktop
</a>
<span class="smallfont" style="white-space:nowrap">
(
<img alt="Multi-page thread" border="0" class="inlineimg" src="images/misc/multipage.gif"/>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556">
1
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556&page=2">
2
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556&page=3">
3
</a>
)
</span>
</div>
<div class="smallfont">
<span onclick="window.open('member.php?s=625e629b088a68126ca2d867c056b363&u=95931', '_self')" style="cursor:pointer">
yellowbee
</span>
</div>
</td>
<td class="alt2" title="Replies: 23, Views: 5,147">
<div class="smallfont" style="text-align:right; white-space:nowrap">
Thu, May-25-2017
<span class="time">
04:04:07 AM
</span>
<br/>
by
<a href="member.php?s=625e629b088a68126ca2d867c056b363&find=lastposter&t=558556" rel="nofollow">
mbausa
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&p=1067660244#post1067660244">
<img alt="Go to last post" border="0" class="inlineimg" src="images/buttons/lastpost.gif"/>
</a>
</div>
</td>
<td align="center" class="alt1">
<a href="misc.php?do=whoposted&t=558556" onclick="who(558556); return false;">
23
</a>
</td>
<td align="center" class="alt2">
5,147
</td>
</tr>
<tr>
<td class="alt1" id="td_threadstatusicon_580693">
<img alt="" border="" id="thread_statusicon_580693" src="images/statusicon/thread_hot.gif"/>
</td>
<td class="alt2">
<img alt="" border="0" src="images/icons/icon1.gif"/>
</td>
<td class="alt1" id="td_threadtitle_580693" title="Selling my wife's car. We have owned her for two years and have put over 20k hassle free miles on...">
<div>
<span style="color: blue">
<b>
<u>
FS:
</u>
</b>
</span>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=580693" id="thread_title_580693">
2011 BMW 740Li Alpine White M Package Dakota Brown Interior Weather-tech Mats
</a>
</div>
<div class="smallfont">
<span onclick="window.open('member.php?s=625e629b088a68126ca2d867c056b363&u=128641', '_self')" style="cursor:pointer">
911-AL
</span>
</div>
</td>
Remove your for loop, try with this:
hello = posts.find_all("td", {"id": lambda L: L and L.startswith('td_threadtitle')})
hello
It will find all td items that start with 'td_threadtitle'
hello will be a list which contains all td(objects <class 'bs4.element.Tag'> ) start with 'td_threadtitle', you can still access their div.

Regex handling of a phrase that may occur once or twice

I really couldn't think of a decent title to give a overview of what I'm trying to do, but the examples I have should explain it nicely, my company provides a schedule online, but they don't have any APIs or anything to extract it, so I'm using the Python framework Scrapy to scrape the data, and then adding it to my Google Calendar
A girl gave me a Regex line to handle the data because it was kicking my butt for days and she was feeling nice, but I've since realized that it doesn't handle split shifts (most likely because I was not scheduled for any so she didn't see the possibility of one)
My regex is
re.findall("""dow1'>(\w+)<\S+?>(\w+ \d+)</td>\s*<td class.*?tlHours'>(\d+).*?span>\s*(\d+)<span.*?ment'>(.*?)</spa.*?Meal: (.*?)</sp.*?start'>(\S+?)</spa.*?end'>(\S+?)<""", response.body)
Example data:
This is a normal 8 hour day with a meal break, which is handled fine:
<tr>
<td class='dt'>
<span class='dow1'>Sunday</span>Dec 09
</td>
<td class='ScheduledDetails'valign='top'>
<div style="position:relative;">
<span class='tlHours'>8<span class='spart'> hrs</span> 0<span class='spart'> mins</span></span><span class='department'>Cashier</span><span class='meal'>Meal: 2pm - 3pm</span>
</div>
</td>
<td>
</td>
<td class='Schedunderlay'>
<div class='Sched'>
<div class='schedbar' style='left: 143px; width: 234px;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 226px;'>
<span class='start'>10am</span><span class='end'>7pm</span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 9px; width: 498px; display: none;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 490px;'>
<span class='start'><img src='/Images/Schedule/arrowLeft.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'></span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 508px; width: 216px; display: none;'>
<div class='schedbar_l_on'></div>
<div class='schedbar_m_on' style='width: 208px;'><span class='start'></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'><img src='/Images/Schedule/arrowRight.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
</div>
<div class='schedbar_r_on'></div>
</div>
</div>
</td>
<td> </td>
<td class='rightColDetails'>
<div class='AvailDetails' align='left' style='display: table-cell;'>
<span class='iefix'><b>Avail - All Day</b></span><br/>
<span style='font-size: 11px;'>Pref - All Day</span>
</div>
</td>
</tr>
And this is a split shift, two four hour shifts separated by a empty 1 hour slot (they do this to cheat the scoring system, two covered shifts instead of one):
<tr>
<td class='dt'>
<span class='dow1'>Thursday</span>Dec 13
</td>
<td class='ScheduledDetails' valign='top'>
<div style="position:relative;">
<span class='tlHours'>8<span class='spart'> hrs</span> 0<span class='spart'> mins</span></span><span class='department'>Cashier</span><span class='meal'>Meal: None</span>
</div>
</td>
<td> </td>
<td class='Schedunderlay'>
<div class='Sched'>
<div class='schedbar' style='left: 247px; width: 104px;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 96px;'>
<span class='start'>2pm</span><span class='end'>6pm</span>
</div><div class='schedbar_r'></div>
</div>
<div class='schedbar' style='left: 377px; width: 104px;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 96px;'>
<span class='start'>7pm</span> <span class='end'>11pm</span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 9px; width: 498px; display: none;'>
<div class='schedbar_l'></div><div class='schedbar_m' style='width: 490px;'>
<span class='start'><img src='/Images/Schedule/arrowLeft.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'></span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 508px; width: 216px; display: none;'>
<div class='schedbar_l_on'></div>
<div class='schedbar_m_on' style='width: 208px;'>
<span class='start'></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'><img src='/Images/Schedule/arrowRight.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
</div>
<div class='schedbar_r_on'></div>
</div>
</div>
</td>
<td> </td>
<td class='rightColDetails'>
<div class='AvailDetails' align='left' style='display: table-cell;'>
<span class='iefix'><b>Avail - All Day</b></span><br/><span style='font-size: 11px;'>Pref - All Day</span>
</div>
</td>
</tr>
The important difference is on the regular shift there's one start and one end time, with the split shift there's a start, and end, and start, and end....
I've been pounding my head against this for about five hours now... and making no headway, I suppose I'd have more luck if I understood Regex.. any help at all would be greatly appreciated...
Here is a solution using BeautifulSoup to parse the document and grab the info.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for schedbar in soup.find_all('div', 'schedbar'):
print "start: " + schedbar.find('div', 'schedbar_m').find('span', 'start').string
print "end: " + schedbar.find('div', 'schedbar_m').find('span', 'end').string
Outputs:
start: 2pm
end: 6pm
start: 7pm
end: 11pm

Categories

Resources