I really couldn't think of a decent title to give a overview of what I'm trying to do, but the examples I have should explain it nicely, my company provides a schedule online, but they don't have any APIs or anything to extract it, so I'm using the Python framework Scrapy to scrape the data, and then adding it to my Google Calendar
A girl gave me a Regex line to handle the data because it was kicking my butt for days and she was feeling nice, but I've since realized that it doesn't handle split shifts (most likely because I was not scheduled for any so she didn't see the possibility of one)
My regex is
re.findall("""dow1'>(\w+)<\S+?>(\w+ \d+)</td>\s*<td class.*?tlHours'>(\d+).*?span>\s*(\d+)<span.*?ment'>(.*?)</spa.*?Meal: (.*?)</sp.*?start'>(\S+?)</spa.*?end'>(\S+?)<""", response.body)
Example data:
This is a normal 8 hour day with a meal break, which is handled fine:
<tr>
<td class='dt'>
<span class='dow1'>Sunday</span>Dec 09
</td>
<td class='ScheduledDetails'valign='top'>
<div style="position:relative;">
<span class='tlHours'>8<span class='spart'> hrs</span> 0<span class='spart'> mins</span></span><span class='department'>Cashier</span><span class='meal'>Meal: 2pm - 3pm</span>
</div>
</td>
<td>
</td>
<td class='Schedunderlay'>
<div class='Sched'>
<div class='schedbar' style='left: 143px; width: 234px;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 226px;'>
<span class='start'>10am</span><span class='end'>7pm</span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 9px; width: 498px; display: none;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 490px;'>
<span class='start'><img src='/Images/Schedule/arrowLeft.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'></span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 508px; width: 216px; display: none;'>
<div class='schedbar_l_on'></div>
<div class='schedbar_m_on' style='width: 208px;'><span class='start'></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'><img src='/Images/Schedule/arrowRight.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
</div>
<div class='schedbar_r_on'></div>
</div>
</div>
</td>
<td> </td>
<td class='rightColDetails'>
<div class='AvailDetails' align='left' style='display: table-cell;'>
<span class='iefix'><b>Avail - All Day</b></span><br/>
<span style='font-size: 11px;'>Pref - All Day</span>
</div>
</td>
</tr>
And this is a split shift, two four hour shifts separated by a empty 1 hour slot (they do this to cheat the scoring system, two covered shifts instead of one):
<tr>
<td class='dt'>
<span class='dow1'>Thursday</span>Dec 13
</td>
<td class='ScheduledDetails' valign='top'>
<div style="position:relative;">
<span class='tlHours'>8<span class='spart'> hrs</span> 0<span class='spart'> mins</span></span><span class='department'>Cashier</span><span class='meal'>Meal: None</span>
</div>
</td>
<td> </td>
<td class='Schedunderlay'>
<div class='Sched'>
<div class='schedbar' style='left: 247px; width: 104px;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 96px;'>
<span class='start'>2pm</span><span class='end'>6pm</span>
</div><div class='schedbar_r'></div>
</div>
<div class='schedbar' style='left: 377px; width: 104px;'>
<div class='schedbar_l'></div>
<div class='schedbar_m' style='width: 96px;'>
<span class='start'>7pm</span> <span class='end'>11pm</span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 9px; width: 498px; display: none;'>
<div class='schedbar_l'></div><div class='schedbar_m' style='width: 490px;'>
<span class='start'><img src='/Images/Schedule/arrowLeft.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'></span>
</div>
<div class='schedbar_r'></div>
</div>
<div class='availbar' style='left: 508px; width: 216px; display: none;'>
<div class='schedbar_l_on'></div>
<div class='schedbar_m_on' style='width: 208px;'>
<span class='start'></span>
<div class='OTtext' align='center'>All Day</div>
<span class='end'><img src='/Images/Schedule/arrowRight.gif' alt='' style='margin-left:5px; margin-top:2px;' /></span>
</div>
<div class='schedbar_r_on'></div>
</div>
</div>
</td>
<td> </td>
<td class='rightColDetails'>
<div class='AvailDetails' align='left' style='display: table-cell;'>
<span class='iefix'><b>Avail - All Day</b></span><br/><span style='font-size: 11px;'>Pref - All Day</span>
</div>
</td>
</tr>
The important difference is on the regular shift there's one start and one end time, with the split shift there's a start, and end, and start, and end....
I've been pounding my head against this for about five hours now... and making no headway, I suppose I'd have more luck if I understood Regex.. any help at all would be greatly appreciated...
Here is a solution using BeautifulSoup to parse the document and grab the info.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for schedbar in soup.find_all('div', 'schedbar'):
print "start: " + schedbar.find('div', 'schedbar_m').find('span', 'start').string
print "end: " + schedbar.find('div', 'schedbar_m').find('span', 'end').string
Outputs:
start: 2pm
end: 6pm
start: 7pm
end: 11pm
Related
Problem Introduction
Language version: Python 3.8
Operating System: Windows 10
Other relevant software: Jupyter notebook and html-requests
Context:
I have been following along with this tutorial to scrape stackoverflow for questions. My goal is to extract the answers (from the url of the question) and who answered it. However, I am having difficulty determining what classes/id's to search for in the html of a question
Things I have tried:
I have attempted searching under ('.container') for things like ('.post-layout'), '.mb0', '#answers', and'#answers-headers' with marginal, cluttered, success.
An excerpt from the code I am using to parse the pages(not the questions) here is the github link:
def parse_tagged_page(html):
question_summaries = html.find(".question-summary")
key_names = ['question', 'votes', 'tags']
classes_needed = ['.question-hyperlink', '.vote', '.tags']
datas = []
for q_el in question_summaries:
question_data = {}
for i, _class in enumerate(classes_needed):
sub_el = q_el.find(_class, first=True)
keyname = key_names[i]
question_data[keyname] = clean_scraped_data(sub_el.text, keyname=keyname)
datas.append(question_data)
return datas
An example of the html code I am looking for is below.
html code on this question:
<div id="answers">
<a name="tab-top"></a>
<div id="answers-header">
<div class="answers-subheader grid ai-center mb8">
<div class="grid--cell fl1">
<h2 class="mb0" data-answercount="13">
13 Answers
<span style="display:none;" itemprop="answerCount">13</span>
</h2>
</div>
<div class="grid--cell">
<div class=" grid s-btn-group js-filter-btn">
<a class="grid--cell s-btn s-btn__muted s-btn__outlined" href="/questions/19254583/how-do-i-host-multiple-node-js-sites-on-the-same-ip-server-with-different-domain?answertab=active#tab-top" data-nav-xhref="" title="Answers with the latest activity first" data-value="active" data-shortcut="A">
Active</a>
<a class="grid--cell s-btn s-btn__muted s-btn__outlined" href="/questions/19254583/how-do-i-host-multiple-node-js-sites-on-the-same-ip-server-with-different-domain?answertab=oldest#tab-top" data-nav-xhref="" title="Answers in the order they were provided" data-value="oldest" data-shortcut="O">
Oldest</a>
<a class="youarehere is-selected grid--cell s-btn s-btn__muted s-btn__outlined" href="/questions/19254583/how-do-i-host-multiple-node-js-sites-on-the-same-ip-server-with-different-domain?answertab=votes#tab-top" data-nav-xhref="" title="Answers with the highest score first" data-value="votes" data-shortcut="V">
Votes</a>
</div>
</div>
</div>
</div>
<a name="19254824"></a>
<div id="answer-19254824" class="answer accepted-answer" data-answerid="19254824" itemprop="acceptedAnswer" itemscope="" itemtype="http://schema.org/Answer">
<div class="post-layout">
<div class="votecell post-layout--left">
<div class="js-voting-container grid fd-column ai-stretch gs4 fc-black-200" data-post-id="19254824">
<button class="js-vote-up-btn grid--cell s-btn s-btn__unset c-pointer" data-controller="s-tooltip" data-s-tooltip-placement="right" aria-pressed="false" aria-label="Up vote" data-selected-classes="fc-theme-primary" aria-describedby="--stacks-s-tooltip-zxmm3912"><svg aria-hidden="true" class="m0 svg-icon iconArrowUpLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 26h32L18 10 2 26z"></path></svg></button><div id="--stacks-s-tooltip-zxmm3912" class="s-popover s-popover__tooltip pe-none" aria-hidden="true" role="tooltip">This answer is useful<div class="s-popover--arrow"></div></div>
<div class="js-vote-count grid--cell fc-black-500 fs-title grid fd-column ai-center" itemprop="upvoteCount" data-value="83">83</div>
<button class="js-vote-down-btn grid--cell s-btn s-btn__unset c-pointer" data-controller="s-tooltip" data-s-tooltip-placement="right" aria-pressed="false" aria-label="Down vote" data-selected-classes="fc-theme-primary" aria-describedby="--stacks-s-tooltip-waz8801n"><svg aria-hidden="true" class="m0 svg-icon iconArrowDownLg" width="36" height="36" viewBox="0 0 36 36"><path d="M2 10h32L18 26 2 10z"></path></svg></button><div id="--stacks-s-tooltip-waz8801n" class="s-popover s-popover__tooltip pe-none" aria-hidden="true" role="tooltip">This answer is not useful<div class="s-popover--arrow"></div></div>
<div class="js-accepted-answer-indicator grid--cell fc-green-500 ta-center py4" data-s-tooltip-placement="right" title="Loading when this answer was accepted…" tabindex="0" role="note" aria-label="Accepted">
<svg aria-hidden="true" class="svg-icon iconCheckmarkLg" width="36" height="36" viewBox="0 0 36 36"><path d="M6 14l8 8L30 6v8L14 30l-8-8v-8z"></path></svg>
</div>
<a class="js-post-issue grid--cell s-btn s-btn__unset c-pointer py6 mx-auto" href="/posts/19254824/timeline" data-shortcut="T" data-controller="s-tooltip" data-s-tooltip-placement="right" aria-label="Timeline" aria-describedby="--stacks-s-tooltip-djt8qt69"><svg aria-hidden="true" class="mln2 mr0 svg-icon iconHistory" width="19" height="18" viewBox="0 0 19 18"><path d="M3 9a8 8 0 113.73 6.77L8.2 14.3A6 6 0 105 9l3.01-.01-4 4-4-4h3L3 9zm7-4h1.01L11 9.36l3.22 2.1-.6.93L10 10V5z"></path></svg></a><div id="--stacks-s-tooltip-djt8qt69" class="s-popover s-popover__tooltip pe-none" aria-hidden="true" role="tooltip">Show activity on this post.<div class="s-popover--arrow"></div></div>
</div>
</div>
<div class="answercell post-layout--right">
<div class="s-prose js-post-body" itemprop="text">
<p>Choose one of:</p>
<ul>
<li>Use some other server (like nginx) as a reverse proxy.</li>
<li>Use node-http-proxy as a reverse proxy.</li>
<li>Use the vhost middleware if each domain can be served from the same Connect/Express codebase and node.js instance.</li>
</ul>
</div>
<div class="mt24">
<div class="grid fw-wrap ai-start jc-end gs8 gsy">
<time itemprop="dateCreated" datetime="2013-10-08T17:53:13"></time>
<div class="grid--cell mr16" style="flex: 1 1 100px;">
<div class="post-menu">
share<div class="s-popover z-dropdown" style="width: unset; max-width: 28em;" id="se-share-sheet-1"><div class="s-popover--arrow"></div><div><span class="js-title fw-bold">Share a link to this answer</span> <span class="js-subtitle">(includes your user id)</span></div><div class="my8"><input type="text" class="js-input s-input wmn3 sm:wmn-initial" readonly=""></div><div class="d-flex jc-space-between mbn4"><button class="js-copy-link-btn s-btn s-btn__link">Copy link</button>CC BY-SA 3.0<div class="js-social-container"></div></div></div>
<span class="lsep">|</span>
edit
<span class="lsep">|</span>
<button id="btnFollowPost-19254824" class="s-btn s-btn__link fc-black-400 h:fc-black-700 pb2 js-follow-post js-follow-answer js-gps-track" role="button" data-gps-track="post.click({ item: 14, priv: -1, post_type: 2 })" data-controller="s-tooltip " data-s-tooltip-placement="bottom" data-s-popover-placement="bottom" aria-controls="" aria-describedby="--stacks-s-tooltip-nb9azr0k">
follow
</button><div id="--stacks-s-tooltip-nb9azr0k" class="s-popover s-popover__tooltip pe-none" aria-hidden="true" role="tooltip">Follow this answer to receive notifications<div class="s-popover--arrow"></div></div>
<span class="lsep">|</span>
</div>
</div>
<div class="post-signature grid--cell fl0">
<div class="user-info user-hover">
<div class="user-action-time">
edited <span title="2017-05-23 11:33:25Z" class="relativetime">May 23 '17 at 11:33</span>
</div>
<div class="user-gravatar32">
<div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/a007be5a61f6aa8f3e85ae2fc18dd66e?s=32&d=identicon&r=PG" alt="" width="32" height="32" class="bar-sm"></div>
</div>
<div class="user-details">
Community<span class="mod-flair " title="moderator">♦</span>
<div class="-flair">
<span class="reputation-score" title="reputation score " dir="ltr">1</span><span title="1 silver badge" aria-hidden="true"><span class="badge2"></span><span class="badgecount">1</span></span><span class="v-visible-sr">1 silver badge</span>
</div>
</div>
</div> </div>
<div class="post-signature grid--cell fl0">
<div class="user-info user-hover">
<div class="user-action-time">
answered <span title="2013-10-08 17:53:13Z" class="relativetime">Oct 8 '13 at 17:53</span>
</div>
<div class="user-gravatar32">
<div class="gravatar-wrapper-32"><img src="https://i.stack.imgur.com/eLXTL.jpg?s=32&g=1" alt="" width="32" height="32" class="bar-sm"></div>
</div>
<div class="user-details" itemprop="author" itemscope="" itemtype="http://schema.org/Person">
josh3736<span class="d-none" itemprop="name">josh3736</span>
<div class="-flair">
<span class="reputation-score" title="reputation score 119,818" dir="ltr">120k</span><span title="24 gold badges" aria-hidden="true"><span class="badge1"></span><span class="badgecount">24</span></span><span class="v-visible-sr">24 gold badges</span><span title="198 silver badges" aria-hidden="true"><span class="badge2"></span><span class="badgecount">198</span></span><span class="v-visible-sr">198 silver badges</span><span title="245 bronze badges" aria-hidden="true"><span class="badge3"></span><span class="badgecount">245</span></span><span class="v-visible-sr">245 bronze badges</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="post-layout--right">
<div id="comments-19254824" class="comments js-comments-container bt bc-black-2 mt12 " data-post-id="19254824" data-min-length="15">
<ul class="comments-list js-comments-list" data-remaining-comments-count="0" data-canpost="false" data-cansee="true" data-comments-unavailable="false" data-addlink-disabled="true">
<li id="comment-45028507" class="comment js-comment " data-comment-id="45028507">
<div class="js-comment-actions comment-actions">
<div class="comment-score js-comment-edit-hide">
<span title="number of 'useful comment' votes received" class="cool">3</span>
</div>
</div>
<div class="comment-text js-comment-text-and-form">
<div class="comment-body js-comment-edit-hide">
<span class="comment-copy">that's a very good and brief list of the options I've read elsewhere. Do you happen to know for each of these solutions which processes would need to be restarted when a new domain is added? For 1) none. For 2) only the node-http-proxy. For 3) the entire thread of all sites would need to be restarted. Is this correct?</span>
– Flion
<span class="comment-date" dir="ltr"><a class="comment-link" href="#comment45028507_19254824"><span title="2015-02-05 10:48:37Z, License: CC BY-SA 3.0" class="relativetime-clean">Feb 5 '15 at 10:48</span></a></span>
</div>
</div>
</li>
<li id="comment-45045094" class="comment js-comment " data-comment-id="45045094">
<div class="js-comment-actions comment-actions">
<div class="comment-score js-comment-edit-hide">
<span title="number of 'useful comment' votes received" class="cool">1</span>
</div>
</div>
<div class="comment-text js-comment-text-and-form">
<div class="comment-body js-comment-edit-hide">
<span class="comment-copy">#Flion: You could write the node-based proxies in such a way that you could reload the domain configuration without requiring a process restart. It really depends on your app's exact requirements.</span>
– josh3736
<span class="comment-date" dir="ltr"><a class="comment-link" href="#comment45045094_19254824"><span title="2015-02-05 17:50:17Z, License: CC BY-SA 3.0" class="relativetime-clean">Feb 5 '15 at 17:50</span></a></span>
</div>
</div>
</li>
<li id="comment-107457123" class="comment js-comment " data-comment-id="107457123">
<div class="js-comment-actions comment-actions">
<div class="comment-score js-comment-edit-hide">
</div>
</div>
<div class="comment-text js-comment-text-and-form">
<div class="comment-body js-comment-edit-hide">
<span class="comment-copy">Not what was asked.</span>
– Patrick Sturm
<span class="comment-date" dir="ltr"><a class="comment-link" href="#comment107457123_19254824"><span title="2020-03-18 07:47:44Z, License: CC BY-SA 4.0" class="relativetime-clean">Mar 18 at 7:47</span></a></span>
</div>
</div>
</li>
</ul>
</div>
<div id="comments-link-19254824" data-rep="50" data-reg="true">
<a class="js-add-link comments-link disabled-link" title="Use comments to ask for more information or suggest improvements. Avoid comments like “+1” or “thanks”." href="#" role="button">add a comment</a>
<span class="js-link-separator dno"> | </span>
<a class="js-show-link comments-link dno" title="expand to show all comments on this post" href="#" onclick="" role="button"></a>
</div>
</div>
</div>
</div>
You should look for .answercell class
I have a piece of HTML code below:
<div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div>
I want to extract "Portugal" from it, note the span class is a dynamic one, it is not always class="country-flag-small flag-113" but indeed changes per the value of country generated for this div block.
To get the player1 and 1357, I am using the following cumbersome code:
player1info = soup.findAll('div', attrs={'class':'user-tagline'})[0].text.split("\n")
player1 = player1info[1]
pscore1 = player1info[1].replace('(','').replace(')', '')
It would be appreciated if someone can share with your better solution here. Thank you in advance
UPDATE:
With the initial HTML div info extracted, now I would like to expand it to extract more for the entire row, here is the row:
<tr board-popover="" fen="r1bk2r1/1p2n3/pN6/1B1qQp2/P2Pp2p/1P6/2P2PPP/R3K1R1 b Q -" flip-board="1" highlight-squares="c4b6">
<td>
<a class="clickable-link td-user" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<span class="time-control">
<i class="icon-rapid">
</i>
</span>
<div class="user-tagline ">
<span class="username " data-avatar="https://betacssjs.chesscomfiles.com/bundles/web/images/noavatar_l.1c5172d5.gif" data-country="Portugal" data-enabled="true" data-flag="113" data-joined="Joined Jun 19, 2016" data-logged="Online 6 hrs ago" data-membership="basic" data-name="Atikinounette" data-popup="hover" data-title="" data-username="Atikinounette">
Atikinounette
</span>
<span class="user-rating">
(1357)
</span>
<span class="country-flag-small flag-113" tip="Portugal">
</span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="https://images.chesscomfiles.com/uploads/v1/user/28196414.83e31ff1.50x50o.3a6f77e4aa44.jpeg" data-country="Indonesia" data-enabled="true" data-flag="70" data-joined="Joined May 15, 2016" data-logged="Online Nov 7, 2017" data-membership="basic" data-name="belemnarmada" data-popup="hover" data-title="" data-username="belemnarmada">
belemnarmada
</span>
<span class="user-rating">
(1387)
</span>
<span class="country-flag-small flag-70" tip="Indonesia">
</span>
</div>
</a>
</td>
<td>
<a class="clickable-link text-middle" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<div class="pull-left">
<span class="game-result">
1
</span>
<span class="game-result">
0
</span>
</div>
<div class="result">
<i class="icon-square-minus loss" tip="Lost">
</i>
</div>
</a>
</td>
<td class="text-center">
<a class="clickable-link" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
30 min
</a>
</td>
<td class="text-right">
<a class="clickable-link text-middle moves" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
25
</a>
</td>
<td class="text-right miniboard">
<a class="clickable-link archive-date" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
Aug 9, 2017
</a>
</td>
<td class="text-center miniboard">
<input class="checkbox" game-checkbox="" game-id="2249663029" game-is-live="true" ng-model="model.gameIds[2249663029].checked" type="checkbox"/>
</td>
</tr>
Needed info are:
player's info (answer provided by #balderman already got that)
game-result (1, 0)
playing time (30 min in this row)
total moves (25)
playing date (Aug 9, 2017)
Thank you so much here.
How about the code below?
The idea that the user attributes are 3 spans under the div. So the code points to those spans and extract the data.
from bs4 import BeautifulSoup
html = '''<html><body> <div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div><body></html>'''
soup = BeautifulSoup(html, 'html.parser')
users = soup.findAll('div', attrs={'class': 'user-tagline'})
for user in users:
user_properties = user.findAll('span')
for idx, prop in enumerate(user):
if idx == 1:
print('user name: {}'.format(prop.text))
elif idx == 3:
print('user rating: {}'.format(prop.text))
elif idx == 5:
print('user country: {}'.format(prop.attrs['tip']))
Output
user name: player1
user rating: (1357)
user country: Portugal
user name: player2
user rating: (1387)
user country: Indonesia
This is a more readable solution:
div1 = soup.select("div.user-tagline")[0]
player1 = div1.select_one("span.user-rating").text
pscore1 = div1.select_one("span.country-flag-small").text
To extract data of all divs, just use a loop. And replace "0" with "i".
If you are interested only in the first div, you can go with this:
res = bsobj.find('div', {'class':'user-tagline'}).findAll('span')
print(res[0].text, res[1].text, res[2]['tip'])
Currently trying to loop the following web scraping...
My current problem is that I can only get the first footballer from the table (I have the table HTML down below) and not the full 10 players, my immediate thoughts are that the loop isn't working and I'm unsure where I'm going wrong. I'm using the BeautifulSoup Method of gathering the data.
TD;DR My error is that only 1 player is appearing in my CSV file instead of the 10 players available from the HTML
Python Code
from urllib.request import urlopen as uReq
from urllib.request import Request
from bs4 import BeautifulSoup as soup
my_url = "https://www.fctables.com/teams/stoke-194901/"
#opening up connection , grabbing page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
topScorers = page_soup.findAll("table",{"class":"table table-striped table-bordered table-hover stage-table table-condensed top_scores"})
filename = "stokeGoals.csv"
f = open(filename, "w")
headers = "player, goal_scored, average_goal"
f.write(headers)
for topScorer in topScorers:
#top 10 players who scored
player = topScorer.a["title"]
#top 10 goalscorers for the team
goalpp = topScorer.findAll("div", {"class": "progress"})
#average goal per game
avg = topScorer.findAll("div", {"class": "label label-primary"})
avgpp = avg[0].text.strip()
print("player: " + player)
print("goal_scored: " + goalpp)
print("AVG: "+ avgpp)
f.write(player + "," +goalpp.replace("," , "|")+ "," + avgpp +"\n")
f.close()
HTML Code for the table/website I'm scraping data from
<table class="table table-striped table-bordered table-hover stage-table table-condensed top_scores">
<thead>
<tr>
<th>#</th>
<th class="tl">Player</th>
<th data-toggle="tooltip" title="Goals scores by player / Goals scores by his team">goals</th>
<th data-toggle="tooltip" title="Average goals">
Avg
</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td class="tl psh" data-id="212996">
<img alt="Benik Afobe" class="img-circle" height="20" src="https://static.fctables.com/upload/images/20x20/s4/s4glg58a2350823d58/benik-afobe.png" width="20" /> Afobe
<div class="slider">
<div class="inner"></div>
</div>
</td>
<td width="30%">
<div class="progress">
<div aria-valuemax="100" aria-valuemin="0" aria-valuenow="55" class="progress-bar progress-bar-primary" role="progressbar" style="width: 100%;">
<span class="goal_p">6</span>
</div>
</div>
</td>
<td>
<div class="label label-primary">0.4</div>
</td>
</tr>
<tr>
<td>2</td>
<td class="tl psh" data-id="320050">
<img alt="Thomas Ince" class="img-circle" height="20" src="https://static.fctables.com/upload/images/20x20/t5/t5ni157c703a92110b/thomas-ince.jpg" width="20" /> Ince
<div class="slider">
<div class="inner"></div>
</div>
</td>
<td width="30%">
<div class="progress">
<div aria-valuemax="100" aria-valuemin="0" aria-valuenow="55" class="progress-bar progress-bar-primary" role="progressbar" style="width: 83.333333333333%;">
<span class="goal_p">5</span>
</div>
</div>
</td>
<td>
<div class="label label-primary">0.6</div>
</td>
</tr>
<tr>
<td>3</td>
<td class="tl psh" data-id="308648">
<img alt="Saido Berahino" class="img-circle" height="20" src="https://static.fctables.com/upload/images/20x20/po/poyhu58a234e0da106/saido-berahino.png" width="20" /> Berahino
<div class="slider">
<div class="inner"></div>
</div>
</td>
<td width="30%">
<div class="progress">
<div aria-valuemax="100" aria-valuemin="0" aria-valuenow="55" class="progress-bar progress-bar-primary" role="progressbar" style="width: 66.666666666667%;">
<span class="goal_p">4</span>
</div>
</div>
</td>
<td>
<div class="label label-primary">0.3</div>
</td>
</tr>
<tr>
<td>4</td>
<td class="tl psh" data-id="257340">
<img alt="Joe Allen" class="img-circle" height="20" src="https://static.fctables.com/upload/images/20x20/6w/6w45558a234deae78e/joe-allen.png" width="20" /> Allen
<div class="slider">
<div class="inner"></div>
</div>
</td>
<td width="30%">
<div class="progress">
<div aria-valuemax="100" aria-valuemin="0" aria-valuenow="55" class="progress-bar progress-bar-primary" role="progressbar" style="width: 50%;">
<span class="goal_p">3</span>
</div>
</div>
</td>
<td>
<div class="label label-primary">0.4</div>
</td>
</tr>
<tr>
<td>5</td>
<td class="tl psh" data-id="234407">
<img alt="Erik Pieters" class="img-circle" height="20" src="https://static.fctables.com/upload/images/20x20/et/et08558a234dd63b68/erik-pieters.png" width="20" /> Pieters
<div class="slider">
<div class="inner"></div>
</div>
</td>
<td width="30%">
<div class="progress">
<div aria-valuemax="100" aria-valuemin="0" aria-valuenow="55" class="progress-bar progress-bar-primary" role="progressbar" style="width: 50%;">
<span class="goal_p">3</span>
</div>
</div>
</td>
<td>
<div class="label label-primary">0.4</div>
</td>
</tr>
<tr>
<td>6</td>
<td class="tl psh" data-id="299368">
<img alt="Peter Crouch" class="img-circle" height="20" src="https://static.fctables.com/upload/images/20x20/qp/qptn558a234df86f1f/peter-crouch.png" width="20" /> Crouch
<div class="slider">
<div class="inner"></div>
</div>
</td>
<td width="30%">
<div class="progress">
<div aria-valuemax="100" aria-valuemin="0" aria-valuenow="55" class="progress-bar progress-bar-primary" role="progressbar" style="width: 33.333333333333%;">
<span class="goal_p">2</span>
</div>
</div>
</td>
<td>
<div class="label label-primary">0.3</div>
</td>
</tr>
<tr>
<td>7</td>
<td class="tl psh" data-id="214479">
<img alt="Bojan Krkic" class="img-circle" height="20" src="https://static.fctables.com/upload/images/20x20/pl/pleyv57eaedf0afeac/bojan-krkic.jpg" width="20" /> Krkic
<div class="slider">
<div class="inner"></div>
</div>
</td>
<td width="30%">
<div class="progress">
<div aria-valuemax="100" aria-valuemin="0" aria-valuenow="55" class="progress-bar progress-bar-primary" role="progressbar" style="width: 33.333333333333%;">
<span class="goal_p">2</span>
</div>
</div>
</td>
<td>
<div class="label label-primary">0.4</div>
</td>
</tr>
<tr>
<td>8</td>
<td class="tl psh" data-id="253114">
<img alt="James McClean" class="img-circle" height="20" src="https://static.fctables.com/upload/images/20x20/gb/gbjmm58a234f55a560/james-mcclean.png" width="20" /> McClean
<div class="slider">
<div class="inner"></div>
</div>
</td>
<td width="30%">
<div class="progress">
<div aria-valuemax="100" aria-valuemin="0" aria-valuenow="55" class="progress-bar progress-bar-primary" role="progressbar" style="width: 16.666666666667%;">
<span class="goal_p">1</span>
</div>
</div>
</td>
<td>
<div class="label label-primary">0.1</div>
</td>
</tr>
<tr>
<td>9</td>
<td class="tl psh" data-id="309022">
<img alt="Sam Clucas" class="img-circle" height="20" src="https://static.fctables.com/upload/images/20x20/g7/g7dig58a234cb144a3/sam-clucas.png" width="20" /> Clucas
<div class="slider">
<div class="inner"></div>
</div>
</td>
<td width="30%">
<div class="progress">
<div aria-valuemax="100" aria-valuemin="0" aria-valuenow="55" class="progress-bar progress-bar-primary" role="progressbar" style="width: 16.666666666667%;">
<span class="goal_p">1</span>
</div>
</div>
</td>
<td>
<div class="label label-primary">0.3</div>
</td>
</tr>
<tr>
<td>10</td>
<td class="tl psh" data-id="215724">
<img alt="Bruno Martins Indi" class="img-circle" height="20" src="https://static.fctables.com/upload/images/20x20/hk/hkung58a234de0dfaa/bruno-martins-indi.png" width="20" /> Indi
<div class="slider">
<div class="inner"></div>
</div>
</td>
<td width="30%">
<div class="progress">
<div aria-valuemax="100" aria-valuemin="0" aria-valuenow="55" class="progress-bar progress-bar-primary" role="progressbar" style="width: 16.666666666667%;">
<span class="goal_p">1</span>
</div>
</div>
</td>
<td>
<div class="label label-primary">0.2</div>
</td>
</tr>
</tbody>
The webpage you specified, loads data via XMLHttpRequest
You can grab the html directly from:
https://www.fctables.com/xml/table_participant/?template_id=&season_id=52%2C38%2C88&type_home=overall&type=top_score&lang_id=2&team_id=194901&limit=10
Through the above url, you can get all the information you need without the extra html noise, i.e.:
my_url = "https://www.fctables.com/xml/table_participant/?template_id=&season_id=52%2C38%2C88&type_home=overall&type=top_score&lang_id=2&team_id=194901&limit=10"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
_names = page_soup.findAll("img",{"class":"img-circle"})
_goals = page_soup.findAll("span",{"class":"goal_p"})
_avg = page_soup.findAll("div",{"class":"label label-primary"})
x = 0
for name in _names:
name = name['alt']
avg = _avg[x].get_text()
goals = _goals[x].get_text()
print(name, avg, goals)
x+=1
Benik Afobe 0.4 6
Thomas Ince 0.6 5
Saido Berahino 0.3 4
Joe Allen 0.4 3
Erik Pieters 0.4 3
Peter Crouch 0.3 2
Bojan Krkic 0.4 2
James McClean 0.1 1
Sam Clucas 0.3 1
Bruno Martins Indi 0.2 1
Note:
Adjust the url values as needed, you can change top_score, type, team_id, limit, etc...
Keeping the url as it is, you can try the following to fetch the required results:
import requests
from bs4 import BeautifulSoup
url = "https://www.fctables.com/teams/stoke-194901/"
res = requests.get(url)
soup = BeautifulSoup(res.text,"html.parser")
for items in soup.select(".top_scores tbody tr"):
name = items.select_one("td a[href^='/players/']").get("title")
goal = items.select_one("td .goal_p").text
avrg = items.select_one("td .label-primary").text
print(name, goal, avrg)
Output you should get:
Benik Afobe 6 0.4
Thomas Ince 5 0.6
Saido Berahino 4 0.3
Joe Allen 3 0.4
Erik Pieters 3 0.4
Peter Crouch 2 0.3
Bojan Krkic 2 0.4
James McClean 1 0.1
Sam Clucas 1 0.3
Bruno Martins Indi 1 0.2
The below HTML code has dynamic attributes for different individual series. Example, one series can have multiple units, like Million or Thousands.
<tr class="series-pager-title">
<td valign="top" colspan="2">
<div class="col-xs-12 col-sm-10">
Total Vehicle Sales
</div>
<div class="hidden-xs col-sm-2">
<span style="padding-left:49px;" class="popularity_bar"> </span> <span class="popularity_bar_background"> </span>
</div>
</td>
</tr>
<tr class="series-pager-attr">
<td colspan="2">
<div class="series-meta series-group-meta">
<span class="attributes">Monthly</span>
<br class="clear">
</div>
<div class="series-meta">
<input class="pager-item-checkbox" type="checkbox" name="sids[0]" value="TOTALSA">
<a href="/series/TOTALSA">
Millions of Units,
Seasonally Adjusted Annual Rate
</a>
<span class="series-meta-dates">
Jan 1976
to
Jul 2017
(4 days ago)
</span>
<br class="clear">
<input class="pager-item-checkbox" type="checkbox" name="sids[1]" value="TOTALNSA">
<a href="/series/TOTALNSA">
Thousands of Units,
Not Seasonally Adjusted
</a>
<span class="series-meta-dates">
Jan 1976
to
Jul 2017
(4 days ago)
</span>
</div>
</td>
</tr>
<tr><td colspan="2" style="font-size:9px"> </td></tr>
<tr class="series-pager-title">
<td valign="top" colspan="2">
<div class="col-xs-12 col-sm-10">
Light Weight Vehicle Sales: Autos and Light Trucks
</div>
<div class="hidden-xs col-sm-2">
<span style="padding-left:46px;" class="popularity_bar"> </span> <span class="popularity_bar_background"> </span>
</div>
</td>
</tr>
<tr class="series-pager-attr">
<td colspan="2">
<div class="series-meta series-group-single">
<input class="pager-item-checkbox" type="checkbox" name="sids[2]" value="ALTSALES">
<span class="attributes" style="width:350px;">Millions of Units, Monthly, Seasonally Adjusted Annual Rate</span><span class="series-meta-dates">Jan 1976 to Jul 2017 (4 days ago)</span>
<br class="clear">
</div>
<a href="/series/ALTSALES">
</a>
</td>
This gets me somewhat close, however it fails to obtain the 2nd frequency for the "Total Vehicle Sales," it only obtains the first "Millions of Units, Seasonally Adjusted Annual Rate." Aside from this issue, my assumption is that I would be mis-classifying things in general with my current query. Code I have created thus far:
browser=webdriver.Chrome(executable_path='F:\Anaconda\chromedriver\chromedriver_win32\chromedriver.exe')
browser.get('https://fred.stlouisfed.org/categories/32993')
soup=BeautifulSoup(browser.page_source,'lxml')
for l in soup.find_all('tbody'):
series_count=len(l.find_all('tr',attrs={'class':'series-pager-title'}))
series_data=l.find_all('tr',attrs={'class':'series-pager-title'})
attrs_data=l.find_all('tr',attrs={'class':'series-pager-attr'})
print(series_count)
print(len(attrs_data))
for m in range(0,series_count):
print(series_data[m].find('a',href=True).text+' | '+attrs_data[m].find('a',href=True).text.strip().replace(' ',' '))
In the above query, can someone please assist in creating the desired outcome:
If someone comes across this with a better solution I am all ears... In the interim, this seems to do the trick...
browser.get('https://fred.stlouisfed.org/categories/32993')
soup=BeautifulSoup(browser.page_source,'lxml')
test=soup.tbody
children=[child for child in test if child != '\n']
series_data=pd.DataFrame([],columns=['series_index','series_title','series_href'])
sub_series_data=pd.DataFrame([],columns=['series_index','frequency','sub_series_units','sub_series_href'])
series_index=0
for index,child in enumerate(children):
if child.find('a',attrs={'class':'series-title'}):
series_index+=1
series_title=child.text.strip()
series_link=child.find('a',href=True).attrs['href']
temp_series_df=({'series_index':series_index,
'series_title':series_title,
'series_href':series_link})
series_data=series_data.append([temp_series_df],ignore_index=True)
if child.find('div',attrs={'class':'series-meta'}):
frequency=child.find('span',attrs={'class':'attributes'}).text.strip()
for i in child.find_all('a',href=True):
temp_sub_series_df=({'series_index':series_index,
'frequency':frequency.strip(),
#'sub_series_units':i.text.strip(),
'sub_series_units':re.sub(' +',' ',re.sub('\n',' ',i.text)),
'sub_series_href':'https://fred.stlouisfed.org'+i.attrs['href']})
sub_series_data=sub_series_data.append([temp_sub_series_df],ignore_index=True)
print(series_data)
print(sub_series_data)
combine_series_data=pd.merge(series_data,sub_series_data,how='left',on=['series_index'])
I have created this for loop to find td items that start with 'td_threadtitle':
for item in posts:
hello = item.find("td", {"id": lambda L: L and L.startswith('td_threadtitle')})
print(hello)
But I get this error:
hello = item.find("td", {"id": lambda L: L and L.startswith('td_threadtitle')})
TypeError: slice indices must be integers or None or have an __index__ method
When I change the variable hello to this:
hello = item.find("td") , it works perfectly fine. Why does it throw that error when I try to specify the id?
EDIT:
This is how I created posts:
tableWithPosts = soup.find("body").find("div", attrs = {"align": "center"}).find("div", {"class" : "page"}).find("div", attrs = {"style" : "padding:0px 0px 0px 0px"}).find("center").find("form").find("table", {"id": "threadslist"})
posts = tableWithPosts.find("tbody", {"id": "threadbits_forum_75"}
Here is a portion of posts:
</a>
)
</span>
</div>
<div class="smallfont">
<span onclick="window.open('member.php?s=625e629b088a68126ca2d867c056b363&u=206824', '_self')" style="cursor:pointer">
thelavenhagen
</span>
</div>
</td>
<td class="alt2" title="Replies: 11, Views: 1,471">
<div class="smallfont" style="text-align:right; white-space:nowrap">
Thu, May-25-2017
<span class="time">
05:06:46 AM
</span>
<br/>
by
<a href="member.php?s=625e629b088a68126ca2d867c056b363&find=lastposter&t=581132" rel="nofollow">
westopher
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&p=1067660274#post1067660274">
<img alt="Go to last post" border="0" class="inlineimg" src="images/buttons/lastpost.gif"/>
</a>
</div>
</td>
<td align="center" class="alt1">
<a href="misc.php?do=whoposted&t=581132" onclick="who(581132); return false;">
11
</a>
</td>
<td align="center" class="alt2">
1,471
</td>
</tr>
<tr>
<td class="alt1" id="td_threadstatusicon_558556">
<img alt="" border="" id="thread_statusicon_558556" src="images/statusicon/thread_hot.gif"/>
</td>
<td class="alt2">
<img alt="" border="0" src="images/icons/icon1.gif"/>
</td>
<td class="alt1" id="td_threadtitle_558556" title="1996 E36 M3 Lux Dakar Yellow, 87,800 miles, special order without sunroof. Second owner, owned...">
<div>
<span style="float:right">
<a href="#" onclick="attachments(558556); return false">
<img alt="4 Attachment(s)" border="0" class="inlineimg" src="images/misc/paperclip.gif"/>
</a>
</span>
<span style="color: blue">
<b>
<u>
FS:
</u>
</b>
</span>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556" id="thread_title_558556">
1996 E36 M3 - Dakar Lux Slicktop
</a>
<span class="smallfont" style="white-space:nowrap">
(
<img alt="Multi-page thread" border="0" class="inlineimg" src="images/misc/multipage.gif"/>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556">
1
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556&page=2">
2
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=558556&page=3">
3
</a>
)
</span>
</div>
<div class="smallfont">
<span onclick="window.open('member.php?s=625e629b088a68126ca2d867c056b363&u=95931', '_self')" style="cursor:pointer">
yellowbee
</span>
</div>
</td>
<td class="alt2" title="Replies: 23, Views: 5,147">
<div class="smallfont" style="text-align:right; white-space:nowrap">
Thu, May-25-2017
<span class="time">
04:04:07 AM
</span>
<br/>
by
<a href="member.php?s=625e629b088a68126ca2d867c056b363&find=lastposter&t=558556" rel="nofollow">
mbausa
</a>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&p=1067660244#post1067660244">
<img alt="Go to last post" border="0" class="inlineimg" src="images/buttons/lastpost.gif"/>
</a>
</div>
</td>
<td align="center" class="alt1">
<a href="misc.php?do=whoposted&t=558556" onclick="who(558556); return false;">
23
</a>
</td>
<td align="center" class="alt2">
5,147
</td>
</tr>
<tr>
<td class="alt1" id="td_threadstatusicon_580693">
<img alt="" border="" id="thread_statusicon_580693" src="images/statusicon/thread_hot.gif"/>
</td>
<td class="alt2">
<img alt="" border="0" src="images/icons/icon1.gif"/>
</td>
<td class="alt1" id="td_threadtitle_580693" title="Selling my wife's car. We have owned her for two years and have put over 20k hassle free miles on...">
<div>
<span style="color: blue">
<b>
<u>
FS:
</u>
</b>
</span>
<a href="showthread.php?s=625e629b088a68126ca2d867c056b363&t=580693" id="thread_title_580693">
2011 BMW 740Li Alpine White M Package Dakota Brown Interior Weather-tech Mats
</a>
</div>
<div class="smallfont">
<span onclick="window.open('member.php?s=625e629b088a68126ca2d867c056b363&u=128641', '_self')" style="cursor:pointer">
911-AL
</span>
</div>
</td>
Remove your for loop, try with this:
hello = posts.find_all("td", {"id": lambda L: L and L.startswith('td_threadtitle')})
hello
It will find all td items that start with 'td_threadtitle'
hello will be a list which contains all td(objects <class 'bs4.element.Tag'> ) start with 'td_threadtitle', you can still access their div.