I am trying to scrape from this website. My objective is to collect the most recent 10 results (win/loss/draw) of ANY team, I am just using this specific team as an example. The source for an individual row is:
<tr class="odd match no-date-repetition" data-timestamp="1515864600" id="page_team_1_block_team_matches_3_match-2463021" data-competition="8">
<td class="day no-repetition">Sat</td>
<td class="full-date" nowrap="nowrap">13/01/18</td>
<td class="competition">PRL</td>
<td class="team team-a ">
<a href="/teams/england/tottenham-hotspur-football-club/675/" title="Tottenham Hotspur">
Tottenham Hotspur
</a>
</td>
<td class="score-time score">
<a href="/matches/2018/01/13/england/premier-league/tottenham-hotspur-football-club/everton-football-club/2463021/" class="result-win">
4 - 0
</a>
</td>
<td class="team team-b ">
<a href="/teams/england/everton-football-club/674/" title="Everton">
Everton
</a>
</td>
<td class="events-button button first-occur">
View events
</td>
<td class="info-button button">
More info
</td>
</tr>
You can see in the <td class="score-time score", the result is stored.
My knowledge of Python and web crawling is pretty limited, so my current code is:
res2 = requests.get(soccerwayURL)
soup2 = bs4.BeautifulSoup(res2.text, 'html.parser')
elems2 = soup2.select('#page_team_1_block_team_matches_3_match-2463021 > td.score-time.score')
print(elems2[0].text.strip())
This prints out '4-0'. That is good, but the issue arises when I try to access a different row. The 7-digit number (2463021 in the example above) is unique to that row. That means that if I want to get the score from a different row, I would have to find that unique 7-digit number and place it in the CSS selector '#page_team_1_block_team_matches_3_match-******* > td.score-time.score' where the asterisks are the unique number.
The online course I took only showed how to reference things by the CSS Selector, so I am unsure how I can go about retrieving the scores without manually taking the CSS Selector for each row.
Within the <td class="score-time score"> class, there is another class that reads class="result-win">. Ideally I would like to be able to pull just that "result-win" because I am not looking for the score of the game, I am only looking for the outcome of win, loss, or draw.
I hope this is a little clearer. I greatly appreciate your patience with me.
Related
I am trying to scrape the year from the html below (https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/punjab-kings-vs-delhi-capitals-64th-match-1304110/full-scorecard). Due to the way the site is coded I have to first identify the table cell that contains the word "Season" then get the year (2022 in this example).
I thought this would get it but it doesn't. There are no errors, just no results. I've not used the following-sibling approach before so I'd be grateful if someone could point out where I've messed up.
l.add_xpath(
'Season',
"//td[contains(text(),'Season')]/following-sibling::td[1]/a/text()")
html:
<tr class="ds-border-b ds-border-line">
<td class="ds-min-w-max ds-border-r ds-border-line">
<span class="ds-text-tight-s ds-font-medium">Season</span>
</td>
<td class="ds-min-w-max">
<span class="ds-inline-flex ds-items-center ds-leading-none">
<a href="https://www.espncricinfo.com/ci/engine/series/index.html?season2022" class="ds-text-ui-typo ds-underline ds-underline-offset-4 ds-decoration-ui-stroke hover:ds-text-ui-typo-primary hover:ds-decoration-ui-stroke-primary ds-block">
<span class="ds-text-tight-s ds-font-medium">2022</span>
</a>
</span>
</td>
</tr>
Try:
//span[contains(text(),"Season")]/../following-sibling::td/span/a/span/text()
I am trying to get a list that matches India's districts to its district codes as they were during the 2011 population census. Below I will post a small subset of the outerHTML I copied from a government website. I am trying to loop over it and extract a string and an int from each little html box and store these ideally in a pandas dataframe on the same row. The HTML blocks look like this, I represent 2, there are around 700 in my txt file:
<tr>
<td width="5%">1</td>
<td>603</td>
<td align="left">**NICOBARS**</td>
<td align="left">NICOBARS </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NIC</td>
<td align="left">02</td>
<td align="left">**638**</td>
<td align="left">
Not Covered
</td>
<td width="5%" align="center"><i class="fa fa-eye" aria-hidden="true"></i>
</td>
<td width="5%" align="center"><i class="fa fa-history" aria-hidden="true"></i>
</td>
<td width="5%" align="center">
</td>
<td width="3%" align="center">
<!-- Merging issue revert beck 05/10/2017 -->
<i class="fa fa-map-marker" aria-hidden="true"></i>
</td>
</tr>
<tr>
<td width="5%">2</td>
<td>632</td>
<td align="left">**NORTH AND MIDDLE ANDAMAN**</td>
<td align="left">NORTH AND MIDDLE ANDAMAN </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NMA</td>
<td align="left"></td>
<td align="left">**639**</td>
<td align="left">
Not Covered
I have put ** around ** the values that I want to get from the text file. I was wonder how I could loop through this text to extract this data. I thought about start counting each time after I encounter and than extract the data of the 1st and 6st but I don't know how to code this. Hope anyone is willing to help out. Or maybe anyone who already has this list, would be great!
If you're able to get the text of the entire html table, you can use df = pd.read_html(html_text_string). 50% of the time, it works everytime!
pd.read_html <-- docs
I have encountered a problem in my programming that has me stumped.
I'm trying to access data stored in a wealth of old HTML-formatted-saved-as-text files. However, when saving the HTML code lost its indentations, tabs, hierarchy, whatever you wish to call it. An example of this can be found below.
......
<tr class="ro">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_RevenueFromContractWithCustomerExcludingAssessedTax', window );">Net sales</a></td>
<td class="nump">$ 123,897<span></span>
</td>
<td class="nump">$ 122,136<span></span>
</td>
<td class="nump">$ 372,586<span></span>
</td>
<td class="nump">$ 360,611<span></span>
</td>
</tr>
<tr class="re">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_OtherIncome', window );">Membership and other income</a></td>
<td class="nump">997<span></span>
</td>
<td class="nump">1,043<span></span>
</td>
<td class="nump">3,026<span></span>
</td>
<td class="nump">3,465<span></span>
</td>
</tr>
<tr class="rou">
<td class="pl " style="border-bottom: 0px;" valign="top"><a class="a" href="javascript:void(0);" onclick="top.Show.showAR( this, 'defref_us-gaap_Revenues', window );">Total revenues</a></td>
<td class="nump">124,894<span></span>
</td>
<td class="nump">123,179<span></span>
</td>
<td class="nump">375,612<span></span>
</td>
<td class="nump">364,076<span></span>
</td>
</tr>
I typically would employ Beautiful Soup here and get to work parsing the data that way, but I've not found a good workflow since technically there is no hierarchy here; I can't tell BS to look within something other than the document itself-which is huge and might be way too time consuming (see next statement).
I also need to find a thorough solution and not a quick-fix because I have hundreds, if not thousands, of these same HTML-to-text files to parse.
So my question here is, if I want to return, in all the files, the first number for "Membership and other Income" (997 in this case), how could I go about doing that?
Two samples files can be found here:
(https://www.sec.gov/Archives/edgar/data/1800/0001104659-18-065076.txt) (https://www.sec.gov/Archives/edgar/data/1084869/0001437749-18-020205.txt)
EDIT - 4/16
Thanks for the replies everyone! I've written some code that returns the tags I'm looking for.
import requests
from bs4 import BeautifulSoup
data = requests.get('https://www.sec.gov/Archives/edgar/data/320193/0000320193-18-000070.txt')
# load the data
soup = BeautifulSoup(data.text, 'html.parser')
# get the data
for tr in soup.find_all('tr', {'class':['rou','ro','re','reu']}):
db = [td.text.strip() for td in tr.find_all('td')]
print(db)
The problem is there are a TON of returns and most contain nothing of use. Is there a way to filter based on these tags' grandparent? I've tried the same approach as above using head, title, body, etc. but I can't quite get BS to identify the FILENAME..
<DOCUMENT>
<TYPE>XML
<SEQUENCE>14
**<FILENAME>R2.htm**
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
<html>
<head>
<title></title>
.....removed for brevity
</head>
<body>
.....removed for brevity
<td class="text"> <span></span>
</td>
.....removed for brevity
</tr>
Just so you are aware, HTML does not care about indentation. If you really wanted to, it could all be on the same line with no spaces in between. A HTML parser will just look at the structure of the tags.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all['<tag you are looking for>'][0]
I have a python code that is extracting some information from a table. But the thing is sometimes the Xpath changes. Right now it only changes between two different XPath's that looks like this:
//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span/
and the other alternative is a slight change in the table like this:
//*[#id='content-primary']/table[2]/tbody/tr[td[1]/span/span/
this is the code that i am using right now to get the information that i need:
rows_xpath = XPath("//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
So what i want to do is a check if the given XPath is valid. If it is not i just try the other XPath alternative.
Hope somebody can help me with this problem. Thank you all.
EDIT1
<table class="clCommonGrid" cellspacing="0">
<thead>
<tr>
<td colspan="3">Kommande matcher</td>
</tr>
<tr>
<th style="width:1%;">Tid</th>
<th style="width:69%;">Match</th>
<th style="width:30%;">Arena</th>
</tr>
</thead>
<tfoot>
<tr>
<td colspan="3">
<dl>
<dt class="clNotify">Röd text</dt>
<dd> = Ändrad matchtid </dd>
<dt><img src="http://svenskfotboll.se/i/u/alert.gif" alt="Röda utropstecknet" /></dt>
<dd> = Peka på utropstecknet så visas en notering </dd>
<dt><img src="http://svenskfotboll.se/i/widget.gif" alt="Widget" /></dt>
<dd>Hämta widget för kommande matcher</dd>
</dl>
</td>
</tr>
</tfoot>
<tbody class="clGrid">
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2015-04-17<!-- br ok --> 19:15</span></span> //This is the date i am checking with first
</td>
<td>Götene IF - Vårgårda IK </td> // The other information that i need from the table later
<td>Sparbanksvallen Götene konstgräs </td>
</tr>
In my situation i did not need to specify which table to extract the information from. Since the information that i will get is specified with the date that only contains in that table i just used this code and it worked out fine for me:
**rows_xpath = XPath("//*[#id='content-primary']/table/tbody/tr[td[1]/span/span//text()='%s']" % (date))**
now it is just table which means it will go through both tables in the website. Its not maybe a clean solution but works for me..
Hi guys i have question about parsing HTML with BeautifulSoup
My question is how to parse this html:
<div class="time_table show_today" id="monday_schedule">
<h3>January 20, 2014</h3>
<table>
<tbody>
<tr>
<th>Time</th>
<th>Program</th>
</tr>
<tr>
<td class="time_part"> 0:00 </td>
<td class="show_content">
<h4>
First Up
</h4>
<p>
Bloomberg Television's award winning morning show takes a look at market openings in Asia and analyzes all the breaking news stories essential for your business day ahead. </p>
</td>
</tr>
<tr>
<td class="time_part"> 2:00 </td>
<td class="show_content">
<h4>
On the Move with Rishaad Salamat
</h4>
<p>
Rishaad Salamat brings you comprehensive coverage of market openings from Asia and live reporting on the stories most impacting business around the globe. </p>
</td>
</tr>
<tr>
<td class="time_part"> 4:00 </td>
<td class="show_content">
<h4>
Asia Edge
</h4>
<p>
Get to the bottom of the days major issues influencing business decisions with Rishaad Salamat. Asia Edge gives viewers a deeper perspective through extended interviews with the region's newsmakers as well as fast-paced panel discussions featuring Bloomberg's market reporters, business experts and influential guests. Stay ahead of the business day with Asia Edge. </p>
</td>
</tr>
My code looks like:
url = 'http://www.bloomberg.com/tv/schedule/europe/'
response = urllib2.urlopen(url)
soup = BeautifulSoup(response)
for line in soup.findAll('div',{'td','h4','p'}):
print line
What I'm doing wrong in code, some advice would be great.
The problem is when <h3>January 20, 2014</h3
is going for about week and he only take one but loop can't do anything to print it in every line with all others tags.
I'm not sure what you're trying to achieve with {'td','h4','p'} as a second argument. That's a set, and not a dict (as you may be thinking it is).
If you want to obtain the date, a simple soup.find('h3') should be good here:
>>> print soup.find('h3')
<h3>January 20, 2014</h3>
>>> print soup.find('h3').text
January 20, 2014