I am trying to scrape the year from the html below (https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/punjab-kings-vs-delhi-capitals-64th-match-1304110/full-scorecard). Due to the way the site is coded I have to first identify the table cell that contains the word "Season" then get the year (2022 in this example).
I thought this would get it but it doesn't. There are no errors, just no results. I've not used the following-sibling approach before so I'd be grateful if someone could point out where I've messed up.
l.add_xpath(
'Season',
"//td[contains(text(),'Season')]/following-sibling::td[1]/a/text()")
html:
<tr class="ds-border-b ds-border-line">
<td class="ds-min-w-max ds-border-r ds-border-line">
<span class="ds-text-tight-s ds-font-medium">Season</span>
</td>
<td class="ds-min-w-max">
<span class="ds-inline-flex ds-items-center ds-leading-none">
<a href="https://www.espncricinfo.com/ci/engine/series/index.html?season2022" class="ds-text-ui-typo ds-underline ds-underline-offset-4 ds-decoration-ui-stroke hover:ds-text-ui-typo-primary hover:ds-decoration-ui-stroke-primary ds-block">
<span class="ds-text-tight-s ds-font-medium">2022</span>
</a>
</span>
</td>
</tr>
Try:
//span[contains(text(),"Season")]/../following-sibling::td/span/a/span/text()
Related
I was thrown ("change this existing program") into python and lxml and try to find my way by doing it.
So I am sorry for asking maybe an easy or silly question ... but I am a bit stuck.
The program is cracking a table into the rows by
rows=page.cssselect("table-data.table-top tbody tr")
The various columns are addressed (after: for row in rows) by
dns = row.cssselect(".column-number")
cds = row.cssselect(".column-documents")
However in the column "column-documents" there are several (maybe 0, maybe 5) entries (empty, 1 icon with link, up to 5 icons with links and different meanings, each defined with it's own class). And I need to find out, if a specific entry (icon with link) is given there.
It is described as a specific class "class="document-link submission-link hide-text".
<tr class="row-0 tier1-5">
<td class="column-notext">4.</td>
<td class="column-label">Descriptive title</td>
<td class="column-number">007</td>
<td class="column-dokumente">
<a href="/somelink.pdf" target="_blank" title="title of pdf">
<span class="document-link submission-link hide-text">
<span>Main Document</span>
</span>
</a>
<a href="/somelink.pdf) title 2">
<span class="attachment-link submission-attachment-link hide-text">
<span>(text)</span>
</span>
</a>
<a href="/link.pdf" target="_blank" title="some title">
<span class="document-link beschluss-link hide-text">
<span>text</span>
</span>
</a>
<span class="document-spacer hide-text" />
<a href="html-link" title="some title">
<span class="vorgang-link hide-text">
<span>text</span>
</span>
</a>
</td>
</tr>
I just need to know if this is there or not.
And my silly question is: How do I do it?
Thanks in advance,
Andreas.
It's an odd one and I sit on this for nearly a week now.
Maybe it's obvious and im just not seeing things right anymore...
Any leads for alternative solutions are welcome, too.
I have no influence on the website.
I'm new to HTML.
I try to get specific Links from a website using scrapy. (how many is changing)
in this case RELATIVELINK1 and RELATIVELINK4; both are labeled "Details".
How many tables depends on how what you are allowd to see.
Before I start with the problem:
I'm using scrpy shell to test responses.
I get Values from all other parts of the HTML code.
I tried xpath, response.css und scrapy's LinkExtractor.
I tried ignoring the /p part in the path.
Now, If I try to get a response with xpath :
response.xpath('/html/body').extract() - I get a everything, including inside <p>
but when i get to
response.xpath('/html/body/.../p').extract() - I only get: ['<p>\n<br>\n</p>']
and then
response.xpath('/html/body/.../p/table').extract() - I get [ ]
same for
response.xpath('/html/body/.../p/br').extract()
Here is the HTML segment I'm having trouble with:
<p>
<BR>
<TABLE BORDER>
<TR>
<TD><b>NAME1</b></TD>
<TD><b>NAME2</b></TD>
<TD><b>NAME3</b></TD>
<TD><b>NAME4</b></TD>
<TD COLSPAN=3><b>Links</b></TD>
</TR>
<TR>
<TD>NUMBER1</font></TD>
<TD>LINK1 </font></TD>
<TD> </font></TD>
<TD>NAME5 </font></TD>
<TD><a href=RELATIVELINK1>Details</a></TD>
<TD><a href=RELATIVELINK2>LABEL1</TD>
<TD><a href=RELATIVELINK3>LABEL2</TD>
</TR>
<TR>
<TD>NUMBER2</font></TD>
<TD>LINK2 </font></TD>
<TD> </font></TD>
<TD>NAME5;</font></TD>
<TD><a href=RELATIVELINK4>Details</a></TD>
<TD><a href=RELATIVELINK5>LABEL1</TD>
<TD><a href=RELATIVELINK6>LABEL2</TD>
</TR>
</TABLE>
<BR>
There is no </P>.
for link_href in response.xpath('//a[.="Details"]/#href').extract():
print(link_href)
I am trying to scrape from this website. My objective is to collect the most recent 10 results (win/loss/draw) of ANY team, I am just using this specific team as an example. The source for an individual row is:
<tr class="odd match no-date-repetition" data-timestamp="1515864600" id="page_team_1_block_team_matches_3_match-2463021" data-competition="8">
<td class="day no-repetition">Sat</td>
<td class="full-date" nowrap="nowrap">13/01/18</td>
<td class="competition">PRL</td>
<td class="team team-a ">
<a href="/teams/england/tottenham-hotspur-football-club/675/" title="Tottenham Hotspur">
Tottenham Hotspur
</a>
</td>
<td class="score-time score">
<a href="/matches/2018/01/13/england/premier-league/tottenham-hotspur-football-club/everton-football-club/2463021/" class="result-win">
4 - 0
</a>
</td>
<td class="team team-b ">
<a href="/teams/england/everton-football-club/674/" title="Everton">
Everton
</a>
</td>
<td class="events-button button first-occur">
View events
</td>
<td class="info-button button">
More info
</td>
</tr>
You can see in the <td class="score-time score", the result is stored.
My knowledge of Python and web crawling is pretty limited, so my current code is:
res2 = requests.get(soccerwayURL)
soup2 = bs4.BeautifulSoup(res2.text, 'html.parser')
elems2 = soup2.select('#page_team_1_block_team_matches_3_match-2463021 > td.score-time.score')
print(elems2[0].text.strip())
This prints out '4-0'. That is good, but the issue arises when I try to access a different row. The 7-digit number (2463021 in the example above) is unique to that row. That means that if I want to get the score from a different row, I would have to find that unique 7-digit number and place it in the CSS selector '#page_team_1_block_team_matches_3_match-******* > td.score-time.score' where the asterisks are the unique number.
The online course I took only showed how to reference things by the CSS Selector, so I am unsure how I can go about retrieving the scores without manually taking the CSS Selector for each row.
Within the <td class="score-time score"> class, there is another class that reads class="result-win">. Ideally I would like to be able to pull just that "result-win" because I am not looking for the score of the game, I am only looking for the outcome of win, loss, or draw.
I hope this is a little clearer. I greatly appreciate your patience with me.
From this Deutsche Börse web page, under the table header Issuer I want to get the string content 'db X-trackers' in the cell next to the one with Name in it.
Using my web browser, I inspect that table area and get the code, which I've pasted into this XML tree just so that I can test my xPath.
<root>
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td>Name</td>
<td class="text-right">db X-trackers</td>
</tr>
</tbody>
</table>
</div>
</root>
According to FreeFormatter.com, my xPath below succeeds in retrieving the correct element (Text='db X-trackers'):
my_xpath = "//h2['Issuer']/ancestor::div[#class='row']/following-sibling::div//td['Name']/following-sibling::td[1]/text()"
Note: It goes to <h2>Issuer</h2> first to identify the right place to start working from.
However, when I run this on the actual web page using Selenium WebDriver, None is returned.
def get_sibling(driver, my_xpath):
try:
find_value = driver.find_element_by_xpath(my_xpath).text
except NoSuchElementException:
return None
else:
value = re.search(r"(.+)", find_value).group()
return value
I don't believe anything is wrong in the function itself, so either the xPath must be faulty or there is something in the actual web page source code that throws it off.
When studying the actual Source code in Chrome, it looks a bit messier than what I see with Inspector, which is what I used to create the little XML tree above.
<div class="box">
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td >
Name
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Product Family
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Homepage
</td>
<td class="text-right" >
<a target="_blank" href="http://www.etf.db.com">www.etf.db.com</a>
</td>
</tr>
</tbody>
</table>
</div>
Are there some peculiarities in the source code above, or is my xPath (or function) wrong?
I would use the following and following-sibling axis:
//h2[. = "Issuer"]/following::table//td[. = "Name"]/following-sibling::td
First we locate the h2 element, then get the following table element. In the table element we look for the td element with Name text and then get the following td sibling.
Similar to .renderContents here, I want to search by that value: Beautiful Soup [Python] and the extracting of text in a table
Sample HTML:
<table>
<tr>
<td>
This is garbage
</td>
<td>
<td class="thead" style="font-weight:normal">
<!-- status icon and date -->
<a name="post1"><img class="inlineimg" src="img.gif" alt="Old" border="0" title="Old"></a>
19-11-2010, 04:25 PM
<!-- / status icon and date -->
</td>
<td>
This is garbage
</td>
</tr>
</table>
What I tried:
soup.find_all("td", text = re.compile('(AM|PM)'))[0].get_text().strip()
However, the text parameter of find_all seems to not work for this application: IndexError: list index out of range
What do I need to do?
Don't specify the tag name at all and let it find the desired text node. Works for me:
soup.find(text=re.compile('(AM|PM)')).strip()