Finding specific classes in table columns (python, lxml)

Finding specific classes in table columns (python, lxml) - python

I was thrown ("change this existing program") into python and lxml and try to find my way by doing it.
So I am sorry for asking maybe an easy or silly question ... but I am a bit stuck.
The program is cracking a table into the rows by
rows=page.cssselect("table-data.table-top tbody tr")
The various columns are addressed (after: for row in rows) by
dns = row.cssselect(".column-number")
cds = row.cssselect(".column-documents")
However in the column "column-documents" there are several (maybe 0, maybe 5) entries (empty, 1 icon with link, up to 5 icons with links and different meanings, each defined with it's own class). And I need to find out, if a specific entry (icon with link) is given there.
It is described as a specific class "class="document-link submission-link hide-text".
<tr class="row-0 tier1-5">
<td class="column-notext">4.</td>
<td class="column-label">Descriptive title</td>
<td class="column-number">007</td>
<td class="column-dokumente">
<a href="/somelink.pdf" target="_blank" title="title of pdf">
<span class="document-link submission-link hide-text">
<span>Main Document</span>
</span>
</a>
<a href="/somelink.pdf) title 2">
<span class="attachment-link submission-attachment-link hide-text">
<span>(text)</span>
</span>
</a>
<a href="/link.pdf" target="_blank" title="some title">
<span class="document-link beschluss-link hide-text">
<span>text</span>
</span>
</a>
<span class="document-spacer hide-text" />
<a href="html-link" title="some title">
<span class="vorgang-link hide-text">
<span>text</span>
</span>
</a>
</td>
</tr>
I just need to know if this is there or not.
And my silly question is: How do I do it?
Thanks in advance,
Andreas.

Related

Using Following-sibling in Xpath with Scrapy

I am trying to scrape the year from the html below (https://www.espncricinfo.com/series/indian-premier-league-2022-1298423/punjab-kings-vs-delhi-capitals-64th-match-1304110/full-scorecard). Due to the way the site is coded I have to first identify the table cell that contains the word "Season" then get the year (2022 in this example).
I thought this would get it but it doesn't. There are no errors, just no results. I've not used the following-sibling approach before so I'd be grateful if someone could point out where I've messed up.
l.add_xpath(
'Season',
"//td[contains(text(),'Season')]/following-sibling::td[1]/a/text()")
html:
<tr class="ds-border-b ds-border-line">
<td class="ds-min-w-max ds-border-r ds-border-line">
<span class="ds-text-tight-s ds-font-medium">Season</span>
</td>
<td class="ds-min-w-max">
<span class="ds-inline-flex ds-items-center ds-leading-none">
<a href="https://www.espncricinfo.com/ci/engine/series/index.html?season2022" class="ds-text-ui-typo ds-underline ds-underline-offset-4 ds-decoration-ui-stroke hover:ds-text-ui-typo-primary hover:ds-decoration-ui-stroke-primary ds-block">
<span class="ds-text-tight-s ds-font-medium">2022</span>
</a>
</span>
</td>
</tr>

Try:
//span[contains(text(),"Season")]/../following-sibling::td/span/a/span/text()

How to get value from piece of html code with BeautifulSoup?

I just started using python for some web page scraping and BeautifulSoup seems to be recommended everywhere.
I have the content like below:
<table class="table with-row-highlight table-archive">
<tbody>
<tr>
<td>
<div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div>
</td>
<td>
<a class="clickable-link text-middle" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<div class="pull-left">
<span class="game-result">1</span>
<span class="game-result">0</span>
</div>
<div class="result">
<i class="icon-square-minus loss" tip="Lost"></i>
</div>
</a>
</td>
<td class="text-center">
<a class="clickable-link" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">30 min</a>
</td>
<td class="text-right">
<a class="clickable-link text-middle moves" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">25</a>
</td>
<td class="text-right miniboard">
<a class="clickable-link archive-date" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">Aug 9, 2017</a>
</td>
</tr>
100 <tr></tr> here
</tbody>
</table>
My code stops here, how do I write the python code to loop all the <tr></tr> pair and extract all the class for each <span> pair in each <td> pair?
edit
I think maybe I didn't explain clearly here, what your code returns are the name of class in that HTML while what I am looking for are the correspondent values, e.g. there is a class username, I want to get its value of player1 and player2; there is a class country-flag-small flag-70 I want to get tip=Indonesia

This should do the trick:
import requests
from bs4 import BeautifulSoup
res = requests.get('someLink')
soup = BeautifulSoup(res.text)
classes = []
for element in soup.find_all(class_=True):
classes.extend(element["class"])
print(classes)
I tested this using your html file and got the following results:
['table', 'with-row-highlight', 'table-archive', 'user-tagline', 'username', 'user-rating', 'country-flag-small', 'flag-113', 'user-tagline', 'username', 'user-rating', 'country-flag-small','flag-70', 'clickable-link', 'text-middle', 'pull-left', 'game-result', 'game-result', 'result', 'icon-square-minus', 'loss', 'text-center', 'clickable-link', 'text-right', 'clickable-link', 'text-middle', 'moves', 'text-right', 'miniboard', 'clickable-link', 'archive-date']
Do note that you will have to pip3 install requests if you haven't already
Also, if you want to test this with a file on your computer, you can do this:
from bs4 import BeautifulSoup
file = open('/path/To/Your/HtmlFile.html', 'r')
lines = file.read()
soup = BeautifulSoup(lines)
classes = []
for element in soup.find_all(class_=True):
classes.extend(element["class"])
print(classes)

Selenium: The variables doesn't change even after import changes

I'm working on a webpage scraping project, using selenium library, in which I need to extract some data out of some tables. As a part of project, I need to iterate the table rows and extract the author of article condition, but it just works for the first row. It seems the variable saves the data of first row and doesn't change, even after each iterating.
This is mentioned part of my code:
div_result = driver.find_element_by_class_name("result-body-paper")
papers = div_result.find_elements_by_tag_name("tr")
papers_information = []
for paper in papers:
data = paper.find_elements_by_tag_name("td")
result_title = data[1].text
author = paper.find_element_by_xpath('//span[#data-paper-person="{id}"]'.format(id=person_id))
try:
first_author = author.find_element_by_tag_name("i").get_attribute("class")
except:
first_author = ""
author_condition = "Helper"
if first_author != "":
if "pencil" in first_author:
author_condition = "First Writer"
if "asterisk" in first_author:
author_condition = "Orginal Writer"
if "star" in first_author:
author_condition == "Orginal Worker"
papers_information.append([author_condition,result_title])
Unlike what I expect, every time first_author and author is the same as it was at the first row of table. However, other parts work correctly and operates properly.
Is that bug or something?
By the way, this is the part of html code I'm trying to extract data from (just consists two of table rows):
<tr class="zarEn selectable">
<td class="result row center" width="35">1</td>
<td class="result title ">Hepatic insulin resistance, metabolic syndrome and cardiovascular disease</td>
<td class="result author zarsmallEn" width="200">
<span data-paper-person="98155">
<a href="...">
<img src="..." class="person-avatar-mini">
<i class="fa fa-fw fa-pencil crimson absolute"></i>
</a>
</span>
</td>
<td class="result source_title ">
Clinical Biochemistry
</td>
<td class="result source_cs">
2.35
</td>
<td class="result published_year center">2009</td>
<td class="result citation center">217</td>
</tr>
<tr class="zarEn selectable">
<td class="result row center">2</td>
<td class="result title ">Molecular and cellular mechanisms linking inflammation to insulin resistance and β-cell dysfunction</td>
<td class="result author zarsmallEn">
<span data-paper-person="14144442">
<a href="...">
<img src="...">
<i class="fa fa-fw fa-pencil lightgray absolute"></i>
</a>
</span>
<span data-paper-person="14137800">
<img src="...">
</span>
<span data-paper-person="98155">
<a href="...">
<img src="...">
<i class="fa fa-fw fa-asterisk crimson absolute"></i>
</a>
</span>
</td>
<td class="result source_title ">
Translational Research
</td>
<td class="result source_cs">
4.26
</td>
<td class="result published_year center">2016</td>
<td class="result citation center">71</td>
</tr>
As you can see, class name of two "" is different, but first_author gets the first one and doesn't change anymore!

Robot Framework - get span element from table

I'm trying to write some test cases to automatically test my websites but I'm having trouble clicking on checkbox witch is situated on every single row in the left column. User can click on every cell in the row he wants and checkbox will became checked or unchcked..
But I'm not able to simulate this click into table cell. First I'm trying to get some cell into variable and then to click on this cell using this variable like this:
Page Should Contain Element xpath=//div[contains(#id,'-tableCtrlCnt')]
${item1} Get Table Cell xpath=//div[contains(#id,'-tableCtrlCnt')]/table/tbody 1 2
Click Element ${item1}
But I'm getting error on the second line of code, I just cannot get the column.
The error/fail is:
Cell in table xpath=//div[contains(#id,'-tableCtrlCnt')]/table/tbody
in row #2 and column #2 could not be found.
And this is how part of my html code looks like:
<div id="__table1-tableCtrlCnt" class="sapUiTableCtrlCnt" style="height: 160px;">
<table id="__table1-table" role="presentation" data-sap-ui-table-acc-covered="overlay,nodata" class="sapUiTableCtrl sapUiTableCtrlRowScroll sapUiTableCtrlScroll" style="min-width:648px">
<tbody>
<tr id="__table1-rows-row0" data-sap-ui="__table1-rows-row0" class="sapUiTableRowEven sapUiTableTr" data-sap-ui-rowindex="0" role="row" title="Click to select or press SHIFT and click to select a range" style="height: 32px;">
<td role="rowheader" aria-labelledby="__table1-ariarowheaderlabel" headers="__table1-colsel" aria-owns="__table1-rowsel0"></td>
<td id="__table1-rows-row0-col0" tabindex="-1" role="gridcell" headers="__table1_col0" aria-labelledby="__table1-0" style="text-align:left" class="sapUiTableTd sapUiTableTdFirst">
<div class="sapUiTableCell">
<span id="__text37-col0-row0" data-sap-ui="__text37-col0-row0" title="1010" class="sapMText sapMTextMaxWidth sapMTextNoWrap sapUiSelectable" style="text-align:left">1010
</span>
</div>
</td>
<td id="__table1-rows-row0-col1" tabindex="-1" role="gridcell" headers="__table1_col1" aria-labelledby="__table1-1" style="text-align:left" class="sapUiTableTd">
<div class="sapUiTableCell">
<span id="__text38-col1-row0" data-sap-ui="__text38-col1-row0" title="Company Code 1010" class="sapMText sapMTextMaxWidth sapMTextNoWrap sapUiSelectable" style="text-align:left">Company Code 1010
</span>
</div>
</td>
</tr>
...
</tbody>
</table>
</div>
Don't you have any idea how to solve this click into table issue?

Check whether this helps you-
${item1} Get Table Cell xpath=//table[contains(#id,'__table1-table')] 1 2
OR
${item1} = Get Text //table[contains(#id,'__table1-table')]//tr[1]//td[2]//div/span

python-selenium returning element is not interactable error

I am using selenium-python binding. I am getting the following error while trying to select and manipulate an element. (using Chromedriver)
Message: invalid element state: Element is not currently interactable and may not be manipulated
I think the element is successfully selected with the following syntax: but I cannot manipulate it with, for example, clear() or send_keys("some value"). I would like to fill the text area, but I cannot make it work. If you have experienced similar problems, please share your thought. Thank you.
UPDATE: I noticed html is changing as I manually type to style="display: none" that might be a reason for this error. Modified the code below. Can you please point out any solution?
driver.find_element(by='xpath', value="//table[#class='input table']//input[#id='gwt-debug-url-suggest-box']")
or
driver.find_element(by='xpath', value="//input[#id='gwt-debug-url-suggest-box']")
or
driver.find_element_by_id("gwt-uid-47")
or
driver.find_element(by='xpath', value="//div[contains(#class, 'sppb-b')][normalize-space()='www.example.com/page']")
Here is the html source code:
<div>
<div class="spH-c" id="gwt-uid-64"> Your landing page </div>
<div class="spH-f">
<table class="input-table" width="100%">
<tbody>
<tr>
<td class="spA-e">
<div class="sppb-a" id="gwt-uid-47">
<div class="sppb-b spA-b" aria-hidden="true" style="display: none;">www.example.com/page</div>
<input type="text" class="spC-a sppb-c" id="gwt-debug-url-suggest-box" aria-labelledby="gwt-uid-64 gwt-uid-47" dir="">
</div>
<div class="error" style="display:none" id="gwt-debug-invalid-url-error-message" role="alert"> Please enter a valid URL. </div>
</td>
<td class="spB-b">
<div class="spB-a" aria-hidden="true" style="display: none;"></div>
</td>
</tr>
</tbody>
</table>
</div>
</div>

Have you tried selecting by:
driver.find_element_by_id("gwt-debug-url-suggest-box")
driver.send_keys("Your input")
This way you are selecting the input directly.
Anyway,the link to the page would help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding specific classes in table columns (python, lxml) - python

Related

Using Following-sibling in Xpath with Scrapy

How to get value from piece of html code with BeautifulSoup?

Selenium: The variables doesn't change even after import changes

Robot Framework - get span element from table

python-selenium returning element is not interactable error

Categories

Resources