I'm working on a webpage scraping project, using selenium library, in which I need to extract some data out of some tables. As a part of project, I need to iterate the table rows and extract the author of article condition, but it just works for the first row. It seems the variable saves the data of first row and doesn't change, even after each iterating.
This is mentioned part of my code:
div_result = driver.find_element_by_class_name("result-body-paper")
papers = div_result.find_elements_by_tag_name("tr")
papers_information = []
for paper in papers:
data = paper.find_elements_by_tag_name("td")
result_title = data[1].text
author = paper.find_element_by_xpath('//span[#data-paper-person="{id}"]'.format(id=person_id))
try:
first_author = author.find_element_by_tag_name("i").get_attribute("class")
except:
first_author = ""
author_condition = "Helper"
if first_author != "":
if "pencil" in first_author:
author_condition = "First Writer"
if "asterisk" in first_author:
author_condition = "Orginal Writer"
if "star" in first_author:
author_condition == "Orginal Worker"
papers_information.append([author_condition,result_title])
Unlike what I expect, every time first_author and author is the same as it was at the first row of table. However, other parts work correctly and operates properly.
Is that bug or something?
By the way, this is the part of html code I'm trying to extract data from (just consists two of table rows):
<tr class="zarEn selectable">
<td class="result row center" width="35">1</td>
<td class="result title ">Hepatic insulin resistance, metabolic syndrome and cardiovascular disease</td>
<td class="result author zarsmallEn" width="200">
<span data-paper-person="98155">
<a href="...">
<img src="..." class="person-avatar-mini">
<i class="fa fa-fw fa-pencil crimson absolute"></i>
</a>
</span>
</td>
<td class="result source_title ">
Clinical Biochemistry
</td>
<td class="result source_cs">
2.35
</td>
<td class="result published_year center">2009</td>
<td class="result citation center">217</td>
</tr>
<tr class="zarEn selectable">
<td class="result row center">2</td>
<td class="result title ">Molecular and cellular mechanisms linking inflammation to insulin resistance and β-cell dysfunction</td>
<td class="result author zarsmallEn">
<span data-paper-person="14144442">
<a href="...">
<img src="...">
<i class="fa fa-fw fa-pencil lightgray absolute"></i>
</a>
</span>
<span data-paper-person="14137800">
<img src="...">
</span>
<span data-paper-person="98155">
<a href="...">
<img src="...">
<i class="fa fa-fw fa-asterisk crimson absolute"></i>
</a>
</span>
</td>
<td class="result source_title ">
Translational Research
</td>
<td class="result source_cs">
4.26
</td>
<td class="result published_year center">2016</td>
<td class="result citation center">71</td>
</tr>
As you can see, class name of two "" is different, but first_author gets the first one and doesn't change anymore!
Related
I am trying to scrape the data in a bunch of rows. I am able to expand an individual row using the following:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="7858101"]'))).click()
The problem is each row has a different id. They have common class name so I have also tried:
WebDriverWait(driver, 60).until(EC.presence_of_elements_located((By.CLASS_NAME, 'course-row normal faculty-BU active'))).click()
I have attached a few rows below Any suggestions on how I can fix this
<tr id="7858101" class="course-row normal faculty-BU active" data-cid="7858101" data-cc="ACTG1P01" data-year="2021" data-session="FW" data-type="UG" data-subtype="UG" data-level="Year1" data-fn2_notes="BB" data-duration="2" data-class_type="ASY" data-course_section="1" data-days=" " data-class_time="" data-room1="ASYNC" data-room2="" data-location="ASYNC" data-location_desc="" data-instructor="Zhang, Xia (Celine)" data-msg="0" data-main_flag="1" data-secondary_type="E" data-startdate="1631073600" data-enddate="1638853200" data-faculty_code="BU" data-faculty_desc="Goodman School of Business">
<td class="arrow"><span class="fa fa-angle-down"></span></td>
<td class="course-code">ACTG 1P01 </td>
<td class="title">Introduction to Financial Accounting <div class="details-loader" style="display: none;"><span class="fa fa-refresh fa-spin fa-fw"></span></div></td>
<td class="duration">D2</td>
<td class="days"> </td>
<td class="time"> </td>
<!-- <td class="start" data-sort-value="1631073600">Sep 08, 2021</td> -->
<!-- <td class="end" data-sort-value="1638853200">Dec 07, 2021</td> -->
<td class="type">ASY</td>
<td class="data"><div style="" class="course-details-data">
<div class="description">
<h3>Introduction to Financial Accounting</h3>
<p class="page-intro">Fundamental concepts of financial accounting as related to the balance sheet, income statement and statement of cash flows. Understanding the accounting cycle and routine transactions. Integrates both theoretical and practical application of accounting concepts.</p>
<p><strong>Format:</strong> Lectures, discussion, 3 hours per week.</p>
<p><strong>Restrictions:</strong> open to BAcc majors.</p>
<p><strong>Exclusions:</strong> Completion of this course will replace previous assigned grade and credit obtained in ACTG 1P11, 1P91 and 2P51.</p>
<p><strong>Notes:</strong> Open to Bachelor of Accounting majors. </p>
</div>
<div class="vitals">
<ul>
<li><strong>Duration:</strong> Sep 08, 2021 to Dec 07, 2021</li>
<li>
<strong>Location:</strong> ASYNC </li>
<li><strong>Instructor:</strong> Zhang, Xia (Celine)</li>
<li><strong>Section:</strong> 1</li>
</ul>
</div>
<hr>
</div>
</td>
</tr>
<tr id="3724102" class="course-row normal faculty-BU active" data-cid="3724102" data-cc="ACTG1P01" data-year="2021" data-session="FW" data-type="UG" data-subtype="UG" data-level="Year1" data-fn2_notes="BB" data-duration="2" data-class_type="LEC" data-course_section="2" data-days=" M R " data-class_time="1100-1230" data-room1="GSB306" data-room2="" data-location="GSB306" data-location_desc="" data-instructor="Zhang, Xia (Celine)" data-msg="0" data-main_flag="1" data-secondary_type="E" data-startdate="1631073600" data-enddate="1638853200" data-faculty_code="BU" data-faculty_desc="Goodman School of Business">
<td class="arrow"><span class="fa fa-angle-right"></span></td>
<td class="course-code">ACTG 1P01 </td>
<td class="title">Introduction to Financial Accounting <div class="details-loader"><span class="fa fa-refresh fa-spin fa-fw"></span></div></td>
<td class="duration">D2</td>
<td class="days">
<table class="coursecal">
<thead>
<tr>
<th class="">S</th>
<th class="active">M</th>
<th class="">T</th>
<th class="">W</th>
<th class="active">T</th>
<th class="">F</th>
<th class="">S</th>
</tr>
</thead>
<tbody>
<tr>
<td class="weekend "></td>
<td class="active"></td>
<td class=""></td>
<td class=""></td>
<td class="active"></td>
<td class=""></td>
<td class="weekend "></td>
</tr>
</tbody>
</table>
</td>
<td class="time">1100-1230</td>
<!-- <td class="start" data-sort-value="1631073600">Sep 08, 2021</td> -->
<!-- <td class="end" data-sort-value="1638853200">Dec 07, 2021</td> -->
<td class="type">LEC</td>
<td class="data"></td>
</tr>
Are almost there...
You can retrieve a list of all the relevant web elements with the use of driver.find_elements method and then to iterate over each element in the list clicking on it.
Since course-row normal faculty-BU active is actually several class names, not a single class name, you should use XPath or CSS Selector there.
Also it's recommended to use visibility_of_element_located expected condition here, not presence_of_elements_located since the former condition is fulfilled even when the web element is not finally rendered on the page while visibility_of_element_located expected condition waits for more mature state of the web element
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, '//tr[#class = "course-row normal faculty-BU active"]')))
time.sleep(0.4) #short delay added to make ALL the elements loaded
elements = driver.find_element(By.XPATH, '//tr[#class = "course-row normal faculty-BU active"]')
for element in elements:
element.click()
#scrape the data you need here etc
As the id attributes of the <tr> have dynamic value to identify all the <tr>s and click on each of them you need to induce WebDriverWait for the visibility_of_all_elements_located() and you need to construct a dynamic locator strategy as follows:
Using CSS_SELECTOR:
elements = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "tr.course-row.normal.faculty-BU.active[data-faculty_desc='Goodman School of Business'] a[data-cc][data-cid]")))
for element in elements:
element.click()
Using XPATH:
elements = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, "//tr[#class='course-row normal faculty-BU active' and #data-faculty_desc='Goodman School of Business']//a[#data-cc and #data-cid]")))
for element in elements:
element.click()
I am trying to get a list that matches India's districts to its district codes as they were during the 2011 population census. Below I will post a small subset of the outerHTML I copied from a government website. I am trying to loop over it and extract a string and an int from each little html box and store these ideally in a pandas dataframe on the same row. The HTML blocks look like this, I represent 2, there are around 700 in my txt file:
<tr>
<td width="5%">1</td>
<td>603</td>
<td align="left">**NICOBARS**</td>
<td align="left">NICOBARS </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NIC</td>
<td align="left">02</td>
<td align="left">**638**</td>
<td align="left">
Not Covered
</td>
<td width="5%" align="center"><i class="fa fa-eye" aria-hidden="true"></i>
</td>
<td width="5%" align="center"><i class="fa fa-history" aria-hidden="true"></i>
</td>
<td width="5%" align="center">
</td>
<td width="3%" align="center">
<!-- Merging issue revert beck 05/10/2017 -->
<i class="fa fa-map-marker" aria-hidden="true"></i>
</td>
</tr>
<tr>
<td width="5%">2</td>
<td>632</td>
<td align="left">**NORTH AND MIDDLE ANDAMAN**</td>
<td align="left">NORTH AND MIDDLE ANDAMAN </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NMA</td>
<td align="left"></td>
<td align="left">**639**</td>
<td align="left">
Not Covered
I have put ** around ** the values that I want to get from the text file. I was wonder how I could loop through this text to extract this data. I thought about start counting each time after I encounter and than extract the data of the 1st and 6st but I don't know how to code this. Hope anyone is willing to help out. Or maybe anyone who already has this list, would be great!
If you're able to get the text of the entire html table, you can use df = pd.read_html(html_text_string). 50% of the time, it works everytime!
pd.read_html <-- docs
I just started using python for some web page scraping and BeautifulSoup seems to be recommended everywhere.
I have the content like below:
<table class="table with-row-highlight table-archive">
<tbody>
<tr>
<td>
<div class="user-tagline ">
<span class="username " data-avatar="aaaaaaa">player1</span>
<span class="user-rating">(1357)</span>
<span class="country-flag-small flag-113" tip="Portugal"></span>
</div>
<div class="user-tagline ">
<span class="username " data-avatar="bbbbbbb">player2</span>
<span class="user-rating">(1387)</span>
<span class="country-flag-small flag-70" tip="Indonesia"></span>
</div>
</td>
<td>
<a class="clickable-link text-middle" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">
<div class="pull-left">
<span class="game-result">1</span>
<span class="game-result">0</span>
</div>
<div class="result">
<i class="icon-square-minus loss" tip="Lost"></i>
</div>
</a>
</td>
<td class="text-center">
<a class="clickable-link" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">30 min</a>
</td>
<td class="text-right">
<a class="clickable-link text-middle moves" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">25</a>
</td>
<td class="text-right miniboard">
<a class="clickable-link archive-date" href="https://www.chess.com/live/game/2249663029?username=belemnarmada" target="_self">Aug 9, 2017</a>
</td>
</tr>
100 <tr></tr> here
</tbody>
</table>
My code stops here, how do I write the python code to loop all the <tr></tr> pair and extract all the class for each <span> pair in each <td> pair?
edit
I think maybe I didn't explain clearly here, what your code returns are the name of class in that HTML while what I am looking for are the correspondent values, e.g. there is a class username, I want to get its value of player1 and player2; there is a class country-flag-small flag-70 I want to get tip=Indonesia
This should do the trick:
import requests
from bs4 import BeautifulSoup
res = requests.get('someLink')
soup = BeautifulSoup(res.text)
classes = []
for element in soup.find_all(class_=True):
classes.extend(element["class"])
print(classes)
I tested this using your html file and got the following results:
['table', 'with-row-highlight', 'table-archive', 'user-tagline', 'username', 'user-rating', 'country-flag-small', 'flag-113', 'user-tagline', 'username', 'user-rating', 'country-flag-small','flag-70', 'clickable-link', 'text-middle', 'pull-left', 'game-result', 'game-result', 'result', 'icon-square-minus', 'loss', 'text-center', 'clickable-link', 'text-right', 'clickable-link', 'text-middle', 'moves', 'text-right', 'miniboard', 'clickable-link', 'archive-date']
Do note that you will have to pip3 install requests if you haven't already
Also, if you want to test this with a file on your computer, you can do this:
from bs4 import BeautifulSoup
file = open('/path/To/Your/HtmlFile.html', 'r')
lines = file.read()
soup = BeautifulSoup(lines)
classes = []
for element in soup.find_all(class_=True):
classes.extend(element["class"])
print(classes)
I was thrown ("change this existing program") into python and lxml and try to find my way by doing it.
So I am sorry for asking maybe an easy or silly question ... but I am a bit stuck.
The program is cracking a table into the rows by
rows=page.cssselect("table-data.table-top tbody tr")
The various columns are addressed (after: for row in rows) by
dns = row.cssselect(".column-number")
cds = row.cssselect(".column-documents")
However in the column "column-documents" there are several (maybe 0, maybe 5) entries (empty, 1 icon with link, up to 5 icons with links and different meanings, each defined with it's own class). And I need to find out, if a specific entry (icon with link) is given there.
It is described as a specific class "class="document-link submission-link hide-text".
<tr class="row-0 tier1-5">
<td class="column-notext">4.</td>
<td class="column-label">Descriptive title</td>
<td class="column-number">007</td>
<td class="column-dokumente">
<a href="/somelink.pdf" target="_blank" title="title of pdf">
<span class="document-link submission-link hide-text">
<span>Main Document</span>
</span>
</a>
<a href="/somelink.pdf) title 2">
<span class="attachment-link submission-attachment-link hide-text">
<span>(text)</span>
</span>
</a>
<a href="/link.pdf" target="_blank" title="some title">
<span class="document-link beschluss-link hide-text">
<span>text</span>
</span>
</a>
<span class="document-spacer hide-text" />
<a href="html-link" title="some title">
<span class="vorgang-link hide-text">
<span>text</span>
</span>
</a>
</td>
</tr>
I just need to know if this is there or not.
And my silly question is: How do I do it?
Thanks in advance,
Andreas.
From this Deutsche Börse web page, under the table header Issuer I want to get the string content 'db X-trackers' in the cell next to the one with Name in it.
Using my web browser, I inspect that table area and get the code, which I've pasted into this XML tree just so that I can test my xPath.
<root>
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td>Name</td>
<td class="text-right">db X-trackers</td>
</tr>
</tbody>
</table>
</div>
</root>
According to FreeFormatter.com, my xPath below succeeds in retrieving the correct element (Text='db X-trackers'):
my_xpath = "//h2['Issuer']/ancestor::div[#class='row']/following-sibling::div//td['Name']/following-sibling::td[1]/text()"
Note: It goes to <h2>Issuer</h2> first to identify the right place to start working from.
However, when I run this on the actual web page using Selenium WebDriver, None is returned.
def get_sibling(driver, my_xpath):
try:
find_value = driver.find_element_by_xpath(my_xpath).text
except NoSuchElementException:
return None
else:
value = re.search(r"(.+)", find_value).group()
return value
I don't believe anything is wrong in the function itself, so either the xPath must be faulty or there is something in the actual web page source code that throws it off.
When studying the actual Source code in Chrome, it looks a bit messier than what I see with Inspector, which is what I used to create the little XML tree above.
<div class="box">
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td >
Name
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Product Family
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Homepage
</td>
<td class="text-right" >
<a target="_blank" href="http://www.etf.db.com">www.etf.db.com</a>
</td>
</tr>
</tbody>
</table>
</div>
Are there some peculiarities in the source code above, or is my xPath (or function) wrong?
I would use the following and following-sibling axis:
//h2[. = "Issuer"]/following::table//td[. = "Name"]/following-sibling::td
First we locate the h2 element, then get the following table element. In the table element we look for the td element with Name text and then get the following td sibling.