I am trying to scrape the data in a bunch of rows. I am able to expand an individual row using the following:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="7858101"]'))).click()
The problem is each row has a different id. They have common class name so I have also tried:
WebDriverWait(driver, 60).until(EC.presence_of_elements_located((By.CLASS_NAME, 'course-row normal faculty-BU active'))).click()
I have attached a few rows below Any suggestions on how I can fix this
<tr id="7858101" class="course-row normal faculty-BU active" data-cid="7858101" data-cc="ACTG1P01" data-year="2021" data-session="FW" data-type="UG" data-subtype="UG" data-level="Year1" data-fn2_notes="BB" data-duration="2" data-class_type="ASY" data-course_section="1" data-days=" " data-class_time="" data-room1="ASYNC" data-room2="" data-location="ASYNC" data-location_desc="" data-instructor="Zhang, Xia (Celine)" data-msg="0" data-main_flag="1" data-secondary_type="E" data-startdate="1631073600" data-enddate="1638853200" data-faculty_code="BU" data-faculty_desc="Goodman School of Business">
<td class="arrow"><span class="fa fa-angle-down"></span></td>
<td class="course-code">ACTG 1P01 </td>
<td class="title">Introduction to Financial Accounting <div class="details-loader" style="display: none;"><span class="fa fa-refresh fa-spin fa-fw"></span></div></td>
<td class="duration">D2</td>
<td class="days"> </td>
<td class="time"> </td>
<!-- <td class="start" data-sort-value="1631073600">Sep 08, 2021</td> -->
<!-- <td class="end" data-sort-value="1638853200">Dec 07, 2021</td> -->
<td class="type">ASY</td>
<td class="data"><div style="" class="course-details-data">
<div class="description">
<h3>Introduction to Financial Accounting</h3>
<p class="page-intro">Fundamental concepts of financial accounting as related to the balance sheet, income statement and statement of cash flows. Understanding the accounting cycle and routine transactions. Integrates both theoretical and practical application of accounting concepts.</p>
<p><strong>Format:</strong> Lectures, discussion, 3 hours per week.</p>
<p><strong>Restrictions:</strong> open to BAcc majors.</p>
<p><strong>Exclusions:</strong> Completion of this course will replace previous assigned grade and credit obtained in ACTG 1P11, 1P91 and 2P51.</p>
<p><strong>Notes:</strong> Open to Bachelor of Accounting majors. </p>
</div>
<div class="vitals">
<ul>
<li><strong>Duration:</strong> Sep 08, 2021 to Dec 07, 2021</li>
<li>
<strong>Location:</strong> ASYNC </li>
<li><strong>Instructor:</strong> Zhang, Xia (Celine)</li>
<li><strong>Section:</strong> 1</li>
</ul>
</div>
<hr>
</div>
</td>
</tr>
<tr id="3724102" class="course-row normal faculty-BU active" data-cid="3724102" data-cc="ACTG1P01" data-year="2021" data-session="FW" data-type="UG" data-subtype="UG" data-level="Year1" data-fn2_notes="BB" data-duration="2" data-class_type="LEC" data-course_section="2" data-days=" M R " data-class_time="1100-1230" data-room1="GSB306" data-room2="" data-location="GSB306" data-location_desc="" data-instructor="Zhang, Xia (Celine)" data-msg="0" data-main_flag="1" data-secondary_type="E" data-startdate="1631073600" data-enddate="1638853200" data-faculty_code="BU" data-faculty_desc="Goodman School of Business">
<td class="arrow"><span class="fa fa-angle-right"></span></td>
<td class="course-code">ACTG 1P01 </td>
<td class="title">Introduction to Financial Accounting <div class="details-loader"><span class="fa fa-refresh fa-spin fa-fw"></span></div></td>
<td class="duration">D2</td>
<td class="days">
<table class="coursecal">
<thead>
<tr>
<th class="">S</th>
<th class="active">M</th>
<th class="">T</th>
<th class="">W</th>
<th class="active">T</th>
<th class="">F</th>
<th class="">S</th>
</tr>
</thead>
<tbody>
<tr>
<td class="weekend "></td>
<td class="active"></td>
<td class=""></td>
<td class=""></td>
<td class="active"></td>
<td class=""></td>
<td class="weekend "></td>
</tr>
</tbody>
</table>
</td>
<td class="time">1100-1230</td>
<!-- <td class="start" data-sort-value="1631073600">Sep 08, 2021</td> -->
<!-- <td class="end" data-sort-value="1638853200">Dec 07, 2021</td> -->
<td class="type">LEC</td>
<td class="data"></td>
</tr>
Are almost there...
You can retrieve a list of all the relevant web elements with the use of driver.find_elements method and then to iterate over each element in the list clicking on it.
Since course-row normal faculty-BU active is actually several class names, not a single class name, you should use XPath or CSS Selector there.
Also it's recommended to use visibility_of_element_located expected condition here, not presence_of_elements_located since the former condition is fulfilled even when the web element is not finally rendered on the page while visibility_of_element_located expected condition waits for more mature state of the web element
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, '//tr[#class = "course-row normal faculty-BU active"]')))
time.sleep(0.4) #short delay added to make ALL the elements loaded
elements = driver.find_element(By.XPATH, '//tr[#class = "course-row normal faculty-BU active"]')
for element in elements:
element.click()
#scrape the data you need here etc
As the id attributes of the <tr> have dynamic value to identify all the <tr>s and click on each of them you need to induce WebDriverWait for the visibility_of_all_elements_located() and you need to construct a dynamic locator strategy as follows:
Using CSS_SELECTOR:
elements = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "tr.course-row.normal.faculty-BU.active[data-faculty_desc='Goodman School of Business'] a[data-cc][data-cid]")))
for element in elements:
element.click()
Using XPATH:
elements = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, "//tr[#class='course-row normal faculty-BU active' and #data-faculty_desc='Goodman School of Business']//a[#data-cc and #data-cid]")))
for element in elements:
element.click()
Related
Consider:
<tr id="pair_12">
<td class="left first">
<span class="ceFlags USD"> </span>
USD
</td>
<td class="" id="last_12_12">1</td>
<td class="pid-2124-last" id="last_12_17">0,8979</td>
<td class="pid-2126-last" id="last_12_3">0,7695</td>
<td class="pid-3-last" id="last_12_2">109,94</td>
<td class="pid-4-last" id="last_12_4">0,9708</td>
<td class="pid-7-last" id="last_12_15">1,3060</td>
<td class="pid-2091-last greenBg" id="last_12_1">1,4481</td>
<td class="pid-18-last greenBg" id="last_12_9">5,8637</td>
</tr>
I want to access, for example, the "5,8637" value and it also refreshes for every other second or so. Here is the website maybe it helps you to help me better link.
driver = Chrome(webdriver)
driver.get("https://tr.investing.com/currencies/exchange-rates-table")
eur_usd = driver.find_element_by_id("last_17_12").text
worked for me!
Use:
By id = By.id("ANY_ID");
Use the getText(id); function under Selenium WebDriver.
I am trying to get a list that matches India's districts to its district codes as they were during the 2011 population census. Below I will post a small subset of the outerHTML I copied from a government website. I am trying to loop over it and extract a string and an int from each little html box and store these ideally in a pandas dataframe on the same row. The HTML blocks look like this, I represent 2, there are around 700 in my txt file:
<tr>
<td width="5%">1</td>
<td>603</td>
<td align="left">**NICOBARS**</td>
<td align="left">NICOBARS </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NIC</td>
<td align="left">02</td>
<td align="left">**638**</td>
<td align="left">
Not Covered
</td>
<td width="5%" align="center"><i class="fa fa-eye" aria-hidden="true"></i>
</td>
<td width="5%" align="center"><i class="fa fa-history" aria-hidden="true"></i>
</td>
<td width="5%" align="center">
</td>
<td width="3%" align="center">
<!-- Merging issue revert beck 05/10/2017 -->
<i class="fa fa-map-marker" aria-hidden="true"></i>
</td>
</tr>
<tr>
<td width="5%">2</td>
<td>632</td>
<td align="left">**NORTH AND MIDDLE ANDAMAN**</td>
<td align="left">NORTH AND MIDDLE ANDAMAN </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NMA</td>
<td align="left"></td>
<td align="left">**639**</td>
<td align="left">
Not Covered
I have put ** around ** the values that I want to get from the text file. I was wonder how I could loop through this text to extract this data. I thought about start counting each time after I encounter and than extract the data of the 1st and 6st but I don't know how to code this. Hope anyone is willing to help out. Or maybe anyone who already has this list, would be great!
If you're able to get the text of the entire html table, you can use df = pd.read_html(html_text_string). 50% of the time, it works everytime!
pd.read_html <-- docs
I'm new to coding in general. To be brief I am using the soup.findAll('table') function and it brings back all the tables on the web page. When I search soup.findAll('table', class_='playerTable rtable') it brings back []. I know that that is the correct class name as I copied it from the HTML. Do you guys know why this might be happening? What am I missing here?
url I'm attempting to scrape from http://www.spotrac.com/nfl/denver-broncos/peyton-manning-5028/
The reason you guys don't see the same table as me is because you need to be signed in to an account, that costs money to for the access to the information, my question still stands, why might this be happening? When I know there is a table with the class I am searching for. Thanks so much for the help guys!
I don't see any class named "playerTable rtable" for the link you provided. Maybe you can try this and let me know if it was what you needed. Happy to delete/change my answer if it doesn't work out for you:
>>> r = BeautifulSoup(requests.get("http://www.spotrac.com/nfl/denver-broncos/peyton-manning-5028/").content, "lxml")
>>> r.findAll("table", attrs = {"class":"playerTable"})
[<table class="playerTable">
<tbody>
<tr>
<td class="contract-type">
<div>
<h2>
<span class="contract-type-logo"><img alt="Team contract signed with" src="http://d1dglpr230r57l.cloudfront.net/images/thumb/broncos.png"/></span>
<span class="contract-type-years">2016-2016 <small>Dead Money</small></span>
</h2>
</div>
</td>
</tr>
</tbody>
</table>, <table class="playerTable">
<tbody>
<tr>
<td style="padding-right:5px;">
<table class="salaryTable rtable current">
<thead>
<tr class="salaryRow">
<th class="header center">Year</th>
<th class="header center"> </th>
<th class="header salaryAmt center "><span>Base Salary</span></th>
<th class="header salaryAmt center"><span title="">Signing Bonus</span></th> <th class="header salaryAmt center"><span>Workout Bonus</span></th> <th class="header salaryAmt center"><span title="">Restruc. Bonus</span></th> <th class="header salaryAmt center"><span>Dead Cap Hit</span></th>
</tr>
</thead>
<tbody>
<tr class="salaryRow">
<td class="salaryYear center">2016</td>
<td class="salaryYear center"><img alt="Player contract details by year" src="http://d1dglpr230r57l.cloudfront.net/images/thumb/broncos.png"/></td>
<td class="salaryAmt ">-</td>
<td class="salaryAmt ">-</td> <td class="salaryAmt ">-</td> <td class="salaryAmt ">$2,500,000</td> <td class="salaryAmt ">$2,500,000</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>]
The reason for this is because no table on this page has both classes playerTable and rtable. And soup.findAll('table', class_='playerTable rtable') is an AND operation ie it will fetch table elements with both the classes, hence empty list.
EDIT: Finally the main reason for this behaviour was because of the unauthenticated request used to fetch html. Therefore no table containing the specified classes existed.
I'm using Ghost for Python 2.7 and I'm trying to click in a link which is in a table. The problem is that I have no ID, name... This is the HTML code:
<table id="table_webbookmarkline_2" cellpadding="4" cellspacing="0" border="0" width="100%">
<tr valign="top">
<td>
<a href="/dana/home/launch.cgi?url=.ahuvs%3A%2F%2Fhq0l5458452ERA-w-Xz8G3LKe8JNM%2F.ISDXWXaWXUivecOc" target="_blank" onClick='javascript:openBookmark(
this.href, "yes", "yes");
return false;' ><img src="/dana-cached/imgs/icn18x18WebBookmarkPop.gif" alt="This will open in a new TAB" width="18" height="18" border="0" ></a>
</td>
<td width="100%" align="left">
<a href="/dana/home/launch.cgi?url=.ahuvs%3A%2F%2Fhq0l5458452ERA-w-Xz8G3LKe8JNM%2F.ISDXWXaWXUivecOc" target="_blank" onClick='JavaScript:openBookmark(
this.href, "yes", "yes");
return false;' ><b>**LINK WHERE I WANT TO CLICK**</b> </a><br><span class="cssSmall"></span>
</td>
</tr>
</table>
How can I click in this kind of link ?
Seems like Ghost's Session.click() takes a CSS selector. Here only the table has an ID, so a selector that takes the second td that is a descendant of that ID and finds the a element should work:
session.click('#table_webbookmarkline_2 td:nth-child(2) a')
From this Deutsche Börse web page, under the table header Issuer I want to get the string content 'db X-trackers' in the cell next to the one with Name in it.
Using my web browser, I inspect that table area and get the code, which I've pasted into this XML tree just so that I can test my xPath.
<root>
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td>Name</td>
<td class="text-right">db X-trackers</td>
</tr>
</tbody>
</table>
</div>
</root>
According to FreeFormatter.com, my xPath below succeeds in retrieving the correct element (Text='db X-trackers'):
my_xpath = "//h2['Issuer']/ancestor::div[#class='row']/following-sibling::div//td['Name']/following-sibling::td[1]/text()"
Note: It goes to <h2>Issuer</h2> first to identify the right place to start working from.
However, when I run this on the actual web page using Selenium WebDriver, None is returned.
def get_sibling(driver, my_xpath):
try:
find_value = driver.find_element_by_xpath(my_xpath).text
except NoSuchElementException:
return None
else:
value = re.search(r"(.+)", find_value).group()
return value
I don't believe anything is wrong in the function itself, so either the xPath must be faulty or there is something in the actual web page source code that throws it off.
When studying the actual Source code in Chrome, it looks a bit messier than what I see with Inspector, which is what I used to create the little XML tree above.
<div class="box">
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td >
Name
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Product Family
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Homepage
</td>
<td class="text-right" >
<a target="_blank" href="http://www.etf.db.com">www.etf.db.com</a>
</td>
</tr>
</tbody>
</table>
</div>
Are there some peculiarities in the source code above, or is my xPath (or function) wrong?
I would use the following and following-sibling axis:
//h2[. = "Issuer"]/following::table//td[. = "Name"]/following-sibling::td
First we locate the h2 element, then get the following table element. In the table element we look for the td element with Name text and then get the following td sibling.