python- is beautifulsoup misreporting my html? - python

I have two machines each, to the best of my knowledge, running python 2.5 and BeautifulSoup 3.1.0.1.
I'm trying to scrape http://utahcritseries.com/RawResults.aspx, using:
from BeautifulSoup import BeautifulSoup
import urllib2
base_url = "http://www.utahcritseries.com/RawResults.aspx"
data=urllib2.urlopen(base_url)
soup=BeautifulSoup(data)
i = 0
table=soup.find("table",id='ctl00_ContentPlaceHolder1_gridEvents')
#table=soup.table
print "begin table"
for row in table.findAll('tr')[1:10]:
i=i + 1
col = row.findAll('td')
date = col[0].string
event = col[1].a.string
confirmed = col[2].string
print '%s - %s' % (date, event)
print "end table"
print "%s rows processed" % i
On my windows machine,I get the correct result, which is a list of dates and event names. On my mac, I don't. instead, I get
3/2/2002 - Rocky Mtn Raceway Criterium
None - Rocky Mtn Raceway Criterium
3/23/2002 - Rocky Mtn Raceway Criterium
None - Rocky Mtn Raceway Criterium
4/2/2002 - Rocky Mtn Raceway Criterium
None - Saltair Time Trial
4/9/2002 - Rocky Mtn Raceway Criterium
None - DMV Criterium
4/16/2002 - Rocky Mtn Raceway Criterium
What I'm noticing is that when I
print row
on my windows machine, the tr data looks exactly the same as the source html. Note the style tag on the second table row. Here's the first two rows:
<tr>
<td>
3/2/2002
</td>
<td>
<a href="Event.aspx?id=226">
Rocky Mtn Raceway Criterium
</a>
</td>
<td>
Confirmed
</td>
<td>
<a href="Event.aspx?id=226">
Points
</a>
</td>
<td>
<a disabled="disabled">
Results
</a>
</td>
</tr>
<tr style="color:#333333;background-color:#EFEFEF;">
<td>
3/16/2002
</td>
<td>
<a href="Event.aspx?id=227">
Rocky Mtn Raceway Criterium
</a>
</td>
<td>
Confirmed
</td>
<td>
<a href="Event.aspx?id=227">
Points
</a>
</td>
<td>
<a disabled="disabled">
Results
</a>
</td>
</tr>
On my mac when I print the first two rows, the style information is removed from the tr tag and it's moved into each td field. I don't understand why this is happening. I'm getting None for every other date value, because BeautifulSoup is putting a font tag around every other date. Here's the mac's output:
<tr>
<td>
3/2/2002
</td>
<td>
<a href="Event.aspx?id=226">
Rocky Mtn Raceway Criterium
</a>
</td>
<td>
Confirmed
</td>
<td>
<a href="Event.aspx?id=226">
Points
</a>
</td>
<td>
<a disabled="disabled">
Results
</a>
</td>
</tr>
<tr bgcolor="#EFEFEF">
<td>
<font color="#333333">
3/16/2002
</font>
</td>
<td>
<font color="#333333">
<a href="Event.aspx?id=227">
Rocky Mtn Raceway Criterium
</a>
</font>
</td>
<td>
<font color="#333333">
Confirmed
</font>
</td>
<td>
<font color="#333333">
<a href="Event.aspx?id=227">
Points
</a>
</font>
</td>
<td>
<font color="#333333">
<a disabled="disabled">
Results
</a>
</font>
</td>
</tr>
My script is displaying the correct result under windows-what do I need to do in order to get my Mac to work correctly?

There are documented problems with version 3.1 of BeautifulSoup.
You might want to double check that is the version you in fact are using, and if so downgrade.

I suspect the problem is in the urlib2 request, not BeautifulSoup:
It might help if you show us the same section of the raw data as returned by this command on both machines:
urllib2.urlopen(base_url)
This page looks like it might help:
http://bytes.com/groups/python/635923-building-browser-like-get-request
The simplest solution is probably just to detect which environment the script is running in and change the parsing logic accordingly.
>>> import os
>>> os.uname()
('Darwin', 'skom.local', '9.6.0', 'Darwin Kernel Version 9.6.0: Mon Nov 24 17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386', 'i386')
Or get microsoft to use web standards :)
Also, didn't you use mechanize to fetch the pages? If so, the problem may be there.

Related

Selenium/python click to open multiple rows

I am trying to scrape the data in a bunch of rows. I am able to expand an individual row using the following:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="7858101"]'))).click()
The problem is each row has a different id. They have common class name so I have also tried:
WebDriverWait(driver, 60).until(EC.presence_of_elements_located((By.CLASS_NAME, 'course-row normal faculty-BU active'))).click()
I have attached a few rows below Any suggestions on how I can fix this
<tr id="7858101" class="course-row normal faculty-BU active" data-cid="7858101" data-cc="ACTG1P01" data-year="2021" data-session="FW" data-type="UG" data-subtype="UG" data-level="Year1" data-fn2_notes="BB" data-duration="2" data-class_type="ASY" data-course_section="1" data-days=" " data-class_time="" data-room1="ASYNC" data-room2="" data-location="ASYNC" data-location_desc="" data-instructor="Zhang, Xia (Celine)" data-msg="0" data-main_flag="1" data-secondary_type="E" data-startdate="1631073600" data-enddate="1638853200" data-faculty_code="BU" data-faculty_desc="Goodman School of Business">
<td class="arrow"><span class="fa fa-angle-down"></span></td>
<td class="course-code">ACTG 1P01 </td>
<td class="title">Introduction to Financial Accounting <div class="details-loader" style="display: none;"><span class="fa fa-refresh fa-spin fa-fw"></span></div></td>
<td class="duration">D2</td>
<td class="days"> </td>
<td class="time"> </td>
<!-- <td class="start" data-sort-value="1631073600">Sep 08, 2021</td> -->
<!-- <td class="end" data-sort-value="1638853200">Dec 07, 2021</td> -->
<td class="type">ASY</td>
<td class="data"><div style="" class="course-details-data">
<div class="description">
<h3>Introduction to Financial Accounting</h3>
<p class="page-intro">Fundamental concepts of financial accounting as related to the balance sheet, income statement and statement of cash flows. Understanding the accounting cycle and routine transactions. Integrates both theoretical and practical application of accounting concepts.</p>
<p><strong>Format:</strong> Lectures, discussion, 3 hours per week.</p>
<p><strong>Restrictions:</strong> open to BAcc majors.</p>
<p><strong>Exclusions:</strong> Completion of this course will replace previous assigned grade and credit obtained in ACTG 1P11, 1P91 and 2P51.</p>
<p><strong>Notes:</strong> Open to Bachelor of Accounting majors. </p>
</div>
<div class="vitals">
<ul>
<li><strong>Duration:</strong> Sep 08, 2021 to Dec 07, 2021</li>
<li>
<strong>Location:</strong> ASYNC </li>
<li><strong>Instructor:</strong> Zhang, Xia (Celine)</li>
<li><strong>Section:</strong> 1</li>
</ul>
</div>
<hr>
</div>
</td>
</tr>
<tr id="3724102" class="course-row normal faculty-BU active" data-cid="3724102" data-cc="ACTG1P01" data-year="2021" data-session="FW" data-type="UG" data-subtype="UG" data-level="Year1" data-fn2_notes="BB" data-duration="2" data-class_type="LEC" data-course_section="2" data-days=" M R " data-class_time="1100-1230" data-room1="GSB306" data-room2="" data-location="GSB306" data-location_desc="" data-instructor="Zhang, Xia (Celine)" data-msg="0" data-main_flag="1" data-secondary_type="E" data-startdate="1631073600" data-enddate="1638853200" data-faculty_code="BU" data-faculty_desc="Goodman School of Business">
<td class="arrow"><span class="fa fa-angle-right"></span></td>
<td class="course-code">ACTG 1P01 </td>
<td class="title">Introduction to Financial Accounting <div class="details-loader"><span class="fa fa-refresh fa-spin fa-fw"></span></div></td>
<td class="duration">D2</td>
<td class="days">
<table class="coursecal">
<thead>
<tr>
<th class="">S</th>
<th class="active">M</th>
<th class="">T</th>
<th class="">W</th>
<th class="active">T</th>
<th class="">F</th>
<th class="">S</th>
</tr>
</thead>
<tbody>
<tr>
<td class="weekend "></td>
<td class="active"></td>
<td class=""></td>
<td class=""></td>
<td class="active"></td>
<td class=""></td>
<td class="weekend "></td>
</tr>
</tbody>
</table>
</td>
<td class="time">1100-1230</td>
<!-- <td class="start" data-sort-value="1631073600">Sep 08, 2021</td> -->
<!-- <td class="end" data-sort-value="1638853200">Dec 07, 2021</td> -->
<td class="type">LEC</td>
<td class="data"></td>
</tr>
Are almost there...
You can retrieve a list of all the relevant web elements with the use of driver.find_elements method and then to iterate over each element in the list clicking on it.
Since course-row normal faculty-BU active is actually several class names, not a single class name, you should use XPath or CSS Selector there.
Also it's recommended to use visibility_of_element_located expected condition here, not presence_of_elements_located since the former condition is fulfilled even when the web element is not finally rendered on the page while visibility_of_element_located expected condition waits for more mature state of the web element
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, '//tr[#class = "course-row normal faculty-BU active"]')))
time.sleep(0.4) #short delay added to make ALL the elements loaded
elements = driver.find_element(By.XPATH, '//tr[#class = "course-row normal faculty-BU active"]')
for element in elements:
element.click()
#scrape the data you need here etc
As the id attributes of the <tr> have dynamic value to identify all the <tr>s and click on each of them you need to induce WebDriverWait for the visibility_of_all_elements_located() and you need to construct a dynamic locator strategy as follows:
Using CSS_SELECTOR:
elements = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "tr.course-row.normal.faculty-BU.active[data-faculty_desc='Goodman School of Business'] a[data-cc][data-cid]")))
for element in elements:
element.click()
Using XPATH:
elements = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, "//tr[#class='course-row normal faculty-BU active' and #data-faculty_desc='Goodman School of Business']//a[#data-cc and #data-cid]")))
for element in elements:
element.click()

How do I loop over this outerHTML code to get out certain data? (I don't know how to webscrape it so I want to try this)

I am trying to get a list that matches India's districts to its district codes as they were during the 2011 population census. Below I will post a small subset of the outerHTML I copied from a government website. I am trying to loop over it and extract a string and an int from each little html box and store these ideally in a pandas dataframe on the same row. The HTML blocks look like this, I represent 2, there are around 700 in my txt file:
<tr>
<td width="5%">1</td>
<td>603</td>
<td align="left">**NICOBARS**</td>
<td align="left">NICOBARS </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NIC</td>
<td align="left">02</td>
<td align="left">**638**</td>
<td align="left">
Not Covered
</td>
<td width="5%" align="center"><i class="fa fa-eye" aria-hidden="true"></i>
</td>
<td width="5%" align="center"><i class="fa fa-history" aria-hidden="true"></i>
</td>
<td width="5%" align="center">
</td>
<td width="3%" align="center">
<!-- Merging issue revert beck 05/10/2017 -->
<i class="fa fa-map-marker" aria-hidden="true"></i>
</td>
</tr>
<tr>
<td width="5%">2</td>
<td>632</td>
<td align="left">**NORTH AND MIDDLE ANDAMAN**</td>
<td align="left">NORTH AND MIDDLE ANDAMAN </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NMA</td>
<td align="left"></td>
<td align="left">**639**</td>
<td align="left">
Not Covered
I have put ** around ** the values that I want to get from the text file. I was wonder how I could loop through this text to extract this data. I thought about start counting each time after I encounter and than extract the data of the 1st and 6st but I don't know how to code this. Hope anyone is willing to help out. Or maybe anyone who already has this list, would be great!
If you're able to get the text of the entire html table, you can use df = pd.read_html(html_text_string). 50% of the time, it works everytime!
pd.read_html <-- docs

Selenium: The variables doesn't change even after import changes

I'm working on a webpage scraping project, using selenium library, in which I need to extract some data out of some tables. As a part of project, I need to iterate the table rows and extract the author of article condition, but it just works for the first row. It seems the variable saves the data of first row and doesn't change, even after each iterating.
This is mentioned part of my code:
div_result = driver.find_element_by_class_name("result-body-paper")
papers = div_result.find_elements_by_tag_name("tr")
papers_information = []
for paper in papers:
data = paper.find_elements_by_tag_name("td")
result_title = data[1].text
author = paper.find_element_by_xpath('//span[#data-paper-person="{id}"]'.format(id=person_id))
try:
first_author = author.find_element_by_tag_name("i").get_attribute("class")
except:
first_author = ""
author_condition = "Helper"
if first_author != "":
if "pencil" in first_author:
author_condition = "First Writer"
if "asterisk" in first_author:
author_condition = "Orginal Writer"
if "star" in first_author:
author_condition == "Orginal Worker"
papers_information.append([author_condition,result_title])
Unlike what I expect, every time first_author and author is the same as it was at the first row of table. However, other parts work correctly and operates properly.
Is that bug or something?
By the way, this is the part of html code I'm trying to extract data from (just consists two of table rows):
<tr class="zarEn selectable">
<td class="result row center" width="35">1</td>
<td class="result title ">Hepatic insulin resistance, metabolic syndrome and cardiovascular disease</td>
<td class="result author zarsmallEn" width="200">
<span data-paper-person="98155">
<a href="...">
<img src="..." class="person-avatar-mini">
<i class="fa fa-fw fa-pencil crimson absolute"></i>
</a>
</span>
</td>
<td class="result source_title ">
Clinical Biochemistry
</td>
<td class="result source_cs">
2.35
</td>
<td class="result published_year center">2009</td>
<td class="result citation center">217</td>
</tr>
<tr class="zarEn selectable">
<td class="result row center">2</td>
<td class="result title ">Molecular and cellular mechanisms linking inflammation to insulin resistance and β-cell dysfunction</td>
<td class="result author zarsmallEn">
<span data-paper-person="14144442">
<a href="...">
<img src="...">
<i class="fa fa-fw fa-pencil lightgray absolute"></i>
</a>
</span>
<span data-paper-person="14137800">
<img src="...">
</span>
<span data-paper-person="98155">
<a href="...">
<img src="...">
<i class="fa fa-fw fa-asterisk crimson absolute"></i>
</a>
</span>
</td>
<td class="result source_title ">
Translational Research
</td>
<td class="result source_cs">
4.26
</td>
<td class="result published_year center">2016</td>
<td class="result citation center">71</td>
</tr>
As you can see, class name of two "" is different, but first_author gets the first one and doesn't change anymore!

How to auto extract data from a html file with python?

I'm beginning to learn python (2.7) and would like to extract certain information from a html code stored in a text file. The code below is just a snippet of the whole html code. In the full html text file the code structure is the same for all other firms data as well and these html code "blocks" are positioned underneath each other (if the latter info helps).
The html snippet code:
<body><div class="tab_content-wrapper noPrint"><div class="tab_content_card">
<div class="card-header">
<strong title="" d.="" kon.="" nl="">"Liberty Associates LLC"</strong>
<span class="tel" title="Phone contacts">Phone contacts</span>
</div>
<div class="card-content">
<table>
<tbody>
<tr>
<td colspan="4">
<label class="downdrill-sbi" title="Industry: Immigration">Industry: Immigration</label>
</td>
</tr>
<tr>
<td width="20"> </td>
<td width="245"> </td>
<td width="50"> </td>
<td width="80"> </td>
</tr>
<tr>
<td colspan="2">
59 Wall St</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="2">NJ 07105
<label class="downdrill-sbi" title="New York">New York</label>
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr><td>Phone:</td><td>+1 973-344-8300</td><td>Firm Nr:</td><td>KL4568TL</td></tr>
<tr><td>Fax:</td><td>+1 973-344-8300</td><td colspan="2"></td></tr>
<tr>
<td colspan="2"> www.liberty.edu </td>
<td>Active:</td>
<td>Yes</td>
</tr>
</tbody>
</table>
</div>
</div></div></body>
How it looks like on a webpage:
Right now im using the following script to extract the desired information:
from lxml import html
str = open('html1.txt', 'r').read()
tree = html.fromstring(str)
for variable in tree.xpath('/html/body/div/div'):
company_name = variable.xpath('/html/body/div/div/div[1]/strong/text()')
location = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[4]/td[1]/label/text()')
website = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[8]/td[1]/a/text()')
print(company_name, location, website)
Printed result:
('"Liberty Associates LLC"', 'New York', 'www.liberty.edu')
So far so good. However, when I use the script above to scape the whole html file, results are printed right after each other on one single line. But I would like to print the data (html code "blocks") under eachother like this:
Liberty Associates LLC | New York | +1 973-344-8300 | www.liberty.edu
Company B | Los Angeles | +1 213-802-1770 | perchla.com
I know I can use [0], [1], [2] etc. to get the data under each other like I would like, but doing this manually for all thousands of html "blocks" is just not really feasible.
So my question: how can I automatically extract the data "block by block" from the html code and print the results under each other like illustrated above?
I think what you want is
print(company_name, location, website,'\n')

Is there anyway to check if given XPath is valid in Python?

I have a python code that is extracting some information from a table. But the thing is sometimes the Xpath changes. Right now it only changes between two different XPath's that looks like this:
//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span/
and the other alternative is a slight change in the table like this:
//*[#id='content-primary']/table[2]/tbody/tr[td[1]/span/span/
this is the code that i am using right now to get the information that i need:
rows_xpath = XPath("//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
So what i want to do is a check if the given XPath is valid. If it is not i just try the other XPath alternative.
Hope somebody can help me with this problem. Thank you all.
EDIT1
<table class="clCommonGrid" cellspacing="0">
<thead>
<tr>
<td colspan="3">Kommande matcher</td>
</tr>
<tr>
<th style="width:1%;">Tid</th>
<th style="width:69%;">Match</th>
<th style="width:30%;">Arena</th>
</tr>
</thead>
<tfoot>
<tr>
<td colspan="3">
<dl>
<dt class="clNotify">Röd text</dt>
<dd> = Ändrad matchtid </dd>
<dt><img src="http://svenskfotboll.se/i/u/alert.gif" alt="Röda utropstecknet" /></dt>
<dd> = Peka på utropstecknet så visas en notering </dd>
<dt><img src="http://svenskfotboll.se/i/widget.gif" alt="Widget" /></dt>
<dd>Hämta widget för kommande matcher</dd>
</dl>
</td>
</tr>
</tfoot>
<tbody class="clGrid">
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2015-04-17<!-- br ok --> 19:15</span></span> //This is the date i am checking with first
</td>
<td>Götene IF - Vårgårda IK </td> // The other information that i need from the table later
<td>Sparbanksvallen Götene konstgräs </td>
</tr>
In my situation i did not need to specify which table to extract the information from. Since the information that i will get is specified with the date that only contains in that table i just used this code and it worked out fine for me:
**rows_xpath = XPath("//*[#id='content-primary']/table/tbody/tr[td[1]/span/span//text()='%s']" % (date))**
now it is just table which means it will go through both tables in the website. Its not maybe a clean solution but works for me..

Categories

Resources