How to auto extract data from a html file with python?

How to auto extract data from a html file with python? - python

I'm beginning to learn python (2.7) and would like to extract certain information from a html code stored in a text file. The code below is just a snippet of the whole html code. In the full html text file the code structure is the same for all other firms data as well and these html code "blocks" are positioned underneath each other (if the latter info helps).
The html snippet code:
<body><div class="tab_content-wrapper noPrint"><div class="tab_content_card">
<div class="card-header">
<strong title="" d.="" kon.="" nl="">"Liberty Associates LLC"</strong>
<span class="tel" title="Phone contacts">Phone contacts</span>
</div>
<div class="card-content">
<table>
<tbody>
<tr>
<td colspan="4">
<label class="downdrill-sbi" title="Industry: Immigration">Industry: Immigration</label>
</td>
</tr>
<tr>
<td width="20"> </td>
<td width="245"> </td>
<td width="50"> </td>
<td width="80"> </td>
</tr>
<tr>
<td colspan="2">
59 Wall St</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="2">NJ 07105
<label class="downdrill-sbi" title="New York">New York</label>
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr><td>Phone:</td><td>+1 973-344-8300</td><td>Firm Nr:</td><td>KL4568TL</td></tr>
<tr><td>Fax:</td><td>+1 973-344-8300</td><td colspan="2"></td></tr>
<tr>
<td colspan="2"> www.liberty.edu </td>
<td>Active:</td>
<td>Yes</td>
</tr>
</tbody>
</table>
</div>
</div></div></body>
How it looks like on a webpage:
Right now im using the following script to extract the desired information:
from lxml import html
str = open('html1.txt', 'r').read()
tree = html.fromstring(str)
for variable in tree.xpath('/html/body/div/div'):
company_name = variable.xpath('/html/body/div/div/div[1]/strong/text()')
location = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[4]/td[1]/label/text()')
website = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[8]/td[1]/a/text()')
print(company_name, location, website)
Printed result:
('"Liberty Associates LLC"', 'New York', 'www.liberty.edu')
So far so good. However, when I use the script above to scape the whole html file, results are printed right after each other on one single line. But I would like to print the data (html code "blocks") under eachother like this:
Liberty Associates LLC | New York | +1 973-344-8300 | www.liberty.edu
Company B | Los Angeles | +1 213-802-1770 | perchla.com
I know I can use [0], [1], [2] etc. to get the data under each other like I would like, but doing this manually for all thousands of html "blocks" is just not really feasible.
So my question: how can I automatically extract the data "block by block" from the html code and print the results under each other like illustrated above?

I think what you want is
print(company_name, location, website,'\n')

Related

How do I loop over this outerHTML code to get out certain data? (I don't know how to webscrape it so I want to try this)

I am trying to get a list that matches India's districts to its district codes as they were during the 2011 population census. Below I will post a small subset of the outerHTML I copied from a government website. I am trying to loop over it and extract a string and an int from each little html box and store these ideally in a pandas dataframe on the same row. The HTML blocks look like this, I represent 2, there are around 700 in my txt file:
<tr>
<td width="5%">1</td>
<td>603</td>
<td align="left">**NICOBARS**</td>
<td align="left">NICOBARS </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NIC</td>
<td align="left">02</td>
<td align="left">**638**</td>
<td align="left">
Not Covered
</td>
<td width="5%" align="center"><i class="fa fa-eye" aria-hidden="true"></i>
</td>
<td width="5%" align="center"><i class="fa fa-history" aria-hidden="true"></i>
</td>
<td width="5%" align="center">
</td>
<td width="3%" align="center">
<!-- Merging issue revert beck 05/10/2017 -->
<i class="fa fa-map-marker" aria-hidden="true"></i>
</td>
</tr>
<tr>
<td width="5%">2</td>
<td>632</td>
<td align="left">**NORTH AND MIDDLE ANDAMAN**</td>
<td align="left">NORTH AND MIDDLE ANDAMAN </td>
<td align="left">ANDAMAN AND NICOBAR ISLANDS(State)</td>
<td align="left">NMA</td>
<td align="left"></td>
<td align="left">**639**</td>
<td align="left">
Not Covered
I have put ** around ** the values that I want to get from the text file. I was wonder how I could loop through this text to extract this data. I thought about start counting each time after I encounter and than extract the data of the 1st and 6st but I don't know how to code this. Hope anyone is willing to help out. Or maybe anyone who already has this list, would be great!

If you're able to get the text of the entire html table, you can use df = pd.read_html(html_text_string). 50% of the time, it works everytime!
pd.read_html <-- docs

When webscraping with python and Beautifulsoup adding a class to a findAll() search brings back zero results?

I'm new to coding in general. To be brief I am using the soup.findAll('table') function and it brings back all the tables on the web page. When I search soup.findAll('table', class_='playerTable rtable') it brings back []. I know that that is the correct class name as I copied it from the HTML. Do you guys know why this might be happening? What am I missing here?
url I'm attempting to scrape from http://www.spotrac.com/nfl/denver-broncos/peyton-manning-5028/
The reason you guys don't see the same table as me is because you need to be signed in to an account, that costs money to for the access to the information, my question still stands, why might this be happening? When I know there is a table with the class I am searching for. Thanks so much for the help guys!

I don't see any class named "playerTable rtable" for the link you provided. Maybe you can try this and let me know if it was what you needed. Happy to delete/change my answer if it doesn't work out for you:
>>> r = BeautifulSoup(requests.get("http://www.spotrac.com/nfl/denver-broncos/peyton-manning-5028/").content, "lxml")
>>> r.findAll("table", attrs = {"class":"playerTable"})
[<table class="playerTable">
<tbody>
<tr>
<td class="contract-type">
<div>
<h2>
<span class="contract-type-logo"><img alt="Team contract signed with" src="http://d1dglpr230r57l.cloudfront.net/images/thumb/broncos.png"/></span>
<span class="contract-type-years">2016-2016 <small>Dead Money</small></span>
</h2>
</div>
</td>
</tr>
</tbody>
</table>, <table class="playerTable">
<tbody>
<tr>
<td style="padding-right:5px;">
<table class="salaryTable rtable current">
<thead>
<tr class="salaryRow">
<th class="header center">Year</th>
<th class="header center"> </th>
<th class="header salaryAmt center "><span>Base Salary</span></th>
<th class="header salaryAmt center"><span title="">Signing Bonus</span></th> <th class="header salaryAmt center"><span>Workout Bonus</span></th> <th class="header salaryAmt center"><span title="">Restruc. Bonus</span></th> <th class="header salaryAmt center"><span>Dead Cap Hit</span></th>
</tr>
</thead>
<tbody>
<tr class="salaryRow">
<td class="salaryYear center">2016</td>
<td class="salaryYear center"><img alt="Player contract details by year" src="http://d1dglpr230r57l.cloudfront.net/images/thumb/broncos.png"/></td>
<td class="salaryAmt ">-</td>
<td class="salaryAmt ">-</td> <td class="salaryAmt ">-</td> <td class="salaryAmt ">$2,500,000</td> <td class="salaryAmt ">$2,500,000</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>]

The reason for this is because no table on this page has both classes playerTable and rtable. And soup.findAll('table', class_='playerTable rtable') is an AND operation ie it will fetch table elements with both the classes, hence empty list.
EDIT: Finally the main reason for this behaviour was because of the unauthenticated request used to fetch html. Therefore no table containing the specified classes existed.

how to verify that a specific cell in a table contains a certain value in a link with Selenium and Python

I have a table similar to the following
<tbody>
<tr>
<td>some text</td>
<td>other text</td>
<td>
process data or
delete
</td>
</tr>
<tr>
<td>some text</td>
<td>other text</td>
<td>
process data or
delete
</td>
</tr>
<tr>
<td>some text</td>
<td>other text</td>
<td>
process data or
delete
</td>
</tr>
</tbody>
I would like to check with Python and Selenium that the cell in row 1, column 3 contains a link to /process/100/
I am able to access the text with
cell_content = self.browser.find_element_by_xpath('//table/tbody/tr[1]/td[3]').text
but I would like to access
href="/process/101/"
to check if it contains 101

Use get_attribute() method as below
cell_href = self.browser.find_element_by_xpath('//table/tbody/tr[1]/td[3]').get_attribute('href')
assert "101" in cell_href

Is there anyway to check if given XPath is valid in Python?

I have a python code that is extracting some information from a table. But the thing is sometimes the Xpath changes. Right now it only changes between two different XPath's that looks like this:
//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span/
and the other alternative is a slight change in the table like this:
//*[#id='content-primary']/table[2]/tbody/tr[td[1]/span/span/
this is the code that i am using right now to get the information that i need:
rows_xpath = XPath("//*[#id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
So what i want to do is a check if the given XPath is valid. If it is not i just try the other XPath alternative.
Hope somebody can help me with this problem. Thank you all.
EDIT1
<table class="clCommonGrid" cellspacing="0">
<thead>
<tr>
<td colspan="3">Kommande matcher</td>
</tr>
<tr>
<th style="width:1%;">Tid</th>
<th style="width:69%;">Match</th>
<th style="width:30%;">Arena</th>
</tr>
</thead>
<tfoot>
<tr>
<td colspan="3">
<dl>
<dt class="clNotify">Röd text</dt>
<dd> = Ändrad matchtid </dd>
<dt><img src="http://svenskfotboll.se/i/u/alert.gif" alt="Röda utropstecknet" /></dt>
<dd> = Peka på utropstecknet så visas en notering </dd>
<dt><img src="http://svenskfotboll.se/i/widget.gif" alt="Widget" /></dt>
<dd>Hämta widget för kommande matcher</dd>
</dl>
</td>
</tr>
</tfoot>
<tbody class="clGrid">
<tr class="clTrOdd">
<td nowrap="nowrap" class="no-line-through">
<span class="matchTid"><span>2015-04-17<!-- br ok --> 19:15</span></span> //This is the date i am checking with first
</td>
<td>Götene IF - Vårgårda IK </td> // The other information that i need from the table later
<td>Sparbanksvallen Götene konstgräs </td>
</tr>

In my situation i did not need to specify which table to extract the information from. Since the information that i will get is specified with the date that only contains in that table i just used this code and it worked out fine for me:
**rows_xpath = XPath("//*[#id='content-primary']/table/tbody/tr[td[1]/span/span//text()='%s']" % (date))**
now it is just table which means it will go through both tables in the website. Its not maybe a clean solution but works for me..

Passing html tables to a new page in Google App Engine

I tried to convert app engine generated output page into pdf, and had some problems.
First: I select the contents in jQuery.
Second: Send this javascript variable to a new python script
Third: In the new python script, using xhtml2pdf to the conversion.
However, I got confused in the Second step. Below is my approach:
HTML:
<div class="articles">
<h2 class="model_header">PFAM Output</h2>
<form>
<table align="center">
<!--end 04uberoutput_start-->
<table class="out_chemical" width="550" border="1">
<tr>
<th scope="col" colspan="5">
<div align="center">Chemical Inputs</div>
</th>
</tr>
<tr>
<th scope="col" width="250">
<div align="center">Variable</div>
</th>
<th scope="col" width="150">
<div align="center">Unit</div>
</th>
<th scope="col" width="150">
<div align="center">Value</div>
</th>
</tr>
<tr>
<td>
<div align="center">Water Column Half life #20 &#8451</div>
</td>
<td>
<div align="center">days</div>
</td>
<td>
<div align="center">11</div>
</td>
</tr>
</table>
</table>
</form>
</div>
JS
$(document).ready(function () {
var jq_html = $("div.articles").html();
console.log(jq_html);
$('.getpdf').append('<tr style="display:none"><td><input name="extract" value="' + jq_html + '"></input></td></tr>');
$('.getpdf').append('<tr><td><input type="submit" value="Generate PDF"/></td></tr>');
})
new python script to do the conversion
def post(self):
form = cgi.FieldStorage()
extract = form.getvalue('extract')
print extract
self.response.out.write(html)
When I tried to check if variable extract is transferred correctly, I got an empty page. It seems like this variable is ignored... The whole framework seems fine if I feed extract with a number. So could anyone help me to identify if my approach is correct? Thanks!

This line of code does not handle escaping HTML correctly. Additionally, it is a text field rather than a hidden field:
$('.getpdf').append('<tr style="display:none"><td><input name="extract" value="' + jq_html + '"></input></td></tr>');
A better way to do it would be like this:
$('<tr style="display:none"><td><input type="hidden" name="extract"></td></tr>')
.appendTo('.getpdf')
.find('input')
.val(jq_html);

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to auto extract data from a html file with python? - python

I think what you want is print(company_name, location, website,'\n')

Related

How do I loop over this outerHTML code to get out certain data? (I don't know how to webscrape it so I want to try this)

When webscraping with python and Beautifulsoup adding a class to a findAll() search brings back zero results?

how to verify that a specific cell in a table contains a certain value in a link with Selenium and Python

Is there anyway to check if given XPath is valid in Python?

Passing html tables to a new page in Google App Engine

Categories

Resources