Parse HTML storage format for data - python

I have extracted a HTML storage format markup language from a website. The information is in a tabular format as shown in the website:
But after I extract the information using a curl command I get the information in terms of HTML. Please advise on how to parse this information using Python such that I can gather only the data. Maybe we can insert the data in a list like [[CALX-582 Action-Item], [CALX-736 Action-Item]......]. Are there any Python-APIs that can do that? Or is it advisable to just use REGEX and parse the required data.
<pre><br /></pre>
<p class="auto-cursor-target"><br /></p>
<table><colgroup><col /><col /></colgroup>
<tbody>
<tr>
<th>JIRA</th>
<th>Type</th></tr>
<tr>
<td>CALX-582</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-736</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-735</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-792</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1563</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1567</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1861</td>
<td>Bug</td></tr></tbody></table>
<p class="auto-cursor-target"><br /><br /></p>

As has been mentioned you could use BeautifulSoup for this.
Not sure how you want the data but the code below will create a list of dictionaries with the keys coming from the JIRA column and the values from the Type column.
You could use other methods to put the data into other types of structures.
from bs4 import BeautifulSoup
html = """
<pre><br /></pre>
<p class="auto-cursor-target"><br /></p>
<table><colgroup><col /><col /></colgroup>
<tbody>
<tr>
<th>JIRA</th>
<th>Type</th></tr>
<tr>
<td>CALX-582</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-736</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-735</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-792</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1563</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1567</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1861</td>
<td>Bug</td></tr></tbody></table>
<p class="auto-cursor-target"><br /><br /></p>
"""
soup = BeautifulSoup(html, 'html.parser')
jira = soup.select('td')
data = [{jira[idx].getText(): jira[idx+1].getText()} for idx in range(0, len(jira), 2)]
print(data)

Related

BeautifulSoup how to only return class objects

I have a html document that looks similar to this:
<div class='product'>
<table>
<tr>
random stuff here
</tr>
<tr class='line1'>
<td class='row'>
<span>TEXT I NEED</span>
</td>
</tr>
<tr class='line2'>
<td class='row'>
<span>MORE TEXT I NEED</span>
</td>
</tr>
<tr class='line3'>
<td class='row'>
<span>EVEN MORE TEXT I NEED</span>
</td>
</tr>
</table>
</div>
So i have used this code but i am getting the first text from the tr that's not a class, and i need to ignore it:
soup.findAll('tr').text
Also, when I try to do just a class, this doesn't seem to be valid python:
soup.findAll('tr', {'class'})
I would like some help extracting the text.
To get the desired output, use a CSS Selector to exclude the first <tr> tag, and select the rest:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
print(tag.text.strip())
Output :
TEXT I NEED
MORE TEXT I NEED
EVEN MORE TEXT I NEED

Sublime Text: regreplace - move entire tag above another tag

Not sure if this is possible.
What I want to do is move an entire tag with its content above another tag.
For example:
<table class="table1">
<tr>
<td>A</td>
</tr>
</table>
<p class="para">Test</p>
I want to move the p and its content above the table so end result would be:
<p class="para">Test</p>
<table class="table1">
<tr>
<td>A</td>
</tr>
</table>
So simply don't know how to move it. I can capture the p by doing this regex:
(?P<test><p class=\"para\">(.*?)(</p>))
I can also capture the entire table:
(<table (.*?)>)(.*?)(</table>))
So not sure if you can move it.
Can anyone help?
Thanks
Regex:(\<table(?:.*\n)+\<\/table>)\n(\<p(?:.*?)\<\/p>)
Replace with: $2\n$1
Demo
Instead of regex, use unpacking and str.join:
s = """
<table class="table1">
<tr>
<td>A</td>
</tr>
</table>
<p class="para">Test</p>
"""
*data, target = filter(None, s.split('\n'))
new_html = '{}\n{}'.format(target, '\n'.join(data))
Ouptut:
<p class="para">Test</p>
<table class="table1">
<tr>
<td>A</td>
</tr>
</table>

how to verify that a specific cell in a table contains a certain value in a link with Selenium and Python

I have a table similar to the following
<tbody>
<tr>
<td>some text</td>
<td>other text</td>
<td>
process data or
delete
</td>
</tr>
<tr>
<td>some text</td>
<td>other text</td>
<td>
process data or
delete
</td>
</tr>
<tr>
<td>some text</td>
<td>other text</td>
<td>
process data or
delete
</td>
</tr>
</tbody>
I would like to check with Python and Selenium that the cell in row 1, column 3 contains a link to /process/100/
I am able to access the text with
cell_content = self.browser.find_element_by_xpath('//table/tbody/tr[1]/td[3]').text
but I would like to access
href="/process/101/"
to check if it contains 101
Use get_attribute() method as below
cell_href = self.browser.find_element_by_xpath('//table/tbody/tr[1]/td[3]').get_attribute('href')
assert "101" in cell_href

How to auto extract data from a html file with python?

I'm beginning to learn python (2.7) and would like to extract certain information from a html code stored in a text file. The code below is just a snippet of the whole html code. In the full html text file the code structure is the same for all other firms data as well and these html code "blocks" are positioned underneath each other (if the latter info helps).
The html snippet code:
<body><div class="tab_content-wrapper noPrint"><div class="tab_content_card">
<div class="card-header">
<strong title="" d.="" kon.="" nl="">"Liberty Associates LLC"</strong>
<span class="tel" title="Phone contacts">Phone contacts</span>
</div>
<div class="card-content">
<table>
<tbody>
<tr>
<td colspan="4">
<label class="downdrill-sbi" title="Industry: Immigration">Industry: Immigration</label>
</td>
</tr>
<tr>
<td width="20"> </td>
<td width="245"> </td>
<td width="50"> </td>
<td width="80"> </td>
</tr>
<tr>
<td colspan="2">
59 Wall St</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="2">NJ 07105
<label class="downdrill-sbi" title="New York">New York</label>
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr><td>Phone:</td><td>+1 973-344-8300</td><td>Firm Nr:</td><td>KL4568TL</td></tr>
<tr><td>Fax:</td><td>+1 973-344-8300</td><td colspan="2"></td></tr>
<tr>
<td colspan="2"> www.liberty.edu </td>
<td>Active:</td>
<td>Yes</td>
</tr>
</tbody>
</table>
</div>
</div></div></body>
How it looks like on a webpage:
Right now im using the following script to extract the desired information:
from lxml import html
str = open('html1.txt', 'r').read()
tree = html.fromstring(str)
for variable in tree.xpath('/html/body/div/div'):
company_name = variable.xpath('/html/body/div/div/div[1]/strong/text()')
location = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[4]/td[1]/label/text()')
website = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[8]/td[1]/a/text()')
print(company_name, location, website)
Printed result:
('"Liberty Associates LLC"', 'New York', 'www.liberty.edu')
So far so good. However, when I use the script above to scape the whole html file, results are printed right after each other on one single line. But I would like to print the data (html code "blocks") under eachother like this:
Liberty Associates LLC | New York | +1 973-344-8300 | www.liberty.edu
Company B | Los Angeles | +1 213-802-1770 | perchla.com
I know I can use [0], [1], [2] etc. to get the data under each other like I would like, but doing this manually for all thousands of html "blocks" is just not really feasible.
So my question: how can I automatically extract the data "block by block" from the html code and print the results under each other like illustrated above?
I think what you want is
print(company_name, location, website,'\n')

Passing html tables to a new page in Google App Engine

I tried to convert app engine generated output page into pdf, and had some problems.
First: I select the contents in jQuery.
Second: Send this javascript variable to a new python script
Third: In the new python script, using xhtml2pdf to the conversion.
However, I got confused in the Second step. Below is my approach:
HTML:
<div class="articles">
<h2 class="model_header">PFAM Output</h2>
<form>
<table align="center">
<!--end 04uberoutput_start-->
<table class="out_chemical" width="550" border="1">
<tr>
<th scope="col" colspan="5">
<div align="center">Chemical Inputs</div>
</th>
</tr>
<tr>
<th scope="col" width="250">
<div align="center">Variable</div>
</th>
<th scope="col" width="150">
<div align="center">Unit</div>
</th>
<th scope="col" width="150">
<div align="center">Value</div>
</th>
</tr>
<tr>
<td>
<div align="center">Water Column Half life #20 &#8451</div>
</td>
<td>
<div align="center">days</div>
</td>
<td>
<div align="center">11</div>
</td>
</tr>
</table>
</table>
</form>
</div>
JS
$(document).ready(function () {
var jq_html = $("div.articles").html();
console.log(jq_html);
$('.getpdf').append('<tr style="display:none"><td><input name="extract" value="' + jq_html + '"></input></td></tr>');
$('.getpdf').append('<tr><td><input type="submit" value="Generate PDF"/></td></tr>');
})
new python script to do the conversion
def post(self):
form = cgi.FieldStorage()
extract = form.getvalue('extract')
print extract
self.response.out.write(html)
When I tried to check if variable extract is transferred correctly, I got an empty page. It seems like this variable is ignored... The whole framework seems fine if I feed extract with a number. So could anyone help me to identify if my approach is correct? Thanks!
This line of code does not handle escaping HTML correctly. Additionally, it is a text field rather than a hidden field:
$('.getpdf').append('<tr style="display:none"><td><input name="extract" value="' + jq_html + '"></input></td></tr>');
A better way to do it would be like this:
$('<tr style="display:none"><td><input type="hidden" name="extract"></td></tr>')
.appendTo('.getpdf')
.find('input')
.val(jq_html);

Categories

Resources