I tried to convert app engine generated output page into pdf, and had some problems.
First: I select the contents in jQuery.
Second: Send this javascript variable to a new python script
Third: In the new python script, using xhtml2pdf to the conversion.
However, I got confused in the Second step. Below is my approach:
HTML:
<div class="articles">
<h2 class="model_header">PFAM Output</h2>
<form>
<table align="center">
<!--end 04uberoutput_start-->
<table class="out_chemical" width="550" border="1">
<tr>
<th scope="col" colspan="5">
<div align="center">Chemical Inputs</div>
</th>
</tr>
<tr>
<th scope="col" width="250">
<div align="center">Variable</div>
</th>
<th scope="col" width="150">
<div align="center">Unit</div>
</th>
<th scope="col" width="150">
<div align="center">Value</div>
</th>
</tr>
<tr>
<td>
<div align="center">Water Column Half life #20 ℃</div>
</td>
<td>
<div align="center">days</div>
</td>
<td>
<div align="center">11</div>
</td>
</tr>
</table>
</table>
</form>
</div>
JS
$(document).ready(function () {
var jq_html = $("div.articles").html();
console.log(jq_html);
$('.getpdf').append('<tr style="display:none"><td><input name="extract" value="' + jq_html + '"></input></td></tr>');
$('.getpdf').append('<tr><td><input type="submit" value="Generate PDF"/></td></tr>');
})
new python script to do the conversion
def post(self):
form = cgi.FieldStorage()
extract = form.getvalue('extract')
print extract
self.response.out.write(html)
When I tried to check if variable extract is transferred correctly, I got an empty page. It seems like this variable is ignored... The whole framework seems fine if I feed extract with a number. So could anyone help me to identify if my approach is correct? Thanks!
This line of code does not handle escaping HTML correctly. Additionally, it is a text field rather than a hidden field:
$('.getpdf').append('<tr style="display:none"><td><input name="extract" value="' + jq_html + '"></input></td></tr>');
A better way to do it would be like this:
$('<tr style="display:none"><td><input type="hidden" name="extract"></td></tr>')
.appendTo('.getpdf')
.find('input')
.val(jq_html);
Related
I have extracted a HTML storage format markup language from a website. The information is in a tabular format as shown in the website:
But after I extract the information using a curl command I get the information in terms of HTML. Please advise on how to parse this information using Python such that I can gather only the data. Maybe we can insert the data in a list like [[CALX-582 Action-Item], [CALX-736 Action-Item]......]. Are there any Python-APIs that can do that? Or is it advisable to just use REGEX and parse the required data.
<pre><br /></pre>
<p class="auto-cursor-target"><br /></p>
<table><colgroup><col /><col /></colgroup>
<tbody>
<tr>
<th>JIRA</th>
<th>Type</th></tr>
<tr>
<td>CALX-582</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-736</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-735</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-792</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1563</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1567</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1861</td>
<td>Bug</td></tr></tbody></table>
<p class="auto-cursor-target"><br /><br /></p>
As has been mentioned you could use BeautifulSoup for this.
Not sure how you want the data but the code below will create a list of dictionaries with the keys coming from the JIRA column and the values from the Type column.
You could use other methods to put the data into other types of structures.
from bs4 import BeautifulSoup
html = """
<pre><br /></pre>
<p class="auto-cursor-target"><br /></p>
<table><colgroup><col /><col /></colgroup>
<tbody>
<tr>
<th>JIRA</th>
<th>Type</th></tr>
<tr>
<td>CALX-582</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-736</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-735</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-792</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1563</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1567</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1861</td>
<td>Bug</td></tr></tbody></table>
<p class="auto-cursor-target"><br /><br /></p>
"""
soup = BeautifulSoup(html, 'html.parser')
jira = soup.select('td')
data = [{jira[idx].getText(): jira[idx+1].getText()} for idx in range(0, len(jira), 2)]
print(data)
Hi we are running this code and it is driving my crazy
we capture a data table in table this works
then grab all th and it's text in sizes this works
then we want to grab all underlying rows in TR; and after loop over columns in rows : does not work! the color_rows object is always empty .. but when testing with xpath in the browser it does! work ... why? how?
My question is: how can I grab the tbody/tr's?
Expected flow
loop over TR's
Access, TR 1 by 1, get 1st TD
Get all TD's data that have class form-control
table = response.xpath('//div[#class="content"]//table[contains(#class,"table")]')
sizes = table.xpath('./thead//th/text()').getall()[1:] #works!
color_rows = table.xpath('./tbody/tr') #does not work! object empty
for color_row in color_rows:
color = color_row.xpath('/td[1]/b/text()').get().strip()
print(color)
stocks = color_row.xpath('/td/div[input[#class="form-control"]]/div//text()').getall()
for size, stock in zip(sizes, stocks)
Our html data looks like this
<table class="table">
<thead>
<tr>
<th id="ctl00_cphCEShop_colColore" class="text-left" colspan="2">Colore</th>
<th>S</th>
<th>M</th>
<th>L</th>
</tr>
</thead>
<tbody>
<tr>
<td id="x">
<b>White</b>
<input type="hidden" name="data" value="3230/201">
</td>
<td id="avail">
Avail:
</td>
<td id="1">
<div>
<input name="cell" type="text" class="form-control">
<div class="text-center">179</div>
</div>
</td>
<td id="2">
<div>
<input name="cell" type="text" class="form-control">
<div class="text-center">360</div>
</div>
</td>
etc etc
Apparently tbody tags are often omitted in HTML but aded by the browser.
In this case there was no (real) body tag making the xpath object miss!
And hence the troubles with xpath (if you really think the tbody tag is there)
Why do browsers insert tbody element into table elements?
I'm beginning to learn python (2.7) and would like to extract certain information from a html code stored in a text file. The code below is just a snippet of the whole html code. In the full html text file the code structure is the same for all other firms data as well and these html code "blocks" are positioned underneath each other (if the latter info helps).
The html snippet code:
<body><div class="tab_content-wrapper noPrint"><div class="tab_content_card">
<div class="card-header">
<strong title="" d.="" kon.="" nl="">"Liberty Associates LLC"</strong>
<span class="tel" title="Phone contacts">Phone contacts</span>
</div>
<div class="card-content">
<table>
<tbody>
<tr>
<td colspan="4">
<label class="downdrill-sbi" title="Industry: Immigration">Industry: Immigration</label>
</td>
</tr>
<tr>
<td width="20"> </td>
<td width="245"> </td>
<td width="50"> </td>
<td width="80"> </td>
</tr>
<tr>
<td colspan="2">
59 Wall St</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="2">NJ 07105
<label class="downdrill-sbi" title="New York">New York</label>
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr><td>Phone:</td><td>+1 973-344-8300</td><td>Firm Nr:</td><td>KL4568TL</td></tr>
<tr><td>Fax:</td><td>+1 973-344-8300</td><td colspan="2"></td></tr>
<tr>
<td colspan="2"> www.liberty.edu </td>
<td>Active:</td>
<td>Yes</td>
</tr>
</tbody>
</table>
</div>
</div></div></body>
How it looks like on a webpage:
Right now im using the following script to extract the desired information:
from lxml import html
str = open('html1.txt', 'r').read()
tree = html.fromstring(str)
for variable in tree.xpath('/html/body/div/div'):
company_name = variable.xpath('/html/body/div/div/div[1]/strong/text()')
location = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[4]/td[1]/label/text()')
website = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[8]/td[1]/a/text()')
print(company_name, location, website)
Printed result:
('"Liberty Associates LLC"', 'New York', 'www.liberty.edu')
So far so good. However, when I use the script above to scape the whole html file, results are printed right after each other on one single line. But I would like to print the data (html code "blocks") under eachother like this:
Liberty Associates LLC | New York | +1 973-344-8300 | www.liberty.edu
Company B | Los Angeles | +1 213-802-1770 | perchla.com
I know I can use [0], [1], [2] etc. to get the data under each other like I would like, but doing this manually for all thousands of html "blocks" is just not really feasible.
So my question: how can I automatically extract the data "block by block" from the html code and print the results under each other like illustrated above?
I think what you want is
print(company_name, location, website,'\n')
From this Deutsche Börse web page, under the table header Issuer I want to get the string content 'db X-trackers' in the cell next to the one with Name in it.
Using my web browser, I inspect that table area and get the code, which I've pasted into this XML tree just so that I can test my xPath.
<root>
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td>Name</td>
<td class="text-right">db X-trackers</td>
</tr>
</tbody>
</table>
</div>
</root>
According to FreeFormatter.com, my xPath below succeeds in retrieving the correct element (Text='db X-trackers'):
my_xpath = "//h2['Issuer']/ancestor::div[#class='row']/following-sibling::div//td['Name']/following-sibling::td[1]/text()"
Note: It goes to <h2>Issuer</h2> first to identify the right place to start working from.
However, when I run this on the actual web page using Selenium WebDriver, None is returned.
def get_sibling(driver, my_xpath):
try:
find_value = driver.find_element_by_xpath(my_xpath).text
except NoSuchElementException:
return None
else:
value = re.search(r"(.+)", find_value).group()
return value
I don't believe anything is wrong in the function itself, so either the xPath must be faulty or there is something in the actual web page source code that throws it off.
When studying the actual Source code in Chrome, it looks a bit messier than what I see with Inspector, which is what I used to create the little XML tree above.
<div class="box">
<div class="row">
<div class="col-lg-12">
<h2>Issuer</h2>
</div>
</div>
<div class="table-responsive">
<table class="table">
<tbody>
<tr>
<td >
Name
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Product Family
</td>
<td class="text-right" >
db X-trackers
</td>
</tr>
<tr>
<td >
Homepage
</td>
<td class="text-right" >
<a target="_blank" href="http://www.etf.db.com">www.etf.db.com</a>
</td>
</tr>
</tbody>
</table>
</div>
Are there some peculiarities in the source code above, or is my xPath (or function) wrong?
I would use the following and following-sibling axis:
//h2[. = "Issuer"]/following::table//td[. = "Name"]/following-sibling::td
First we locate the h2 element, then get the following table element. In the table element we look for the td element with Name text and then get the following td sibling.
I am trying to use Requests module to login into a site and get the html of the landing page. I am new to these stuff and I can't find a decent tutorial for this.
Here's the information that I have about that page
HTML of the form for login (url:http://14.139.251.99:8080/jopacv06/html/checkouts)
<FORM NAME="form" METHOD="POST" ACTION="./memberlogin" onsubmit="this.onsubmit= function(){return false;}">
<table class='loginTbl' border='1' align="center" cellspacing='3' cellpadding='3' width='60%'>
<input type="hidden" name="hdnrequesttype" value="1" />
<thead>
<tr>
<td colspan='3' align="middle" class='loginHead'>Login</td>
</tr>
</thead>
<tbody class='loginBody'>
<tr>
<td class='loginBodyTd1' nowrap="nowrap">Employee ID</td>
<td class='loginBodyTd2'><input type='text' name='txtmemberid' id='txtmemberid' value='' class='loginTextBox' size='30' maxlength='8'/></td>
<td class='loginBodyTd3' rowspan='2'><input type="submit" class="goclearbutton" value=" Go "></td>
</tr><input type='hidden' name='txtmemberpwd' id='txtmemberpwd' value='' />
</tbody>
<tfoot>
<tr>
<td colspan='3' class='loginFoot'>
<font class='loginRed'>New Visitor?</font>
Send your registration request to library !
</td>
</tr>
</tfoot>
</table>
</form>
I came to know that I may need to set cookie , so the cookie name in the landing page is JSESSIONID(in case that's reqd). And I discovered that once I successfuly log in then I would have to use beautifulSoup to get the details. Please help me how to combine these pieces together.
You will have to do something like this,
import requests
response = requests.post("http://14.139.251.99:8080/jopacv06/html/checkouts/memberlogin", data = {'txtmemberid': '1'})
if response.status_code == 200:
html_code = response.text
// Do whatever you want to do further with this HTML now.