Beautiful soup not finding table ID - python

I am trying to extract some data from a webpage that has multiple tables. All the tables have an id="name" attribute. I am using beautiful soup 4 with Python 3.4.1. My code lopped through the first tables just fine, but on the last one it returns 'None' and I can't figure out why.
The html code for the table info is below and from what I can see, it was not formatted any differently than the other tables that had other id names such as id=Datagrid1
<TR>
<TD vAlign=top>
<TABLE id=Datagrid7
style="FONT-SIZE: smaller; FONT-FAMILY: Verdana; WIDTH: 675px; BORDER-COLLAPSE: collapse"
cellSpacing=0 rules=all align=left border=1>
<TBODY>
The python code below returns None, but will work if I change the id to another known id name.
table = soup.find('table', id='DataGrid7')
print(table)

there was typo error in your program it should be small 'g'
from bs4 import BeautifulSoup
html="""<TR>
<TD vAlign=top>
<TABLE id=Datagrid7
style="FONT-SIZE: smaller; FONT-FAMILY: Verdana; WIDTH: 675px; BORDER-COLLAPSE: collapse"
cellSpacing=0 rules=all align=left border=1>
<TBODY>"""
soup=BeautifulSoup(html)
print soup.find('table',id='Datagrid7')
#output <table align="left" border="1" cellspacing="0" id="Datagrid7" rules="all" style="FONT-SIZE: smaller; FONT-FAMILY: Verdana; WIDTH: 675px; BORDER-COLLAPSE: collapse">
<tbody></tbody></table>

There's a typo in the code.
The id of the table is Datagrid7, not DataGrid7:
table = soup.find('table', id='Datagrid7')
# ^

Related

Loop through multiple html tables looking for specific <td> values

I'm trying to find an account number in a table (it can be in many multiple tables) along with the status of the account. I'm trying to utilize find_element using the Xpath and the odd thing is that it is saying it cannot find it. You can see in the html that the id exists yet it is defaulting to my except saying table not found. My end result is to find the table that has the header Instance ID with the value of 9083495r3498q345 and to give the value under the Status field for that row in the same table. Please keep in mind that it may not be DataTables_Table_6 but could be DataTables_Table_i
<table class="data-table clear-both dataTable no-footer" cellspacing="0" id="DataTables_Table_6" role="grid">
<thead>
<tr role="row"><th style="text-align: left; width: 167.104px;" class="ui-state-default sorting_disabled" rowspan="1" colspan="1"><div class="DataTables_sort_wrapper"><span class="DataTables_sort_icon"></span>Parent Instance ID</div></th><th style="text-align: left; width: 116.917px;" class="ui-state-default sorting_disabled" rowspan="1" colspan="1"><div class="DataTables_sort_wrapper"><span class="DataTables_sort_icon"></span>Instance ID</div></th><th style="text-align: left; width: 97.1771px;" class="ui-state-default sorting_disabled" rowspan="1" colspan="1"><div class="DataTables_sort_wrapper"><span class="DataTables_sort_icon"></span>Plan Name</div></th><th style="text-align: left; width: 168.719px;" class="ui-state-default sorting_disabled" rowspan="1" colspan="1"><div class="DataTables_sort_wrapper"><span class="DataTables_sort_icon"></span>Client Defined Identifier</div></th><th style="text-align: left; width: 39.5729px;" class="ui-state-default sorting_disabled" rowspan="1" colspan="1"><div class="DataTables_sort_wrapper"><span class="DataTables_sort_icon"></span>Units</div></th><th style="text-align: left; width: 89.8438px;" class="ui-state-default sorting_disabled" rowspan="1" colspan="1"><div class="DataTables_sort_wrapper"><span class="DataTables_sort_icon"></span>Status</div></th></tr>
</thead>
<tbody>
<tr role="row" class="odd">
<td style="text-align: left;"><span style="padding-left:px;\"><a href="#" class="doAccountsPanel" ></a></span></td>
<td style="text-align: left;"><span style="padding-left:px;\">Not Needed</span></td>
<td style="text-align: left;">The Product</td>
<td style="text-align: left;">9083495r3498q345</td>
<td style="text-align: left;">1</td>
<td style="text-align: left;">Suspended</td>
</tr></tbody></table>
try:
driver_chrom.find_element(By.XPATH,'//*[#id="DataTables_Table_6"]')
print("Found The Table")
except:
print("Didn't find the table")
I would have expected my print result to be "Found the Table", but I'm getting the "Didn't find the table".
DataTables are dynamic elements - the actual info they hold is being hydrated by javascript on an empty table skeleton, after page loads. Therefore, you need to wait for the table to fully load, then look up the information it holds:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
[...]
wait = WebDriverWait(driver, 20)
[...]
desired_info = wait.until(EC.element_to_be_clickable((By.XPATH, '//*[#id="DataTables_Table_6"]')))
print(desired_info.text)
See Selenium documentation here.

How to use border-radius while converting html to pdf using xhtmltopdf

I am trying to round the corners of my table, border-radius doen't seem to work when I convert the below HTML to PDF using xhtmltopdf pdf generator. Below is the HTML written for content file name is sticker_print.html :
<div class="sticker" style="height:196px">
<table class="sticker_box" align="left">
<tr>
<td style="border: 1px solid #222;background-color: #ffffff;">
<h3 style="border-bottom: 1px solid #222222;">Batch Sticker</h3>
<h5 style="padding: 0 0 0 10px;">Batch ID</h5>
<p>MFG Date</p>
<p style="padding-bottom:0px;"><img src="http://www.computalabel.com/Images/C128ff#2x.png" width="195px" height="26px"><span> Bar Code </span></p>
<p style="text-align: left; padding-bottom: 0px;">
<img src="https://www.kaspersky.com/content/en-global/images/repository/isc/2020/9910/a-guide-to-qr-codes-and-how-to-scan-qr-codes-2.png" width="65px" height="65px">
<span style="display: block;margin-top: 0px;">QR Code</span>
</p>
</td>
</tr>
</table>
</div>
PDF CODE
pdf = render_to_pdf('sticker_print.html')
return HttpResponse(pdf, content_type='application/pdf')
Even though I'm not using the same PDF engine as you (and your question is 6 months old), I solved this issue by using corner-radius instead of border-radius on a table cell or div.

extract log file data and input directly into xhtml body

I've currently got a python script where a log file is put through and any defined 'excluded' keywords are stripped in the same file. I am attempting to then, after extracting the required words, input this into a pre-built XHTML file directly into the "body" section.
Is there a way that this can be accomplished?
My code for the writing from the extracted log file to the XHTML file is as follows, but this overwrites the XHTML file currently (which I expect as this is where I am stuck).
I have read up on BeautifulSoup but I don't want to go down that path, I want to strictly keep this all executed within the python file (if possible).
contents = open('\path\to\file.log','r')
with open("output.html", "w") as writehtml:
for lines in contents.readlines():
writehtml.write("<pre>" + lines + "</pre> <br>\n")
The formatting I have for my XHTML page within the section is as follows:
<body>
<tr>
<td bgcolor="#ffffff" style="padding: 40px 30px 40px 30px;">
<table border="1" cellpadding="0" cellspacing="0" width="100%%">
<tr>
<td style="padding: 10px 0 10px 0; font-family: Calibri, sans-serif; font-size: 16px;">
<!-- Body text from file goes here-->
Body Text Replaces Here
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</body>
Thanks.
How is this?
# You can read the template data and spell it in
contents = open('\path\to\file.log','r')
# Suppose that the beginning of your template is stored in this file,\path\template\start.txt
start = '''
<body>
<tr>
<td bgcolor="#ffffff" style="padding: 40px 30px 40px 30px;">
<table border="1" cellpadding="0" cellspacing="0" width="100%%">
<tr>
<td style="padding: 10px 0 10px 0; font-family: Calibri, sans-serif; font-size: 16px;">
'''
# start = open('\path\template\start.txt','r')
# Assume that the end of your template is in this file,\path\template\end.txt
end = '''
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</body>
'''
# end = open('\path\template\end.txt','r')
with open("output.html", "a") as writehtml:
writehtml.write(start)
for lines in contents.readlines():
writehtml.write("<pre>" + lines + "</pre> <br>\n")
writehtml.write(end)

can't get text from SPAN tag

The structure of the website I'm trying to parse looks like this:
<table border="0" cellpadding="3" cellspacing="0" width="100%">
<tr height="25">
<td class="th" style="border:none" width="2%"> </td>
<td class="th">movie</td>
<td class="th"> </td>
<td class="th"> </td>
</tr>
<tr id="place_1">
<td style="color: #555; vertical-align: top; padding: 6px">
<a name="1"></a>1.
</td>
<td style="height: 27px; vertical-align: middle; padding: 6px 30px 6px 0">
<a class="all" href="/326/">MOVIE TITLE IN SPANISH</a>
<br/>
<span class="text-grey">MOVIE TITLE IN ENGLISH</span>
</td>
<td style="width: 85px">
<div style="width: 85px; position: relative">
<a class="continue" href="/326/votes/">
9.191
</a>
<span style="color: #777">
(592 184)
</span>
</div>
</td>
</tr>
...
...
...
The problem is I can't get the text inside span-tag. I've tried .text as for a-tag, also tried .get_text(). But none of these worked. My code on Python:
for row in table.find_all('tr')[1:]:
info = row.find_all('td')
movies.append({
'spn_title' : info[1].a.text,
'eng_title' : info[1].span.text,
})
The errors I get:
AttributeError: 'NoneType' object has no attribute 'get_text'
or
'eng_title' : info[1].span.text AttributeError: 'NoneType' object has
no attribute 'text'
Try the following. Also, check your soup variable because I can run your code without problem. I suspect that somewhere later in the HTML you don't have one of these present in a row.
If the class names are consistent you could filter only qualifying rows having the appropriate type elements with those classes.Using bs4 4.7.1.
for row in table.select('tr :has(span.text-grey):has(a.all)'):
movies.append({
'spn_title' : row.select_one('.all').text,
'eng_title' : row.select_one('.text-grey').text
})
print(movies)
Otherwise, you want a way to handle if not present. For example,
for row in table.find_all('tr')[1:]:
movies.append({
'spn_title' : row.select_one('.all').text if row.select_one('.all') is not None else 'None',
'eng_title' : row.select_one('.text-grey').text if row.select_one('.text-grey') is not None else 'None'
})
print(movies)
I think that you should use innerHTML.
info[1].getElementsByTagName('span')[0].innerHTML
should work.
I have the same issue but I was able to resolve it.
example
<span class="a-offscreen">$10.99</span>
instead of Elem.FindElementByCss("span.a-offscreen").Text
use:
Elem.FindElementByCss("span.a-offscreen").FindElementByXPath("parent::*").Text
The trick is to get the text of the parent.
Btw, I am using VBA so you need to change it to Python Syntax.

pyfpdf write_html In-line CSS style attribute not working in fpdf python

I am trying to create a PDF file by using pyfpdf in python Django. the following code snippet I am trying to generate the pdf of HTML code and I am using the in-line CSS, but it not rendering the css style
from fpdf import FPDF, HTMLMixin
import os
class WriteHtmlPDF(FPDF, HTMLMixin):
pass
pdf = WriteHtmlPDF()
# First page
pdf.add_page()
html = f"""<h3>xxxxx</h3>
<div style="border:1px solid #000">
<table border="1" cellpadding="5" cellspacing="0">
<tr><th width=20 align="left">xxxxxxxx:</th><td width="100">xxxxxxxx</td></tr>
<tr><th width=20 align="left">xxxxxxxx:</th><td width="100">xxxxxxxxx</td></tr>
<tr><th width=20 align="left">xxxxxxxx:</th><td width="100">xxxxxxxxx</td></tr>
<tr><th width=20 align="left">xxxxxxxx:</th><td width="100">xxxxxxxxx</td></tr>
</table>
</div>
<div style="border: 1px solid; padding: 2px; font-size: 12px;">
<table>
<tr>
<td width="20">xxxxxx: 1</td>
<td width="20">xxxxxx: 0</td>
<td width="20">xxxxxx: 1</td>
</tr>
</table>
</div>"""
PDF file get generated but without the CSS styling.
https://pyfpdf.readthedocs.io/en/latest/reference/write_html/index.html#details
inline css is not supported in pyfpdf

Categories

Resources