Pandas read_html() with table containing html elements

Pandas read_html() with table containing html elements - python

I have the following HTML table:
<table>
<thead>
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody>
<tr>
<td>Test</td>
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
that I would want to parse to a dataframe by using pd.read_html().
The output is as follows:
X1
X2
Test
Test2
However, I would prefer the following output (preserving HTML elements within a cell):
X1
X2
Test
<span style="..."> Test2 </span>
Is this possible with pd.read_html()?
I couldn't find a solution in the read_html() docs, and the alternative would be manual parsing.

You could modify ._text_getter() if you really wanted to.
Something like:
import lxml.html
import pandas as pd
html = """
<table>
<thead>
<th> X1 </th>
<th> X2 </th>
</thead>
<tbody>
<tr>
<td>Test</td>
<td><span style="..."> Test2 </span> </td>
</tr>
</tbody>
</table>
"""
def custom_text_getter(self, obj):
result = obj.xpath("node()")[0]
if isinstance(result, lxml.html.HtmlElement):
result = lxml.html.tostring(result, encoding="unicode")
return result
pd.io.html._LxmlFrameParser._text_getter = custom_text_getter
print(
pd.read_html(html)[0]
)
X1 X2
0 Test <span style="..."> Test2 </span>

Related

Trying to append a new row to the first row in a the table body with BeautifulSoup

Having trouble appending a new row to the first row (the header row) in the table body ().
my code:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('page_content.xml'), 'html.parser')
# append a row to the first row in the table body
row = soup.find('tbody').find('tr')
row.append(soup.new_tag('tr', text='New Cell'))
print(row)
the output:
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
<tr text="New Cell"></tr></tr>
what the output should be:
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
</tr>
<tr text="New Cell"></tr>
the full xml file is:
<h1>Rental Agreement/Editor</h1>
<table class="wrapped">
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<tbody>
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
<tr text="New Cell"></tr></tr>
<tr>
<td>1.0.1-0</td>
<td>ABC-1234</td>
<td colspan="1">
<br/>
</td>
</tr>
</tbody>
</table>
<p class="auto-cursor-target">
<br/>
</p>

You can use .insert_after:
from bs4 import BeautifulSoup
html_doc = """
<table>
<tr>
<th>Version</th>
<th>Jira</th>
<th colspan="1">Date/Time</th>
</tr>
<tr>
<td> something else </td>
</tr>
</table>
"""
soup = BeautifulSoup(html_doc, "html.parser")
row = soup.select_one("tr:has(th)")
row.insert_after(soup.new_tag("tr", text="New Cell"))
print(soup.prettify())
Prints:
<table>
<tr>
<th>
Version
</th>
<th>
Jira
</th>
<th colspan="1">
Date/Time
</th>
</tr>
<tr text="New Cell">
</tr>
<tr>
<td>
something else
</td>
</tr>
</table>
EDIT: If you want to insert arbitrary HTML code, you can try:
what_to_insert = BeautifulSoup(
'<tr param="xxx">This is new <b>text</b></tr>', "html.parser"
)
row.insert_after(what_to_insert)

How to get text from nested html table with beautifulsoup?

Within each of the main tables respectively, there are two tables nested of which the first one contains the data A_A_A_A that i want to extract to a pandas.dataframe
<table>
<tr valign="top">
<td> </td>
<td>
<br/>
<center>
<h2>asd</h2>
</center>
<h4>asd</h4>
<table>
<tr>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="tabcol" width="100%">
<tr>
<td> </td>
</tr>
<tr>
<td width="3%"> </td>
<td>
<table border="0" width="100%">
<tr>
<td width="2%"> </td>
<td> A_A_A_A <br/> A_A_A_A 111-222<br/> </td>
<td width="2%"> </td>
</tr>
</table>
</td>
<td width="3%"> </td>
</tr>
<tr>
<td width="3%"> </td>
<td>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td width="4%"> </td>
<td class="unique"> asd <br/> asd </td>
<td width="4%"> </td>
</tr>
</table>
</td>
<td width="3%"> </td>
</tr>
<tr>
<td> </td>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="tabcol" width="100%">
.
.
.
</table>
<br/>
<table>
</table>
</td>
</tr>
</table>
I figured that because of the limited availiability of attributes the only way to go forward would be an iteration over a td siblings with .next_siblings and if needed .next_elements
data1 = []
for item in soup.find_all('td', attrs={'width': '2%'}):
data = item.find_next_sibling().text
data1.append(data)
returns and empty list []. Now i dont know forward because i cannot identify any other helpful attributes/classes that would help me get to the middle td that contains the information.

.find_next(name=None, attrs={}, text=None, **kwargs)
Returns the first item that matches the given criteria and appears after this Tag in the document. So in your case:
item = soup.find('td', attrs={'width': '2%'})
data = item.find_next('td').text
Note that, I removed for loop since the desired data is coming after first td with width: '2%'. After running this, data will be:
' A_A_A_A A_A_A_A 111-222 '

I took #Wiktor Stribiżew answer from here regex for loop over list in python
and kind of merged it with yours #Rustam Garayev
item = soup.find_all('td', attrs={'width': '2%'})
data = [x.find_next('td').text for x in item]
since i needed not only the first AAAA but from all the following tables as well. The code above gives this output:
['A_A_A_A',
'\xa0',
'A_A_A_A',
'\xa0', ...]
which is good enough for my purpose. I think the '\xa0' comes from it trying to do the find_next on the third td sibling, which does not have a consecutive.

Use BeautifulSoup to fetch rows by header

I have an html structure like this one:
<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>
Those attributes are not always present, sometimes I can have only Brand, other cases Brand and Flavoring.
To scrap this I did a code like this:
BlendInfo = namedtuple('BlendInfo', ['brand', 'type', 'contents', 'flavoring'])
stats_rows = soup.find('table', id='stats').find_all('tr')
bi = BlendInfo(brand = stats_rows[1].td.get_text(),
type = stats_rows[2].td.get_text(),
contents = stats_rows[3].td.get_text(),
flavoring = stats_rows[4].td.get_text())
But as expected it fails with index out bounds (or get really messed up) when the table ordering is different (type before brand) or some of the rows are missing (no contents).
Is there any better approach to something like:
Give me the data from row with header with string 'brand'

It is definitely possible. Check this out:
from bs4 import BeautifulSoup
html_content='''
<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(html_content,"lxml")
for item in soup.find_all(class_='info')[0].find_all("th"):
header = item.text
rows = item.find_next_sibling().text
print(header,rows)
Output:
Brand 2 Guys Smoke Shop
Blend Type Aromatic
Contents Black Cavendish, Virginia
Flavoring Other / Misc

This would build a dict for you:
from BeautifulSoup import BeautifulSoup
valid_headers = ['brand', 'type', 'contents', 'flavoring']
t = """<table class="info" id="stats">
<tbody>
<tr>
<th> Brand </th>
<td> 2 Guys Smoke Shop </td>
</tr>
<tr>
<th> Blend Type </th>
<td> Aromatic </td>
</tr>
<tr>
<th> Contents </th>
<td> Black Cavendish, Virginia </td>
</tr>
<tr>
<th> Flavoring </th>
<td> Other / Misc </td>
</tr>
</tbody>
</table>"""
bs = BeautifulSoup(t)
results = {}
for row in bs.findAll('tr'):
hea = row.findAll('th')
if hea.strip().lstrip().lower() in valid_headers:
val = row.findAll('td')
results[hea[0].string] = val[0].string
print results

Calling Python function in XML report (odoo 10)

I am trying to call my method which is returning values. I want to fetch the values and use them in my report.
#api.one
def check_month(self,record,res):
fd = datetime.strptime(str(record.from_date), "%Y-%m-%d")
for rec in record.sales_record_ids:
res.append(rec.jan_month)
#api.one
def get_sales_rec(self):
result=[]
target_records = self.env['sales.target'].search([('sales_team','=', self.sales_team_ids.id)])
for rec in target_records:
self.check_month(rec,result)
return result
like this in xml:
<tbody>
<tr t-foreach="get_sales_rec()" t-as="data">
<tr>
<td>
<span t-esc="data[0]" />
</td>
</tr>
</tr>
</tbody>

Change your xml code to:
<tbody>
<tr t-foreach="o.get_sales_rec()" t-as="data">
<tr>
<td>
<span t-esc="data[0]" />
</td>
</tr>
</tr>
</tbody>
Here o stands for the report model object , so make sure you have added a python method in the same object.

How can i append <br> tags after a text element?

I am parsing a HTML file with BeautifulSoup and got stuck with < br> tags.
I want to append < br> tag after inserting a list element, but it didn't work.
What is the easiest way to do this?
soup = BeautifulSoup(open("test.html"))
mylist = [Item_1,Item_2]
for i in range(len(mylist)):
#insert Items to the 4. column
This is the default HTML:
<html>
<body>
<table>
<tr>
<th>
1. Column
</th>
<th>
2. Column
</th>
<th>
3. Column
</th>
<th>
4. Column
</th>
<th>
5. Column
</th>
<th>
6. Column
</th>
<th>
7. Column
</th>
<th>
8. Column
</th>
</tr>
<tr class="a">
<td class="h">
Text in first column
</td>
<td>
<br/>
</td>
<td>
<br/>
</td>
<td>
<!--I want to insert items here-->
</td>
<td>
1
</td>
<td>
37
</td>
<td>
38
</td>
<td>
38
</td>
</tr>
</table>
</body>
</html>
This is the HTML i want to make
<html>
<body>
<table>
<tr>
<th>
1. Column
</th>
<th>
2. Column
</th>
<th>
3. Column
</th>
<th>
4. Column
</th>
<th>
5. Column
</th>
<th>
6. Column
</th>
<th>
7. Column
</th>
<th>
8. Column
</th>
</tr>
<tr class="a">
<td class="h">
Text in first column
</td>
<td>
<br/>
</td>
<td>
<br/>
</td>
<td>
Item_1 <br>
Item_2
</td>
<td>
1
</td>
<td>
37
</td>
<td>
38
</td>
<td>
38
</td>
</tr>
</table>
</body>
</html>

To append a tag, first create it with the new_tag() factory function, like so:
soup.td.append(soup.new_tag('br'))
Consider the following program. For every table cell (that is, every td) in the html, it appends a <br/> tag and some text to the cell.
from bs4 import BeautifulSoup
html_doc = '''
<html>
<body>
<table>
<tr>
<td>
data1
</td>
<td>
data2
</td>
</tr>
</table>
</body>
</html>
'''
soup = BeautifulSoup(html_doc)
mylist = ['addendum 1', 'addendum 2']
for td,item in zip(soup.find_all('td'), mylist):
td.append(soup.new_tag('br'))
td.append(item)
print soup.prettify()
Result:
<html>
<body>
<table>
<tr>
<td>
data1
<br/>
addendum 1
</td>
<td>
data2
<br/>
addendum 2
</td>
</tr>
</table>
</body>
</html>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas read_html() with table containing html elements - python

Related

Trying to append a new row to the first row in a the table body with BeautifulSoup

How to get text from nested html table with beautifulsoup?

Use BeautifulSoup to fetch rows by header

Calling Python function in XML report (odoo 10)

How can i append <br> tags after a text element?

Categories

Resources