Selecting and Rearranging HTML Elements with Python - python

How can the following unstructured table element can be structured, without using any library.
<table>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
</table>
Desired table:
<table>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
</table>
It is important to maintain the order of attributes of html elements. I have tried using Beautifulsoup. It changes the order. Please suggest any pythonic way of solving this problem, which doesn't require using beautifulsoup or lxml.

You can use regex via re:
import re
s = """
<table>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
</table>
"""
new_s = re.sub('\<tfoot\>[\w\W]+\</tfoot\>|\<tbody\>[\w\W]+\</tbody\>', '{}', s).format(*re.findall('\<tfoot\>[\w\W]+\</tfoot\>|\<tbody\>[\w\W]+\</tbody\>', s)[::-1])
Output:
<table>
<tbody>
<tr><td>January</td><td>$100</td></tr>
</tbody>
<tfoot>
<tr><td>Sum</td><td>$180</td></tr>
</tfoot>
</table>

Related

Parse HTML storage format for data

I have extracted a HTML storage format markup language from a website. The information is in a tabular format as shown in the website:
But after I extract the information using a curl command I get the information in terms of HTML. Please advise on how to parse this information using Python such that I can gather only the data. Maybe we can insert the data in a list like [[CALX-582 Action-Item], [CALX-736 Action-Item]......]. Are there any Python-APIs that can do that? Or is it advisable to just use REGEX and parse the required data.
<pre><br /></pre>
<p class="auto-cursor-target"><br /></p>
<table><colgroup><col /><col /></colgroup>
<tbody>
<tr>
<th>JIRA</th>
<th>Type</th></tr>
<tr>
<td>CALX-582</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-736</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-735</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-792</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1563</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1567</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1861</td>
<td>Bug</td></tr></tbody></table>
<p class="auto-cursor-target"><br /><br /></p>
As has been mentioned you could use BeautifulSoup for this.
Not sure how you want the data but the code below will create a list of dictionaries with the keys coming from the JIRA column and the values from the Type column.
You could use other methods to put the data into other types of structures.
from bs4 import BeautifulSoup
html = """
<pre><br /></pre>
<p class="auto-cursor-target"><br /></p>
<table><colgroup><col /><col /></colgroup>
<tbody>
<tr>
<th>JIRA</th>
<th>Type</th></tr>
<tr>
<td>CALX-582</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-736</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-735</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-792</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1563</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1567</td>
<td><span>Action-Item</span></td></tr>
<tr>
<td>CALX-1861</td>
<td>Bug</td></tr></tbody></table>
<p class="auto-cursor-target"><br /><br /></p>
"""
soup = BeautifulSoup(html, 'html.parser')
jira = soup.select('td')
data = [{jira[idx].getText(): jira[idx+1].getText()} for idx in range(0, len(jira), 2)]
print(data)

Sublime Text: regreplace - move entire tag above another tag

Not sure if this is possible.
What I want to do is move an entire tag with its content above another tag.
For example:
<table class="table1">
<tr>
<td>A</td>
</tr>
</table>
<p class="para">Test</p>
I want to move the p and its content above the table so end result would be:
<p class="para">Test</p>
<table class="table1">
<tr>
<td>A</td>
</tr>
</table>
So simply don't know how to move it. I can capture the p by doing this regex:
(?P<test><p class=\"para\">(.*?)(</p>))
I can also capture the entire table:
(<table (.*?)>)(.*?)(</table>))
So not sure if you can move it.
Can anyone help?
Thanks
Regex:(\<table(?:.*\n)+\<\/table>)\n(\<p(?:.*?)\<\/p>)
Replace with: $2\n$1
Demo
Instead of regex, use unpacking and str.join:
s = """
<table class="table1">
<tr>
<td>A</td>
</tr>
</table>
<p class="para">Test</p>
"""
*data, target = filter(None, s.split('\n'))
new_html = '{}\n{}'.format(target, '\n'.join(data))
Ouptut:
<p class="para">Test</p>
<table class="table1">
<tr>
<td>A</td>
</tr>
</table>

Transform BeautifulSoup extract into sqlite table

I have a html table structure that looks something like:
<table>
<tbody>
<tr>
<td>
<ul>
</ul
</td>
<td>
<table>
<tbody>
<tr></tr>
<tr></tr>
<tr></tr>
</tbody>
</table>
<table> -- (table structure I am interested in)
<tbody>
<tr>
<td class="dte"></td>
<td class="id"></td>
<td class="desc"></td>
</tr>
<tr>
<td class="dte"></td>
<td class="id"></td>
<td class="desc"></td>
</tr>
<tr>
<td class="dte"></td>
<td class="id"></td>
<td class="desc"></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
using python/BeautifulSoup, I have managed to print an output to screen like:-
[b'16 March', b'987654', b'Something happens on this date']
[b'23 March', b'321987', b'Something happens on this date']
[b'26 March', b'123456', b'Something happens on this date']
using the following code (which I have hacked together from various posts on this site):-
for mytable in soup.find('body').find_all('table'):
#print (len(mytable))
for trs in mytable.find_all('tr'):
tds = trs.find_all('td', class_='dte id desc'.split())
if tds: # checks if 'tds' has value. if YES then block is executed
row = [elem.text.strip().encode('utf-8') for elem in tds]
print (row)
else:
continue # 'row' item is empty, proceed to next loop
2 questions:
when the output prints to screen, I get the whole table structure on the first line (so each of the above examples would be output on the first line (the actual table has about 100 entries in length)) and then from the second line I get a single entry per line (as shown above) which is what I want. How can I ignore or NOT output the full structure on the first line? And why do I get that?
I would like to transform the results shown above into a sqlite3 table structure which I would at a later date etl into a production mssql environment. I have not been able to find a way to do this based on the output I am getting.

scrapy get text from image title attribute inside a nested table

I am new to scrapy and I am trying to get the text value from the title attribute of a image inside a nested table. Below is a sample of a table
<html>
<body>
<div id=yw1>
<table id="x">
<thead></thead>
<tbody>
<tr>
<td>
<table id="y">
<thead></thead>
<tbody>
<tr>
<td><img src=".." title="Sample"></td>
<td></td>
</tr>
</tbody>
</table>
</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
I use the following scrapy code to get the text from the title attribute.
def parse(self, response):
transfers = Selector(response).xpath('//*[#id="yw1"]/table/tbody/tr')
for transfer in transfers:
item = TransfermarktItem()
item['naam'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[1]/img/#title/text()').extract()
item['positie'] = transfer.xpath('td[1]/table/tbody/tr[1]/td[2]/a/text()').extract()
item['leeftijd'] = transfer.xpath('td[2]/text()').extract()
yield item
For some reason the text value of the title attribute is not extracted. What is it I am doing wrong??
Cheers!
It seems you can just use
item['naam'] = transfer.xpath(
'td[1]/table/tbody/tr[1]/td[1]/img/#title'
)
This will return a list.
text() is not useful for getting tag attribute values.
extract() I think can also be omitted here.
EDIT:
some more possibility, if the above is still not working, would be the tbody problem, i.e. http://doc.scrapy.org/en/latest/topics/firefox.html. You can try like that:
td[1]/table//tr[1]/td[1]/img/#title
If that doesn't help, then based on the data we've got here, I think I'm out of ideas :)

BeautifulSoup SoupStrainer doesn't work when element has multiple classes?

I try
necessaryStuffOnly = SoupStrainer("table",{"class": "views-table"})
soup = BeautifulSoup(vegetables,parse_only=necessaryStuffOnly)
without luck on a table like this:
<div class="view-content">
<table class="views-table sticky-enabled cols-20">
<thead>
<tr>
<td>blablaba</td>
</tr>
</thead>
<tbody>
<tr>
<td>more blablabla</td>
</tr>
</tbody>
</table>
</div>
and this does work for the div
SoupStrainer("div",{"class": "view-content"})
Can't a SoupStrainer like this filter on element with multiple classes?
The comparision that's used is a literal equality check, so the following works:
soup('table', {'class': "views-table sticky-enabled cols-20"})
You can get it to match by doing by passing a function as to the filter:
soup('table', {'class': lambda L: 'views-table' in L.split()})
It might be worth checking the version you're using, because I have a feeling this shouldn't be the case anymore... update: yup, here you go https://bugs.launchpad.net/beautifulsoup/+bug/410304

Categories

Resources