HTML and other formatting

HTML and other formatting - python

is there a way (using python and lxml) to get an output of HTML code like this:
<table class=main>
<tr class=row>
</tr>
</table>
instead of one like this one:
<table class=main><tr class=row></tr>
</table>
Only tags named "span" in div-tags can be appended. So things like:
<div class=paragraph><span class=font48>hello</span></div>
are allowed.
Thanks a lot for any help.

you could insert a line break before every "<" with a regex

Another option would be using BeautifulSoup:
from bs4 import BeautifulSoup
html = "<table class=main><tr class=row></tr></table>"
soup = BeautifulSoup(html)
print soup.prettify()
Output:
<table class="main">
<tr class="row">
</tr>
</table>

Have you considered the prettify() method from the module BeautifulSoup ?
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup as bs
html = '<table class=main><tr class=row></tr>\
</table>'
print bs(html).prettify()
outputs:
<table class="main">
<tr class="row">
</tr>
</table>
Note - it will add some indentation to the output, as you can see.

Related

How to extract the first "src" attribute from a HTML tag

Let's say I got an HTML tag below:
target = <tr src="./sound/6/4-1-1.mp3"><td class="code">(4-1)a.</td><td class="sound"><audio controls=""><source src="./sound/6/4-1-1.mp3" type="audio/mpeg"/></audio></td><td class="text"><p class="ab">Na mapaspas a Subalis bunuaz busul tu laas.</p><p class="en">Subali is going to hit the plum.</p></td></tr>
My ideal output:
<tr src="./sound/6/4-1-1.mp3">
I've tried by using the following code:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(target, 'lxml')
soup.find(src=re.compile('\.\w'))
However, my output:
<source src="./sound/6/4-1-1.mp3" type="audio/mpeg"/>
How can I get the ideal output as mentioned above?
Thanks for any help!!

You can first find tr then with regex and '<tr.*>' find what you want like below.
Try this:
from bs4 import BeautifulSoup
import re
html="""
<tr src="./sound/6/4-1-1.mp3">
<td class="code">(4-1)a.</td>
<td class="sound"><audio controls="">
<source src="./sound/6/4-1-1.mp3" type="audio/mpeg"/></audio>
</td>
<td class="text">
<p class="ab">Na mapaspas a Subalis bunuaz busul tu laas.</p>
<p class="en">Subali is going to hit the plum.</p>
</td>
</tr>
"""
soup=BeautifulSoup(html,"lxml")
re.search(r'<tr.*>',str(soup.find("tr"))).group()
Output:
'<tr src="./sound/6/4-1-1.mp3">'

BeautifulSoup how to only return class objects

I have a html document that looks similar to this:
<div class='product'>
<table>
<tr>
random stuff here
</tr>
<tr class='line1'>
<td class='row'>
<span>TEXT I NEED</span>
</td>
</tr>
<tr class='line2'>
<td class='row'>
<span>MORE TEXT I NEED</span>
</td>
</tr>
<tr class='line3'>
<td class='row'>
<span>EVEN MORE TEXT I NEED</span>
</td>
</tr>
</table>
</div>
So i have used this code but i am getting the first text from the tr that's not a class, and i need to ignore it:
soup.findAll('tr').text
Also, when I try to do just a class, this doesn't seem to be valid python:
soup.findAll('tr', {'class'})
I would like some help extracting the text.

To get the desired output, use a CSS Selector to exclude the first <tr> tag, and select the rest:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.select('.product tr:not(.product tr:nth-of-type(1))'):
print(tag.text.strip())
Output :
TEXT I NEED
MORE TEXT I NEED
EVEN MORE TEXT I NEED

how to extract the text from the following HTML code?

I am doing web scraping for a DS project, and i am using BeautifulSoup for that. But i am unable to extract the Duration from "tbody" tag in "table" class.
Following is the HTML code :
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th>Start Date</th>
<th>Duration</th>
<th>Stipend</th>
<th>Posted On</th>
<th>Apply By</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<div id="start-date-first">Immediately</div>
</td>
<td>1 Month</td>
<td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
1500 /month
</td>
<td>26 May'20</td>
<td>23 Jun'20</td>
</tr>
</tbody>
</table>
</div>
Note : for extracting 'Immediately' text, i use the following code :
x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text

You can use select() function to find tags by css selector.
tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag
print(tds[1].text)
The return value of select() function is the list of all HTML tags that matches the selector. The one you want to retrieve is second one, so using index 1, then get text of it.

Try this:
from bs4 import BeautifulSoup
import requests
url = "yourUrlHere"
pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)
In my code i use lxml library to parse the data. If you want to install pip install lxml... or just change into your libray in this part of the code:
soup = BeautifulSoup(pageRaw , 'lxml')
This code will return the first table ok?
Take care

BeautifulSoup: How to extract text encapsulated in multiple div/span/id tags

I need to extract the digits (0.04) in the "td" tag at the end of this html page.
<div class="boxContentInner">
<table class="values non-zebra">
<thead>
<tr>
<th>Apertura</th>
<th>Max</th>
<th>Min</th>
<th>Variazione giornaliera</th>
<th class="last">Variazione %</th>
</tr>
</thead>
<tbody>
<tr>
<td id="open" class="quaternary-header">2708.46</td>
<td id="high" class="quaternary-header">2710.20</td>
<td id="low" class="quaternary-header">2705.66</td>
<td id="change" class="quaternary-header changeUp">0.99</td>
<td id="percentageChange" class="quaternary-header last changeUp">0.04</td>
</tr>
</tbody>
</table>
</div>
I tried this code using BeautifulSoup with Python 2.8:
from bs4 import BeautifulSoup
import requests
page= requests.get('https://www.ig.com/au/indices/markets-indices/us-spx-500').text
soup = BeautifulSoup(page, 'lxml')
percent= soup.find('td',{'id':'percentageChange'})
percent2=percent.text
print percent2
The result is NONE.
Where is the error?

I had a look at https://www.ig.com/au/indices/markets-indices/us-spx-500 and it seems you are not searching for the right id when doing percent= soup.find('td', {'id':'percentageChange'})
The actual value is located in <span data-field="CPC">VALUE</span>
You can retrieve this information with the below:
percent = soup.find("span", {'data-field': 'CPC'})
print(percent.text.strip())

This worked for me.
percents = soup.find_all("span", {'data-field': 'CPC'})
for percent in percents:
print(percent.text.strip())

BeautifulSoup SoupStrainer doesn't work when element has multiple classes?

I try
necessaryStuffOnly = SoupStrainer("table",{"class": "views-table"})
soup = BeautifulSoup(vegetables,parse_only=necessaryStuffOnly)
without luck on a table like this:
<div class="view-content">
<table class="views-table sticky-enabled cols-20">
<thead>
<tr>
<td>blablaba</td>
</tr>
</thead>
<tbody>
<tr>
<td>more blablabla</td>
</tr>
</tbody>
</table>
</div>
and this does work for the div
SoupStrainer("div",{"class": "view-content"})
Can't a SoupStrainer like this filter on element with multiple classes?

The comparision that's used is a literal equality check, so the following works:
soup('table', {'class': "views-table sticky-enabled cols-20"})
You can get it to match by doing by passing a function as to the filter:
soup('table', {'class': lambda L: 'views-table' in L.split()})
It might be worth checking the version you're using, because I have a feeling this shouldn't be the case anymore... update: yup, here you go https://bugs.launchpad.net/beautifulsoup/+bug/410304

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

HTML and other formatting - python

you could insert a line break before every "<" with a regex

Another option would be using BeautifulSoup: from bs4 import BeautifulSoup html = "<table class=main><tr class=row></tr></table>" soup = BeautifulSoup(html) print soup.prettify() Output: <table class="main"> <tr class="row"> </tr> </table>

Related

How to extract the first "src" attribute from a HTML tag

BeautifulSoup how to only return class objects

how to extract the text from the following HTML code?

BeautifulSoup: How to extract text encapsulated in multiple div/span/id tags

BeautifulSoup SoupStrainer doesn't work when element has multiple classes?

Categories

Resources