Below is the HTML code which needs extraction
<div class="one_block" style="display:block;" onClick="location.href=\'/games/box.html
?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020\';" style="cursor:pointer;">
<!-- \xe5\xb0\x8d\xe6\x88\xb0\xe7\x90\x83\xe9\x9a\x8
a\xe5\x8f\x8a\xe5\xa0\xb4\xe5\x9c\xb0 start -->
<table width="100%" border="0" cellspacing="0" cellpadding="0" class="schedule_team">
<tr>
How do I get the location.href value?
Tried:
soup.findAll("div", {"onClick": "location.href"})
Returns null
Desired Output:
/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020
PS: there's plenty of location.href
How about using .select() method for SoupSieve package to run a CSS selector
from bs4 import BeautifulSoup
html = '<div class="one_block" style="display:block;" onClick="location.href=\'/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020\';" style="cursor:pointer;">' \
'<!-- \xe5\xb0\x8d\xe6\x88\xb0\xe7\x90\x83\xe9\x9a\x8a\xe5\x8f\x8a\xe5\xa0\xb4\xe5\x9c\xb0 start -->' \
'<table width="100%" border="0" cellspacing="0" cellpadding="0" class="schedule_team"><tr>'
soup = BeautifulSoup(html, features="lxml")
element = soup.select('div.one_block')[0]
print(element.get('onclick'))
Use split to get just print(element.get('onclick').split("'")[1])
/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020
Related
This question already has answers here:
How to find tags with only certain attributes - BeautifulSoup
(8 answers)
Closed 1 year ago.
How can I fetch table with other attributes using beautifulSoup?
I have this table :
I have tried this but didn't succeed :
Soup.find_all('table',{'style':"width=100%;align:center;border:0;cellpadding:0;cellspacing:0"})
You can do this:
from bs4 import BeautifulSoup
html_source = '''
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
'''
soup = BeautifulSoup(html_source, 'html.parser')
els = list(soup.find("table",attrs={"width":"100%","border":"0","align":"center"}))
print(' '.join(map(str, els)))
You can provide as many attributes as you want in that dictionary. You can get a bit more information from here.
Example: https://www.napkin.io/n/c174e5c8557d402d
I am doing web scraping for a DS project, and i am using BeautifulSoup for that. But i am unable to extract the Duration from "tbody" tag in "table" class.
Following is the HTML code :
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th>Start Date</th>
<th>Duration</th>
<th>Stipend</th>
<th>Posted On</th>
<th>Apply By</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<div id="start-date-first">Immediately</div>
</td>
<td>1 Month</td>
<td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
1500 /month
</td>
<td>26 May'20</td>
<td>23 Jun'20</td>
</tr>
</tbody>
</table>
</div>
Note : for extracting 'Immediately' text, i use the following code :
x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text
You can use select() function to find tags by css selector.
tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag
print(tds[1].text)
The return value of select() function is the list of all HTML tags that matches the selector. The one you want to retrieve is second one, so using index 1, then get text of it.
Try this:
from bs4 import BeautifulSoup
import requests
url = "yourUrlHere"
pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)
In my code i use lxml library to parse the data. If you want to install pip install lxml... or just change into your libray in this part of the code:
soup = BeautifulSoup(pageRaw , 'lxml')
This code will return the first table ok?
Take care
I'm trying to parse a table from website https://www.kp.ru/best/kazan/abiturient_2018/ivmit/. DevTools by Chrome shows me that table is:
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table " style="">
...
</table>
</div>
But when I do this:
url = r"https://www.kp.ru/best/kazan/abiturient_2018/ivmit/"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
tag = soup.find_all('div', {'class':r't431__table-wapper'})
print(tag)
It returns me like <table> is empty:
[<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>]
Is it JavaScript or something? How to fix this?
You can get that info from another tag
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.kp.ru/best/kazan/abiturient_2018/ivmit/'
soup = bs(requests.get(url).content, 'lxml')
print(soup.select_one('.t431__data-part2').text)
Output:
I'm developing a python script to scrape data from a specific site.
I'm using Beautiful Soap as python module.
The interesting data into HTML page are into this structure:
<tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">Name<a>
</td>
<td class="nowrap"></td>
<td class="hidden-xs"></td>
</tr>
</tbody>
into tag tbody there are more tr tag and I would like take to each only first tag a of tag td
I have tried in this way:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
a = soup.find(id='tabella_falist')
b = a.find("tbody")
link = [p.attrs['href'] for p in b.select("a")]
but in this way the script take all href into all td tag. How can take only first?
Thanks
If I understood correctly you can try this:
from bs4 import BeautifulSoup
import requests
url = 'your_url'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.a)
soup.a will return the first a tag on the page.
This should do the work
html = '''<html><body><tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">GOOD ONE<a>
<a href="www.server.com/art/crag">NOT GOOD ONE<a>
</td>
<td class="nowrap">
GOOD ONE
</td>
<td class="hidden-xs"></td>
</tr>
</tbody></body></html>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for td in soup.select('td'):
a = td.find('a')
if a is not None:
print a.attrs['href']
is there a way (using python and lxml) to get an output of HTML code like this:
<table class=main>
<tr class=row>
</tr>
</table>
instead of one like this one:
<table class=main><tr class=row></tr>
</table>
Only tags named "span" in div-tags can be appended. So things like:
<div class=paragraph><span class=font48>hello</span></div>
are allowed.
Thanks a lot for any help.
you could insert a line break before every "<" with a regex
Another option would be using BeautifulSoup:
from bs4 import BeautifulSoup
html = "<table class=main><tr class=row></tr></table>"
soup = BeautifulSoup(html)
print soup.prettify()
Output:
<table class="main">
<tr class="row">
</tr>
</table>
Have you considered the prettify() method from the module BeautifulSoup ?
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup as bs
html = '<table class=main><tr class=row></tr>\
</table>'
print bs(html).prettify()
outputs:
<table class="main">
<tr class="row">
</tr>
</table>
Note - it will add some indentation to the output, as you can see.