How to extract onClick url using beautifulsoup

How to extract onClick url using beautifulsoup - python

Below is the HTML code which needs extraction
<div class="one_block" style="display:block;" onClick="location.href=\'/games/box.html
?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020\';" style="cursor:pointer;">
<!-- \xe5\xb0\x8d\xe6\x88\xb0\xe7\x90\x83\xe9\x9a\x8
a\xe5\x8f\x8a\xe5\xa0\xb4\xe5\x9c\xb0 start -->
<table width="100%" border="0" cellspacing="0" cellpadding="0" class="schedule_team">
<tr>
How do I get the location.href value?
Tried:
soup.findAll("div", {"onClick": "location.href"})
Returns null
Desired Output:
/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020
PS: there's plenty of location.href

How about using .select() method for SoupSieve package to run a CSS selector
from bs4 import BeautifulSoup
html = '<div class="one_block" style="display:block;" onClick="location.href=\'/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020\';" style="cursor:pointer;">' \
'<!-- \xe5\xb0\x8d\xe6\x88\xb0\xe7\x90\x83\xe9\x9a\x8a\xe5\x8f\x8a\xe5\xa0\xb4\xe5\x9c\xb0 start -->' \
'<table width="100%" border="0" cellspacing="0" cellpadding="0" class="schedule_team"><tr>'
soup = BeautifulSoup(html, features="lxml")
element = soup.select('div.one_block')[0]
print(element.get('onclick'))
Use split to get just print(element.get('onclick').split("'")[1])
/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020

Related

How do I fetch table with attributes using BeautifulSoup? [duplicate]

This question already has answers here:
How to find tags with only certain attributes - BeautifulSoup
(8 answers)
Closed 1 year ago.
How can I fetch table with other attributes using beautifulSoup?
I have this table :
I have tried this but didn't succeed :
Soup.find_all('table',{'style':"width=100%;align:center;border:0;cellpadding:0;cellspacing:0"})

You can do this:
from bs4 import BeautifulSoup
html_source = '''
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
'''
soup = BeautifulSoup(html_source, 'html.parser')
els = list(soup.find("table",attrs={"width":"100%","border":"0","align":"center"}))
print(' '.join(map(str, els)))
You can provide as many attributes as you want in that dictionary. You can get a bit more information from here.
Example: https://www.napkin.io/n/c174e5c8557d402d

how to extract the text from the following HTML code?

I am doing web scraping for a DS project, and i am using BeautifulSoup for that. But i am unable to extract the Duration from "tbody" tag in "table" class.
Following is the HTML code :
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th>Start Date</th>
<th>Duration</th>
<th>Stipend</th>
<th>Posted On</th>
<th>Apply By</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<div id="start-date-first">Immediately</div>
</td>
<td>1 Month</td>
<td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
1500 /month
</td>
<td>26 May'20</td>
<td>23 Jun'20</td>
</tr>
</tbody>
</table>
</div>
Note : for extracting 'Immediately' text, i use the following code :
x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text

You can use select() function to find tags by css selector.
tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag
print(tds[1].text)
The return value of select() function is the list of all HTML tags that matches the selector. The one you want to retrieve is second one, so using index 1, then get text of it.

Try this:
from bs4 import BeautifulSoup
import requests
url = "yourUrlHere"
pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)
In my code i use lxml library to parse the data. If you want to install pip install lxml... or just change into your libray in this part of the code:
soup = BeautifulSoup(pageRaw , 'lxml')
This code will return the first table ok?
Take care

<table> becomes empty, when I'm trying to get it via BeautifulSoup

I'm trying to parse a table from website https://www.kp.ru/best/kazan/abiturient_2018/ivmit/. DevTools by Chrome shows me that table is:
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table " style="">
...
</table>
</div>
But when I do this:
url = r"https://www.kp.ru/best/kazan/abiturient_2018/ivmit/"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
tag = soup.find_all('div', {'class':r't431__table-wapper'})
print(tag)
It returns me like <table> is empty:
[<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>]
Is it JavaScript or something? How to fix this?

You can get that info from another tag
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.kp.ru/best/kazan/abiturient_2018/ivmit/'
soup = bs(requests.get(url).content, 'lxml')
print(soup.select_one('.t431__data-part2').text)
Output:

Python scrape specific tag without class name

I'm developing a python script to scrape data from a specific site.
I'm using Beautiful Soap as python module.
The interesting data into HTML page are into this structure:
<tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">Name<a>
</td>
<td class="nowrap"></td>
<td class="hidden-xs"></td>
</tr>
</tbody>
into tag tbody there are more tr tag and I would like take to each only first tag a of tag td
I have tried in this way:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
a = soup.find(id='tabella_falist')
b = a.find("tbody")
link = [p.attrs['href'] for p in b.select("a")]
but in this way the script take all href into all td tag. How can take only first?
Thanks

If I understood correctly you can try this:
from bs4 import BeautifulSoup
import requests
url = 'your_url'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.a)
soup.a will return the first a tag on the page.

This should do the work
html = '''<html><body><tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">GOOD ONE<a>
<a href="www.server.com/art/crag">NOT GOOD ONE<a>
</td>
<td class="nowrap">
GOOD ONE
</td>
<td class="hidden-xs"></td>
</tr>
</tbody></body></html>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for td in soup.select('td'):
a = td.find('a')
if a is not None:
print a.attrs['href']

HTML and other formatting

is there a way (using python and lxml) to get an output of HTML code like this:
<table class=main>
<tr class=row>
</tr>
</table>
instead of one like this one:
<table class=main><tr class=row></tr>
</table>
Only tags named "span" in div-tags can be appended. So things like:
<div class=paragraph><span class=font48>hello</span></div>
are allowed.
Thanks a lot for any help.

you could insert a line break before every "<" with a regex

Another option would be using BeautifulSoup:
from bs4 import BeautifulSoup
html = "<table class=main><tr class=row></tr></table>"
soup = BeautifulSoup(html)
print soup.prettify()
Output:
<table class="main">
<tr class="row">
</tr>
</table>

Have you considered the prettify() method from the module BeautifulSoup ?
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup as bs
html = '<table class=main><tr class=row></tr>\
</table>'
print bs(html).prettify()
outputs:
<table class="main">
<tr class="row">
</tr>
</table>
Note - it will add some indentation to the output, as you can see.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract onClick url using beautifulsoup - python

Related

How do I fetch table with attributes using BeautifulSoup? [duplicate]

how to extract the text from the following HTML code?

<table> becomes empty, when I'm trying to get it via BeautifulSoup

Python scrape specific tag without class name

HTML and other formatting

Categories

Resources