<table> becomes empty, when I'm trying to get it via BeautifulSoup

<table> becomes empty, when I'm trying to get it via BeautifulSoup - python

I'm trying to parse a table from website https://www.kp.ru/best/kazan/abiturient_2018/ivmit/. DevTools by Chrome shows me that table is:
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table " style="">
...
</table>
</div>
But when I do this:
url = r"https://www.kp.ru/best/kazan/abiturient_2018/ivmit/"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
tag = soup.find_all('div', {'class':r't431__table-wapper'})
print(tag)
It returns me like <table> is empty:
[<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>]
Is it JavaScript or something? How to fix this?

You can get that info from another tag
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.kp.ru/best/kazan/abiturient_2018/ivmit/'
soup = bs(requests.get(url).content, 'lxml')
print(soup.select_one('.t431__data-part2').text)
Output:

Related

bs4 - how to use find or find_all to get specific content from an url

I need to get the individual url for each country after the "a href=" under the "div" class of "well span4". For example,I need to get https://www.rulac.org/browse/countries/myanmar and https://www.rulac.org/browse/countries/the-netherlands and every url after "a href=" (as shown in the partial html structure below.
since the "a href=" is not under any class, how do I conduct a search and get all the countries url?
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all("div", class_="well span4")
# Partial html structure shown as below 
[<div class="well span4">
<a href="https://www.rulac.org/browse/countries/myanmar">
<div class="map-wrap">
<img alt="Myanmar" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=19.7633057,96.07851040000003&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Myanmar"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Myanmar</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/myanmar">Read on <i class="icon-caret-right"></i></a>
</div>,
<div class="well span4">
<a href="https://www.rulac.org/browse/countries/the-netherlands">
<div class="map-wrap">
<img alt="Netherlands" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=52.203566364441,5.7275408506393&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Netherlands"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Netherlands</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/the-netherlands">Read on <i class="icon-caret-right"></i></a>
</div>,
<div class="well span4">
<a href="https://www.rulac.org/browse/countries/niger">
<div class="map-wrap">
<img alt="Niger" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=13.5115963,2.1253854000000274&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Niger"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Niger</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/niger">Read on <i class="icon-caret-right"></i></a>
</div>,

You can use soup.select() with a CSS selector to get all <a> elements of class btn that are children of <div>s with classes well and span4. Like this:
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.select("div.well.span4 > a.btn")
# get all hrefs in a list and print it
hrefs = [el['href'] for el in res]
for href in hrefs:
print(href)

how to get attribute data using python beautiful soup

Hi am trying to use python beautiful-soup web crawler to get data from imdb i have followed the documentation online am able to retrieve all the data using this code
from requests import get
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'image')
print(movie_containers)
with the above code am able to retrieve a list of all the data in the div class tagged as image just as show below
<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>
but am trying to get the value of the attributes data-const as gotten from the result i want to display just the values of the data-const attribute instead of the whole html result Expected Result : tt1486497, tt1485650

Instead use the class name that div is using.
from bs4 import BeautifulSoup
html = """<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>"""
soup = BeautifulSoup(html, "lxml")
for div in soup.find_all("div", attrs={"class":"hover-over-image zero-z-index"}):
print(div["data-const"])
Output:
tt1486497
tt1485650

Try something along the lines of:
for dc in movie_containers.select('div.hover-over-image'):
print(dc['data-const'])
output:
tt1486497
tt1485650

I recommend using requests-html. It's more intuitive than just using beautiful soup.
Example:
from requests_html import HTMLSession
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
session = HTMLSession()
response = session.get(url)
html = response.html
imageContainers = html.find_all("div.image")
dataConsts = list(map(lambda x: x.find("a", first=True).attrs["data-const"], imageContainers))
This should exactly do what you need, but I couldn't test it
Good luck!

How to extract onClick url using beautifulsoup

Below is the HTML code which needs extraction
<div class="one_block" style="display:block;" onClick="location.href=\'/games/box.html
?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020\';" style="cursor:pointer;">
<!-- \xe5\xb0\x8d\xe6\x88\xb0\xe7\x90\x83\xe9\x9a\x8
a\xe5\x8f\x8a\xe5\xa0\xb4\xe5\x9c\xb0 start -->
<table width="100%" border="0" cellspacing="0" cellpadding="0" class="schedule_team">
<tr>
How do I get the location.href value?
Tried:
soup.findAll("div", {"onClick": "location.href"})
Returns null
Desired Output:
/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020
PS: there's plenty of location.href

How about using .select() method for SoupSieve package to run a CSS selector
from bs4 import BeautifulSoup
html = '<div class="one_block" style="display:block;" onClick="location.href=\'/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020\';" style="cursor:pointer;">' \
'<!-- \xe5\xb0\x8d\xe6\x88\xb0\xe7\x90\x83\xe9\x9a\x8a\xe5\x8f\x8a\xe5\xa0\xb4\xe5\x9c\xb0 start -->' \
'<table width="100%" border="0" cellspacing="0" cellpadding="0" class="schedule_team"><tr>'
soup = BeautifulSoup(html, features="lxml")
element = soup.select('div.one_block')[0]
print(element.get('onclick'))
Use split to get just print(element.get('onclick').split("'")[1])
/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020

Parse multiple href within one parent using BeautifulSoup

I have one line in my program, using BeautifulSoup's find():
print(table.find('td','monsters'))
This is the output of the above line:
<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>
Now I want to parse all five hrefs, so that it would output something like this:
/m154
/m153
/m152
/m155
/m147
I have attempted to convert my print line into a for loop by changing find() to find_all(), and then retrieve the href by using .a['href'] within the foor loop. However, no matter what I try, I would always only get one entry instead of five. Any suggestions for retrieving multiple href? Seeing that find_all() returns an array, would it make sense to make find_all() directly above the parent of a?

Input:
page = """<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "html.parser") # your source page parsed as html
links = soup.find_all('a', href=True) # get all links having href attribute
for i in links:
print(i['href'])
Result:
/m154
/m153
/m152
/m155
/m147

What you want to do is something like the following:
cell = table.find('td', 'monsters')
for a_tag in cell.find_all('a'):
print(a['href'])

Full Code, similar to posts above
import bs4
HTML= """<html>
<table>
<tr>
<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>
</tr>
</table>
</html>
"""
table = bs4.BeautifulSoup(HTML, 'lxml')
anker = table.find('td', 'monsters').find_all('a')
[print(a['href']) for a in anker]

Python scrape specific tag without class name

I'm developing a python script to scrape data from a specific site.
I'm using Beautiful Soap as python module.
The interesting data into HTML page are into this structure:
<tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">Name<a>
</td>
<td class="nowrap"></td>
<td class="hidden-xs"></td>
</tr>
</tbody>
into tag tbody there are more tr tag and I would like take to each only first tag a of tag td
I have tried in this way:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
a = soup.find(id='tabella_falist')
b = a.find("tbody")
link = [p.attrs['href'] for p in b.select("a")]
but in this way the script take all href into all td tag. How can take only first?
Thanks

If I understood correctly you can try this:
from bs4 import BeautifulSoup
import requests
url = 'your_url'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.a)
soup.a will return the first a tag on the page.

This should do the work
html = '''<html><body><tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">GOOD ONE<a>
<a href="www.server.com/art/crag">NOT GOOD ONE<a>
</td>
<td class="nowrap">
GOOD ONE
</td>
<td class="hidden-xs"></td>
</tr>
</tbody></body></html>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for td in soup.select('td'):
a = td.find('a')
if a is not None:
print a.attrs['href']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

<table> becomes empty, when I'm trying to get it via BeautifulSoup - python

You can get that info from another tag import requests from bs4 import BeautifulSoup as bs url = 'https://www.kp.ru/best/kazan/abiturient_2018/ivmit/' soup = bs(requests.get(url).content, 'lxml') print(soup.select_one('.t431__data-part2').text) Output:

Related

bs4 - how to use find or find_all to get specific content from an url

how to get attribute data using python beautiful soup

How to extract onClick url using beautifulsoup

Parse multiple href within one parent using BeautifulSoup

Python scrape specific tag without class name

Categories

Resources