I'm trying to parse a table from website https://www.kp.ru/best/kazan/abiturient_2018/ivmit/. DevTools by Chrome shows me that table is:
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table " style="">
...
</table>
</div>
But when I do this:
url = r"https://www.kp.ru/best/kazan/abiturient_2018/ivmit/"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
tag = soup.find_all('div', {'class':r't431__table-wapper'})
print(tag)
It returns me like <table> is empty:
[<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>,
<div class="t431__table-wapper" data-auto-correct-mobile-width="false">
<table class="t431__table" style=""></table></div>]
Is it JavaScript or something? How to fix this?
You can get that info from another tag
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.kp.ru/best/kazan/abiturient_2018/ivmit/'
soup = bs(requests.get(url).content, 'lxml')
print(soup.select_one('.t431__data-part2').text)
Output:
Related
I need to get the individual url for each country after the "a href=" under the "div" class of "well span4". For example,I need to get https://www.rulac.org/browse/countries/myanmar and https://www.rulac.org/browse/countries/the-netherlands and every url after "a href=" (as shown in the partial html structure below.
since the "a href=" is not under any class, how do I conduct a search and get all the countries url?
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all("div", class_="well span4")
# Partial html structure shown as below
[<div class="well span4">
<a href="https://www.rulac.org/browse/countries/myanmar">
<div class="map-wrap">
<img alt="Myanmar" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=19.7633057,96.07851040000003&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Myanmar"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Myanmar</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/myanmar">Read on <i class="icon-caret-right"></i></a>
</div>,
<div class="well span4">
<a href="https://www.rulac.org/browse/countries/the-netherlands">
<div class="map-wrap">
<img alt="Netherlands" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=52.203566364441,5.7275408506393&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Netherlands"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Netherlands</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/the-netherlands">Read on <i class="icon-caret-right"></i></a>
</div>,
<div class="well span4">
<a href="https://www.rulac.org/browse/countries/niger">
<div class="map-wrap">
<img alt="Niger" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=13.5115963,2.1253854000000274&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Niger"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Niger</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/niger">Read on <i class="icon-caret-right"></i></a>
</div>,
You can use soup.select() with a CSS selector to get all <a> elements of class btn that are children of <div>s with classes well and span4. Like this:
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.select("div.well.span4 > a.btn")
# get all hrefs in a list and print it
hrefs = [el['href'] for el in res]
for href in hrefs:
print(href)
Hi am trying to use python beautiful-soup web crawler to get data from imdb i have followed the documentation online am able to retrieve all the data using this code
from requests import get
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'image')
print(movie_containers)
with the above code am able to retrieve a list of all the data in the div class tagged as image just as show below
<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>
but am trying to get the value of the attributes data-const as gotten from the result i want to display just the values of the data-const attribute instead of the whole html result Expected Result : tt1486497, tt1485650
Instead use the class name that div is using.
from bs4 import BeautifulSoup
html = """<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>"""
soup = BeautifulSoup(html, "lxml")
for div in soup.find_all("div", attrs={"class":"hover-over-image zero-z-index"}):
print(div["data-const"])
Output:
tt1486497
tt1485650
Try something along the lines of:
for dc in movie_containers.select('div.hover-over-image'):
print(dc['data-const'])
output:
tt1486497
tt1485650
I recommend using requests-html. It's more intuitive than just using beautiful soup.
Example:
from requests_html import HTMLSession
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
session = HTMLSession()
response = session.get(url)
html = response.html
imageContainers = html.find_all("div.image")
dataConsts = list(map(lambda x: x.find("a", first=True).attrs["data-const"], imageContainers))
This should exactly do what you need, but I couldn't test it
Good luck!
Below is the HTML code which needs extraction
<div class="one_block" style="display:block;" onClick="location.href=\'/games/box.html
?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020\';" style="cursor:pointer;">
<!-- \xe5\xb0\x8d\xe6\x88\xb0\xe7\x90\x83\xe9\x9a\x8
a\xe5\x8f\x8a\xe5\xa0\xb4\xe5\x9c\xb0 start -->
<table width="100%" border="0" cellspacing="0" cellpadding="0" class="schedule_team">
<tr>
How do I get the location.href value?
Tried:
soup.findAll("div", {"onClick": "location.href"})
Returns null
Desired Output:
/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020
PS: there's plenty of location.href
How about using .select() method for SoupSieve package to run a CSS selector
from bs4 import BeautifulSoup
html = '<div class="one_block" style="display:block;" onClick="location.href=\'/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020\';" style="cursor:pointer;">' \
'<!-- \xe5\xb0\x8d\xe6\x88\xb0\xe7\x90\x83\xe9\x9a\x8a\xe5\x8f\x8a\xe5\xa0\xb4\xe5\x9c\xb0 start -->' \
'<table width="100%" border="0" cellspacing="0" cellpadding="0" class="schedule_team"><tr>'
soup = BeautifulSoup(html, features="lxml")
element = soup.select('div.one_block')[0]
print(element.get('onclick'))
Use split to get just print(element.get('onclick').split("'")[1])
/games/box.html?&game_type=01&game_id=13&game_date=2020-04-19&pbyear=2020
I have one line in my program, using BeautifulSoup's find():
print(table.find('td','monsters'))
This is the output of the above line:
<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>
Now I want to parse all five hrefs, so that it would output something like this:
/m154
/m153
/m152
/m155
/m147
I have attempted to convert my print line into a for loop by changing find() to find_all(), and then retrieve the href by using .a['href'] within the foor loop. However, no matter what I try, I would always only get one entry instead of five. Any suggestions for retrieving multiple href? Seeing that find_all() returns an array, would it make sense to make find_all() directly above the parent of a?
Input:
page = """<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "html.parser") # your source page parsed as html
links = soup.find_all('a', href=True) # get all links having href attribute
for i in links:
print(i['href'])
Result:
/m154
/m153
/m152
/m155
/m147
What you want to do is something like the following:
cell = table.find('td', 'monsters')
for a_tag in cell.find_all('a'):
print(a['href'])
Full Code, similar to posts above
import bs4
HTML= """<html>
<table>
<tr>
<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>
</tr>
</table>
</html>
"""
table = bs4.BeautifulSoup(HTML, 'lxml')
anker = table.find('td', 'monsters').find_all('a')
[print(a['href']) for a in anker]
I'm developing a python script to scrape data from a specific site.
I'm using Beautiful Soap as python module.
The interesting data into HTML page are into this structure:
<tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">Name<a>
</td>
<td class="nowrap"></td>
<td class="hidden-xs"></td>
</tr>
</tbody>
into tag tbody there are more tr tag and I would like take to each only first tag a of tag td
I have tried in this way:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
a = soup.find(id='tabella_falist')
b = a.find("tbody")
link = [p.attrs['href'] for p in b.select("a")]
but in this way the script take all href into all td tag. How can take only first?
Thanks
If I understood correctly you can try this:
from bs4 import BeautifulSoup
import requests
url = 'your_url'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.a)
soup.a will return the first a tag on the page.
This should do the work
html = '''<html><body><tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">GOOD ONE<a>
<a href="www.server.com/art/crag">NOT GOOD ONE<a>
</td>
<td class="nowrap">
GOOD ONE
</td>
<td class="hidden-xs"></td>
</tr>
</tbody></body></html>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for td in soup.select('td'):
a = td.find('a')
if a is not None:
print a.attrs['href']