I am new in using Python and BeautifulSoup. I want to get title and description of a video.
I am getting a description using this code:
import requests
from bs4 import BeautifulSoup
x='https://www.youtube.com/watch?v=NjG5ZwuY0Rc'
source = requests.get(x).text
soup = BeautifulSoup(source, 'lxml')
for p in soup.find_all('p', id='eow-description'):
print(p.get_text('\n'))
How can I get title of the video?
To fetch any desired text from an html page:
Get the tag name by inspecting element in the browser(right click on browser and click inspect for chrome) if it is not known already.
Get the id of the desired tag.
Once you have details of 1 & 2, using get_text it is easy to get the details of that tag.
for title in soup.find_all('span', id="eow-title"):
print(title.get_text('\n'))
Related
I am trying to parse the title of links using BeautifulSoup. I have tried various things but just can't get it to work.
The html is behind a login so here's a screenshot:
And here's my latest attempt which I was sure would work but just returns "None".
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
links = soup.find_all('ul', class_='nav list-group')
print(links)
for link in links:
title = link.get('title')
print(title)
Can anyone see what I am doing wrong?
This line of code:
links = soup.find_all('ul', class_='nav list-group')
Is not extracting the links, it's extracting the <ul> tags. Instead, you could try extracting the links with something like:
links = soup.find_all('a', class_='odds')
Then you will be able to loop over them and extract your titles:
for link in links:
print(link['title'])
What happens?
You are selecting the <ul> not its <a> so you wont get any href value.
How to fix?
Select more specific e.g. with these css selector that will find all <a> that has an title attribute, in your <ul>:
links = soup.select('ul.nav.list-group a[title]')
Example
Note: Your question needs some improvement, so you should provide specific part of driver.page_source as text and not as image - Took your code, so it is just a hint.
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
for link in soup.select('ul.nav.list-group a[title]'):
title = link.get('title')
print(title)
I am trying to Scrape NBA.com play by play table so I want to get the text for each box that is in the example picture.
for example(https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play).
checking the html code I figured that each line is in an article tag that contains div tag that contains two p tags with the information I want, however I wrote the following code and I get back that there are 0 articles and only 9 P tags (should be much more) but even the tags I get their text is not the box but something else. I get 9 tags so I am doing something terrible wrong and I am not sure what it is.
this is the code to get the tags:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def contains_word(t):
return t and 'keyword' in t
url = "https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
div_tags = soup.find_all('div', text=contains_word("playByPlayContainer"))
articles=soup.find_all('article')
p_tag = soup.find_all('p', text=contains_word("md:bg"))
thank you!
Use Selenium since it's using Javascript and pass it to Beautifulsoup. Also pip install selenium and get the chromedriver.exe
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play")
soup = BeautifulSoup(driver.page_source, "html.parser")
This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
I 've been trying to scrape two values from a website using beautiful soup in Python, and it's been giving me trouble. Here is the URL of the page I'm scraping:
https://www.stjosephpartners.com/Home/Index
Here are the values I'm trying to scrape:
HTML of Website to be Scraped
I tried:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.stjosephpartners.com/Home/Index').text
soup = BeautifulSoup(source, 'lxml')
gold_spot_shell = soup.find('div', class_ = 'col-lg-10').children
print(gold_spot_shell)
the output I got was: <list_iterator object at 0x039FD0A0>
When I tried using: gold_spot_shell = soup.find('div', class_ = 'col-lg-10').children
The output was: ['\n']
when I tried using: gold_spot_shell = soup.find('div', class_ = 'col-lg-10').span
The output was: none
The HTML definitely has at least one span child. I'm not sure how to scrape the values I'm after. Thanks.
Beautifulsoup + Request is not a good method to scrape dynamic website like this. That span is generated by javascript so when you get the html using request, it just does not exist.
You can try to use selenium instead.
You can check if the website is using javascript to render element or not by disabling javascript on the page and find that element again, or just "view page source"
I am scraping an html file, each page has a video on it, and in the html there is the video id. I want to print out the video id.
I know that if i want to print a headline from a div class i would do this
with open('yeehaw.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')
article = soup.find('div', class_='article')
headline = article.h2.a.text
print headline
However the id for the video is found inside a data-id='qe67234'
I dont know how to access this 'qe67234' and print it out.
please help thank you!
Assuming that the tag for data-id begins with div:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup('<div class="_article" data-id="qe67234"></div>')
results = soup.findAll("div", {"data-id" : re.compile(r".*")})
print('output: ', results[0]['data-id'])
# output: qe67234
Assuming that the data-id is in div
BeautifulSoup.find returns you the found html element as a dictionary. You can therefore navigate it using standard means to get access to the text (as you did in your question) as well as html tags (as shown in the code below)
soup = BeautifulSoup('<div class="_article" data-id="qe67234">')
soup.find("div", {"class":"_article"})['data-id']
Note that, oftentimes, video elements require JS for playback, and you might not be able to find the necessary element if it was scraped with a non-javascript client (i.e. python requests).
If this happens, you have to use tools like phantomjs + selenium browser to get the website combined with the javascript to perform your scraping.
EDIT
If the data-id tag itself is not constant, you should look into lxml library to replace BeautifulSoup and use xpath values to find the element that you need