actually i tried a code but it doesnt work could any one help me to fix it
its actually saying that the video_link is not defined
i think error in for link in soup.find_all('a'):
import os
import glob
from bs4 import BeautifulSoup
import urllib
from urllib.parse import quote_plus as qp
DEFAULT_AUDIO_QUALITY = '320K'
search = ' '
# We do not want to except empty inputs :)
while search == '':
search = raw_input('Enter your query ')
search = qp(search)
print('Making a Query Request! ')
response = urllib.request.urlopen('https://www.youtube.com/results?search_query='+search)
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
if '/watch?v=' in link.get('href'):
print(link.get('href'))
# May change when Youtube Website may get updated in the future.
video_link = link.get('href')
break
video_link = 'http://www.youtube.com/'+video_link
command = ('youtube-dl --extract-audio --audio-format mp3 --audio-quality ' +
DEFAULT_AUDIO_QUALITY + ' ' +video_link)
print ('Downloading...')
os.system(command)
but this is giving error
To get correct version of YouTube HTML page, use correct User-Agent HTTP header.
For example:
import requests
from bs4 import BeautifulSoup
search = 'tree'
headers = {'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
html = requests.get('https://www.youtube.com/results?search_query='+search, headers=headers).text
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
if '/watch?v=' in link.get('href'):
print(link.get('href'))
# May change when Youtube Website may get updated in the future.
video_link = link.get('href')
Prints:
/watch?v=ZKAM_Hk4eZ0
/watch?v=ZKAM_Hk4eZ0
/watch?v=wCQfkEkePx8
/watch?v=wCQfkEkePx8
/watch?v=Va0vs1fhhNI
/watch?v=Va0vs1fhhNI
/watch?v=kUDPr5xPYhM
/watch?v=kUDPr5xPYhM
/watch?v=kSjXOebB7eI
/watch?v=kSjXOebB7eI
/watch?v=IiDkVftBgak
/watch?v=IiDkVftBgak
/watch?v=F3hTW9e20d8
/watch?v=F3hTW9e20d8
/watch?v=Iy-dJwHVX84
/watch?v=Iy-dJwHVX84
... etc.
Related
I'm trying to find all the URLs on this page: https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments
More specifically, I want the links that are hyperlinked under each "Subject Code". However, when I run my code, barely any links get extracted.
I would like to know why this is happening, and how I can fix it.
from bs4 import BeautifulSoup
import requests
url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
page = requests.get(url)
soup = BeautifulSoup(page.text, features="lxml")
for link in soup.find_all('a'):
print(link.get('href'))
This is my first attempt in web-scraping..
there's an anti-bot protection, just add a user-agent to your headers. and do not forget to check your soup when things go wrong
from bs4 import BeautifulSoup
import requests
url = "https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments"
ua={'User-Agent':'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_8_2) AppleWebKit/531.2 (KHTML, like Gecko) Chrome/26.0.869.0 Safari/531.2'}
r = requests.get(url, headers=ua)
soup = BeautifulSoup(r.text, features="lxml")
for link in soup.find_all('a'):
print(link.get('href'))
the message in the soup was
Sorry for the inconvenience.
We have detected excess or unusual web requests originating from your browser, and are unable to determine whether these requests are automated.
To proceed to the requested page, please complete the captcha below.
I would use nth-child(1) to restrict to the first column of the table matched by id. Then simply extract the .text. If that contains * then provide a default string for no course offered, otherwise, concatenate the retrieved course identifier onto a base query string construct:
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-all-departments', headers=headers)
soup = bs(r.content, 'lxml')
no_course = ''
base = 'https://courses.students.ubc.ca/cs/courseschedule?pname=subjarea&tname=subj-department&dept='
course_info = {i.text:(no_course if '*' in i.text else base + i.text) for i in soup.select('#mainTable td:nth-child(1)')}
course_info
I am new to python and web scraping. I wrote some code for scraping quotes and the corresponding author name from https://www.brainyquote.com/topics/inspirational-quotes and ended with no result. Here is the code i used for the purpose,
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r"C:\Users\Sandheep\Desktop\chromedriver.exe")
product = []
prices = []
driver.get("https://www.brainyquote.com/topics/inspirational-quotes")
content = driver.page_source
soup = BeautifulSoup(content, "lxml")
for a in soup.findAll("a", href=True, attrs={"class": "clearfix"}):
quote = a.find("a", href=True, attrs={"title": "view quote"}).text
author = a.find("a", href=True, attrs={"class": "bq-aut"}).text
product.append(quote)
prices.append(author)
print(product)
print(prices)
I am not getting where i need to edit to get the result.
THANKS IN ADVANCE!!!!
As I understand site has this information in attribute alt of images. Also, quote and author separated by ' - '.
So you need to iterate by soup.find_all('img'), the function to fetch result may look like:
def fetch_quotes(soup):
for img in soup.find_all('img'):
try:
quote, author = img['alt'].split(' - ')
except ValueError:
pass
else:
yield {'quote': quote, 'author': author}
Then, use it like: print(list(fetch_quotes(soup)))
Also, note, it is often that you can replace using selenium to pure requests, e.g.:
import requests
from bs4 import BeautifulSoup
content = requests.get("https://www.brainyquote.com/topics/inspirational-quotes").content
soup = BeautifulSoup(content, "lxml")
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r"ChromeDriver path")
driver.get("https://www.brainyquote.com/topics/inspirational-quotes")
content = driver.page_source
soup = BeautifulSoup(content, "lxml")
root_tag=["div", {"class":"m-brick grid-item boxy bqQt r-width"}]
quote_author=["a",{"title":"view author"}]
quote=[]
author=[]
all_data = soup.findAll(root_tag[0], root_tag[1])
for div in all_data:
try:
quote.append(div.find_all("a",{"title":"view quote"})[1].text)
author.append(div.find(quote_author[0], quote_author[1]).text)
except:
continue
The output Will be:
for i in range(len(author)):
print(quote[i])
print(author[i])
break
Start by doing what's necessary; then do what's possible; and suddenly you are doing the impossible.
Francis of Assisi
Is there any way to get all the text of a website without the source code?
Like: Opening a website and ctrl + a everything there.
import requests
content = requests.get('any url')
print(content.text)
This prints the source code in a text form but I want to achieve that with the above?
Step 1: Get some HTML from a web page
Step 2: Use Beautiful Soup package to parse the HTML (Learn about Beautiful Soup if you don't have prior knowledge 'https://pypi.org/project/beautifulsoup4/')
Step 3: List the elements that are not required (eg-header, meta, script)
import requests
from bs4 import BeautifulSoup
url = 'https://www.zzz.com/yyy/ #give any url
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)
output = ''
blacklist = [
'[document]',
'noscript',
'header',
'html',
'meta',
'head',
'input',
'script',
# name more elements if not required
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
print(output)
For this you have to install beautifulsoup and lxml, but it will work after that.
from bs4 import BeautifulSoup
import requests
source = requests.get('your_url').text
soup = BeautifulSoup(source, 'lxml').text
print(soup)
So I'm trying to write a mediocre script to download subtitles from one particular website as y'all can see. I'm a newbie to beautifulsoup, so far I have a list of all the "href" after a search query(GET). So how do I navigate further, after getting all the links?
Here's the code:
import requests
from bs4 import BeautifulSoup
usearch = input("Movie Name? : ")
url = "https://www.yifysubtitles.com/search?q="+usearch
print(url)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
for link in soup.find_all('a'):
dictn = link.get('href')
print(dictn)
You need to use resp.text instead of resp.content
Try this to get the search results.
import requests
from bs4 import BeautifulSoup
base_url_f = "https://www.yifysubtitles.com"
search_url = base_url_f + "/search?q=last+jedi"
resp = requests.get(search_url)
soup = BeautifulSoup(resp.text, 'lxml')
for media in soup.find_all("div", {"class": "media-body"}):
print(base_url_f + media.find('a')['href'])
out: https://www.yifysubtitles.com/movie-imdb/tt2527336
I am writing a program to extract unique web links from www.stevens.edu( it is an assignment ) but there is one problem. My program is working and extracting links for all sites except www.stevens.edu for which i am getting output as 'none'. I am very frustrated with this and need help.i am using this url for testing - http://www.stevens.edu/
import urllib
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
html = urllib.urlopen(url).read()
soup = bs (html)
tags = soup ('a')
for tag in tags:
print tag.get('href',None)
please guide me here and let me know why it is not working with www.stevens.edu?
The site check the User-Agent header, and returns different html base on it.
You need to set User-Agent header to get proper html:
import urllib
import urllib2
from bs4 import BeautifulSoup as bs
url = raw_input('enter - ')
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) # <--
html = urllib2.urlopen(req).read()
soup = bs(html)
tags = soup('a')
for tag in tags:
print tag.get('href', None)