So i built this small script that would give back a URL of any searched video on youtube. But after opening it up again turns out that the web scraping with youtube is not working out properly. As when printing soup it returns something completely different than from what can be seen with inspect element on Youtube. Can someone help me solve this...
Heres My Code:
import requests
from lxml import html
import webbrowser
from bs4 import BeautifulSoup
import time
import tkinter
from pytube import YouTube
headers= {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36"}
def video_finder():
word = input("Enter video title: ")
if ' ' in word:
new = word.replace(' ', '+')
print(new)
else:
pass
vid = requests.get('https://www.youtube.com/results?search_query={}'.format(new))
soup = BeautifulSoup(vid.text, features='lxml')
all_vids = soup.find_all('div', id_='contents')
print(all_vids)
video1st = all_vids[0]
a_Tag = video1st.find('a', class_="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link", href=True)
Video_name = a_Tag.text
Video_id = a_Tag['href']
video_link = 'https://www.youtube.com' + Video_id
print(Video_name)
print(video_link)
Its not the best but ye... thank you
To get correct result from Youtube page, set User-Agent HTTP header to Googlebot, and use html.parser in BeautifulSoup.
For example:
import requests
from bs4 import BeautifulSoup
headers= {"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}
def video_finder():
word = input("Enter video title: ")
params = {
'search_query': word
}
vid = requests.get('https://www.youtube.com/results', params=params, headers=headers)
soup = BeautifulSoup(vid.content, features='html.parser')
a_Tag = soup.find('a', class_="yt-uix-tile-link yt-ui-ellipsis yt-ui-ellipsis-2 yt-uix-sessionlink spf-link", href=lambda h: h.startswith('/watch?'))
Video_name = a_Tag.text
Video_id = a_Tag['href']
video_link = 'https://www.youtube.com' + Video_id
print(Video_name)
print(video_link)
video_finder()
Prints:
Enter video title: sailor moon
Sailor Moon Opening (English) *HD*
https://www.youtube.com/watch?v=5txHGxJRwtQ
Related
I want to do this for the top 500 movies of Metacritic found at https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc
Each genre will be extracted from a detail link like this(for the first one): https://www.metacritic.com/movie/citizen-kane-1941/details
Just need some help on the extraction of the genre part from the HTML from the above-detailed link
My get_genre function (but I get an attribute error)
def get_genre(detail_link):
detail_page = requests.get(detail_link, headers = headers)
detail_soup = BeautifulSoup(detail_page.content, "html.parser")
try:
#time.sleep(1)
table=detail_soup.find('table',class_='details',summary=movie_name +" Details and Credits")
#print(table)
gen_line1=table.find('tr',class_='genres')
#print(gen_line1)
gen_line=gen_line1.find('td',class_='data')
#print(gen_line)
except:
time.sleep(1)
year=detail_soup.find(class_='release_date')
year=year.findAll('span')[-1]
year=year.get_text()
year=year.split()[-1]
table=detail_soup.find('table',class_='details',summary=movie_name +" ("+ year +")"+" Details and Credits")
#print(table)
gen_line1=table.find('tr',class_='genres')
#print(gen_line1)
gen_line=gen_line1.find('td',class_='data')
genres=[]
for line in gen_line:
genre = gen_line.get_text()
genres.append(genre.strip())
genres=list(set(genres))
genres=(str(genres).split())
return genres
you're too much focused on getting the table. just use what elements you're sure about. here's an example with select
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_0) AppleWebKit/536.1 (KHTML, like Gecko) Chrome/58.0.849.0 Safari/536.1'}
detail_link="https://www.metacritic.com/movie/citizen-kane-1941/details"
detail_page = requests.get(detail_link, headers = headers)
detail_soup = BeautifulSoup(detail_page.content, "html.parser")
genres=detail_soup.select('tr.genres td.data span')
print([genre.text for genre in genres])
>>> ['Drama', 'Mystery']
Trigger warning: I'm a noob
import requests
from bs4 import BeautifulSoup
from termcolor import colored
with open('ign.txt') as f:
namesList = f.readlines()
print("Accounts found: ", namesList) # Opening file and reading it
for x in namesList:
url = "https://oldschool.runeclan.com/user/" + x # Adding username from file to URL
print(url)
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(class_="xp_tracker_gain_today").get_text()
title2 = soup.find(class_="xp_tracker_gain altcolor xp_tracker_pos").get_text()
title3 = soup.find(class_="xp_tracker_next").get_text() # Finding right information on site
print(colored("Exp gain today: " + title, 'green'))
print(colored("Exp gain yesterday: " + title2, 'green'))
print(colored(title3, 'green')) # Printing data found
When I modify my URL url = "https://oldschool.runeclan.com/user/" + x
I get the following error message
AttributeError: 'NoneType' object has no attribute 'get_text'
Which should mean there is nothing found.
This is the output from the upper half of the code Accounts found: ['mausie\n', 'mr+stevieyh\n', 'Douwe\n', 'Henk\n']https://oldschool.runeclan.com/user/mausie
So the link I made is correct
When I don't try to modify the URL and do let's say url = "https://oldschool.runeclan.com/user/myusername"
It gives no error. However I want to loop trough my file to check more then 1 username.
Does anyone know how to fix this?
Here is the problem. When you read from the text file...an extra \n for newline is being added to your URL. That is why requests returns a 404 page not found error. A good idea is to check by using print(repr(url)) instead of print(url). That will show you the extra '\n . To fix this we just do url=url.rstrip() and Voila..It works.
import requests
from bs4 import BeautifulSoup
from termcolor import colored
with open('Sample', encoding='utf-8-sig') as f:
namesList = f.readlines()
print("Accounts found: ", namesList) # Opening file and reading it
for x in namesList:
url = "https://oldschool.runeclan.com/user/" + x # Adding username from file to URL print(repr(url))
print(repr(url))
url = url.rstrip()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(class_="xp_tracker_gain_today").text
title2 = soup.find(class_="xp_tracker_gain altcolor xp_tracker_pos").text
title3 = soup.find(class_="xp_tracker_next").text # Finding right information on site
print(colored("Exp gain today: " + title, 'green'))
print(colored("Exp gain yesterday: " + title2, 'green'))
print(colored(title3, 'green')) # Printing data found
Output:-
Accounts found: ['mausie\n', 'mr+stevieyh\n']
'https://oldschool.runeclan.com/user/mausie\n'
Exp gain today: 233,508
Exp gain yesterday: 469,011
Last tracked: 1 minute agoNext track available: now
'https://oldschool.runeclan.com/user/mr+stevieyh\n'
Exp gain today: 129,203
Exp gain yesterday: 730,434
Last tracked: 1 minute agoNext track available: now
That extra \n was the problem all along.
I want to search for different company names on the website. Website link: https://www.firmenwissen.de/index.html
On this website, I want to use the search engine and search companies. Here is the code I am trying to use:
from bs4 import BeautifulSoup as BS
import requests
import re
companylist = ['ABEX Dachdecker Handwerks-GmbH']
url = 'https://www.firmenwissen.de/index.html'
payloads = {
'searchform': 'UFT-8',
'phrase':'ABEX Dachdecker Handwerks-GmbH',
"mainSearchField__button":'submit'
}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
html = requests.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'html.parser')
link_list= []
links = soup.findAll('a')
for li in links:
link_list.append(li.get('href'))
print(link_list)
This code should bring me the next page with company information. But unfortunately, it returns only the home page. How can I do this?
Change your initial url you are doing search for. Grab the appropriate hrefs only and add to a set to ensure no duplicates (or alter selector to return only one match if possible); add those items to a final set for looping to ensure only looping required number of links. I have used Session on assumption you will repeat for many companies.
Iterate over the set using selenium to navigate to each company url and extract whatever info you need.
This is an outline.
from bs4 import BeautifulSoup as BS
import requests
from selenium import webdriver
d = webdriver.Chrome()
companyList = ['ABEX Dachdecker Handwerks-GmbH','SUCHMEISTEREI GmbH']
url = 'https://www.firmenwissen.de/ergebnis.html'
baseUrl = 'https://www.firmenwissen.de'
headers = {'User-Agent': 'Mozilla/5.0'}
finalLinks = set()
## searches section; gather into set
with requests.Session() as s:
for company in companyList:
payloads = {
'searchform': 'UFT-8',
'phrase':company,
"mainSearchField__button":'submit'
}
html = s.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'lxml')
companyLinks = {baseUrl + item['href'] for item in soup.select("[href*='firmeneintrag/']")}
# print(soup.select_one('.fp-result').text)
finalLinks = finalLinks.union(companyLinks)
for item in finalLinks:
d.get(item)
info = d.find_element_by_css_selector('.yp_abstract_narrow')
address = d.find_element_by_css_selector('.yp_address')
print(info.text, address.text)
d.quit()
Just the first links:
from bs4 import BeautifulSoup as BS
import requests
from selenium import webdriver
d = webdriver.Chrome()
companyList = ['ABEX Dachdecker Handwerks-GmbH','SUCHMEISTEREI GmbH', 'aktive Stuttgarter']
url = 'https://www.firmenwissen.de/ergebnis.html'
baseUrl = 'https://www.firmenwissen.de'
headers = {'User-Agent': 'Mozilla/5.0'}
finalLinks = []
## searches section; add to list
with requests.Session() as s:
for company in companyList:
payloads = {
'searchform': 'UFT-8',
'phrase':company,
"mainSearchField__button":'submit'
}
html = s.post(url, data=payloads, headers=headers)
soup = BS(html.content, 'lxml')
companyLink = baseUrl + soup.select_one("[href*='firmeneintrag/']")['href']
finalLinks.append(companyLink)
for item in set(finalLinks):
d.get(item)
info = d.find_element_by_css_selector('.yp_abstract_narrow')
address = d.find_element_by_css_selector('.yp_address')
print(info.text, address.text)
d.quit()
I want to scrape few pages from amazon website like title,url,aisn and i run into a problem that script only parsing 15 products while on the page it is showing 50. i decided to print out all html to console and i saw that the html is ending at 15 products without any errors from the script.
Here is the part of my script
keyword = "men jeans".replace(' ', '+')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3 GTB5'}
url = "https://www.amazon.com/s/field-keywords={}".format(keyword)
request = requests.session()
req = request.get(url, headers = headers)
sleep(3)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup)
It's because few of the items are generated dynamically. There might be any better solution other than using selenium. However, as a workaround you can try the below way instead.
from selenium import webdriver
from bs4 import BeautifulSoup
def fetch_item(driver,keyword):
driver.get(url.format(keyword.replace(" ", "+")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
for items in soup.select("[id^='result_']"):
try:
name = items.select_one("h2").text
except AttributeError: name = ""
print(name)
if __name__ == '__main__':
url = "https://www.amazon.com/s/field-keywords={}"
driver = webdriver.Chrome()
try:
fetch_item(driver,"men jeans")
finally:
driver.quit()
Upon running the above script you should get 56 names or something as result.
import requests
from bs4 import BeautifulSoup
for page in range(1, 21):
keyword = "red car".replace(' ', '+')
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3 GTB5'}
url = "https://www.amazon.com/s/field-keywords=" + keyword + "?page=" + str(page)
request = requests.session()
req = request.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
results = soup.findAll("li", {"class": "s-result-item"})
for i in results:
try:
print(i.find("h2", {"class": "s-access-title"}).text.replace('[SPONSORED]', ''))
print(i.find("span", {"class": "sx-price-large"}).text.replace("\n", ' '))
print('*' * 20)
except:
pass
Amazon's page range is max till 20 here is it crawling the pages
So I have this code that will give me the urls I need in a list format
import requests
from bs4 import BeautifulSoup
offset = 0
links = []
with requests.Session() as session:
while True:
r = session.get("http://rayleighev.deviantart.com/gallery/44021661/Reddit?offset=%d" % offset)
soup = BeautifulSoup(r.content, "html.parser")
new_links = soup.find_all("a", {'class' : "thumb"})
# no more links - break the loop
if not new_links:
break
# denotes the number of gallery pages gone through at one time (# of pages times 24 equals the number below)
links.extend(new_links)
print(len(links))
offset += 24
#denotes the number of gallery pages(# of pages times 24 equals the number below)
if offset == 48:
break
for link in links:
print(link.get("href"))
After that I try to get different text from all of the urls, and all that text is in relatively the same place on each one. But, whenever I run the second half, below, I keep getting a chunk of html text and some errors, and I'm not sure of how to fix it or if there is any other, and preferably simpler, way to get the text from each url.
import urllib.request
import re
for link in links:
url = print("%s" % link)
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
paragraphs = re.findall(r'</a><br /><br />(.*?)</div>', str(respData))
if paragraphs != None:
paragraphs = re.findall(r'<br /><br />(.*?)</span>', str(respData))
if paragraphs != None:
paragraphs = re.findall(r'<br /><br />(.*?)</span></div>', str(respData))
for eachP in paragraphs:
print(eachP)
title = re.findall(r'<title>(.*?)</title>', str(respData))
for eachT in title:
print(eachT)
Your code:
for link in links:
url = print("%s" % link)
assigns None to url. Perhaps you mean:
for link in links:
url = "%s" % link.get("href")
There's also no reason to use urllib to get the sites content, you can use requests as you did before by changing:
req = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(req)
respData = resp.read()
to
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, "html.parser")
Now you can get the title and paragraph with just:
title = soup.find('div', {'class': 'dev-title-container'}).h1.text
paragraph = soup.find('div', {'class': 'text block'}).text