Getting issue with python Web scraping

Getting issue with python Web scraping - python

I am new to python and web scraping. I wrote some code for scraping quotes and the corresponding author name from https://www.brainyquote.com/topics/inspirational-quotes and ended with no result. Here is the code i used for the purpose,
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r"C:\Users\Sandheep\Desktop\chromedriver.exe")
product = []
prices = []
driver.get("https://www.brainyquote.com/topics/inspirational-quotes")
content = driver.page_source
soup = BeautifulSoup(content, "lxml")
for a in soup.findAll("a", href=True, attrs={"class": "clearfix"}):
quote = a.find("a", href=True, attrs={"title": "view quote"}).text
author = a.find("a", href=True, attrs={"class": "bq-aut"}).text
product.append(quote)
prices.append(author)
print(product)
print(prices)
I am not getting where i need to edit to get the result.
THANKS IN ADVANCE!!!!

As I understand site has this information in attribute alt of images. Also, quote and author separated by ' - '.
So you need to iterate by soup.find_all('img'), the function to fetch result may look like:
def fetch_quotes(soup):
for img in soup.find_all('img'):
try:
quote, author = img['alt'].split(' - ')
except ValueError:
pass
else:
yield {'quote': quote, 'author': author}
Then, use it like: print(list(fetch_quotes(soup)))
Also, note, it is often that you can replace using selenium to pure requests, e.g.:
import requests
from bs4 import BeautifulSoup
content = requests.get("https://www.brainyquote.com/topics/inspirational-quotes").content
soup = BeautifulSoup(content, "lxml")

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r"ChromeDriver path")
driver.get("https://www.brainyquote.com/topics/inspirational-quotes")
content = driver.page_source
soup = BeautifulSoup(content, "lxml")
root_tag=["div", {"class":"m-brick grid-item boxy bqQt r-width"}]
quote_author=["a",{"title":"view author"}]
quote=[]
author=[]
all_data = soup.findAll(root_tag[0], root_tag[1])
for div in all_data:
try:
quote.append(div.find_all("a",{"title":"view quote"})[1].text)
author.append(div.find(quote_author[0], quote_author[1]).text)
except:
continue
The output Will be:
for i in range(len(author)):
print(quote[i])
print(author[i])
break
Start by doing what's necessary; then do what's possible; and suddenly you are doing the impossible.
Francis of Assisi

Related

Python Web Scraping | How to scrape data from multiple urls by choosing page number as a range with Beautiful Soup and selenium?

from selenium import webdriver
import time
from bs4 import BeautifulSoup as Soup
driver = webdriver.Firefox(executable_path='C://Downloads//webdrivers//geckodriver.exe')
a = 'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page='
for c in range(8):
#a = f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={c}'
cd = driver.get(a+str(c))
page_source = driver.page_source
bs = Soup(page_source, 'html.parser')
fetch_data = bs.find_all('div', {'class': 's-expand-height.s-include-content-margin.s-latency-cf-section.s-border-bottom'})
for f_data in fetch_data:
product_name = f_data.find('span', {'class': 'a-size-medium.a-color-base.a-text-normal'})
print(product_name + '\n')
Now The problem here is that, Webdriver successfully visits 7 pages, But doesn't provide any output or an error.
Now I don't know where M in going wrong.
Any suggestions, reference to a article that provides solution about this problem will be always welcomed.

You are not selecting the right div tag to fetch the products using BeautifulSoup, leading to no output.
Try the following snippet:-
#range of pages
for i in range(1,20):
driver.get(f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={i}')
page_source = driver.page_source
bs = Soup(page_source, 'html.parser')
#get search results
products=bs.find_all('div',{'data-component-type':"s-search-result"})
#for each product in search result print product name
for i in range(0,len(products)):
for product_name in products[i].find('span',class_="a-size-medium a-color-base a-text-normal"):
print(product_name)

You can print bs or fetch_data to debug.
Anyway
In my opinion, you can use requests or urllib to get page_source instead of selenium

Scraping a table with BeautifulSoup

I am trying to scrape a table with info on footballplayers on https://www.transfermarkt.co.uk/manchester-city/kader/verein/281/saison_id/2019/plus/1
It works fine when I try to get information manually like this:
url = 'https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2019'
response = requests.get(url, headers={'User-Agent': 'Custom5'})
data = response.text
soup = BeautifulSoup(data, 'html.parser')
players_table = soup.find("table", attrs={"class": "items"})
Players = soup.find_all("a", {"class": "spielprofil_tooltip"})
Players[5].text
Values = soup.find_all("td", {"class": "rechts hauptlink"})
Values[9].text
Birthdays = soup.find_all("td", {"class": "zentriert"})
Birthdays[1].text
But to actually get the data into a table I think I need to use a for loop with td and tr tags. I have looked for solutions but cannot find anything that works with this particular website.
When I try this for example, the list remains empty
data = []
for tr in players_table.find_all("tr"):
# remove any newlines and extra spaces from left and right
data.append
print(data)

You don't actually append anything to the list.
Change data.append to data.append(tr).
That way you tell the your program what to append to the list, assuming players_table.find_all("tr") does return at least 1 item.

The website uses JavaScript, but requests doesn't support it. so we can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
URL = "https://www.transfermarkt.co.uk/manchester-city/kader/verein/281/saison_id/2019/plus/1"
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe")
driver.get(URL)
# Wait 5 seconds for the page to load
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
players_table = soup.find("table", attrs={"class": "items"})
for tr in players_table.find_all('tr'):
tds = ' '.join(td.get_text(strip=True) for td in tr.select('td'))
print(tds)
driver.quit()

Beautiful Soup does not get full div

BeautifulSoup does something weird and I can't figure out why.
import requests
from bs4 import BeautifulSoup
url = "nsfw"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
cards = soup.find_all("div", {"class": "card-body"})
cards.pop(0)
cards.pop(0)
cards.pop(0) # i really like to pop
texte = []
print(soup)
for i, card in enumerate(cards):
texte.append(card.text)
if i == len(cards)-1:
print(card)
Now what I expect it to do is get the divs and to put the text of the divs into the array. And it does work. For the first 8 out of 9 divs. The 9th div is extremly shortened. Result of the print:
<div class="card-body" id="card_Part_9"><p class="storytext"><span class="brk2_firstwords">“Door’s open,” Brendan shouted.</span></p>
<p class="storytext">Jeffrey</p></div>
But on the website itself it doesn't end there. Here is a screenshot: https://i.imgur.com/CmvYzfJ.png
Why does this happen? What can I do to prevent this? I have already tried to change the parser, but that does not change the result. The site does not use Javascript to load content.
Structure when opening with a browser: https://pastebin.com/N2bPYFBD
But when I print(soup) I get:
<p class="storytext">Jeffrey</p></div></div></div></div></div></div></div></body></html> entered the apartment```

Seems like the html.parser messes up the DOM. The lxml-parser works for me:
import requests
from bs4 import BeautifulSoup
url = "six-pack-thingy"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
cards = soup.find_all("div", {"class": "card-body"})
texte = [card.text for card in cards[3:]]

Thought I could post my scribble as well:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('six-pack-thingy')
elems = driver.find_elements_by_class_name('card-body')
texte = [t.text for t in elems[3:]]
You will have to get some webdriver to run selenium, though. Are you familiar with that?

Using Python to Scrape Sky Cinema List

I'd like to gather a list of films and their links to all available movies on Sky Cinema website.
The website is:
http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200
I am using Python 3.6 and Beautiful Soup.
I am having problems finding the title and link. Especially as there are several pages to click through - possibly based on scroll position (in the URL?)
I've tried using BS and Python but there is no output. The code I have tried would only return the title. I'd like the title and the link to the film. As these are in different areas on the site, I am unsure on how this is done.
Code I have tried:
from bs4 import BeautifulSoup
import requests
link = "http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200"
r = requests.get(link)
page = BeautifulSoup(r.content, "html.parser")
for dd in page.find_all("div", {"class":"sentence-result-infos"}):
title = dd.find(class_="title ellipsis ng-binding").text.strip()
print(title)
spans=page.find_all('span', {'class': 'title ellipsis ng-binding'})
for span in spans:
print(span.text)
I'd like the output to show as the title, link.
EDIT:
I have just tried the following but get "text" is not an attribute:
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://www.sky.com/tv/channel/skycinema/find-a-movie/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200')
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('span', {'class': 'title ellipsis ng-binding'}).text.strip()
print(title)

There is an API to be found in network tab. You can get all results with one call. You can set the limit to a number greater than the expected result count
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=10000&window=skyMovies').json()
Or use the number you can see on the page
import requests
import pandas as pd
base = 'http://www.sky.com/tv'
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=1555&window=skyMovies').json()
data = [(item['title'], base + item['url']) for item in r['items']]
df = pd.DataFrame(data, columns = ['Title', 'Link'])
print(df)

First of all, read terms and conditions of the site you are going to scrape.
Next, you need selenium:
from selenium import webdriver
import bs4
# MODIFY the url with YOURS
url = "type the url to scrape here"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")
baseurl = 'http://www.sky.com/'
titles = [n.text for n in soup.find_all('span', {'class':'title ellipsis ng-binding'})]
links = [baseurl+h['href'] for h in soup.find_all('a', {'class':'sentence-result-pod ng-isolate-scope'})]

BeautifulSoup : Fetched all the links on a webpage how to navigate through them without selenium?

So I'm trying to write a mediocre script to download subtitles from one particular website as y'all can see. I'm a newbie to beautifulsoup, so far I have a list of all the "href" after a search query(GET). So how do I navigate further, after getting all the links?
Here's the code:
import requests
from bs4 import BeautifulSoup
usearch = input("Movie Name? : ")
url = "https://www.yifysubtitles.com/search?q="+usearch
print(url)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
for link in soup.find_all('a'):
dictn = link.get('href')
print(dictn)

You need to use resp.text instead of resp.content
Try this to get the search results.
import requests
from bs4 import BeautifulSoup
base_url_f = "https://www.yifysubtitles.com"
search_url = base_url_f + "/search?q=last+jedi"
resp = requests.get(search_url)
soup = BeautifulSoup(resp.text, 'lxml')
for media in soup.find_all("div", {"class": "media-body"}):
print(base_url_f + media.find('a')['href'])
out: https://www.yifysubtitles.com/movie-imdb/tt2527336

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting issue with python Web scraping - python

Related

Python Web Scraping | How to scrape data from multiple urls by choosing page number as a range with Beautiful Soup and selenium?

Scraping a table with BeautifulSoup

Beautiful Soup does not get full div

Using Python to Scrape Sky Cinema List

BeautifulSoup : Fetched all the links on a webpage how to navigate through them without selenium?

Categories

Resources