Scraping a table with BeautifulSoup

Scraping a table with BeautifulSoup - python

I am trying to scrape a table with info on footballplayers on https://www.transfermarkt.co.uk/manchester-city/kader/verein/281/saison_id/2019/plus/1
It works fine when I try to get information manually like this:
url = 'https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2019'
response = requests.get(url, headers={'User-Agent': 'Custom5'})
data = response.text
soup = BeautifulSoup(data, 'html.parser')
players_table = soup.find("table", attrs={"class": "items"})
Players = soup.find_all("a", {"class": "spielprofil_tooltip"})
Players[5].text
Values = soup.find_all("td", {"class": "rechts hauptlink"})
Values[9].text
Birthdays = soup.find_all("td", {"class": "zentriert"})
Birthdays[1].text
But to actually get the data into a table I think I need to use a for loop with td and tr tags. I have looked for solutions but cannot find anything that works with this particular website.
When I try this for example, the list remains empty
data = []
for tr in players_table.find_all("tr"):
# remove any newlines and extra spaces from left and right
data.append
print(data)

You don't actually append anything to the list.
Change data.append to data.append(tr).
That way you tell the your program what to append to the list, assuming players_table.find_all("tr") does return at least 1 item.

The website uses JavaScript, but requests doesn't support it. so we can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
URL = "https://www.transfermarkt.co.uk/manchester-city/kader/verein/281/saison_id/2019/plus/1"
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe")
driver.get(URL)
# Wait 5 seconds for the page to load
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
players_table = soup.find("table", attrs={"class": "items"})
for tr in players_table.find_all('tr'):
tds = ' '.join(td.get_text(strip=True) for td in tr.select('td'))
print(tds)
driver.quit()

Related

Beautiful Soup parsing table from react html

Im trying to parse table with orders from html page.
Here the html:
HTML PIC
I need to get data from those table rows,
Here what i tried to do:
response = requests.get('https://partner.market.yandex.ru/supplier/23309133/fulfillment/orders', headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
q = soup.findAll('tr')
a = soup.find('tr')
print(q)
print(a)
But it gives me None. So any idea how to get into those table rows?
I tried to iterate over each div in html... once i get closer to div which contains those tables it give me None as well.
Appreciate any help

Aight. I found a solution by using selenium instead of requests lib.
I don't have any idea why it doesn't work with requests lib since it's doing the same thing as selenium (just sending an get request). But, with the selenium it works.
So here is what I do:
driver = webdriver.Chrome(r"C:\Users\Booking\PycharmProjects\britishairways\chromedriver.exe")
driver.get('https://www.britishairways.com/travel/managebooking/public/ru_ru')
time.sleep(15) # make an authorization
res = driver.page_source
print(res)
soup = BeautifulSoup(res, 'lxml')
b = soup.find_all('tr')

Python Web Scraping | How to scrape data from multiple urls by choosing page number as a range with Beautiful Soup and selenium?

from selenium import webdriver
import time
from bs4 import BeautifulSoup as Soup
driver = webdriver.Firefox(executable_path='C://Downloads//webdrivers//geckodriver.exe')
a = 'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page='
for c in range(8):
#a = f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={c}'
cd = driver.get(a+str(c))
page_source = driver.page_source
bs = Soup(page_source, 'html.parser')
fetch_data = bs.find_all('div', {'class': 's-expand-height.s-include-content-margin.s-latency-cf-section.s-border-bottom'})
for f_data in fetch_data:
product_name = f_data.find('span', {'class': 'a-size-medium.a-color-base.a-text-normal'})
print(product_name + '\n')
Now The problem here is that, Webdriver successfully visits 7 pages, But doesn't provide any output or an error.
Now I don't know where M in going wrong.
Any suggestions, reference to a article that provides solution about this problem will be always welcomed.

You are not selecting the right div tag to fetch the products using BeautifulSoup, leading to no output.
Try the following snippet:-
#range of pages
for i in range(1,20):
driver.get(f'https://www.amazon.com/s?k=Mobile&i=amazon-devices&page={i}')
page_source = driver.page_source
bs = Soup(page_source, 'html.parser')
#get search results
products=bs.find_all('div',{'data-component-type':"s-search-result"})
#for each product in search result print product name
for i in range(0,len(products)):
for product_name in products[i].find('span',class_="a-size-medium a-color-base a-text-normal"):
print(product_name)

You can print bs or fetch_data to debug.
Anyway
In my opinion, you can use requests or urllib to get page_source instead of selenium

Getting issue with python Web scraping

I am new to python and web scraping. I wrote some code for scraping quotes and the corresponding author name from https://www.brainyquote.com/topics/inspirational-quotes and ended with no result. Here is the code i used for the purpose,
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r"C:\Users\Sandheep\Desktop\chromedriver.exe")
product = []
prices = []
driver.get("https://www.brainyquote.com/topics/inspirational-quotes")
content = driver.page_source
soup = BeautifulSoup(content, "lxml")
for a in soup.findAll("a", href=True, attrs={"class": "clearfix"}):
quote = a.find("a", href=True, attrs={"title": "view quote"}).text
author = a.find("a", href=True, attrs={"class": "bq-aut"}).text
product.append(quote)
prices.append(author)
print(product)
print(prices)
I am not getting where i need to edit to get the result.
THANKS IN ADVANCE!!!!

As I understand site has this information in attribute alt of images. Also, quote and author separated by ' - '.
So you need to iterate by soup.find_all('img'), the function to fetch result may look like:
def fetch_quotes(soup):
for img in soup.find_all('img'):
try:
quote, author = img['alt'].split(' - ')
except ValueError:
pass
else:
yield {'quote': quote, 'author': author}
Then, use it like: print(list(fetch_quotes(soup)))
Also, note, it is often that you can replace using selenium to pure requests, e.g.:
import requests
from bs4 import BeautifulSoup
content = requests.get("https://www.brainyquote.com/topics/inspirational-quotes").content
soup = BeautifulSoup(content, "lxml")

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r"ChromeDriver path")
driver.get("https://www.brainyquote.com/topics/inspirational-quotes")
content = driver.page_source
soup = BeautifulSoup(content, "lxml")
root_tag=["div", {"class":"m-brick grid-item boxy bqQt r-width"}]
quote_author=["a",{"title":"view author"}]
quote=[]
author=[]
all_data = soup.findAll(root_tag[0], root_tag[1])
for div in all_data:
try:
quote.append(div.find_all("a",{"title":"view quote"})[1].text)
author.append(div.find(quote_author[0], quote_author[1]).text)
except:
continue
The output Will be:
for i in range(len(author)):
print(quote[i])
print(author[i])
break
Start by doing what's necessary; then do what's possible; and suddenly you are doing the impossible.
Francis of Assisi

Using Python to Scrape Sky Cinema List

I'd like to gather a list of films and their links to all available movies on Sky Cinema website.
The website is:
http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200
I am using Python 3.6 and Beautiful Soup.
I am having problems finding the title and link. Especially as there are several pages to click through - possibly based on scroll position (in the URL?)
I've tried using BS and Python but there is no output. The code I have tried would only return the title. I'd like the title and the link to the film. As these are in different areas on the site, I am unsure on how this is done.
Code I have tried:
from bs4 import BeautifulSoup
import requests
link = "http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200"
r = requests.get(link)
page = BeautifulSoup(r.content, "html.parser")
for dd in page.find_all("div", {"class":"sentence-result-infos"}):
title = dd.find(class_="title ellipsis ng-binding").text.strip()
print(title)
spans=page.find_all('span', {'class': 'title ellipsis ng-binding'})
for span in spans:
print(span.text)
I'd like the output to show as the title, link.
EDIT:
I have just tried the following but get "text" is not an attribute:
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://www.sky.com/tv/channel/skycinema/find-a-movie/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200')
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('span', {'class': 'title ellipsis ng-binding'}).text.strip()
print(title)

There is an API to be found in network tab. You can get all results with one call. You can set the limit to a number greater than the expected result count
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=10000&window=skyMovies').json()
Or use the number you can see on the page
import requests
import pandas as pd
base = 'http://www.sky.com/tv'
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=1555&window=skyMovies').json()
data = [(item['title'], base + item['url']) for item in r['items']]
df = pd.DataFrame(data, columns = ['Title', 'Link'])
print(df)

First of all, read terms and conditions of the site you are going to scrape.
Next, you need selenium:
from selenium import webdriver
import bs4
# MODIFY the url with YOURS
url = "type the url to scrape here"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")
baseurl = 'http://www.sky.com/'
titles = [n.text for n in soup.find_all('span', {'class':'title ellipsis ng-binding'})]
links = [baseurl+h['href'] for h in soup.find_all('a', {'class':'sentence-result-pod ng-isolate-scope'})]

Extract data from BSE website

How can I extract the value of Security ID, Security Code, Group / Index, Wtd.Avg Price, Trade Date, Quantity Traded, % of Deliverable Quantity to Traded Quantity using Python 3 and save it to an XLS file. Below is the link.
https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/
PS: I am completely new to the python. I know there are few libs which make scrapping easier like BeautifulSoup, selenium, requests, lxml etc. Don't have much idea about them.
Edit 1:
I tried something
from bs4 import BeautifulSoup
import requests
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'id':'newheaddivgrey'})
print(table)
Its output is None. I was expecting all tables in the webpage and filter them further to get required data.
import requests
import lxml.html
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
root = lxml.html.fromstring(r.content)
title = root.xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(title)
Tried another code. Same problem.
Edit 2:
Tried selenium. But I am not getting the table contents.
from selenium import webdriver
driver = webdriver.Chrome(r"C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\bin\chromedriver.exe")
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
table=driver.find_elements_by_xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(table)
driver.quit()
Output is [<selenium.webdriver.remote.webelement.WebElement (session="befdd4f01e6152942c9cfc7c563a6bf2", element="0.13124528538297953-1")>]

After loading the page with Selenium, you can get the Javascript modified page source using driver.page_source. You can then pass this page source in the BeautifulSoup object.
driver = webdriver.Chrome()
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('div', id='SecuritywiseDeliveryPosition')
This code will give you the Securitywise Delivery Position table in the table variable. You can then parse this BeautifulSoup object to get the different values you want.
The soup object contains the full page source including the elements that were dynamically added. Now, you can parse this to get all the things you mentioned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a table with BeautifulSoup - python

You don't actually append anything to the list. Change data.append to data.append(tr). That way you tell the your program what to append to the list, assuming players_table.find_all("tr") does return at least 1 item.

Related

Beautiful Soup parsing table from react html

Python Web Scraping | How to scrape data from multiple urls by choosing page number as a range with Beautiful Soup and selenium?

Getting issue with python Web scraping

Using Python to Scrape Sky Cinema List

Extract data from BSE website

Categories

Resources