Python - Beginner Scraping with Beautiful Soup 4 - onmouseover - python

i'm a beginner python (3) user and i'm currently trying to scrape some sports stats for my fantasy football season. Previously I did this in a round about way (downloading in HT-track, converting to excel and using VBA to combine my data). But now I'm trying to learn python to improve my coding abilities.
I want to scrape this page but running into some difficulty in selecting only the rows/tables I want. Here is how my code currently stands. It still has a bit of code where I have been trying to play around with it.
from urllib.request import urlopen # import the library
from bs4 import BeautifulSoup # Import BS
from bs4 import SoupStrainer # Import Soup Strainer
page = urlopen('http://www.footywire.com/afl/footy/ft_match_statistics?mid=6172') # access the website
only_tables = SoupStrainer('table') # parse only table elements when parsing
soup = BeautifulSoup(page, 'html.parser') # parse the html
# for row in soup('table',{'class':'tbody'}[0].tbody('tr')):
# tds = row('td')
# print (tds[0].string, tds[1].string)
# create variables to keep the data in
team = []
player = []
kicks = []
handballs = []
disposals = []
marks = []
goals = []
tackles = []
hitouts = []
inside50s = []
freesfor = []
freesagainst = []
fantasy = []
supercoach = []
table = soup.find_all('tr')
# print(soup.prettify())
print(table)
Right now I can select all 'tr' from the page, however I'm having trouble only selecting the rows which have the following attribute:
<tr bgcolor="#ffffff" onmouseout="this.bgColor='#ffffff';" onmouseover="this.bgColor='#cbcdd0';">
"onmouseover" seems to be the only attribute which is common/unique to the table I want to scrape.
Does anyone know how I can alter this line of code, to select this attribute?
table = soup.find_all('tr')
From here I am confident I can place the data into a dataframe which hopefully I can export to CSV.
Any help would be greatly appreciated as I have looked through the BS4 documentation with no luck.

As explained on the BeautifulSoup documentation
You may use this :
table = soup.findAll("tr", {"bgcolor": "#ffffff", "onmouseout": "this.bgColor='#ffffff'", "onmouseover": "this.bgColor='#cbcdd0';"})
More, you can also use the following approach:
tr_tag = soup.findAll(lambda tag:tag.name == "tr" and tag["bgcolor"] == "#ffffff") and tag["onmouseout"] = "this.bgColor='#ffffff'" and tag["onmouseover"] = "this.bgColor='#cbcdd0';"
The advantage of the above approach is that it uses the full power of BS and it's giving you the result in a very optimized way

Check this:
soup.find_all("tr", attrs={"onmouseover" : "this.bgColor='#cbcdd0';"})

Related

Beutilfulsoup is not gathering all HTML (Python)

I have made a web-scrap script in Python. The job is to go through many sofascore.com pages to gather information. Furthermore, I am using Beutilfulsoup and playwright to do the job.
However, when the loop has started through all my Sofascore pages, there exists 2 types of situations. The first type lets me gather all information and the second type do not let me gather the information. I have researched both types of pages and they have the same elements. My code is:
from time import time
from numpy import true_divide
from playwright.sync_api import sync_playwright
import pandas as pd
from bs4 import BeautifulSoup
import time
from selenium import webdriver
HomeGoal= []
AwayGoal = []
HomeTeam = []
AwayTeam = []
with sync_playwright() as p:
#headless = False, slow_mo=50
browser = p.chromium.launch(headless = False, slow_mo=50)
page = browser.new_page()
page.goto(THEPAGES)
time.sleep(1)
page.is_visible('//div[contains(#class, "sc-18688171-0 sc-7d450bff-4 fXAhuT fBSHnS")]')
HTML = page.inner_html('//div[contains(#class, "sc-cd4cfbdc-0 hDkGff")]')
Soup= BeautifulSoup(HTML, 'html.parser')
NotFirtst = 0
for I in Soup:
if len(I.text) > 0 and EnGang != 1:
NotFirtst = NotFirtst + 1
Home = I.text.rsplit(" - ",1)[0]
Away = I.text.rsplit(" - ",1)[1]
#This below will gather information about the matches
HMTL = page.inner_html('//div[contains(#class, "sc-4b793701-0 dTwLyM u-overflow-hidden")]')
Soup= BeautifulSoup(HMTL,'html.parser')
#The information for previous matches
for I in Soup.find_all(class_= "sc-c2090177-0 dLUwVT"):
print(I.text)
#Information is gathered
This code is working fine with pages such as:
https://www.sofascore.com/ymir-kopavogur-knattspyrnufelag-rangaeinga/EvvsEIO
But is not working on pages like https://www.sofascore.com/gimnasia-y-esgrima-csyd-liniers/fobsQgCb
Using this code to test if all information is gathered. Informs me that it gathered all on the first page, but not all on the second page:
HMTL = page.inner_html('//div[contains(#class, "sc-4b793701-0 dTwLyM u-overflow-hidden")]')
Soup= BeautifulSoup(HMTL,'html.parser')
print(Soup)
In my world, this should work fine and I cannot find anywhere where this problem occurs for others, when the elements exists on both pages.

Python: Get element next to href

Python code:
url = 'https://www.basketball-reference.com/players/'
initial = list(string.ascii_lowercase)
initial_url = [url + i for i in initial]
html_initial = [urllib.request.urlopen(i).read() for i in initial_url]
soup_initial = [BeautifulSoup(i, 'html.parser') for i in html_initial]
tags_initial = [i('a') for i in soup_initial]
print(tags_initial[0][50])
Results example:
Shareef Abdur-Rahim
From the example above, I want to extract the name of the players which is 'Shareef Abdur-Rahim', but I want to do it for all the tags_initial lists,
Does anyone have an idea?
Could you modify your post by adding your code so that we can help you better?
Maybe that could help you :
name = soup.findAll(YOUR_SELECTOR)[0].string
UPDATE
import re
import string
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = 'https://www.basketball-reference.com/players/'
# Alphabet
initial = list(string.ascii_lowercase)
datas = []
# URLS
urls = [url + i for i in initial]
for url in urls:
# Soup Object
soup = BeautifulSoup(urlopen(url), 'html.parser')
# Players link
url_links = soup.findAll("a", href=re.compile("players"))
for link in url_links:
# Player name
datas.append(link.string)
print("datas : ", datas)
Then, "datas" contains all the names of the players, but I advise you to do a little processing afterwards to remove some erroneous information like "..." or perhaps duplicates
There are probably better ways but I'd do it like this:
html = "a href=\"/teams/LAL/2021.html\">Los Angeles Lakers</a"
index = html.find("a href")
index = html.find(">", index) + 1
index_end = html.find("<", index)
print(html[index:index_end])
If you're using a scraper library it probably has a similar function built-in.

How can I get data from a website using BeautifulSoup and requests?

I am a beginner in web scraping, and I need help with this problem.
The website, allrecipes.com, is a website where you can find recipes based on a search, which in this case is 'pie':
link to the html file:
'view-source:https://www.allrecipes.com/search/results/?wt=pie&sort=re'
(right click-> view page source)
I want to create a program that takes a input, searches it up on allrecipes, and returns a list with tuples of the first five recipes with data such as the time that takes to make, serving yield, ingrediants, and more.
This is my program so far:
import requests
from bs4 import BeautifulSoup
def searchdata():
inp=input('what recipe would you like to search')
url ='http://www.allrecipes.com/search/results/?wt='+str(inp)+'&sort=re'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
links=[]
#fill in code for finding top 3 or five links
for i in range(3)
a = requests.get(links[i])
soupa = BeautifulSoup(a.text, 'html.parser')
#fill in code to find name, ingrediants, time, and serving size with data from soupa
names=[]
time=[]
servings=[]
ratings=[]
ingrediants=[]
searchdata()
Yes, i know, my code is very messy but What should I fill in in the two code fill-in areas?
Thanks
After searching for the recipe you have to get the links of each recipe and then request again for each of those links, because the information you're looking for is not available on the search page. That would not look clean without OOP so here's the class I wrote that does what you want.
import requests
from time import sleep
from bs4 import BeautifulSoup
class Scraper:
links = []
names = []
def get_url(self, url):
url = requests.get(url)
self.soup = BeautifulSoup(url.content, 'html.parser')
def print_info(self, name):
self.get_url(f'https://www.allrecipes.com/search/results/?wt={name}&sort=re')
if self.soup.find('span', class_='subtext').text.strip()[0] == '0':
print(f'No recipes found for {name}')
return
results = self.soup.find('section', id='fixedGridSection')
articles = results.find_all('article')
texts = []
for article in articles:
txt = article.find('h3', class_='fixed-recipe-card__h3')
if txt:
if len(texts) < 5:
texts.append(txt)
else:
break
self.links = [txt.a['href'] for txt in texts]
self.names = [txt.a.span.text for txt in texts]
self.get_data()
def get_data(self):
for i, link in enumerate(self.links):
self.get_url(link)
print('-' * 4 + self.names[i] + '-' * 4)
info_names = [div.text.strip() for div in self.soup.find_all(
'div', class_='recipe-meta-item-header')]
ingredient_spans = self.soup.find_all('span', class_='ingredients-item-name')
ingredients = [span.text.strip() for span in ingredient_spans]
for i, div in enumerate(self.soup.find_all('div', class_='recipe-meta-item-body')):
print(info_names[i].capitalize(), div.text.strip())
print()
print('Ingredients'.center(len(ingredients[0]), ' '))
print('\n'.join(ingredients))
print()
print('*' * 50, end='\n\n')
chrome = Scraper()
chrome.print_info(input('What recipe would you like to search: '))

Unusual results from BeautifulSoup4 from a website [duplicate]

I've been working with BeautifulSoup lately. I'm trying to get the data from https://www.pro-football-reference.com/teams/mia/2000_roster.htm site. Specifically all I want is the player name and 'gs' (games started).
However, when doing it, it's only returning the 1st ('Starters') table data. I'm actually not interested in that top table at all, I want the 2nd table titled 'Roster'.
Here's the code, that I was doing. Like I said, I didn't really want/need anything other than player name and games started, but was just practicing and learning BeautifulSoup.
import pandas as pd
import requests
import bs4
alpha = requests.get('https://www.pro-football-
reference.com/teams/mia/2000_roster.htm')
beta = bs4.BeautifulSoup(alpha.text,'lxml')
gama = beta.findAll('th',{'data-stat':'pos'})
position = [th.text for th in gama]
position = position[1:]
position = list(filter(None, position))
gama = beta.findAll('td',{'data-stat':'player'})
player = [td.text for td in gama]
player = player[1:]
while 'Defensive Starters' in player: player.remove('Defensive Starters')
while 'Special Teams Starters' in player: player.remove('Special Teams
Starters')
gama = beta.findAll('td',{'data-stat':'age'})
age = [td.text for td in gama]
age = list(filter(None, age))
gama = beta.findAll('td',{'data-stat':'gs'})
gs = [td.text for td in gama]
gs = list(filter(None, gs))
target = pd.DataFrame(
{
'player_name':player,
'position':position,
'gs':gs,
'age':age
})
Anyone see where I'm going wrong? Or maybe an alternative way to go about it?
To get the content from that table you need to use any browser simulator cause the response of that portion is generated dynamically. Data from the first table can easily be accessible without any browser simulator, though. I tried selenium in this case:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
page_url = "https://www.pro-football-reference.com/teams/mia/2000_roster.htm"
driver.get(page_url)
soup = BeautifulSoup(driver.page_source, "lxml")
table = soup.select(".table_outer_container")[1]
for items in table.select("tr"):
player = items.select("[data-stat='player']")[0].text
gs = items.select("[data-stat='gs']")[0].text
print(player,gs)
driver.quit()
Partial output:
Player  GS
Trace Armstrong* 0
John Bock 1
Tim Bowens 15
Lorenzo Bromell 0
Autry Denson 0
Mark Dixon 15
Kevin Donnalley 16
For some reason if you encounter such error, this time there will be no such option for that error either:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
page_url = "https://www.pro-football-reference.com/teams/mia/2000_roster.htm"
driver.get(page_url)
soup = BeautifulSoup(driver.page_source, "lxml")
table = soup.select(".table_outer_container")[1]
for items in table.select("tr"):
player = items.select("[data-stat='player']")[0].text if items.select("[data-stat='player']") else ""
gs = items.select("[data-stat='gs']")[0].text if items.select("[data-stat='gs']") else ""
print(player,gs)
driver.quit()

I need help web-scraping

So I wanted to scrape visualizations from visual.ly, however right now I do not understand how the "show more" button works. As of now, my code will get the image link, the text next to the image, and the link of the page. I was wondering how the "show more" button functions, because I was going to try to loop through using the number of pages. As of now I do not know how i would loop through each one individually. Any ideas on how I could loop through and go on to get more images than they originally show you????
from BeautifulSoup import BeautifulSoup
import urllib2
import HTMLParser
import urllib, re
counter = 1
columnno = 1
parser = HTMLParser.HTMLParser()
soup = BeautifulSoup(urllib2.urlopen('http://visual.ly/?view=explore& type=static#v2_filter').read())
image = soup.findAll("div", attrs = {'class': 'view-mode-wrapper'})
if columnno < 4:
column = image[0].findAll("div", attrs = {'class': 'v2_grid_column'})
columnno += 1
else:
column = image[0].findAll("div", attrs = {'class': 'v2_grid_column last'})
visualizations = column[0].findAll("div", attrs = {'class': '0 v2_grid_item viewmode-item'})
getImage = visualizations[0].find("a")
print counter
print getImage['href']
soup1 = BeautifulSoup(urllib2.urlopen(getImage['href']).read())
theImage = soup1.findAll("div", attrs = {'class': 'ig-graphic-wrapper'})
text = soup1.findAll("div", attrs = {'class': 'ig-content-right'})
getText = text[0].findAll("div", attrs = {'class': 'ig-description right-section first'})
imageLink = theImage[0].find("a")
print imageLink['href']
print getText
for row in image:
theImage = image[0].find("a")
actually_download = False
if actually_download:
filename = link.split('/')[-1]
urllib.urlretrieve(link, filename)
counter += 1
You cannot use a urllib-parser combo here because it uses javascript to load more content. In order to do this you will need a full force browser emulator (with javascript support). I have never used Selenium before, but I have heard that it does this, as well as has a python binding
However, I have found that it uses a very predictable form
http://visual.ly/?page=<page_number>
for its GET requests. Perhaps an easier way would be to go under
<div class="view-mode-wrapper">...</div>
to parse the data (using the above url format). After all, ajax requests must go to a location.
Then you could do
for i in xrange(<whatever>):
url = r'http://visual.ly/?page={pagenum}'.format(pagenum=i)
#do whatever you want from here

Categories

Resources