Beautifulsoup: how to iterate a table

Beautifulsoup: how to iterate a table - python

I am trying to extract data from a dynamic table with the following structure:
Team 1 - Score - Team 2 - Minute first goal.
It is a table of soccer match results and there are about 10 matches per table and one table for each matchday. This is an example of the website in working with: https://www.resultados-futbol.com/premier/grupo1/jornada1
For this I am trying web scraping with BeautifulSoup in Python. Although I've made good progress, I'm running into a problem. I would like to generate a code that would iterate data by data each row of the table and I would get each data to a list so that I would have, for example:
List Team 1: Real Madrid, Barcelona
Score list: 1-0, 1-0
List Team 2: Atletico Madrid, Sevilla
First goal minutes list: 17', 64'
Once I have the lists, my intention is to make a complete dataframe with all the extracted data. However, I have the following problem: the matches that end 0-0. This implies that in the column Minute first goal there is none and it doesn't extract anything, so I can't 'fill' that value in any way in my dataframe and I get an error. To continue with the previous example, imagine that the second game has ended 0-0 and that in the 'Minutes first goal list' there is only one data (17').
In my mind the solution would be to create a loop that takes the data cell by cell and put a condition in 'Score' that if it is 0-0 to the list of Minutes first goal a value for example 'No goals' would be added.
This is the code I am using. I paste only the part in which I would like to create the loop:
page = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
table = page.find('div', class_= 'contentitem').find_all('tr', class_= 'vevent')
teams1 = []
teams2 = []
scores = []
for cell in table:
team1 = cell.find('td', class_='team1')
for name in local:
nteam1 = name.text
teams1.append(nteam1)
team2 = cell.find('td', class_='team2')
for name in team2:
nteam2 = name.text
teams2.append(nteam2)
score = cell.find('span', class_='clase')
for name in score:
nscore = name.text
scores.append(nscore)
It is not clear to me how to iterate over the table to be able to store in the list the content of each cell and it is essential to include a condition "when the score cell is 0-0 create a non-goals entry in the list".
If someone could help me, I would be very grateful. Best regards

You are close to your goal, but can optimize your script a bit.
Do not use these different lists, just use one:
data = []
Try to get all information in one loop, there is an td that contains all the information and push a dict to your list:
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
Push your data into DataFrame
pd.DataFrame(data)
Example
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://www.resultados-futbol.com/premier/grupo1/jornada1'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser') # I have to use Selenium previously cos I have to expand some buttons in the web
data = []
for row in soup.select('tr.vevent .rstd'):
teams = row.select_one('.summary').get_text().split(' - ')
score = row.select_one('.clase').get_text()
data.append({
'team1':teams[0],
'team2':teams[1],
'score': score if score != '0-0' else 'No goals'
})
pd.DataFrame(data)

Related

How to fetch all elements from HTML code having h2 element in python selenium package?

website : https://www.covid19india.org/state/MH
enter image description here
I want to fetch all district wise total cases and daily cases by tag h2 or h5 as per html code. I tried By.CLASS_NAME,BY.TAG_NAME but not getting elements.
I used below 2 lines to get district data
state = browser.find_element(By.CLASS_NAME, "State")
districts = state.find_elements(By.CLASS_NAME, "district")
and using for loop for getting cases data:
for district in districts:
# tag_district = district.find_element(By.XPATH, "h5")
# tag_district = district.find_element(By.TAG_NAME, "//h5")
# tag_district = district.find_elements_by_tag_name('h5')
# district_name = tag_district.text
# print(tag_district)
nothing works.
which is useless as I want to export as csv.

How can I scrape the daily price changes from Coinmarketcap with requests_html?

I want to get the daily price changes from coinmarketcap for all coins available on the website.
I have tried to scrape and put the daily changes into a list but, somehow I'm getting the hourly, daily and weekly changes into the list. The code I used:
import requests
from requests_html import HTML, HTMLSession
r = HTMLSession().get('https://coinmarketcap.com/all/views/all/')
table = r.html.find('tbody')
delta_list = []
for row in table:
change = row.find('.percent-change')
for d in change:
delta = d.text
delta_list.append(delta)
print(delta_list)
How can I scrape only the daily changes?

Since requests_html supports xpath...
from requests_html import HTMLSession
r = HTMLSession().get('https://coinmarketcap.com/all/views/all/')
# get the table by id
table = r.html.xpath('//*[#id="currencies-all"]')
# filter table rows to tr elements with id
rows = table[0].xpath('//tr[#id]')
# your list of results
delta_list = []
# iterate in the rows result
for row in rows:
# get the cryptocurrency name
name = row.xpath('//*[#class="no-wrap currency-name"]')[0].text.replace('\n', ' ')
# get the element which contains the 24h cahnge data
val_elem = row.xpath('//*[#data-timespan="24h"]')
# some currencies are too fresh to have a result in 24h, they contain '?'
# Such elements don't have the #data-timespan="24h" attribute
# So if the result is empty something should be done, I decided to add 0
val = val_elem[0].text if val_elem else 0
# just debug print
print(f"Change of {name} in the past 24h is {val}")
# add the result to your list
delta_list.append(val)
On sidenote, using a list to store the results is not the best choice. The currencies are sorted by "market cap" and order of some currencies may change on any day. Using a dict/OrderedDict would be a better choice because that way you can pair currencies with values...

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

Apologies in advance for long question- I am new to Python and I'm trying to be as explicit as I can with a fairly specific situation.
I am trying to identify specific data points from SEC Filings on a routine basis however I want to automate this instead of having to manually go search a companies CIK ID and Form filing. So far, I have been able to get to a point where I am downloading metadata about all filings received by the SEC in a given time period. It looks like this:
index cik conm type date path
0 0 1000045 NICHOLAS FINANCIAL INC 10-Q 2019-02-14 edgar/data/1000045/0001193125-19-039489.txt
1 1 1000045 NICHOLAS FINANCIAL INC 4 2019-01-15 edgar/data/1000045/0001357521-19-000001.txt
2 2 1000045 NICHOLAS FINANCIAL INC 4 2019-02-19 edgar/data/1000045/0001357521-19-000002.txt
3 3 1000045 NICHOLAS FINANCIAL INC 4 2019-03-15 edgar/data/1000045/0001357521-19-000003.txt
4 4 1000045 NICHOLAS FINANCIAL INC 8-K 2019-02-01 edgar/data/1000045/0001193125-19-024617.txt
Despite having all this information, as well as being able to download these text files and see the underlying data, I am unable to parse this data as it is in xbrl format and is a bit out of my wheelhouse. Instead I came across this script (kindly provided from this site https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python):
from bs4 import BeautifulSoup
import requests
import sys
# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'
# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text
# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if '2015' in cells[3].text:
doc_link = 'https://www.sec.gov' + cells[1].a['href']
# Exit if document link couldn't be found
if doc_link == '':
print("Couldn't find the document link")
sys.exit()
# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text
# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if 'INS' in cells[3].text:
xbrl_link = 'https://www.sec.gov' + cells[2].a['href']
# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text
# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == 'us-gaap:stockholdersequity':
print("Stockholder's equity: " + tag.text)
Just running this script works exactly how I'd like it to. It returns the stockholders equity for a given company (IBM in this case) and I can then take that value and write it to an excel file.
My two-part question is this:
I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [('1009759', 'D', '20190215'),('1009891', 'D', '20190206'),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I'm interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.
Thank you for the help!

You need to define a function which can be essentially most of the code you have posted and that function should take 3 keyword arguments (your 3 values). Then rather than define the three in your code, you just pass in those values and return a result.
Then you take your list which you created and make a simple for loop around it to cal the function you defined with those three values and then do something with the result.
def get_data(value1, value2, value3):
# your main code here but replace with your arguments above.
return content
for company in companies:
content = get_data(value1, value2, value3)
# do something with content

Assuming you have a dataframe sec with correctly named columns for your list of filings, above, you first need to extract from the dataframe the relevant information into three lists:
cik = list(sec['cik'].values)
dat = list(sec['date'].values)
typ = list(sec['type'].values)
Then you create your base_url, with the items inserted and get your data:
for c, t, d in zip(cik, typ, dat):
base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
edgar_resp = requests.get(base_url)
And go from there.

Python - Combine two, single column lists into one dual column list and print

I'm just beginning to dabble with Python, and as many have done I am starting with a web-scraping example to try the language.
I have seen many examples of using zip and map to combine lists, but I am having issues attempting to have that list print.
Again, I am new so please be gentle.
The code gathers everything from 2 certain tag types (the date and title of a post) and returns them as 2 lists. For this I am using BeautifulSoup and requests.
The site I am practicing on for this test is the blog for a small game called 'Staxel'
I can get my code to print a full list of one tag using [soup.find] and [print] in a for loop, but when I attempt to add a 2nd list to print I am simply getting a termination with no error.
Any tips on how to correctly print the 2 lists?
I am looking for output like
Entry 2019-01-06 New Years
Entry 2018-11-30 Staxel Changelog for 1.3.52
# import libraries
import requests
import ssl
from bs4 import BeautifulSoup
# set the URL string
quote_page = 'https://blog.playstaxel.com'
# query the website and return the html to give us a 'page' variable
page = requests.get(quote_page)
# parse the html using beautiful soup and store in a variable ... 'soup'
soup = BeautifulSoup(page.content, 'lxml')
# Remove the 'div' of name and get it's value
title_box = soup.find_all('h1',attrs={'class':'entry-title'})
date_box = soup.find_all('span',attrs={'class':'entry-date published'})
titles = [title.text.strip() for title in title_box]
dates = [date.text.strip()for date in date_box]
date_list = zip(dates, titles)
for heading in date_list:
print ("Entry {}")

The problem is your query for dates is returning an empty list, so the zipped result will also be empty. To extract the date from that page, you want to look for tags of type time, not span, with class entry-date published:
like this:
date_box = soup.find_all("time", attrs={"class": "entry-date published"})
So with the following code:
import requests
from bs4 import BeautifulSoup
quote_page = "https://blog.playstaxel.com"
page = requests.get(quote_page)
soup = BeautifulSoup(page.content, "lxml")
title_box = soup.find_all("h1", attrs={"class": "entry-title"})
date_box = soup.find_all("time", attrs={"class": "entry-date published"})
titles = [title.text.strip() for title in title_box]
dates = [date.text.strip() for date in date_box]
for date, title in zip(dates, titles):
print(f"{date}: {title}")
The result becomes:
2019-01-10: Magic update – feature preview
2019-01-06: New Years
2018-11-30: Staxel Changelog for 1.3.52
2018-11-13: Staxel Changelog for 1.3.49
2018-10-21: Staxel Changelog for 1.3.48
2018-10-12: Halloween Update & GOG

Python scraping from website. selecting TR elements based on multiple class attrs

I am scraping from the following page: https://kenpom.com/index.php?y=2018
The page shows a list of every Divison 1 college basketball team. Each row is for one team. I want to assign every team-row to a variable called "teams". The problem is that after each 40 teams there are two header rows that I do not want to include. These rows are unique as they have a class of "thead1" and "thead2". The rows that I want to grab have a class of None or "bold-bottom". So essentially i need to iterate through every tr element in that table and grab any that has a class of None or "bold-bottom". My attempt below does not work. It returns a count of 35 when it should be 353
import requests
from bs4 import BeautifulSoup
url ='https://kenpom.com/index.php?y=2018'
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')
table = soup.find('table',{'id':'ratings-table'}).tbody
teams = table.findAll('tr',attrs = {'class':(None or 'bold-bottom')})
print(len(teams))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup: how to iterate a table - python

Related

How to fetch all elements from HTML code having h2 element in python selenium package?

How can I scrape the daily price changes from Coinmarketcap with requests_html?

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

Python - Combine two, single column lists into one dual column list and print

Python scraping from website. selecting TR elements based on multiple class attrs

Categories

Resources