scrape data from into dataframe with BeautifulSoup - python

I'm working on a project to scrape and parse data from California lottery into a dataframe
Here's my code so far, it produces no error but also no output:
import requests
from bs4 import BeautifulSoup as bs4
draw = 'http://www.calottery.com/play/draw-games/superlotto-plus/winning-numbers/?page=1'
page = requests.get(draw)
soup = bs4(page.text)
drawing_list = []
for table_row in soup.select("table.tag_even_numbers tr"):
cells = table_row.findAll('td')
if len(cells) > 0:
draw_date = cells[0].text.strip()
numbers = cells[1].text.strip()
mega = cells[2].text.strip()
drawings = {'dates': draw_date, 'winning_numbers': numbers, 'mega_number': mega}
drawing_list.append(drawings)
print "added {0} {1} {2}, to the list".format(draw_date, numbers, mega)
Expected Output: I'd love to scrape the table rows into a dataframe
draw_date | numbers | mega
-----------|----------------|-----
12/06/2017 | 12 24 07 01 02 | 23
12/02/2017 | 33 18 07 42 40 | 7
Thanks for any revision or assistance into the right direction.

This expression "table.tag_even_numbers tr" selects nothing because the table has no 'tag_even_numbers' class, but has a 'tag_even' class and a 'numbers' class.
So if you change this:
soup.select("table.tag_even_numbers tr")
to:
soup.select("table.tag_even.numbers tr")
you should have 20 items in drawing_list.
Also by using .text to select numbers you get all the numbers joined side by side in a string.
If you want a list of numbers you should use .stripped_strings instead, eg:
numbers = list(cells[1].stripped_strings)
Then you can create a dataframe from drawing_list, eg:
df = pd.DataFrame(drawing_list)
print(df.head())
dates mega_number winning_numbers
0 Dec 6, 2017 - 3201 23 [12, 24, 07, 01, 02]
1 Dec 2, 2017 - 3200 7 [33, 18, 07, 42, 40]
2 Nov 29, 2017 - 3199 6 [03, 33, 26, 27, 07]
3 Nov 25, 2017 - 3198 19 [21, 46, 13, 25, 17]
4 Nov 22, 2017 - 3197 3 [32, 40, 27, 42, 08]

Related

Web Scraping ESPN NFL webpage with Python

I am trying to perform web scraping using Python on the ESPN website to extract historical NFL football game results scores only into a csv file. I’m unable to find a way to add the dates as displayed in the desired output. Could someone help me a way to get the desired output from the current output. The website I am using to scrape the data and the desired output is below:
NFL Website:
https://www.espn.com/nfl/scoreboard/_/week/17/year/2022/seasontype/2
Current Output:
Week #, Away Team, Away Score, Home Team, Home Score
Week 17, Cowboys, 27, Titans, 13
Week 17, Cardinals, 19, Falcons, 20
Week 17, Bears, 10, Lions, 41
Desired Game Results Output:
Week #, Date, Away Team, Away Score, Home Team, Home Score
Week 17, 12/29/2022, Cowboys, 27, Titans, 13
Week 17, 1/1/2023, Cardinals, 19, Falcons, 20
Week 17, 1/1/2023, Bears, 10, Lions, 41
Code:
import bs4
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
daterange = 1
url_list = []
while daterange < 19:
url = "https://www.espn.com/nfl/scoreboard/_/week/"+str(daterange)+"/year/2022/seasontype/2"
url_list.append(url)
daterange = daterange + 1
j = 1
away_team = []
home_team = []
away_team_score = []
home_team_score = []
week = []
for url in url_list:
response = urlopen(url)
urlname = requests.get(url)
bs = bs4.BeautifulSoup(urlname.text,'lxml')
print(response.url)
i = 0
while True:
try:
name = bs.findAll('div',{'class':'ScoreCell__TeamName ScoreCell__TeamName--shortDisplayName truncate db'})[i]
except Exception:
break
name = name.get_text()
try:
score = bs.findAll('div',{'class':'ScoreCell__Score h4 clr-gray-01 fw-heavy tar ScoreCell_Score--scoreboard pl2'})[i]
except Exception:
break
score = score.get_text()
if i%2 == 0:
away_team.append(name)
away_team_score.append(score)
else:
home_team.append(name)
home_team_score.append(score)
week.append("week "+str(j))
i = i + 1
j = j + 1
web_scraping = list (zip(week, home_team, home_team_score, away_team, away_team_score))
web_scraping_df = pd.DataFrame(web_scraping, columns = ['week','home_team','home_team_score','away_team','away_team_score'])
web_scraping_df
Try:
import requests
import pandas as pd
from bs4 import BeautifulSoup
week = 17
url = f'https://www.espn.com/nfl/scoreboard/_/week/{week}/year/2022/seasontype/2'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for board in soup.select('.ScoreboardScoreCell'):
title = board.find_previous(class_='Card__Header__Title').text
teams = [t.text for t in board.select('.ScoreCell__TeamName')]
scores = [s.text for s in board.select('.ScoreCell__Score')] or ['-', '-']
all_data.append((week, title, teams[0], scores[0], teams[1], scores[1]))
df = pd.DataFrame(all_data, columns=['Week', 'Date', 'Team 1', 'Score 1', 'Team 2', 'Score 2'])
print(df.to_markdown(index=False))
Prints:
Week
Date
Team 1
Score 1
Team 2
Score 2
17
Thursday, December 29, 2022
Cowboys
27
Titans
13
17
Sunday, January 1, 2023
Cardinals
19
Falcons
20
17
Sunday, January 1, 2023
Bears
10
Lions
41
17
Sunday, January 1, 2023
Broncos
24
Chiefs
27
17
Sunday, January 1, 2023
Dolphins
21
Patriots
23
17
Sunday, January 1, 2023
Colts
10
Giants
38
17
Sunday, January 1, 2023
Saints
20
Eagles
10
17
Sunday, January 1, 2023
Panthers
24
Buccaneers
30
17
Sunday, January 1, 2023
Browns
24
Commanders
10
17
Sunday, January 1, 2023
Jaguars
31
Texans
3
17
Sunday, January 1, 2023
49ers
37
Raiders
34
17
Sunday, January 1, 2023
Jets
6
Seahawks
23
17
Sunday, January 1, 2023
Vikings
17
Packers
41
17
Sunday, January 1, 2023
Rams
10
Chargers
31
17
Sunday, January 1, 2023
Steelers
16
Ravens
13
17
Monday, January 2, 2023
Bills
-
Bengals
-

Using Python and BeautifulSoup to scrape list from an URL

I am new to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape an url and want to store list of movies under one date.
Below is the code I have so far:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=ul.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
I am getting "AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
Expected result in list or dataframe
29th May 2020 Romantic
29th May 2020 Sohreyan Da Pind Aa Gaya
5th June 2020 Lakshmi Bomb
and so on
Thanks in advance for help.
This script will get all movies and corresponding dates to a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/calendar?region=IN&ref_=rlm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out, last = [], ''
for tag in soup.select('#main h4, #main li'):
if tag.name == 'h4':
last = tag.get_text(strip=True)
else:
out.append({'Date':last, 'Movie':tag.get_text(strip=True).rsplit('(', maxsplit=1)[0]})
df = pd.DataFrame(out)
print(df)
Prints:
Date Movie
0 29 May 2020 Romantic
1 29 May 2020 Sohreyan Da Pind Aa Gaya
2 05 June 2020 Laxmmi Bomb
3 05 June 2020 Roohi Afzana
4 05 June 2020 Nikamma
.. ... ...
95 26 March 2021 Untitled Luv Ranjan Film
96 02 April 2021 F9
97 02 April 2021 Bell Bottom
98 02 April 2021 NTR Trivikiram Untitled Movie
99 09 April 2021 Manje Bistre 3
[100 rows x 2 columns]
I think you should replace "ul" with "h1" on the 10th line. And add definition of variable "movielist" ahead.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
# add code here
movielist = []
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
# replace ul with h1 here
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
print(movielist)
I didn't specify a list to receive, and I changed it from 'h1' to 'text capture' instead of 'h4'.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
movielist = []
date = soup.find_all("h4")
ul = soup.find_all("ui")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
The reason the date doesn't match in the output result is that the 'date' retrieved looks like the following, so you need to fix the logic.
There are multiple titles on the same release date, so the release date and number of titles don't match up. I can't help you that much because I don't have the time. Have a good night.
29 May 2020
05 June 2020
07 June 2020
07 June 2020 Romantic
12 June 2020
12 June 2020 Sohreyan Da Pind Aa Gaya
18 June 2020
18 June 2020 Laxmmi Bomb
19 June 2020
19 June 2020 Roohi Afzana
25 June 2020
25 June 2020 Nikamma
26 June 2020
26 June 2020 Naandhi
02 July 2020
02 July 2020 Mandela
03 July 2020
03 July 2020 Medium Spicy
10 July 2020
10 July 2020 Out of the Blue

How to create a customized multi-index with different sub column headings using pandas in a dataframe

I have a dataset that contains multi-index columns with the first level consisting of a year divided into four quarters. How do I structure the index so as to have 4 sets of months under each quarter?
I found the following piece of code on stack overflow:
index = pd.MultiIndex.from_product([['S1', 'S2'], ['Start', 'Stop']])
print pd.DataFrame([pd.DataFrame(dic).unstack().values], columns=index)
that gave the following output:
S1 S2
Start Stop Start Stop
0 2013-11-12 2013-11-13 2013-11-15 2013-11-17
However, it couldn't solve my requirement of having different sets of months under each quarter of the year.
My data looks like this:
2015
Q1 Q2 Q3 Q4
Country jan Feb March Apr May Jun July Aug Sep Oct Nov Dec
India 45 54 34 34 45 45 43 45 67 45 56 56
Canada 44 34 12 32 35 45 43 41 60 43 55 21
I wish to input the same structure of the dataset into pandas with the specific set of months under each quarter. How should I go about this?
You can also create a MultiIndex in a few other ways. One of these, which is useful if you have a complicated structure, is to construct it from an explicit set of tuples where each tuple is one hierarchical column. Below I first create all of the tuples that you need of the form (year, quarter, month), make a MultiIndex from these, then assign that as the columns of the dataframe.
import pandas as pd
year = 2015
months = [
("Jan", "Feb", "Mar"),
("Apr", "May", "Jun"),
("Jul", "Aug", "Sep"),
("Oct", "Nov", "Dec"),
]
tuples = [(year, f"Q{i + 1}", month) for i in range(4) for month in months[i]]
multi_index = pd.MultiIndex.from_tuples(tuples)
data = [
[45, 54, 34, 34, 45, 45, 43, 45, 67, 45, 56, 56],
[44, 34, 12, 32, 35, 45, 43, 41, 60, 43, 55, 21],
]
df = pd.DataFrame(data, index=["India", "Canada"], columns=multi_index)
df
# 2015
# Q1 Q2 Q3 Q4
# Jan FebMar Apr May Jun Jul Aug Sep Oct Nov Dec
# India 45 54 34 34 45 45 43 45 67 45 56 56
# Canada 44 34 12 32 35 45 43 41 60 43 55 21

Return list from a dataframe of a value who's date matches the date from a list

So I have a pandas dataframe df_testing_set which looks like this sample:
Index Ycurrent. date. bucket_id.
. 245 June 17, 2017. 45
. 235 June 17, 2017. 46
. 265 June 18, 2017. 47
. 235 June 18, 2017. 48
. 225 June 19, 2017. 49
. 205 June 20, 2017. 50
. 215 June 21, 2017. 51
. 212 June 22, 2017. 52
. 225 June 23, 2017. 53
. 257 June 24, 2017. 54
. 236 June 25, 2017. 55
. 248 June 26, 2017. 56
. 245 June 27, 2017. 57
. 245 June 27, 2017. 58
and I have a list of 8 random dates from another dataframe that looks like this:
0. June 01, 2017
1. June 23, 2017
2. June 13, 2017
3. June 27, 2017
4. June 17, 2017
5. June 04, 2017
6. June 09, 2017
7. June 11, 2017
8. June 15, 2017
Given the data above, how do I (for each date in the date_list), select all the records for that specific date (From my code it looks like there's around 144 rows per date).
With this data I've been trying to get (x,y) where x is the value in bucket_id (Goes from 1-144) and y is the value in the field Ycurrent. The coordinates are then used with matplotlib to plot a line graph.
My graphs don't show when I try to plot then using matplotlib. I tried to plot all the lines on the same graph, since the x-axis remains the same for all dates, but I keep getting
raise ValueError('Must pass DataFrame with boolean values only')
ValueError: Must pass DataFrame with boolean values only
IIUC, you can filter your original dataframe with isin:
df_testing_set = df_testing_set[df_testing_set['date'].isin(date_list[1])]
where date_list[1] is supposed to be the column related to the date of your second dataframe/list of dates.
If you want the first Index just subselect it:
df_testing_set = df_testing_set[df_testing_set['date'].isin(date_list[1])]['Index']
Hope that helps.

Extract Columns from html using Python (Beautifulsoup)

I need to extract the info from this page -http://www.investing.com/currencies/usd-brl-historical-data. I need Date, Price, Open, High, Low,Change %.
I`m new to Python so I got stuck at this step:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('table', {'class':'genTbl closedTbl historicalTbl'})
d=[]
for item in g_data:
Table_Values = item.find_all('tr')
N=len(Table_Values)-1
for n in range(N):
k = (item.find_all('td', {'class':'first left bold noWrap'})[n].text)
print(item.find_all('td', {'class':'first left bold noWrap'})[n].text)
Here I have several problems:
Column for Price can de tagged as or . How can I specify that I want items tagged with class = 'redFont' or/and 'greenfont'?. Also Change % can also have class redFont and greenFont. Other columns are tagged by . How can I extract them?
Is there a way to extract columns from table?
Ideally I would like to have a dateframe with Columns Date, Price, Open, High, Low,Change %.
Thanks
How to parse the table from that site I have already answered here but since you want a DataFrame, just use pandas.read_html
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
import pandas as pd
df = pd.read_html(r.content,attrs = {'id': 'curr_table'})[0]
Which will give you:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3609 3.4411 3.4465 3.3584 -2.36%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%
You can generally pass the url directly but we get a 403 error for this particular site using urllib2 which is the lib used by read_html so we need to use requests to get that html.
Here's a way to convert the html table into a nested list
The solution is to find the specific table, then loop through each tr in the table, creating a sublist of the text of all the items inside that tr. The code to do this is a nested list comprehension.
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
#first row is empty
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
pprint(tableRows)
This gets all the data from the table
[['Jun 08, 2016', '3.3614', '3.4411', '3.4465', '3.3584', '-2.34%'],
['Jun 07, 2016', '3.4421', '3.4885', '3.5141', '3.4401', '-1.36%'],
['Jun 06, 2016', '3.4896', '3.5265', '3.5295', '3.4840', '-1.09%'],
['Jun 05, 2016', '3.5280', '3.5280', '3.5280', '3.5280', '0.11%'],
['Jun 03, 2016', '3.5240', '3.5910', '3.5947', '3.5212', '-1.91%'],
['Jun 02, 2016', '3.5926', '3.6005', '3.6157', '3.5765', '-0.22%'],
['Jun 01, 2016', '3.6007', '3.6080', '3.6363', '3.5755', '-0.29%'],
['May 31, 2016', '3.6111', '3.5700', '3.6383', '3.5534', '1.11%'],
['May 30, 2016', '3.5713', '3.6110', '3.6167', '3.5675', '-1.11%'],
['May 27, 2016', '3.6115', '3.5824', '3.6303', '3.5792', '0.81%'],
['May 26, 2016', '3.5825', '3.5826', '3.5857', '3.5757', '-0.03%'],
['May 25, 2016', '3.5836', '3.5702', '3.6218', '3.5511', '0.34%'],
['May 24, 2016', '3.5713', '3.5717', '3.5903', '3.5417', '-0.04%'],
['May 23, 2016', '3.5728', '3.5195', '3.5894', '3.5121', '1.49%'],
['May 20, 2016', '3.5202', '3.5633', '3.5663', '3.5154', '-1.24%'],
['May 19, 2016', '3.5644', '3.5668', '3.6197', '3.5503', '-0.11%'],
['May 18, 2016', '3.5683', '3.4877', '3.5703', '3.4854', '2.28%'],
['May 17, 2016', '3.4888', '3.4990', '3.5300', '3.4812', '-0.32%'],
['May 16, 2016', '3.5001', '3.5309', '3.5366', '3.4944', '-0.96%'],
['May 13, 2016', '3.5340', '3.4845', '3.5345', '3.4630', '1.39%'],
['May 12, 2016', '3.4855', '3.4514', '3.5068', '3.4346', '0.95%'],
['May 11, 2016', '3.4528', '3.4755', '3.4835', '3.4389', '-0.66%'],
['May 10, 2016', '3.4758', '3.5155', '3.5173', '3.4623', '-1.15%'],
['May 09, 2016', '3.5164', '3.5010', '3.6766', '3.4906', '0.40%']]
If you want to convert it to a pandas dataframe you just need to also grab the table headings and add them
import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]
#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)
print(df)
Then you'll get a dataframe that looks like this:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3596 3.4411 3.4465 3.3584 -2.40%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%

Categories

Resources