I am trying to perform web scraping using Python on the ESPN website to extract historical NFL football game results scores only into a csv file. I’m unable to find a way to add the dates as displayed in the desired output. Could someone help me a way to get the desired output from the current output. The website I am using to scrape the data and the desired output is below:
NFL Website:
https://www.espn.com/nfl/scoreboard/_/week/17/year/2022/seasontype/2
Current Output:
Week #, Away Team, Away Score, Home Team, Home Score
Week 17, Cowboys, 27, Titans, 13
Week 17, Cardinals, 19, Falcons, 20
Week 17, Bears, 10, Lions, 41
Desired Game Results Output:
Week #, Date, Away Team, Away Score, Home Team, Home Score
Week 17, 12/29/2022, Cowboys, 27, Titans, 13
Week 17, 1/1/2023, Cardinals, 19, Falcons, 20
Week 17, 1/1/2023, Bears, 10, Lions, 41
Code:
import bs4
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
daterange = 1
url_list = []
while daterange < 19:
url = "https://www.espn.com/nfl/scoreboard/_/week/"+str(daterange)+"/year/2022/seasontype/2"
url_list.append(url)
daterange = daterange + 1
j = 1
away_team = []
home_team = []
away_team_score = []
home_team_score = []
week = []
for url in url_list:
response = urlopen(url)
urlname = requests.get(url)
bs = bs4.BeautifulSoup(urlname.text,'lxml')
print(response.url)
i = 0
while True:
try:
name = bs.findAll('div',{'class':'ScoreCell__TeamName ScoreCell__TeamName--shortDisplayName truncate db'})[i]
except Exception:
break
name = name.get_text()
try:
score = bs.findAll('div',{'class':'ScoreCell__Score h4 clr-gray-01 fw-heavy tar ScoreCell_Score--scoreboard pl2'})[i]
except Exception:
break
score = score.get_text()
if i%2 == 0:
away_team.append(name)
away_team_score.append(score)
else:
home_team.append(name)
home_team_score.append(score)
week.append("week "+str(j))
i = i + 1
j = j + 1
web_scraping = list (zip(week, home_team, home_team_score, away_team, away_team_score))
web_scraping_df = pd.DataFrame(web_scraping, columns = ['week','home_team','home_team_score','away_team','away_team_score'])
web_scraping_df
Try:
import requests
import pandas as pd
from bs4 import BeautifulSoup
week = 17
url = f'https://www.espn.com/nfl/scoreboard/_/week/{week}/year/2022/seasontype/2'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for board in soup.select('.ScoreboardScoreCell'):
title = board.find_previous(class_='Card__Header__Title').text
teams = [t.text for t in board.select('.ScoreCell__TeamName')]
scores = [s.text for s in board.select('.ScoreCell__Score')] or ['-', '-']
all_data.append((week, title, teams[0], scores[0], teams[1], scores[1]))
df = pd.DataFrame(all_data, columns=['Week', 'Date', 'Team 1', 'Score 1', 'Team 2', 'Score 2'])
print(df.to_markdown(index=False))
Prints:
Week
Date
Team 1
Score 1
Team 2
Score 2
17
Thursday, December 29, 2022
Cowboys
27
Titans
13
17
Sunday, January 1, 2023
Cardinals
19
Falcons
20
17
Sunday, January 1, 2023
Bears
10
Lions
41
17
Sunday, January 1, 2023
Broncos
24
Chiefs
27
17
Sunday, January 1, 2023
Dolphins
21
Patriots
23
17
Sunday, January 1, 2023
Colts
10
Giants
38
17
Sunday, January 1, 2023
Saints
20
Eagles
10
17
Sunday, January 1, 2023
Panthers
24
Buccaneers
30
17
Sunday, January 1, 2023
Browns
24
Commanders
10
17
Sunday, January 1, 2023
Jaguars
31
Texans
3
17
Sunday, January 1, 2023
49ers
37
Raiders
34
17
Sunday, January 1, 2023
Jets
6
Seahawks
23
17
Sunday, January 1, 2023
Vikings
17
Packers
41
17
Sunday, January 1, 2023
Rams
10
Chargers
31
17
Sunday, January 1, 2023
Steelers
16
Ravens
13
17
Monday, January 2, 2023
Bills
-
Bengals
-
Related
I'm a newbie seeking help.
I've tried without success with the following.
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.canada.ca/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds.html"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
print(table.get('class'))
Result:
['table']
None
Can anyone help me with how to get this data?
Thank you so much.
The data you see on the page is loaded from external URL. To load the data you can use next example:
import requests
import pandas as pd
url = "https://www.canada.ca/content/dam/ircc/documents/json/ee_rounds_123_en.json"
data = requests.get(url).json()
df = pd.DataFrame(data["rounds"])
df = df.drop(columns=["drawNumberURL", "DrawText1", "mitext"])
print(df.head(10).to_markdown(index=False))
Prints:
drawNumber
drawDate
drawDateFull
drawName
drawSize
drawCRS
drawText2
drawDateTime
drawCutOff
drawDistributionAsOn
dd1
dd2
dd3
dd4
dd5
dd6
dd7
dd8
dd9
dd10
dd11
dd12
dd13
dd14
dd15
dd16
dd17
dd18
231
2022-09-14
September 14, 2022
No Program Specified
3,250
510
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
September 14, 2022 at 13:29:26 UTC
January 08, 2022 at 10:24:52 UTC
September 12, 2022
408
6,228
63,860
5,845
9,505
19,156
16,541
12,813
58,019
12,245
12,635
9,767
11,186
12,186
68,857
35,833
5,068
238,273
230
2022-08-31
August 31, 2022
No Program Specified
2,750
516
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 31, 2022 at 13:55:23 UTC
April 16, 2022 at 18:24:41 UTC
August 29, 2022
466
7,224
63,270
5,554
9,242
19,033
16,476
12,965
58,141
12,287
12,758
9,796
11,105
12,195
68,974
36,001
5,120
239,196
229
2022-08-17
August 17, 2022
No Program Specified
2,250
525
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 17, 2022 at 13:43:47 UTC
December 28, 2021 at 11:03:15 UTC
August 15, 2022
538
8,221
62,753
5,435
9,129
18,831
16,465
12,893
58,113
12,200
12,721
9,801
11,138
12,253
68,440
35,745
5,137
238,947
228
2022-08-03
August 3, 2022
No Program Specified
2,000
533
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 03, 2022 at 15:16:24 UTC
January 06, 2022 at 14:29:50 UTC
August 2, 2022
640
8,975
62,330
5,343
9,044
18,747
16,413
12,783
57,987
12,101
12,705
9,747
11,117
12,317
68,325
35,522
5,145
238,924
227
2022-07-20
July 20, 2022
No Program Specified
1,750
542
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 20, 2022 at 16:32:49 UTC
December 30, 2021 at 15:29:35 UTC
July 18, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
226
2022-07-06
July 6, 2022
No Program Specified
1,500
557
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 6, 2022 at 14:34:34 UTC
November 13, 2021 at 02:20:46 UTC
July 11, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
225
2022-06-22
June 22, 2022
Provincial Nominee Program
636
752
Provincial Nominee Program
June 22, 2022 at 14:13:57 UTC
April 19, 2022 at 13:45:45 UTC
June 20, 2022
664
8,017
55,917
4,246
7,845
16,969
15,123
11,734
53,094
10,951
11,621
8,800
10,325
11,397
64,478
33,585
4,919
220,674
224
2022-06-08
June 8, 2022
Provincial Nominee Program
932
796
Provincial Nominee Program
June 08, 2022 at 14:03:28 UTC
October 18, 2021 at 17:13:17 UTC
June 6, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
223
2022-05-25
May 25, 2022
Provincial Nominee Program
590
741
Provincial Nominee Program
May 25, 2022 at 13:21:23 UTC
February 02, 2022 at 12:29:53 UTC
May 23, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
222
2022-05-11
May 11, 2022
Provincial Nominee Program
545
753
Provincial Nominee Program
May 11, 2022 at 14:08:07 UTC
December 15, 2021 at 20:32:57 UTC
May 9, 2022
635
7,193
52,684
3,749
7,237
16,027
14,466
11,205
50,811
10,484
11,030
8,393
9,945
10,959
62,341
32,590
4,839
211,093
I am web-scraping the table that is found on this website : " https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html "
Everything was good, but I had a small issue with the "Price" label and was unable to fix it. I've been trying for the past few hours and this is the last error that I ran into : " https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html "
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests
page = requests.get("https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html")
soup = BeautifulSoup(page.content, "lxml")
gdp = soup.find_all("table", attrs={"class": "table flight-detail hidden-xs"})
print("Number of tables on site: ",len(gdp))
table1 = gdp[0]
# the head will form our column names
body = table1.find_all("tr")
print(len(body))
# Head values (Column names) are the first items of the body list
head = body[0] # 0th item is the header row
body_rows = body[1:] # All other items becomes the rest of the rows
# Lets now iterate through the head HTML code and make list of clean headings
# Declare empty list to keep Columns names
headings = []
for item in head.find_all("th"): # loop through all th elements
# convert the th elements to text and strip "\n"
item = (item.text).rstrip("\n")
# append the clean column name to headings
headings.append(item)
print(headings)
import re
all_rows = [] # will be a list for list for all rows
for row_num in range(len(body_rows)): # A row at a time
row = [] # this will old entries for one row
for row_item in body_rows[row_num].find_all("td")[:-1]: #loop through all row entries
# row_item.text removes the tags from the entries
# the following regex is to remove \xa0 and \n and comma from row_item.text
# xa0 encodes the flag, \n is the newline and comma separates thousands in numbers
aa = re.sub("(\xa0)|(\n)|(\t),","",row_item.text)
#append aa to row - note one row entry is being appended
row.append(aa)
# append one row to all_rows
all_rows.append(row)
for row_item in body_rows[row_num].find_all("td")[-1].find("span").text: #loop through the last row entry, price.
aa = re.sub("(\xa0)|(\n)|(\t),","",row_item)
row.append(aa)
all_rows.append(row)
# We can now use the data on all_rowsa and headings to make a table
# all_rows becomes our data and headings the column names
df = pd.DataFrame(data=all_rows,columns=headings)
#df.head()
#print(df)
df["Date"]=pd.to_datetime(df["Date"]).dt.strftime("%d/%m/%Y")
print(df)
If you could please run the code and tell me how to solve this issue so I could print everything when I am using " print(df) ".
Previusly, I was able to print eveything, except the price, who had "\t\t\t\t\t\t\t" instead of the price.
Thank you.
To get the table into panda DataFrame, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = (
"https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html"
)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = []
for tr in soup.select("tr:has(td)"):
row = [td.get_text(strip=True, separator=" ") for td in tr.select("td")]
data.append(row)
df = pd.DataFrame(data, columns="From To Aircraft Seats Date Price".split())
print(df)
df.to_csv("data.csv", index=False)
Prints:
From To Aircraft Seats Date Price
0 Prague Vaclav Havel Airport Bratislava M R Stefanik Citation XLS+ 9 Thu Jun 03 00:00:00 UTC 2021 €3 300 (RRP €6 130)
1 Billund Odense Learjet 45 / 45XR 8 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €7 100)
2 La Roche/yon Les Ajoncs Nantes Atlantique Embraer Phenom 100 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €4 820)
3 London Biggin Hill Paris Le Bourget Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €6 980)
4 Prague Vaclav Havel Airport Salzburg (mozart) Gulfstream G200 9 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 800)
5 Palma De Mallorca Edinburgh Cessna C525 Citation CJ2 5 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €18 680)
6 Linz Blue Danube Linz Munich Munchen Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €3 600)
7 Geneva Cointrin Paris Le Bourget Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €9 240)
8 Vienna Schwechat Cologne-bonn Koln Bonn Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 590)
9 Cannes Mandelieu Geneva Cointrin Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 220)
10 Brussels National Cologne-bonn Koln Bonn Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €3 790)
11 Split Bari Palese Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 220)
12 Copenhagen Roskilde Aalborg Challenger 604 11 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €16 750)
13 Brussels National Leipzig Halle Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €6 690)
...
And saves data.csv (screenshot from LibreOffice):
I am new to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape an url and want to store list of movies under one date.
Below is the code I have so far:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=ul.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
I am getting "AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
Expected result in list or dataframe
29th May 2020 Romantic
29th May 2020 Sohreyan Da Pind Aa Gaya
5th June 2020 Lakshmi Bomb
and so on
Thanks in advance for help.
This script will get all movies and corresponding dates to a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/calendar?region=IN&ref_=rlm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out, last = [], ''
for tag in soup.select('#main h4, #main li'):
if tag.name == 'h4':
last = tag.get_text(strip=True)
else:
out.append({'Date':last, 'Movie':tag.get_text(strip=True).rsplit('(', maxsplit=1)[0]})
df = pd.DataFrame(out)
print(df)
Prints:
Date Movie
0 29 May 2020 Romantic
1 29 May 2020 Sohreyan Da Pind Aa Gaya
2 05 June 2020 Laxmmi Bomb
3 05 June 2020 Roohi Afzana
4 05 June 2020 Nikamma
.. ... ...
95 26 March 2021 Untitled Luv Ranjan Film
96 02 April 2021 F9
97 02 April 2021 Bell Bottom
98 02 April 2021 NTR Trivikiram Untitled Movie
99 09 April 2021 Manje Bistre 3
[100 rows x 2 columns]
I think you should replace "ul" with "h1" on the 10th line. And add definition of variable "movielist" ahead.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
# add code here
movielist = []
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
# replace ul with h1 here
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
print(movielist)
I didn't specify a list to receive, and I changed it from 'h1' to 'text capture' instead of 'h4'.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
movielist = []
date = soup.find_all("h4")
ul = soup.find_all("ui")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
The reason the date doesn't match in the output result is that the 'date' retrieved looks like the following, so you need to fix the logic.
There are multiple titles on the same release date, so the release date and number of titles don't match up. I can't help you that much because I don't have the time. Have a good night.
29 May 2020
05 June 2020
07 June 2020
07 June 2020 Romantic
12 June 2020
12 June 2020 Sohreyan Da Pind Aa Gaya
18 June 2020
18 June 2020 Laxmmi Bomb
19 June 2020
19 June 2020 Roohi Afzana
25 June 2020
25 June 2020 Nikamma
26 June 2020
26 June 2020 Naandhi
02 July 2020
02 July 2020 Mandela
03 July 2020
03 July 2020 Medium Spicy
10 July 2020
10 July 2020 Out of the Blue
I am scraping lists of US presidents using beautiful soup and requests. I want to scrape both the date for example start of the presidency and end of the presidency date and for some reason it's showing list index out of range error . I'll Provide you the link so you can understand better .
website Link : https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html , 'html.parser' )
containers = page_soup.find_all('table' , class_ = 'wikitable')
#print(containers[0])
#print(len(containers))
#print(soup.prettify(containers[0]))
container = containers[0]
date =container.find_all('span' , attrs = {'class': 'date'})
#print(len(date))
#print(date[0].text)
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
print(date_container[0].text)
The find_all function can return an empty list, which can lead you to getting an error.
You can simple check this:
all_dates = []
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
all_dates.extend([date.text for date in date_container])
As you have last lines of code, that store all spans of dates on first table "wikitable", you can make list comprehension:
date = [x.text for x in container.find_all('span' , attrs = {'class': 'date'})]
print(date)
Which will print:
['April 30, 1789', 'March 4, 1797', 'March 4, 1797', 'March 4, 1801', 'March 4, 1801'...
Since it has <table> tags, have you considered using pandas' .read_html()? It uses BeautifulSoup under the hood. Takes alot of the work out and puts it straight into a dataframe for you. The only work then needed is any manipulation or cleanup/filtering:
import pandas as pd
import re
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
# Returns a list of dataframes
dfs = pd.read_html(my_url)
# Get the specific dataframe with the desired columns
df = dfs[1].iloc[:,[1,3]]
# Rename the columns
df.columns = ['Date','Name']
# Split the date column into start and end dates and drop the date column
df[['Start','End']] = df.Date.str.split('–', expand=True)
df = df.drop('Date',axis=1)
# Clean up the name column using regex to pull out the name
df['Name'] = [re.match(r'.+?(?=\d)', x)[0].strip().split('Born')[0] for x in df['Name']]
# Drop duplicate rows
df.drop_duplicates(inplace = True)
print (df)
Output:
print (df.to_string())
Name Start End
0 George Washington April 30, 1789[d] March 4, 1797
1 John Adams March 4, 1797 March 4, 1801
2 Thomas Jefferson March 4, 1801 March 4, 1809
3 James Madison March 4, 1809 March 4, 1817
4 James Monroe March 4, 1817 March 4, 1825
5 John Quincy Adams March 4, 1825 March 4, 1829
6 Andrew Jackson March 4, 1829 March 4, 1837
7 Martin Van Buren March 4, 1837 March 4, 1841
8 William Henry Harrison March 4, 1841 April 4, 1841(Died in office)
9 John Tyler April 4, 1841[i] March 4, 1845
10 James K. Polk March 4, 1845 March 4, 1849
11 Zachary Taylor March 4, 1849 July 9, 1850(Died in office)
12 Millard Fillmore July 9, 1850[k] March 4, 1853
13 Franklin Pierce March 4, 1853 March 4, 1857
14 James Buchanan March 4, 1857 March 4, 1861
15 Abraham Lincoln March 4, 1861 April 15, 1865(Assassinated)
16 Andrew Johnson April 15, 1865 March 4, 1869
17 Ulysses S. Grant March 4, 1869 March 4, 1877
18 Rutherford B. Hayes March 4, 1877 March 4, 1881
19 James A. Garfield March 4, 1881 September 19, 1881(Assassinated)
20 Chester A. Arthur September 19, 1881[n] March 4, 1885
21 Grover Cleveland March 4, 1885 March 4, 1889
22 Benjamin Harrison March 4, 1889 March 4, 1893
23 Grover Cleveland March 4, 1893 March 4, 1897
24 William McKinley March 4, 1897 September 14, 1901(Assassinated)
25 Theodore Roosevelt September 14, 1901 March 4, 1909
26 William Howard Taft March 4, 1909 March 4, 1913
27 Woodrow Wilson March 4, 1913 March 4, 1921
28 Warren G. Harding March 4, 1921 August 2, 1923(Died in office)
29 Calvin Coolidge August 2, 1923[o] March 4, 1929
30 Herbert Hoover March 4, 1929 March 4, 1933
31 Franklin D. Roosevelt March 4, 1933 April 12, 1945(Died in office)
32 Harry S. Truman April 12, 1945 January 20, 1953
33 Dwight D. Eisenhower January 20, 1953 January 20, 1961
34 John F. Kennedy January 20, 1961 November 22, 1963(Assassinated)
35 Lyndon B. Johnson November 22, 1963 January 20, 1969
36 Richard Nixon January 20, 1969 August 9, 1974(Resigned)
37 Gerald Ford August 9, 1974 January 20, 1977
38 Jimmy Carter January 20, 1977 January 20, 1981
39 Ronald Reagan January 20, 1981 January 20, 1989
40 George H. W. Bush January 20, 1989 January 20, 1993
41 Bill Clinton January 20, 1993 January 20, 2001
42 George W. Bush January 20, 2001 January 20, 2009
43 Barack Obama January 20, 2009 January 20, 2017
44 Donald Trump January 20, 2017 Incumbent
Background
I have five years of NO2 measurement data, in csv files-one file for every location and year. I have loaded all the files into pandas dataframes in the same format:
Date Hour Location NO2_Level
0 01/01/2016 00 Street 18
1 01/01/2016 01 Street 39
2 01/01/2016 02 Street 129
3 01/01/2016 03 Street 76
4 01/01/2016 04 Street 40
Goal
For each dataframe count the number of times NO2_Level is greater than 150 and output this.
So I wrote a loop that's creates all the dataframes from the right directories and cleans them appropriately .
Problem
Whatever I've tried produces results I know on inspection are incorrect, e.g :
-the count value for every location on a given year is the same (possible but unlikely)
-for a year when I know there should be any positive number for the count, every location returns 0
What I've tried
I have tried a lot of approaches to getting this value for each dataframe, such as making the column a series:
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()'''
Using pd.count():
count = df[df['NO2_Level'] >= 150].count()
These two approaches have gotten closest to what I want to output
Example to test on
data = {'Date': ['01/01/2016','01/02/2016',' 01/03/2016', '01/04/2016', '01/05/2016'], 'Hour': ['00', '01', '02', '03', '04'], 'Location': ['Street','Street','Street','Street','Street',], 'NO2_Level': [18, 39, 129, 76, 40]}
df = pd.DataFrame(data=d)
NO2_Level = pd.Series(df['NO2_Level'])
count = (NO2_Level > 150).sum()
count
Expected Outputs
So from this I'm trying to get it to output a single line for each dataframe that was made in the format Location, year, count (of condition):
Kirkstall Road,2013,47
Haslewood Close,2013,97
...
Jack Lane Hunslet,2015,158
So the above example would produce
Street, 2016, 1
Actual
Every year produces the same result for each location, for some years (2014) the count doesn't seem to work at all when on inspection there should be:
Kirkstall Road,2013,47
Haslewood Close,2013,47
Tilbury Terrace,2013,47
Corn Exchange,2013,47
Temple Newsam,2014,0
Queen Street Morley,2014,0
Corn Exchange,2014,0
Tilbury Terrace,2014,0
Haslewood Close,2015,43
Tilbury Terrace,2015,43
Corn Exchange,2015,43
Jack Lane Hunslet,2015,43
Norman Rows,2015,43
Hopefully this helps.
import pandas as pd
ddict = {
'Date':['2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-01','2016-01-02',],
'Hour':['00','01','02','03','04','02'],
'Location':['Street','Street','Street','Street','Street','Street',],
'N02_Level':[19,39,129,76,40, 151],
}
df = pd.DataFrame(ddict)
# Convert dates to datetime
df['Date'] = pd.to_datetime(df['Date'])
# Make a Year column
df['Year'] = df['Date'].apply(lambda x: x.strftime('%Y'))
# Group by lcoation and year, count by M02_Level > 150
df1 = df[df['N02_Level'] > 150].groupby(['Location','Year']).size().reset_index(name='Count')
# Interate the results
for i in range(len(df1)):
loc = df1['Location'][i]
yr = df1['Year'][i]
cnt = df1['Count'][i]
print(f'{loc},{yr},{cnt}')
### To not use f-strings
for i in range(len(df1)):
print('{loc},{yr},{cnt}'.format(loc=df1['Location'][i], yr=df1['Year'][i], cnt=df1['Count'][i]))
Sample data:
Date Hour Location N02_Level
0 2016-01-01 00 Street 19
1 2016-01-01 01 Street 39
2 2016-01-01 02 Street 129
3 2016-01-01 03 Street 76
4 2016-01-01 04 Street 40
5 2016-01-02 02 Street 151
Output:
Street,2016,1
here is a solution with a sample generated (randomly):
def random_dates(start, end, n):
start_u = start.value // 10 ** 9
end_u = end.value // 10 ** 9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
location = ['street', 'avenue', 'road', 'town', 'campaign']
df = pd.DataFrame({'Date' : random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-12-31'), 20),
'Location' : np.random.choice(location, 20),
'NOE_level' : np.random.randint(low=130, high= 200, size=20)})
#Keep only year for Date
df['Date'] = df['Date'].dt.strftime("%Y")
print(df)
df = df.groupby(['Location', 'Date'])['NOE_level'].apply(lambda x: (x>150).sum()).reset_index(name='count')
print(df)
Example df generated:
Date Location NOE_level
0 2018 town 191
1 2017 campaign 187
2 2017 town 137
3 2016 avenue 148
4 2017 campaign 195
5 2018 town 181
6 2018 road 187
7 2018 town 184
8 2016 town 155
9 2016 street 183
10 2018 road 136
11 2017 road 171
12 2018 street 165
13 2015 avenue 193
14 2016 campaign 170
15 2016 street 132
16 2016 campaign 165
17 2015 road 161
18 2018 road 161
19 2015 road 140
output:
Location Date count
0 avenue 2015 1
1 avenue 2016 0
2 campaign 2016 2
3 campaign 2017 2
4 road 2015 1
5 road 2017 1
6 road 2018 2
7 street 2016 1
8 street 2018 1
9 town 2016 1
10 town 2017 0
11 town 2018 3