Scraping data from Spotify charts - python

I want to scrape daily top 200 songs from Spotify charts website. I am trying to parse html code of page and trying to get song's artist, name and stream informations. But following code returns nothing. How can I get these informations with the following way?
for a in soup.find("div",{"class":"Container-c1ixcy-0 krZEp encore-base-set"}):
for b in a.findAll("main",{"class":"Main-tbtyrr-0 flXzSu"}):
for c in b.findAll("div",{"class":"Content-sc-1n5ckz4-0 jyvkLv"}):
for d in c.findAll("div",{"class":"TableContainer__Container-sc-86p3fa-0 fRKUEz"}):
print(d)
And let say this is the songs list that I want to scrape from it.
https://charts.spotify.com/charts/view/regional-tr-daily/2022-09-14
And also this is the html code of the page.

none selenium solution:
import requests
import pandas as pd
url = 'https://charts-spotify-com-service.spotify.com/public/v0/charts'
response = requests.get(url)
chart = []
for entry in response.json()['chartEntryViewResponses'][0]['entries']:
chart.append({
"Rank": entry['chartEntryData']['currentRank'],
"Artist": ', '.join([artist['name'] for artist in entry['trackMetadata']['artists']]),
"TrackName": entry['trackMetadata']['trackName']
})
df = pd.DataFrame(chart)
print(df.to_string(index=False))
OUTPUT:
Rank Artist TrackName
1 Bizarrap,Quevedo Quevedo: Bzrp Music Sessions, Vol. 52
2 Harry Styles As It Was
3 Bad Bunny,Chencho Corleone Me Porto Bonito
4 Bad Bunny Tití Me Preguntó
5 Manuel Turizo La Bachata
6 ROSALÍA DESPECHÁ
7 BLACKPINK Pink Venom
8 David Guetta,Bebe Rexha I'm Good (Blue)
9 OneRepublic I Ain't Worried
10 Bad Bunny Efecto
11 Chris Brown Under The Influence
12 Steve Lacy Bad Habit
13 Bad Bunny,Bomba Estéreo Ojitos Lindos
14 Kate Bush Running Up That Hill (A Deal With God) - 2018 Remaster
15 Joji Glimpse of Us
16 Nicki Minaj Super Freaky Girl
17 Bad Bunny Moscow Mule
18 Rosa Linn SNAP
19 Glass Animals Heat Waves
20 KAROL G PROVENZA
21 Charlie Puth,Jung Kook,BTS Left and Right (Feat. Jung Kook of BTS)
22 Harry Styles Late Night Talking
23 The Kid LAROI,Justin Bieber STAY (with Justin Bieber)
24 Tom Odell Another Love
25 Central Cee Doja
26 Stephen Sanchez Until I Found You
27 Bad Bunny Neverita
28 Post Malone,Doja Cat I Like You (A Happier Song) (with Doja Cat)
29 Lizzo About Damn Time
30 Nicky Youre,dazy Sunroof
31 Elton John,Britney Spears Hold Me Closer
32 Luar La L Caile
33 KAROL G,Maldy GATÚBELA
34 The Weeknd Die For You
35 Bad Bunny,Jhay Cortez Tarot
36 James Hype,Miggy Dela Rosa Ferrari
37 Imagine Dragons Bones
38 Elton John,Dua Lipa,PNAU Cold Heart - PNAU Remix
39 The Neighbourhood Sweater Weather
40 Ghost Mary On A Cross
41 Shakira,Rauw Alejandro Te Felicito
42 Justin Bieber Ghost
43 Bad Bunny,Rauw Alejandro Party
44 Drake,21 Savage Jimmy Cooks (feat. 21 Savage)
45 Doja Cat Vegas (From the Original Motion Picture Soundtrack ELVIS)
46 Camila Cabello,Ed Sheeran Bam Bam (feat. Ed Sheeran)
47 Rauw Alejandro,Lyanno,Brray LOKERA
48 Rels B cómo dormiste?
49 The Weeknd Blinding Lights
50 Arctic Monkeys 505

In the example link you provided, there aren't 200 songs, but only 50. The following is one way to get those songs:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time as t
import pandas as pd
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("window-size=1920,1080")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://charts.spotify.com/charts/view/regional-tr-daily/2022-09-14'
browser.get(url)
wait = WebDriverWait(browser, 5)
try:
wait.until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
print("accepted cookies")
except Exception as e:
print('no cookie button')
header_to_be_removed = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'header[data-testid="charts-header"]')))
browser.execute_script("""
var element = arguments[0];
element.parentNode.removeChild(element);
""", header_to_be_removed)
while True:
try:
show_more_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//div[#data-testid="load-more-entries"]//button')))
show_more_button.location_once_scrolled_into_view
t.sleep(5)
show_more_button.click()
print('clicked to show more')
t.sleep(3)
except TimeoutException:
print('all done')
break
songs = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'li[data-testid="charts-entry-item"]')))
print('we have', len(songs), 'songs')
song_list = []
for song in songs:
song.location_once_scrolled_into_view
t.sleep(1)
title = song.find_element(By.CSS_SELECTOR, 'p[class^="Type__TypeElement-"]')
artist = song.find_element(By.CSS_SELECTOR, 'span[data-testid="artists-names"]')
song_list.append((artist.text, title.text))
df = pd.DataFrame(song_list, columns = ['Title', 'Artist'])
print(df)
This will print out in terminal:
no cookie button
clicked to show more
clicked to show more
clicked to show more
clicked to show more
all done
we have 50 songs
Title
Artist
0
Bizarrap,
Quevedo: Bzrp Music Sessions, Vol. 52
1
Harry Styles
As It Was
2
Bad Bunny,
Me Porto Bonito
3
Bad Bunny
Tití Me Preguntó
4
Manuel Turizo
La Bachata
5
ROSALÍA
DESPECHÁ
6
BLACKPINK
Pink Venom
7
David Guetta,
I'm Good (Blue)
8
OneRepublic
I Ain't Worried
9
Bad Bunny
Efecto
10
Chris Brown
Under The Influence
11
Steve Lacy
Bad Habit
12
Bad Bunny,
Ojitos Lindos
13
Kate Bush
Running Up That Hill (A Deal With God) - 2018 Remaster
14
Joji
Glimpse of Us
15
Nicki Minaj
Super Freaky Girl
16
Bad Bunny
Moscow Mule
17
Rosa Linn
SNAP
18
Glass Animals
Heat Waves
19
KAROL G
PROVENZA
20
Charlie Puth,
Left and Right (Feat. Jung Kook of BTS)
21
Harry Styles
Late Night Talking
22
The Kid LAROI,
STAY (with Justin Bieber)
23
Tom Odell
Another Love
24
Central Cee
Doja
25
Stephen Sanchez
Until I Found You
26
Bad Bunny
Neverita
27
Post Malone,
I Like You (A Happier Song) (with Doja Cat)
28
Lizzo
About Damn Time
29
Nicky Youre,
Sunroof
30
Elton John,
Hold Me Closer
31
Luar La L
Caile
32
KAROL G,
GATÚBELA
33
The Weeknd
Die For You
34
Bad Bunny,
Tarot
35
James Hype,
Ferrari
36
Imagine Dragons
Bones
37
Elton John,
Cold Heart - PNAU Remix
38
The Neighbourhood
Sweater Weather
39
Ghost
Mary On A Cross
40
Shakira,
Te Felicito
41
Justin Bieber
Ghost
42
Bad Bunny,
Party
43
Drake,
Jimmy Cooks (feat. 21 Savage)
44
Doja Cat
Vegas (From the Original Motion Picture Soundtrack ELVIS)
45
Camila Cabello,
Bam Bam (feat. Ed Sheeran)
46
Rauw Alejandro,
LOKERA
47
Rels B
cómo dormiste?
48
The Weeknd
Blinding Lights
49
Arctic Monkeys
505
​
Of course you can get other info like chart ranking, all artists when there are more than one, etc.
Selenium chrome/chromedriver setup is for Linux, you just have to observe the imports and code after defining the browser, to adapt it to your own setup.
Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/index.html
For selenium docs, visit: https://www.selenium.dev/documentation/

Related

Cannot get response.get() to load full webpage

When I go to scrape https://www.onthesnow.com/epic-pass/skireport for the names of all the ski resorts listed, I'm running into an issue where some of the ski resorts don't show up in my output. Here's my current code:
import requests
url = "https://www.onthesnow.com/epic-pass/skireport"
response = requests.get(url)
response.text
The current output gives all resorts up to Mont Sainte Anne, but then it skips to the resorts at the bottom of the webpage under "closed resorts". I notice that when you scroll down the webpage in a browser that the missing resort names need to be scrolled down to before they will load. How do I make my response.get() obtain all of the HTML, even the HTML that still needs to load?
The data you see is loaded from external URL in Json form. To load it, you can use this example:
import json
import requests
url = "https://api.onthesnow.com/api/v2/region/1291/resorts/1/page/1?limit=999"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for i, d in enumerate(data["data"], 1):
print(i, d["title"])
Prints:
1 Beaver Creek
2 Breckenridge
3 Brides les Bains
4 Courchevel
5 Crested Butte Mountain Resort
6 Fernie Alpine
7 Folgàrida - Marilléva
8 Heavenly
9 Keystone
10 Kicking Horse
11 Kimberley
12 Kirkwood
13 La Tania
14 Les Menuires
15 Madonna di Campiglio
16 Meribel
17 Mont Sainte Anne
18 Nakiska Ski Area
19 Nendaz
20 Northstar California
21 Okemo Mountain Resort
22 Orelle
23 Park City
24 Pontedilegno - Tonale
25 Saint Martin de Belleville
26 Snowbasin
27 Stevens Pass Resort
28 Stoneham
29 Stowe Mountain
30 Sun Valley
31 Thyon 2000
32 Vail
33 Val Thorens
34 Verbier
35 Veysonnaz
36 Whistler Blackcomb

Problem concatenating URL and scraping data

I am trying to append a URL in python to scrape details from the target URL.
I have the below code but it seems to be scraping the data from url1 rather than URL.
I have scraped the team names from the NFL websit without any issue. The issue is with the spotrac URL where I am appending the team name which I have scraped from the NFL website.
import requests
from bs4 import BeautifulSoup
URL ='https://www.nfl.com/teams/'
page = requests.get(URL)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text)
for team in team_name:
team = team.replace(" ", "-").lower()
url1 = 'https://www.spotrac.com/nfl/rankings/'
URL = url1 +str(team)
print(URL)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(URL, data=data).content, 'html.parser')
spotrac_df = pd.DataFrame(columns = ['Name', 'Salary'])
for h3 in bs_soup.select('h3'):
spotrac_df = spotrac_df.append(pd.DataFrame({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text)}, index=[0]), ignore_index=False)
I'm almost certain the problem is coming from the URL not appending properly. The scraping is taking the salaries etc from url1 rather than URL.
My console output (using Spyder IDE) is as below for print(URL)
url is appending correctly, but you have a leading white space in your team names. I also made a few other changes and noted them in the code.
Lastly, (and I used to do this two), creating an empty dataframe then appending to it after each iteration I suppose isn't the best method. I've been told it better to construct your rows using lists/dictionaries, and then when done, then call on pandas to construct the dataframe, so changed that as well.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url ='https://www.nfl.com/teams/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
team_name = []
team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline')
for team in team_name_list:
if team.find('p'):
team_name.append(team.text.strip()) #<- remove leading/trailing white space
url1 = 'https://www.spotrac.com/nfl/rankings/' #<- since this is fixed, put it before the loop
spotrac_rows = []
for team in team_name:
team = '-'.join(team.split()).lower() #<- changed to split in case theres 2 spaces between city and team
url1 = 'https://www.spotrac.com/nfl/rankings/'
url = url1 + str(team)
print(url)
data = {
'ajax': 'true',
'mobile': 'false'
}
bs_soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser')
for h3 in bs_soup.select('h3'):
spotrac_rows.append({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text.strip())}) #<- remove white space from the salary
spotrac_df = pd.DataFrame(spotrac_rows)
Output:
print(spotrac_df)
Name Salary
0 Chandler Jones $21,333,333
1 Patrick Peterson $13,184,588
2 D.J. Humphries $12,800,000
3 DeAndre Hopkins $12,500,000
4 Larry Fitzgerald $11,750,000
5 Jordan Hicks $10,500,000
6 Justin Pugh $10,500,000
7 Kenyan Drake $8,483,000
8 Kyler Murray $8,080,601
9 Robert Alford $7,500,000
10 J.R. Sweezy $6,500,000
11 Corey Peters $4,437,500
12 Haason Reddick $4,288,444
13 Jordan Phillips $4,000,000
14 Isaiah Simmons $3,757,101
15 Maxx Williams $3,400,000
16 Zane Gonzalez $3,259,000
17 Devon Kennard $2,500,000
18 Budda Baker $2,173,184
19 De'Vondre Campbell $2,000,000
20 Andy Lee $2,000,000
21 Byron Murphy $1,815,795
22 Christian Kirk $1,607,691
23 Aaron Brewer $1,168,750
24 Max Garcia $1,143,125
25 Andy Isabella $1,052,244
26 Mason Cole $977,629
27 Zach Allen $975,855
28 Chris Banjo $887,500
29 Jonathan Bullard $887,500
... ...
2530 Khari Blasingame $675,000
2531 Kenneth Durden $675,000
2532 Cody Hollister $675,000
2533 Joey Ivie $675,000
2534 Greg Joseph $675,000
2535 Kareem Orr $675,000
2536 David Quessenberry $675,000
2537 Derick Roberson $675,000
2538 Shaun Wilson $675,000
2539 Cole McDonald $635,421
2540 Chris Jackson $629,570
2541 Kobe Smith $614,333
2542 Aaron Brewer $613,333
2543 Cale Garrett $613,333
2544 Tommy Hudson $613,333
2545 Kristian Wilkerson $613,333
2546 Khaylan Kearse-Thomas $612,500
2547 Nick Westbrook $612,333
2548 Kyle Williams $611,833
2549 Mason Kinsey $611,666
2550 Tucker McCann $611,666
2551 Cameron Scarlett $611,666
2552 Teair Tart $611,666
2553 Brandon Kemp $611,333
2554 Wyatt Ray $610,000
2555 Josh Smith $610,000
2556 Logan Woodside $610,000
2557 Rashard Davis $610,000
2558 Avery Gennesy $610,000
2559 Parker Hesse $610,000
[2560 rows x 2 columns]

How to count paragraphs from each article from dataframe?

I want to count paragraphs from data frames. However, it turns out that my result gets zero inside the list. Does anybody know how to fix it? Thank you so much.
Here is my code:
def count_paragraphs(df):
paragraph_count = []
linecount = 0
for i in df.text:
if i in ('\n','\r\n'):
if linecount == 0:
paragraphcount = paragraphcount + 1
return paragraph_count
count_paragraphs(df)
df.text
0 On Saturday, September 17 at 8:30 pm EST, an e...
1 Story highlights "This, though, is certain: to...
2 Critical Counties is a CNN series exploring 11...
3 McCain Criticized Trump for Arpaio’s Pardon… S...
4 Story highlights Obams reaffirms US commitment...
5 Obama weighs in on the debate\n\nPresident Bar...
6 Story highlights Ted Cruz refused to endorse T...
7 Last week I wrote an article titled “Donald Tr...
8 Story highlights Trump has 45%, Clinton 42% an...
9 Less than a day after protests over the police...
10 I woke up this morning to find a variation of ...
11 Thanks in part to the declassification of Defe...
12 The Democrats are using an intimidation tactic...
13 Dolly Kyle has written a scathing “tell all” b...
14 The Haitians in the audience have some newswor...
15 The man arrested Monday in connection with the...
16 Back when the news first broke about the pay-t...
17 Chicago Environmentalist Scumbags\n\nLeftists ...
18 Well THAT’S Weird. If the Birther movement is ...
19 Former President Bill Clinton and his Clinton ...
Name: text, dtype: object
Use Series.str.count:
def count_paragraphs(df):
return df.text.str.count(r'\n\n').tolist()
count_paragraphs(df)
This is my answer and It works!
def count_paragraphs(df):
paragraph_count = []
for i in range(len(df)):
paragraph_count.append(df.text[i].count('\n\n'))
return paragraph_count
count_paragraphs(df)

Selenium loading, but not printing all HTML

I am attempting to use Python and Selenium to web-scrape dynamically loaded data from a website. The problem is, only about half of the data is being reported as present, when in reality it all should be there. Even after using pauses before printing out all the page content, or simple find element by class searches, there seems to be no solution. The URL of the site is https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909. As you can see, there are 13 main sections, however I am only able to retrieve data from the first four games. To best show the problem I'll attach the code for printing the inner-HTML for the entire page to show the discrepancies between the loaded and non-loaded data.
from selenium import webdriver
import requests
url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
print(driver.execute_script("return document.documentElement.innerText;"))
EDIT:
The problem is not the wait time, for I am running it line by line and fully waiting for it to load. It appears the problem boild down to selenium not grabbing all the JS loaded text on the page, as seen by the console output in the answer below.
#sudonym's analysis was in the right direction. You need to induce WebDriverWait for the desired elements to be visible before you attempt to extract them through execute_script() method as follows:
Code Block:
# -*- coding: UTF-8 -*-
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909"
driver = webdriver.Chrome()
driver.get(url)
WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2[contains(.,'USA - National Football League')]//following::section//span[3]")))
print(driver.execute_script("return document.documentElement.innerText;"))
Console Output:
SPORTSBOOK REVIEW
Home
Best Sportsbooks
Rating Guide
Blacklist
Bonuses
BETTING ODDS
FREE PICKS
Sports Picks
NFL
College Football
NBA
NCAAB
MLB
NHL
More Sports
How to Bet
Tools
FORUM
Home
Players Talk
Sportsbooks & Industry
Newbie Forum
Handicapper Think Tank
David Malinsky's Point Blank
Service Plays
Bitcoin Sports Betting
NBA Betting
NFL Betting
NCAAF Betting
MLB Betting
NHL Betting
CONTESTS
EARN BETPOINTS
What Are Betpoints?
SBR Sportsbook
SBR Casino
SBR Racebook
SBR Poker
SBR Store
Today
NFL
NBA
NHL
MLB
College Football
NCAA Basketball
Soccer
Soccer Odds
Major League Soccer
UEFA Champions League
UEFA Nations League
UEFA Europa League
English Premier League
World Cup 2022
Tennis
Tennis Odds
ATP
WTA
UFC
Boxing
More Sports
CFL
WNBA
AFL
Betting Odds/NFL Odds/Consensus
TODAY
|
YESTERDAY
|
DATE
?
Login
?
Settings
?
Bet Tracker
?
Bet Card
?
Favorites
NFL Consensus for Sep 09, 2018
USA - National Football League
Sunday Sep 09, 2018
01:00 PM
/
Pittsburgh vs Cleveland
453
Pittsburgh
454
Cleveland
Current Line
-3½+105
+3½-115
Wagers Placed
10040
54.07%
8530
45.93%
Amount Wagered
$381,520.00
56.10%
$298,550.00
43.90%
Average Bet Size
$38.00
$35.00
SBR Contest Best Bets
22
9
01:00 PM
/
San Francisco vs Minnesota
455
San Francisco
456
Minnesota
Current Line
+6-102
-6-108
Wagers Placed
6250
41.25%
8900
58.75%
Amount Wagered
$175,000.00
29.50%
$418,300.00
70.50%
Average Bet Size
$28.00
$47.00
SBR Contest Best Bets
5
19
01:00 PM
/
Cincinnati vs Indianapolis
457
Cincinnati
458
Indianapolis
Current Line
-1-104
+1-106
Wagers Placed
11640
66.36%
5900
33.64%
Amount Wagered
$1,338,600.00
85.65%
$224,200.00
14.35%
Average Bet Size
$115.00
$38.00
SBR Contest Best Bets
23
12
01:00 PM
/
Buffalo vs Baltimore
459
Buffalo
460
Baltimore
Current Line
+7½-103
-7½-107
Wagers Placed
5220
33.83%
10210
66.17%
Amount Wagered
$78,300.00
16.79%
$387,980.00
83.21%
Average Bet Size
$15.00
$38.00
SBR Contest Best Bets
5
17
01:00 PM
/
Jacksonville vs N.Y. Giants
461
Jacksonville
462
N.Y. Giants
01:00 PM
/
Tampa Bay vs New Orleans
463
Tampa Bay
464
New Orleans
01:00 PM
/
Houston vs New England
465
Houston
466
New England
01:00 PM
/
Tennessee vs Miami
467
Tennessee
468
Miami
04:05 PM
/
Kansas City vs L.A. Chargers
469
Kansas City
470
L.A. Chargers
04:25 PM
/
Seattle vs Denver
471
Seattle
472
Denver
04:25 PM
/
Dallas vs Carolina
473
Dallas
474
Carolina
04:25 PM
/
Washington vs Arizona
475
Washington
476
Arizona
08:20 PM
/
Chicago vs Green Bay
477
Chicago
478
Green Bay
Media
Site Map
Terms of use
Contact Us
Privacy Policy
DMCA
18+. Gamble Responsibly.
© Sportsbook Review. All Rights Reserved.
This solution is only worth to consider if there are lots of WebDriverWait calls
and given the interest in reduced runtime - else go for DebanjanB's
approach
You need to wait some time to let your html load completely. Also, you can set a timeout for script execution. To add a unconditional wait to driver.get(URL) in selenium, driver.set_page_load_timeout(n) with n = time/seconds and loop:
driver.set_page_load_timeout(n) # Set timeout of n seconds for page load
loading_finished = 0 # Set flag to 0
while loading_finished == 0: # Repeat while flag = 0
try:
sleep(random.uniform(0.1, 0.5)) # wait some time
website = driver.get(URL) # try to load for n seconds
loading_finished = 1 # Set flag to 1 and exit while loop
logger.info("website loaded") # Indicate load success
except:
logger.warn("timeout - retry") # Indicate load fail
else: # If flag == 1
driver.set_script_timeout(n) # Set timeout of n seconds for script
script_finished = 0 # Set flag to 0
while script_finished == 0 # Second loop
try:
print driver.execute_script("return document.documentElement.innerText;")
script_finished = 1 # Set flag to 1
logger.info("script done") # Indicate script done
except:
logger.warn("script timeout")
else:
logger.info("if you're still missing html here, increase timeout")

Filtering in Pandas dataframe

I'm grouping rotten tomatoes scores by director with the following:
director_counts = bigbadpanda.groupby(["Director"]).size().order(ascending = False)
print director_counts --->
Director
Woody Allen 44
Alfred Hitchcock 38
Clint Eastwood 32
Martin Scorsese 29
Steven Spielberg 29
Sidney Lumet 25
...
Question:
What's the best way for me to filter by directors with more than 2 movies?
For filtering by the average movies per director would this work? bigbadpanda.groupby(["Director"]).size().mean())
Data I created based on your info
Director,Movies
Woody Allen,44
Alfred Hitchcock,38
Clint Eastwood,32
Someone,2
Someone else,1
Simply do this:
df = pd.read_csv('data.txt')
print(df[df.Movies > 2])
Output:
Director Movies
0 Woody Allen 44
1 Alfred Hitchcock 38
2 Clint Eastwood 32

Categories

Resources