Scraping data from Spotify charts - python
I want to scrape daily top 200 songs from Spotify charts website. I am trying to parse html code of page and trying to get song's artist, name and stream informations. But following code returns nothing. How can I get these informations with the following way?
for a in soup.find("div",{"class":"Container-c1ixcy-0 krZEp encore-base-set"}):
for b in a.findAll("main",{"class":"Main-tbtyrr-0 flXzSu"}):
for c in b.findAll("div",{"class":"Content-sc-1n5ckz4-0 jyvkLv"}):
for d in c.findAll("div",{"class":"TableContainer__Container-sc-86p3fa-0 fRKUEz"}):
print(d)
And let say this is the songs list that I want to scrape from it.
https://charts.spotify.com/charts/view/regional-tr-daily/2022-09-14
And also this is the html code of the page.
none selenium solution:
import requests
import pandas as pd
url = 'https://charts-spotify-com-service.spotify.com/public/v0/charts'
response = requests.get(url)
chart = []
for entry in response.json()['chartEntryViewResponses'][0]['entries']:
chart.append({
"Rank": entry['chartEntryData']['currentRank'],
"Artist": ', '.join([artist['name'] for artist in entry['trackMetadata']['artists']]),
"TrackName": entry['trackMetadata']['trackName']
})
df = pd.DataFrame(chart)
print(df.to_string(index=False))
OUTPUT:
Rank Artist TrackName
1 Bizarrap,Quevedo Quevedo: Bzrp Music Sessions, Vol. 52
2 Harry Styles As It Was
3 Bad Bunny,Chencho Corleone Me Porto Bonito
4 Bad Bunny Tití Me Preguntó
5 Manuel Turizo La Bachata
6 ROSALÍA DESPECHÁ
7 BLACKPINK Pink Venom
8 David Guetta,Bebe Rexha I'm Good (Blue)
9 OneRepublic I Ain't Worried
10 Bad Bunny Efecto
11 Chris Brown Under The Influence
12 Steve Lacy Bad Habit
13 Bad Bunny,Bomba Estéreo Ojitos Lindos
14 Kate Bush Running Up That Hill (A Deal With God) - 2018 Remaster
15 Joji Glimpse of Us
16 Nicki Minaj Super Freaky Girl
17 Bad Bunny Moscow Mule
18 Rosa Linn SNAP
19 Glass Animals Heat Waves
20 KAROL G PROVENZA
21 Charlie Puth,Jung Kook,BTS Left and Right (Feat. Jung Kook of BTS)
22 Harry Styles Late Night Talking
23 The Kid LAROI,Justin Bieber STAY (with Justin Bieber)
24 Tom Odell Another Love
25 Central Cee Doja
26 Stephen Sanchez Until I Found You
27 Bad Bunny Neverita
28 Post Malone,Doja Cat I Like You (A Happier Song) (with Doja Cat)
29 Lizzo About Damn Time
30 Nicky Youre,dazy Sunroof
31 Elton John,Britney Spears Hold Me Closer
32 Luar La L Caile
33 KAROL G,Maldy GATÚBELA
34 The Weeknd Die For You
35 Bad Bunny,Jhay Cortez Tarot
36 James Hype,Miggy Dela Rosa Ferrari
37 Imagine Dragons Bones
38 Elton John,Dua Lipa,PNAU Cold Heart - PNAU Remix
39 The Neighbourhood Sweater Weather
40 Ghost Mary On A Cross
41 Shakira,Rauw Alejandro Te Felicito
42 Justin Bieber Ghost
43 Bad Bunny,Rauw Alejandro Party
44 Drake,21 Savage Jimmy Cooks (feat. 21 Savage)
45 Doja Cat Vegas (From the Original Motion Picture Soundtrack ELVIS)
46 Camila Cabello,Ed Sheeran Bam Bam (feat. Ed Sheeran)
47 Rauw Alejandro,Lyanno,Brray LOKERA
48 Rels B cómo dormiste?
49 The Weeknd Blinding Lights
50 Arctic Monkeys 505
In the example link you provided, there aren't 200 songs, but only 50. The following is one way to get those songs:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time as t
import pandas as pd
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("window-size=1920,1080")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://charts.spotify.com/charts/view/regional-tr-daily/2022-09-14'
browser.get(url)
wait = WebDriverWait(browser, 5)
try:
wait.until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
print("accepted cookies")
except Exception as e:
print('no cookie button')
header_to_be_removed = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'header[data-testid="charts-header"]')))
browser.execute_script("""
var element = arguments[0];
element.parentNode.removeChild(element);
""", header_to_be_removed)
while True:
try:
show_more_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//div[#data-testid="load-more-entries"]//button')))
show_more_button.location_once_scrolled_into_view
t.sleep(5)
show_more_button.click()
print('clicked to show more')
t.sleep(3)
except TimeoutException:
print('all done')
break
songs = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'li[data-testid="charts-entry-item"]')))
print('we have', len(songs), 'songs')
song_list = []
for song in songs:
song.location_once_scrolled_into_view
t.sleep(1)
title = song.find_element(By.CSS_SELECTOR, 'p[class^="Type__TypeElement-"]')
artist = song.find_element(By.CSS_SELECTOR, 'span[data-testid="artists-names"]')
song_list.append((artist.text, title.text))
df = pd.DataFrame(song_list, columns = ['Title', 'Artist'])
print(df)
This will print out in terminal:
no cookie button
clicked to show more
clicked to show more
clicked to show more
clicked to show more
all done
we have 50 songs
Title
Artist
0
Bizarrap,
Quevedo: Bzrp Music Sessions, Vol. 52
1
Harry Styles
As It Was
2
Bad Bunny,
Me Porto Bonito
3
Bad Bunny
Tití Me Preguntó
4
Manuel Turizo
La Bachata
5
ROSALÍA
DESPECHÁ
6
BLACKPINK
Pink Venom
7
David Guetta,
I'm Good (Blue)
8
OneRepublic
I Ain't Worried
9
Bad Bunny
Efecto
10
Chris Brown
Under The Influence
11
Steve Lacy
Bad Habit
12
Bad Bunny,
Ojitos Lindos
13
Kate Bush
Running Up That Hill (A Deal With God) - 2018 Remaster
14
Joji
Glimpse of Us
15
Nicki Minaj
Super Freaky Girl
16
Bad Bunny
Moscow Mule
17
Rosa Linn
SNAP
18
Glass Animals
Heat Waves
19
KAROL G
PROVENZA
20
Charlie Puth,
Left and Right (Feat. Jung Kook of BTS)
21
Harry Styles
Late Night Talking
22
The Kid LAROI,
STAY (with Justin Bieber)
23
Tom Odell
Another Love
24
Central Cee
Doja
25
Stephen Sanchez
Until I Found You
26
Bad Bunny
Neverita
27
Post Malone,
I Like You (A Happier Song) (with Doja Cat)
28
Lizzo
About Damn Time
29
Nicky Youre,
Sunroof
30
Elton John,
Hold Me Closer
31
Luar La L
Caile
32
KAROL G,
GATÚBELA
33
The Weeknd
Die For You
34
Bad Bunny,
Tarot
35
James Hype,
Ferrari
36
Imagine Dragons
Bones
37
Elton John,
Cold Heart - PNAU Remix
38
The Neighbourhood
Sweater Weather
39
Ghost
Mary On A Cross
40
Shakira,
Te Felicito
41
Justin Bieber
Ghost
42
Bad Bunny,
Party
43
Drake,
Jimmy Cooks (feat. 21 Savage)
44
Doja Cat
Vegas (From the Original Motion Picture Soundtrack ELVIS)
45
Camila Cabello,
Bam Bam (feat. Ed Sheeran)
46
Rauw Alejandro,
LOKERA
47
Rels B
cómo dormiste?
48
The Weeknd
Blinding Lights
49
Arctic Monkeys
505
Of course you can get other info like chart ranking, all artists when there are more than one, etc.
Selenium chrome/chromedriver setup is for Linux, you just have to observe the imports and code after defining the browser, to adapt it to your own setup.
Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/index.html
For selenium docs, visit: https://www.selenium.dev/documentation/
Related
Cannot get response.get() to load full webpage
When I go to scrape https://www.onthesnow.com/epic-pass/skireport for the names of all the ski resorts listed, I'm running into an issue where some of the ski resorts don't show up in my output. Here's my current code: import requests url = "https://www.onthesnow.com/epic-pass/skireport" response = requests.get(url) response.text The current output gives all resorts up to Mont Sainte Anne, but then it skips to the resorts at the bottom of the webpage under "closed resorts". I notice that when you scroll down the webpage in a browser that the missing resort names need to be scrolled down to before they will load. How do I make my response.get() obtain all of the HTML, even the HTML that still needs to load?
The data you see is loaded from external URL in Json form. To load it, you can use this example: import json import requests url = "https://api.onthesnow.com/api/v2/region/1291/resorts/1/page/1?limit=999" data = requests.get(url).json() # uncomment to print all data: # print(json.dumps(data, indent=4)) for i, d in enumerate(data["data"], 1): print(i, d["title"]) Prints: 1 Beaver Creek 2 Breckenridge 3 Brides les Bains 4 Courchevel 5 Crested Butte Mountain Resort 6 Fernie Alpine 7 Folgàrida - Marilléva 8 Heavenly 9 Keystone 10 Kicking Horse 11 Kimberley 12 Kirkwood 13 La Tania 14 Les Menuires 15 Madonna di Campiglio 16 Meribel 17 Mont Sainte Anne 18 Nakiska Ski Area 19 Nendaz 20 Northstar California 21 Okemo Mountain Resort 22 Orelle 23 Park City 24 Pontedilegno - Tonale 25 Saint Martin de Belleville 26 Snowbasin 27 Stevens Pass Resort 28 Stoneham 29 Stowe Mountain 30 Sun Valley 31 Thyon 2000 32 Vail 33 Val Thorens 34 Verbier 35 Veysonnaz 36 Whistler Blackcomb
Problem concatenating URL and scraping data
I am trying to append a URL in python to scrape details from the target URL. I have the below code but it seems to be scraping the data from url1 rather than URL. I have scraped the team names from the NFL websit without any issue. The issue is with the spotrac URL where I am appending the team name which I have scraped from the NFL website. import requests from bs4 import BeautifulSoup URL ='https://www.nfl.com/teams/' page = requests.get(URL) soup = BeautifulSoup(page.text, 'html.parser') team_name = [] team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline') for team in team_name_list: if team.find('p'): team_name.append(team.text) for team in team_name: team = team.replace(" ", "-").lower() url1 = 'https://www.spotrac.com/nfl/rankings/' URL = url1 +str(team) print(URL) data = { 'ajax': 'true', 'mobile': 'false' } bs_soup = BeautifulSoup(requests.post(URL, data=data).content, 'html.parser') spotrac_df = pd.DataFrame(columns = ['Name', 'Salary']) for h3 in bs_soup.select('h3'): spotrac_df = spotrac_df.append(pd.DataFrame({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text)}, index=[0]), ignore_index=False) I'm almost certain the problem is coming from the URL not appending properly. The scraping is taking the salaries etc from url1 rather than URL. My console output (using Spyder IDE) is as below for print(URL)
url is appending correctly, but you have a leading white space in your team names. I also made a few other changes and noted them in the code. Lastly, (and I used to do this two), creating an empty dataframe then appending to it after each iteration I suppose isn't the best method. I've been told it better to construct your rows using lists/dictionaries, and then when done, then call on pandas to construct the dataframe, so changed that as well. import requests from bs4 import BeautifulSoup import pandas as pd url ='https://www.nfl.com/teams/' page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') team_name = [] team_name_list = soup.find_all('h4',class_='d3-o-media-object__roofline nfl-c-custom-promo__headline') for team in team_name_list: if team.find('p'): team_name.append(team.text.strip()) #<- remove leading/trailing white space url1 = 'https://www.spotrac.com/nfl/rankings/' #<- since this is fixed, put it before the loop spotrac_rows = [] for team in team_name: team = '-'.join(team.split()).lower() #<- changed to split in case theres 2 spaces between city and team url1 = 'https://www.spotrac.com/nfl/rankings/' url = url1 + str(team) print(url) data = { 'ajax': 'true', 'mobile': 'false' } bs_soup = BeautifulSoup(requests.post(url, data=data).content, 'html.parser') for h3 in bs_soup.select('h3'): spotrac_rows.append({'Name': str(h3.text), 'Salary' : str(h3.find_next(class_="rank-value").text.strip())}) #<- remove white space from the salary spotrac_df = pd.DataFrame(spotrac_rows) Output: print(spotrac_df) Name Salary 0 Chandler Jones $21,333,333 1 Patrick Peterson $13,184,588 2 D.J. Humphries $12,800,000 3 DeAndre Hopkins $12,500,000 4 Larry Fitzgerald $11,750,000 5 Jordan Hicks $10,500,000 6 Justin Pugh $10,500,000 7 Kenyan Drake $8,483,000 8 Kyler Murray $8,080,601 9 Robert Alford $7,500,000 10 J.R. Sweezy $6,500,000 11 Corey Peters $4,437,500 12 Haason Reddick $4,288,444 13 Jordan Phillips $4,000,000 14 Isaiah Simmons $3,757,101 15 Maxx Williams $3,400,000 16 Zane Gonzalez $3,259,000 17 Devon Kennard $2,500,000 18 Budda Baker $2,173,184 19 De'Vondre Campbell $2,000,000 20 Andy Lee $2,000,000 21 Byron Murphy $1,815,795 22 Christian Kirk $1,607,691 23 Aaron Brewer $1,168,750 24 Max Garcia $1,143,125 25 Andy Isabella $1,052,244 26 Mason Cole $977,629 27 Zach Allen $975,855 28 Chris Banjo $887,500 29 Jonathan Bullard $887,500 ... ... 2530 Khari Blasingame $675,000 2531 Kenneth Durden $675,000 2532 Cody Hollister $675,000 2533 Joey Ivie $675,000 2534 Greg Joseph $675,000 2535 Kareem Orr $675,000 2536 David Quessenberry $675,000 2537 Derick Roberson $675,000 2538 Shaun Wilson $675,000 2539 Cole McDonald $635,421 2540 Chris Jackson $629,570 2541 Kobe Smith $614,333 2542 Aaron Brewer $613,333 2543 Cale Garrett $613,333 2544 Tommy Hudson $613,333 2545 Kristian Wilkerson $613,333 2546 Khaylan Kearse-Thomas $612,500 2547 Nick Westbrook $612,333 2548 Kyle Williams $611,833 2549 Mason Kinsey $611,666 2550 Tucker McCann $611,666 2551 Cameron Scarlett $611,666 2552 Teair Tart $611,666 2553 Brandon Kemp $611,333 2554 Wyatt Ray $610,000 2555 Josh Smith $610,000 2556 Logan Woodside $610,000 2557 Rashard Davis $610,000 2558 Avery Gennesy $610,000 2559 Parker Hesse $610,000 [2560 rows x 2 columns]
How to count paragraphs from each article from dataframe?
I want to count paragraphs from data frames. However, it turns out that my result gets zero inside the list. Does anybody know how to fix it? Thank you so much. Here is my code: def count_paragraphs(df): paragraph_count = [] linecount = 0 for i in df.text: if i in ('\n','\r\n'): if linecount == 0: paragraphcount = paragraphcount + 1 return paragraph_count count_paragraphs(df) df.text 0 On Saturday, September 17 at 8:30 pm EST, an e... 1 Story highlights "This, though, is certain: to... 2 Critical Counties is a CNN series exploring 11... 3 McCain Criticized Trump for Arpaio’s Pardon… S... 4 Story highlights Obams reaffirms US commitment... 5 Obama weighs in on the debate\n\nPresident Bar... 6 Story highlights Ted Cruz refused to endorse T... 7 Last week I wrote an article titled “Donald Tr... 8 Story highlights Trump has 45%, Clinton 42% an... 9 Less than a day after protests over the police... 10 I woke up this morning to find a variation of ... 11 Thanks in part to the declassification of Defe... 12 The Democrats are using an intimidation tactic... 13 Dolly Kyle has written a scathing “tell all” b... 14 The Haitians in the audience have some newswor... 15 The man arrested Monday in connection with the... 16 Back when the news first broke about the pay-t... 17 Chicago Environmentalist Scumbags\n\nLeftists ... 18 Well THAT’S Weird. If the Birther movement is ... 19 Former President Bill Clinton and his Clinton ... Name: text, dtype: object
Use Series.str.count: def count_paragraphs(df): return df.text.str.count(r'\n\n').tolist() count_paragraphs(df)
This is my answer and It works! def count_paragraphs(df): paragraph_count = [] for i in range(len(df)): paragraph_count.append(df.text[i].count('\n\n')) return paragraph_count count_paragraphs(df)
Selenium loading, but not printing all HTML
I am attempting to use Python and Selenium to web-scrape dynamically loaded data from a website. The problem is, only about half of the data is being reported as present, when in reality it all should be there. Even after using pauses before printing out all the page content, or simple find element by class searches, there seems to be no solution. The URL of the site is https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909. As you can see, there are 13 main sections, however I am only able to retrieve data from the first four games. To best show the problem I'll attach the code for printing the inner-HTML for the entire page to show the discrepancies between the loaded and non-loaded data. from selenium import webdriver import requests url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909" driver = webdriver.Chrome() driver.get(url) print(driver.execute_script("return document.documentElement.innerText;")) EDIT: The problem is not the wait time, for I am running it line by line and fully waiting for it to load. It appears the problem boild down to selenium not grabbing all the JS loaded text on the page, as seen by the console output in the answer below.
#sudonym's analysis was in the right direction. You need to induce WebDriverWait for the desired elements to be visible before you attempt to extract them through execute_script() method as follows: Code Block: # -*- coding: UTF-8 -*- from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By url = "https://www.sportsbookreview.com/betting-odds/nfl-football/consensus/?date=20180909" driver = webdriver.Chrome() driver.get(url) WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2[contains(.,'USA - National Football League')]//following::section//span[3]"))) print(driver.execute_script("return document.documentElement.innerText;")) Console Output: SPORTSBOOK REVIEW Home Best Sportsbooks Rating Guide Blacklist Bonuses BETTING ODDS FREE PICKS Sports Picks NFL College Football NBA NCAAB MLB NHL More Sports How to Bet Tools FORUM Home Players Talk Sportsbooks & Industry Newbie Forum Handicapper Think Tank David Malinsky's Point Blank Service Plays Bitcoin Sports Betting NBA Betting NFL Betting NCAAF Betting MLB Betting NHL Betting CONTESTS EARN BETPOINTS What Are Betpoints? SBR Sportsbook SBR Casino SBR Racebook SBR Poker SBR Store Today NFL NBA NHL MLB College Football NCAA Basketball Soccer Soccer Odds Major League Soccer UEFA Champions League UEFA Nations League UEFA Europa League English Premier League World Cup 2022 Tennis Tennis Odds ATP WTA UFC Boxing More Sports CFL WNBA AFL Betting Odds/NFL Odds/Consensus TODAY | YESTERDAY | DATE ? Login ? Settings ? Bet Tracker ? Bet Card ? Favorites NFL Consensus for Sep 09, 2018 USA - National Football League Sunday Sep 09, 2018 01:00 PM / Pittsburgh vs Cleveland 453 Pittsburgh 454 Cleveland Current Line -3½+105 +3½-115 Wagers Placed 10040 54.07% 8530 45.93% Amount Wagered $381,520.00 56.10% $298,550.00 43.90% Average Bet Size $38.00 $35.00 SBR Contest Best Bets 22 9 01:00 PM / San Francisco vs Minnesota 455 San Francisco 456 Minnesota Current Line +6-102 -6-108 Wagers Placed 6250 41.25% 8900 58.75% Amount Wagered $175,000.00 29.50% $418,300.00 70.50% Average Bet Size $28.00 $47.00 SBR Contest Best Bets 5 19 01:00 PM / Cincinnati vs Indianapolis 457 Cincinnati 458 Indianapolis Current Line -1-104 +1-106 Wagers Placed 11640 66.36% 5900 33.64% Amount Wagered $1,338,600.00 85.65% $224,200.00 14.35% Average Bet Size $115.00 $38.00 SBR Contest Best Bets 23 12 01:00 PM / Buffalo vs Baltimore 459 Buffalo 460 Baltimore Current Line +7½-103 -7½-107 Wagers Placed 5220 33.83% 10210 66.17% Amount Wagered $78,300.00 16.79% $387,980.00 83.21% Average Bet Size $15.00 $38.00 SBR Contest Best Bets 5 17 01:00 PM / Jacksonville vs N.Y. Giants 461 Jacksonville 462 N.Y. Giants 01:00 PM / Tampa Bay vs New Orleans 463 Tampa Bay 464 New Orleans 01:00 PM / Houston vs New England 465 Houston 466 New England 01:00 PM / Tennessee vs Miami 467 Tennessee 468 Miami 04:05 PM / Kansas City vs L.A. Chargers 469 Kansas City 470 L.A. Chargers 04:25 PM / Seattle vs Denver 471 Seattle 472 Denver 04:25 PM / Dallas vs Carolina 473 Dallas 474 Carolina 04:25 PM / Washington vs Arizona 475 Washington 476 Arizona 08:20 PM / Chicago vs Green Bay 477 Chicago 478 Green Bay Media Site Map Terms of use Contact Us Privacy Policy DMCA 18+. Gamble Responsibly. © Sportsbook Review. All Rights Reserved.
This solution is only worth to consider if there are lots of WebDriverWait calls and given the interest in reduced runtime - else go for DebanjanB's approach You need to wait some time to let your html load completely. Also, you can set a timeout for script execution. To add a unconditional wait to driver.get(URL) in selenium, driver.set_page_load_timeout(n) with n = time/seconds and loop: driver.set_page_load_timeout(n) # Set timeout of n seconds for page load loading_finished = 0 # Set flag to 0 while loading_finished == 0: # Repeat while flag = 0 try: sleep(random.uniform(0.1, 0.5)) # wait some time website = driver.get(URL) # try to load for n seconds loading_finished = 1 # Set flag to 1 and exit while loop logger.info("website loaded") # Indicate load success except: logger.warn("timeout - retry") # Indicate load fail else: # If flag == 1 driver.set_script_timeout(n) # Set timeout of n seconds for script script_finished = 0 # Set flag to 0 while script_finished == 0 # Second loop try: print driver.execute_script("return document.documentElement.innerText;") script_finished = 1 # Set flag to 1 logger.info("script done") # Indicate script done except: logger.warn("script timeout") else: logger.info("if you're still missing html here, increase timeout")
Filtering in Pandas dataframe
I'm grouping rotten tomatoes scores by director with the following: director_counts = bigbadpanda.groupby(["Director"]).size().order(ascending = False) print director_counts ---> Director Woody Allen 44 Alfred Hitchcock 38 Clint Eastwood 32 Martin Scorsese 29 Steven Spielberg 29 Sidney Lumet 25 ... Question: What's the best way for me to filter by directors with more than 2 movies? For filtering by the average movies per director would this work? bigbadpanda.groupby(["Director"]).size().mean())
Data I created based on your info Director,Movies Woody Allen,44 Alfred Hitchcock,38 Clint Eastwood,32 Someone,2 Someone else,1 Simply do this: df = pd.read_csv('data.txt') print(df[df.Movies > 2]) Output: Director Movies 0 Woody Allen 44 1 Alfred Hitchcock 38 2 Clint Eastwood 32