Loop scrapes the same page 20 times instead of iterating through range - python

I'm trying to scrape IMDB for a list of the top 1000 movies and get some details about them. However, when I run it, instead of getting the first 50 movies and going to the next page for the next 50, it repeats the loop and makes the same 50 entries 20 times in my database.
# Dataframe template
data = pd.DataFrame(columns=['ID','Title','Genre','Summary'])
#Get page data function
def getPageContent(start=1):
start = 1
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start='+str(start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs
#Run for top 1000
for start in range(1,1001,50):
getPageContent(start)
movies = bs.findAll("div", "lister-item-content")
for movie in movies:
id = movie.find("span", "lister-item-index").contents[0]
title = movie.find('a').contents[0]
genres = movie.find('span', 'genre').contents[0]
genres = [g.strip() for g in genres.split(',')]
summary = movie.find("p", "text-muted").find_next_sibling("p").contents
i = data.shape[0]
data.loc[i] = [id,title,genres,summary]
#Clean data
# data.ID = [float(re.sub('.','',str(i))) for i in data.ID] #remove . from ID
data.head(51)
0 1. The Shawshank Redemption [Drama] [\nTwo imprisoned men bond over a number of ye...
1 2. The Dark Knight [Action, Crime, Drama] [\nWhen the menace known as the Joker wreaks h...
2 3. Inception [Action, Adventure, Sci-Fi] [\nA thief who steals corporate secrets throug...
3 4. Fight Club [Drama] [\nAn insomniac office worker and a devil-may-...
...
46 47. The Usual Suspects [Crime, Drama, Mystery] [\nA sole survivor tells of the twisty events ...
47 48. The Truman Show [Comedy, Drama] [\nAn insurance salesman discovers his whole l...
48 49. Avengers: Infinity War [Action, Adventure, Sci-Fi] [\nThe Avengers and their allies must be willi...
49 50. Iron Man [Action, Adventure, Sci-Fi] [\nAfter being held captive in an Afghan cave,...
50 1. The Shawshank Redemption [Drama] [\nTwo imprisoned men bond over a number of ye...

Delete 'start' variable inside 'getPageContent' function. It assigns 'start=1' every time.
#Get page data function
def getPageContent(start=1):
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start='+str(start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs

I was not able to test this code. See inline comments for what I see as the main issue.
# Dataframe template
data = pd.DataFrame(columns=['ID', 'Title', 'Genre', 'Summary'])
# Get page data function
def getPageContent(start=1):
start = 1
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1950-01-01,2019-12-31&sort=num_votes,desc&start=' + str(
start)
r = requests.get(url)
bs = bsp(r.text, "lxml")
return bs
# Run for top 1000
# for start in range(1, 1001, 50): # 50 is a
# step value so this gets every 50th movie
# Try 2 loops
start = 0
for group in range(0, 1001, 50):
for item in range(group, group + 50):
getPageContent(item)
movies = bs.findAll("div", "lister-item-content")
for movie in movies:
id = movie.find("span", "lister-item-index").contents[0]
title = movie.find('a').contents[0]
genres = movie.find('span', 'genre').contents[0]
genres = [g.strip() for g in genres.split(',')]
summary = movie.find("p", "text-muted").find_next_sibling("p").contents
i = data.shape[0]
data.loc[i] = [id, title, genres, summary]
# Clean data
# data.ID = [float(re.sub('.','',str(i))) for i in data.ID] #remove . from ID
data.head(51)

Related

How to scrape a table but 'not a table' from a page, using python?

Humble greetings and welcome to anyone willing to spend time here. I shall introduce myself as a very green student of data science and also python. This thread is meant to gain insight from rather more fortunate minds capable of deeper understanding within the realm of python.
As we can see, the value for each row itself could be found easily on the page inspection. But it seems that they all are using the same class name. As for now, i'm afraid i couldnt even find the right keyword to search for any working method in google.
These are the codes that i've tried. They dont work and embaressing, but i have to show it anyway. Ive tried fiddling by adding .content, .text, find, find_all, but i understand that my failure lies at even deeper fundamental core.
from bs4 import BeautifulSoup
import requests
from csv import writer
import pandas as pd
url= 'https://m4.mobilelegends.com/stats'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
lists = soup.find('div', class_="m4-team-stats-scroll")
with open('m4stats_team.csv', 'w', encoding='utf8', newline='') as f:
thewriter = writer(f)
header = ['Team', 'Win Rate', 'Average KDA', 'Average Kills', 'average Deaths', 'Average Assists', 'Average Game Time', 'Average Lord Kills', 'Average Tortoise Kills', 'Average Towers Destroy', 'First Blood Rate', 'Hero Pool']
thewriter.writerow(header)
for list in lists:
team = list.find_all('p', class_="h3 pl-5 whitespace-nowrap hidden xl:block")
awr = list.find_all('p', class_="h4")
akda = list.find('p', class_="h4").text
akill = list.find('p', class_="h4").text
adeath = list.find('p', class_="h4").text
aassist = list.find('p', class_="h4").text
atime = list.find('p', class_="h4").text
aalord = list.find('p', class_="h4").text
atortoise = list.find('p', class_="h4").text
atower = list.find('p', class_="h4").text
firstblood = list.find('p', class_="h4").text
hrpool = list.find('p', class_="h4").text
info = [team, awr, akda, akill, adeath, aassist, atime, aalord, atortoise, atower, firstblood, hrpool]
thewriter.writerow(info)
pd.read_csv('m4stats_team.csv').head()
What am i expecting:
Any kind of insight. Whether if it's clue, keyword, code snippet, i do appreciate and mostfully thankful for any kind of guidance. I am not asking for somehow getting the complete scrapped CSV, as i couldve done it manually. At these point i want to be able to do basic webscraping myself.
You can iterate over rows in the table and its items.
from bs4 import BeautifulSoup
import requests
page = requests.get('https://m4.mobilelegends.com/stats')
page.raise_for_status()
page = BeautifulSoup(page.content)
table = page.find("div", class_="m4-team-stats-scroll")
with open("table.csv", "w") as file:
for row in table.find_all("div", class_="m4-team-stats"):
items = row.find_all("div", class_="col-span-1")
# write into file in csv format, use map to extract text from items
file.write(",".join(map(lambda item: item.text, items)) + "\n")
Display output:
import pandas as pd
df = pd.read_csv("table.csv")
print(df)
# Outputs:
"""
Team ↓Win Rate ... ↓First Blood Rate ↓Hero pool
0 echo 72.0% ... 48.0% 37
1 rrq 60.9% ... 60.9% 37
2 tv 60.0% ... 60.0% 29
3 fcon 55.0% ... 85.0% 32
4 inc 53.3% ... 26.7% 31
5 onic 52.9% ... 47.1% 39
6 blck 52.2% ... 47.8% 31
7 rrq-br 46.2% ... 30.8% 32
8 thq 45.5% ... 63.6% 27
9 s11 42.9% ... 28.6% 26
10 tdk 37.5% ... 62.5% 24
11 ot 28.6% ... 28.6% 21
12 mvg 20.0% ... 20.0% 15
13 rsg-sg 20.0% ... 60.0% 17
14 burn 0.0% ... 20.0% 21
15 mdh 0.0% ... 40.0% 18
[16 rows x 12 columns]
"""

BeautifulSoup trying to get text from wrapped divs but empty or "none" is being returned

Here is a picture (sorry) of the HTML that I am trying to parse:
I am using this line:
home_stats = soup.select_one('div', class_='statText:nth-child(1)').text
Thinking that I'd get the 1st child of the class statText and the outcome would be 53%.
But it's not. I get "Loading..." and none of the data that I was trying to use and display.
The full code I have so far:
soup = BeautifulSoup(source, 'lxml')
home_team = soup.find('div', class_='tname-home').a.text
away_team = soup.find('div', class_='tname-away').a.text
home_score = soup.select_one('.current-result .scoreboard:nth-child(1)').text
away_score = soup.select_one('.current-result .scoreboard:nth-child(2)').text
print("The home team is " + home_team, "and they scored " + home_score)
print()
print("The away team is " + away_team, "and they scored " + away_score)
home_stats = soup.select_one('div', class_='statText:nth-child(1)').text
print(home_stats)
Which currently does print the hone and away team and the number of goals they scored. But I can't seem to get any of the statistical content from this site.
My output plan is to have:
[home_team] had 53% ball possession and [away_team] had 47% ball possession
However, I would like to remove the "%" symbols from the parse (but that's not essential). My plan is to use these numbers for more stats later on, so the % symbol gets in the way.
Apologies for the noob question - this is the absolute beginning of my Pythonic journey. I have scoured the internet and StackOverflow and just can not find this situation - I also possibly don't know exactly what I am looking for either.
Thanks kindly for your help! May your answer be the one I pick as "correct" ;)
Assuming that this is the website that u r tryna scrape, here is the complete code to scrape all the stats:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.scoreboard.com/en/match/SO3Fg7NR/#match-statistics;0')
pg = driver.page_source #Gets the source code of the page
driver.close()
soup = BeautifulSoup(pg,'html.parser') #Creates a soup object
statrows = soup.find_all('div',class_ = "statTextGroup") #Finds all the div tags with class statTextGroup -- these div tags contain the stats
#Scrapes the team names
teams = soup.find_all('a',class_ = "participant-imglink")
teamslst = []
for x in teams:
team = x.text.strip()
if team != "":
teamslst.append(team)
stats_dict = {}
count = 0
for x in statrows:
txt = x.text
final_txt = ""
stat = ""
alphabet = False
percentage = False
#Extracts the numbers from the text
for c in txt:
if c in '0123456789':
final_txt+=c
else:
if alphabet == False:
final_txt+= "-"
alphabet = True
if c != "%":
stat += c
else:
percentage = True
values = final_txt.split('-')
#Appends the values to the dictionary
for x in values:
if stat in stats_dict.keys():
if percentage == True:
stats_dict[stat].append(x + "%")
else:
stats_dict[stat].append(int(x))
else:
if percentage == True:
stats_dict[stat] = [x + "%"]
else:
stats_dict[stat] = [int(x)]
count += 1
if count == 15:
break
index = [teamslst[0],teamslst[1]]
#Creates a pandas DataFrame out of the dictionary
df = pd.DataFrame(stats_dict,index = index).T
print(df)
Output:
Burnley Southampton
Ball Possession 53% 47%
Goal Attempts 10 5
Shots on Goal 2 1
Shots off Goal 4 2
Blocked Shots 4 2
Free Kicks 11 10
Corner Kicks 8 2
Offsides 2 1
Goalkeeper Saves 0 2
Fouls 8 10
Yellow Cards 1 0
Total Passes 522 480
Tackles 15 12
Attacks 142 105
Dangerous Attacks 44 29
Hope that this helps!
P.S: I actually wrote this code for a different question, but I didn't post it as an answer was already posted! But I didn't know that it would come in handy now! Anyways, I hope that my answer does what u need.

For loop for web scraping in python

I have a small project working on web-scraping Google search with a list of keywords. I have built a nested For loop for scraping the search results. The problem is that a for loop for searching keywords in the list does not work as I intended to, which is scraping the data from each searching result. The results get only the result of the last keyword, except for the first two search results.
Here is the code:
browser = webdriver.Chrome(r"C:\...\chromedriver.exe")
df = pd.DataFrame(columns = ['ceo', 'value'])
baseUrl = 'https://www.google.com/search?q='
html = browser.page_source
soup = BeautifulSoup(html)
ceo_list = ["Bill Gates", "Elon Musk", "Warren Buffet"]
values =[]
for ceo in ceo_list:
browser.get(baseUrl + ceo)
r = soup.select('div.g.rhsvw.kno-kp.mnr-c.g-blk')
df = pd.DataFrame()
for i in r:
value = i.select_one('div.Z1hOCe').text
ceo = i.select_one('.kno-ecr-pt.PZPZlf.gsmt.i8lZMc').text
values = [ceo, value]
s = pd.Series(values)
df = df.append(s,ignore_index=True)
print(df)
The output:
0 1
0 Warren Buffet Born: October 28, 1955 (age 64 years), Seattle...
The output that I am expecting is as this:
0 1
0 Bill Gates Born:..........
1 Elon Musk Born:...........
2 Warren Buffett Born: August 30, 1930 (age 89 years), Omaha, N...
Any suggestions or comments are welcome here.
Declare df = pd.DataFrame() outside the for loop
Since currently, you have defined it inside the loop, for each keyword in your list it will initialize a new data frame and the older will be replaced. That's why you are just getting the result for the last keyword.
Try this:
browser = webdriver.Chrome(r"C:\...\chromedriver.exe")
df = pd.DataFrame(columns = ['ceo', 'value'])
baseUrl = 'https://www.google.com/search?q='
html = browser.page_source
soup = BeautifulSoup(html)
ceo_list = ["Bill Gates", "Elon Musk", "Warren Buffet"]
df = pd.DataFrame()
for ceo in ceo_list:
browser.get(baseUrl + ceo)
r = soup.select('div.g.rhsvw.kno-kp.mnr-c.g-blk')
for i in r:
value = i.select_one('div.Z1hOCe').text
ceo = i.select_one('.kno-ecr-pt.PZPZlf.gsmt.i8lZMc').text
s = pd.Series([ceo, value])
df = df.append(s,ignore_index=True)
print(df)

Beautifulsoup + Python HTML UL targeting, creating a list and appending to variables

I'm trying to scrape Autotrader's website to get an excel of the stats and names.
I'm stuck at trying to loop through an html 'ul' element without any classes or IDs and organize that info in python list to then append the individual li elements to different fields in my table.
As you can see I'm able to target the title and price elements, but the 'ul' is really tricky... Well... for someone at my skill level.
The specific code I'm struggling with:
for i in range(1, 2):
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
html_soup = BeautifulSoup(response.text, 'html.parser')
ad_containers = html_soup.find_all('h2', class_ = 'listing-title title-wrap')
price_containers = html_soup.find_all('section', class_ = 'price-column')
for container in ad_containers:
name = container.find('a', class_ ="js-click-handler listing-fpa-link").text
names.append(name)
# Trying to loop through the key specs list and assigned each 'li' to a different field in the table
lis = []
list_container = container.find('ul', class_='listing-key-specs')
for li in list_container.find('li'):
lis.append(li)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
lis = [] # Clearing dictionary to get ready for next set of data
And the error message I get is the following:
Full code here:
from requests import get
from bs4 import BeautifulSoup
import pandas
# from time import sleep, time
# import random
# Create table fields
names = []
prices = []
year = []
body_type = []
milage = []
engine = []
hp = []
transmission = []
petrol_type = []
for i in range(1, 2):
# Make a get request
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
# Pause the loop
# sleep(random.randint(4, 7))
# Create containers
html_soup = BeautifulSoup(response.text, 'html.parser')
ad_containers = html_soup.find_all('h2', class_ = 'listing-title title-wrap')
price_containers = html_soup.find_all('section', class_ = 'price-column')
for container in ad_containers:
name = container.find('a', class_ ="js-click-handler listing-fpa-link").text
names.append(name)
# Trying to loop through the key specs list and assigned each 'li' to a different field in the table
lis = []
list_container = container.find('ul', class_='listing-key-specs')
for li in list_container.find('li'):
lis.append(li)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
lis = [] # Clearing dictionary to get ready for next set of data
for pricteainers in price_containers:
price = pricteainers.find('div', class_ ='vehicle-price').text
prices.append(price)
test_df = pandas.DataFrame({'Title': names, 'Price': prices, 'Year': year, 'Body Type': body_type, 'Mileage': milage, 'Engine Size': engine, 'HP': hp, 'Transmission': transmission, 'Petrol Type': petrol_type})
print(test_df.info())
# test_df.to_csv('Autotrader_test.csv')
I followed the advice from David in the other answer's comment area.
Code:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
pd.set_option('display.width', 1000)
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
names = []
prices = []
year = []
body_type = []
milage = []
engine = []
hp = []
transmission = []
petrol_type = []
for i in range(1, 2):
response = get('https://www.autotrader.co.uk/car-search?sort=sponsored&seller-type=private&page=' + str(i))
html_soup = BeautifulSoup(response.text, 'html.parser')
outer = html_soup.find_all('article', class_='search-listing')
for inner in outer:
lis = []
names.append(inner.find_all('a', class_ ="js-click-handler listing-fpa-link")[1].text)
prices.append(inner.find('div', class_='vehicle-price').text)
for li in inner.find_all('ul', class_='listing-key-specs'):
for i in li.find_all('li')[-7:]:
lis.append(i.text)
year.append(lis[0])
body_type.append(lis[1])
milage.append(lis[2])
engine.append(lis[3])
hp.append(lis[4])
transmission.append(lis[5])
petrol_type.append(lis[6])
test_df = pd.DataFrame.from_dict({'Title': names, 'Price': prices, 'Year': year, 'Body Type': body_type, 'Mileage': milage, 'Engine Size': engine, 'HP': hp, 'Transmission': transmission, 'Petrol Type': petrol_type}, orient='index')
print(test_df.transpose())
Output:
Title Price Year Body Type Mileage Engine Size HP Transmission Petrol Type
0 Citroen C3 1.4 HDi Exclusive 5dr £500 2002 (52 reg) Hatchback 123,065 miles 1.4L 70bhp Manual Diesel
1 Volvo V40 1.6 XS 5dr £585 1999 (V reg) Estate 125,000 miles 1.6L 109bhp Manual Petrol
2 Toyota Yaris 1.3 VVT-i 16v GLS 3dr £700 2000 (W reg) Hatchback 94,000 miles 1.3L 85bhp Automatic Petrol
3 MG Zt-T 2.5 190 + 5dr £750 2002 (52 reg) Estate 95,000 miles 2.5L 188bhp Manual Petrol
4 Volkswagen Golf 1.9 SDI E 5dr £795 2001 (51 reg) Hatchback 153,000 miles 1.9L 68bhp Manual Diesel
5 Volkswagen Polo 1.9 SDI Twist 5dr £820 2005 (05 reg) Hatchback 106,116 miles 1.9L 64bhp Manual Diesel
6 Volkswagen Polo 1.4 S 3dr (a/c) £850 2002 (02 reg) Hatchback 125,640 miles 1.4L 75bhp Manual Petrol
7 KIA Picanto 1.1 LX 5dr £990 2005 (05 reg) Hatchback 109,000 miles 1.1L 64bhp Manual Petrol
8 Vauxhall Corsa 1.2 i 16v SXi 3dr £995 2004 (54 reg) Hatchback 81,114 miles 1.2L 74bhp Manual Petrol
9 Volkswagen Beetle 1.6 3dr £995 2003 (53 reg) Hatchback 128,000 miles 1.6L 102bhp Manual Petrol
The ul is not a child of the h2 . It's a sibling.
So you will need to make a separate selection because it's not part of the ad_containers.

html scraping using python topboxoffice list from imdb website

URL: http://www.imdb.com/chart/?ref_=nv_ch_cht_2
I want you to print top box office list from above site (all the movies' rank, title, weekend, gross and weeks movies in the order)
Example output:
Rank:1
title: godzilla
weekend:$93.2M
Gross:$93.2M
Weeks: 1
Rank: 2
title: Neighbours
This is just a simple way to extract those entities by BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
url = "http://www.imdb.com/chart/?ref_=nv_ch_cht_2"
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data, 'html.parser')
rows = page.findAll("tr", {'class': ['odd', 'even']})
for tr in rows:
for data in tr.findAll("td", {'class': ['titleColumn', 'weeksColumn','ratingColumn']}):
print data.get_text()
P.S.-Arrange according to your will.
There is no need to scrape anything. See the answer I gave here.
How to scrape data from imdb business page?
The below Python script will give you, 1) List of Top Box Office movies from IMDb 2) And also the List of Cast for each of them.
from lxml.html import parse
def imdb_bo(no_of_movies=5):
bo_url = 'http://www.imdb.com/chart/'
bo_page = parse(bo_url).getroot()
bo_table = bo_page.cssselect('table.chart')
bo_total = len(bo_table[0][2])
if no_of_movies <= bo_total:
count = no_of_movies
else:
count = bo_total
movies = {}
for i in range(0, count):
mo = {}
mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()
m_page = parse(mo['url']).getroot()
m_casttable = m_page.cssselect('table.cast_list')
flag = 0
mo['cast'] = []
for cast in m_casttable[0]:
if flag == 0:
flag = 1
else:
m_starname = cast[1][0][0].text_content().strip()
mo['cast'].append(m_starname)
movies[i] = mo
return movies
if __name__ == '__main__':
no_of_movies = raw_input("Enter no. of Box office movies to display:")
bo_movies = imdb_bo(int(no_of_movies))
for k,v in bo_movies.iteritems():
print '#'+str(k+1)+' '+v['title']+' ('+v['year']+')'
print 'URL: '+v['url']
print 'Weekend: '+v['weekend']
print 'Gross: '+v['gross']
print 'Weeks: '+v['weeks']
print 'Cast: '+', '.join(v['cast'])
print '\n'
Output (run in terminal):
parag#parag-innovate:~/python$ python imdb_bo_scraper.py
Enter no. of Box office movies to display:3
#1 Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden
#2 Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski
#3 Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long

Categories

Resources