How to traverse imdb movie reviews? - python

I want to download some movie reviews from imdb so that I cant use those reviews for my LDA model. (for my school)
But default website for reviews contains only 25 reviews (e. g. https://www.imdb.com/title/tt0111161/reviews/?ref_=tt_ql_urv)
If I want more I need to press "Load more" button at the bottom of website, which gives me 25 more reviews.
I don't know how to automate this in python, I can't go to *https://www.imdb.com/title/tt0111161/reviews/?ref_=tt_ql_urv*```/2``` or add parameter ?page=2
How to automate traversing pages at imdb review site using python?
Also, is this deliberately made so difficult?

When I click Load More then in DevTools in Crome/Firefox (tab: Network, filter: XHR) I see link like
https://www.imdb.com/title/tt0111161/reviews/_ajax?ref_=undefined&paginationKey=g4xolermtiqhejcxxxgs753i36t52q343mpt34pjada6qpye4w6qtalmfyy7wfxcwfzuwsyh
and it has paginationKey=g4x...
and I see something similar in HTML <div ... data-key="g4x..." - so
using this data-key I create link to get next page.
Example code.
First I get HTML from normal URL, and I get titles from reviews. Next I get data-key and create URL to get new reviews. I repeate it in for-loop to get 3 pages but you could use while True loop and repeate it if there is still data-key.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
#s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:93.0) Gecko/20100101 Firefox/93.0'
# get first/full page
url = 'https://www.imdb.com/title/tt0111161/reviews/?ref_=tt_ql_urv'
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('a', {'class': 'title'})
for number, title in enumerate(items, 1):
print(number, '>', title.text.strip())
# get next page(s)
for _ in range(3):
div = soup.find('div', {'data-key': True})
print('---', div['data-key'], '---')
url = 'https://www.imdb.com/title/tt0111161/reviews/_ajax'
payload = {
'ref_': 'tt_ql_urv',
'paginationKey': div['data-key']
}
#headers = {'X-Requested-With': 'XMLHttpRequest'}
r = s.get(url, params=payload) #, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('a', {'class': 'title'})
for number, title in enumerate(items, 1):
print(number, '>', title.text.strip())
Result:
1 > Enthralling, fantastic, intriguing, truly remarkable!
2 > "I Had To Go To Prison To Learn To Be A Crook"
3 > Masterpiece
4 > All-time prison film classic
5 > Freeman gives it depth
6 > impressive
7 > Simply a great story that is moving and uplifting
8 > An incredible movie. One that lives with you.
9 > "I'm a convicted murderer who provides sound financial planning".
10 > IMDb and the Greatest Film of All Time
11 > never give up hope
12 > The Shawshank Redemption
13 > Brutal Anti-Bible Bigotry Prevails Again
14 > Time and Pressure.
15 > A classic
16 > An extraordinary and unforgettable film about a bank veep who is convicted of murders and sentenced to the toughest prison
17 > A genre picture, but a satisfying one...
18 > Why it is ranked so highly.
19 > Exceptional
20 > Shawshank Redemption- Prison Film is Redeemed by Quality ****
21 > A Classic Film On Hope And Redemption
22 > Compelling masterpiece
23 > Relentless Storytelling
24 > Some birds aren't meant to be caged.
25 > Good , But It Is Overrated By Some
--- g4xolermtiqhejcxxxgs753i36t52q343mpt34pjada6qpye4w6qtalmfyy7wfxcwfzuwsyh ---
1 > Stephen King's prison tale with a happy ending...
2 > Breaking Big Rocks Into Little Rocks
3 > Over Rated
4 > Terrific stuff!
5 > Highly Overrated But Still Good
6 > Superb
7 > Beautiful movie
8 > Tedious, overlong, with "hope" being the second word spoken in just about every sentence... who cares?
9 > Excellent Stephen King adaptation; flawless Robbins & Freeman
10 > Good for the spirit
11 > Entertaining Prison Movie Isn't Nearly as Good as Its Rabid Fan Base Would Lead You to Believe
12 > Observations...
13 > Why can't they make films like this anymore?
14 > Shawshank Redemption Comes Out Clean
15 > Hope Springs Eternal:Rita Hayworth And The Shawshank Redemption.
16 > Redeeming.
17 > You don't understand! I'm not supposed to be here!
18 > A Story Of Hope & Resilence
19 > Salvation lies within....
20 > Pretty good movie...one of those that you do not really need to watch from beginning to end.
21 > A film of Eloquence
22 > A great film of a helping hand leading to end-around justice
23 > about freedom
24 > Reputation notwithstanding, this is powerful stuff
25 > The best film ever made!
--- g4uorermtiqhejcxxxgs753i36t52q343eod34plapeoqp27z6b2lhdevccn5wyrz2vmgufh ---
1 > A Sort of Secular Redemption
2 > In virus times, we need this hope.
3 > The placement isn't an exaggeration
4 > A true story of friendship and hard times
5 > Escape from Shawshank
6 > Great Story Telling
7 > moving and emotionally impactful(if you liked "The Green Mile" you will like this movie)
8 > Super Good - Highly Recommended
9 > I can see why this is rated Number 1 on IMDb.
# ...

Related

Having two divs under the same class take the content of the fist and second seperately webscraping with BeautifulSoup

I have such a html page inside the content_list variable
<h3 class="sds-heading--7 title">Problems with battery capacity long-term</h3>
<div class="review-byline review-section">
<div>July 21, 2014</div>
<div>By Cathie from San Diego</div>
<div class="review-type"><strong>Owns this car</strong></div>
</div>
<div class="review-section">
<p class="review-body">We have owned our Leaf since May 2011. We have loved the car but are now getting quite concerned. My husband drives the car, on average, 20-40 miles/day to and from work and running errands, mostly 100% on city roads. We live in San Diego, so no issue with winter weather and we live 7 miles from the ocean so seldom have daytime temperatures above 85. Originally, we would get 65-70 miles per 80-90% charge. Last fall we noticed that there was considerably less remaining charge left after a day of driving. He began to track daily miles, remaining "bars", as well as started charging it 100%. For 9 months we have only been getting 40-45 miles on a full charge with only 1-2 "bars" remaining at the end of the day. Sometimes it will be blinking and "talking" to us to get to a charging place ASAP. We just had it into the dealership. Though on a full charge, the car gauge shows 12 bars, the dealership states that the batteries have lost 2 bars via the computer diagnostics (which we are told is a different reading from the car gauge itself) and, that they say, is average and excepted for the car at this age. Everything else (software, diagnostics, etc.) shows 100%, so the dealership thinks that the car is functioning as it should. They are unable to explain why we can only go 40-45 miles on a charge, but keep saying that the car tests out fine. If the distance one is able to drive on a full charge decreases any further, it will begin to render the car useless. As someone else recommended, in retrospect, the best way to go is to lease the Leaf so that battery life is not an issue.</p>
</div>
First I used this code to get to the collection of reviews
ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
print(response)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})
Now I would like to take the value of date of the review and the name of the reviewer which in this case would be
<div class="review-byline review-section">
<div>July 21, 2014</div>
<div>By Cathie from San Diego</div>
The problem is I can't separate those two divs
My code:
data = []
for e in content_list:
data.append({
'review_date':e.find_all("div", {"class":"review-byline"})[0].text,
'overall_rating': e.select_one('span.sds-rating__count').text,
'review_title': e.h3.text,
'review_content': e.select_one('p').text,
})
The result of my code
{'overall_rating': '4.7',
'review_content': 'This is the perfect electric car for driving around town, doing errands or even for a short daily commuter. It is very comfy and very quick. The only issue was the first gen battery. The 2011-2014 battery degraded quickly and if the owner did not have Nissan replace it, all those cars are now junk and can only go 20 miles or so on a charge. We had Nissan replace our battery with the 2nd gen battery and it is good as new!',
'review_date': '\nFebruary 24, 2020\nBy EVs are the future from Tucson, AZ\nOwns this car\n',
'review_title': 'Great Electric Car!'}
For the first one you could the <div> directly:
'review_date':e.find("div", {"class":"review-byline"}).div.text,
for the second one use e.g. css selector:
'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
Example
url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})
data = []
for e in content_list:
data.append({
'review_date':e.find("div", {"class":"review-byline"}).div.text,
'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
'overall_rating': e.select_one('span.sds-rating__count').text,
'review_title': e.h3.text,
'review_content': e.select_one('p').text,
})
data

Ranker.com python beautifulsoup scraper not scraping the entire website

So I am working on a beautifulsoup scraper that would scrape 100 names from the ranker.com page list. The code is as follows
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.ranker.com/crowdranked-list/best-anime-series-all-time')
soup = BeautifulSoup(r.text, 'html.parser')
for p in soup.find_all('a', class_='gridItem_name__3zasT gridItem_nameLink__3jE6V'):
print(p.text)
This works and gives the output as
Attack on Titan
My Hero Academia
Naruto: Shippuden
Hunter x Hunter (2011)
One-Punch Man
Fullmetal Alchemist: Brotherhood
One Piece
Naruto
Tokyo Ghoul
Assassination Classroom
The Seven Deadly Sins
Parasyte: The Maxim
Code Geass
Haikyuu!!
Your Lie in April
Noragami
Akame ga Kill!
Dragon Ball
No Game No Life
Fullmetal Alchemist
Dragon Ball Z
Cowboy Bebop
Steins;Gate
Mob Psycho 100
Fairy Tail
I wanted the program to fetch 100 items from the list, but it just gives 25 items. Can someone pls help me with this.
Additional items come from API call with offset and limit params to determine next batch of 25 results to return. You can simply remove both of these and get a max 200 results, or leave in limit and set to 100. You can ignore everything else in the API call apart from the endpoint.
import requests
r = requests.get('https://api.ranker.com/lists/538997/items?limit=100')
data = r.json()['listItems']
ranked_titles = {i['rank']:i['name'] for i in data}
print(ranked_titles)

Facebook Market scraping using class id in python

Class ID to scrape
I wanted to scrape data from facebook market using python and by using this script below however no data is showing when i run the script. Class ID is in the picture above.
elements = driver.find_elements_by_class_name('l9j0dhe7 f9o22wc5 ad2k81qe') for ele in elements: print(ele.text) print(ele.get_attribute('title'))
Okay there are a few things, you should take a look - find_element_by_class_name takes only one class name, you better take use of find_element_by_css_selector
Solution for getting informations based on your information
Get all results:
elements = driver.find_elements_by_class_name('kbiprv82')
Loop results and print:
for ele in elements:
title = ele.find_element_by_css_selector('span.a8c37x1j.ni8dbmo4.stjgntxs.l9j0dhe7').text
price = ele.find_element_by_css_selector('div.hlyrhctz > span').text
print(title, price)
Output
IMac 21,5 Ende 2012 380 €
Imac 27 Mitte 2011 (Wie Neu Zustand ) 550 €
iMac 14,2 27" 550 €
Hope that helps, let us know.

Access aria label and reviews of yelp with BeautifulSoup [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I am trying to access the reviews and star rating of each reviewer and append the values to the list. However, it doesn't allow me to retun the output. Can anyone tell me what's wrong with my codes?
l=[]
for i in range(0,len(allrev)):
try:
l["stars"]=allrev[i].allrev.find("div",{"class":"lemon--div__373c0__1mboc i-stars__373c0__1T6rz i-stars--regular-4__373c0__2YrSK border-color--default__373c0__3-ifU overflow--hidden__373c0__2y4YK"}).get('aria-label')
except:
l["stars"]= None
try:
l["review"]=allrev[i].find("span",{"class":"lemon--span__373c0__3997G raw__373c0__3rKqk"}).text
except:
l["review"]=None
u.append(l)
l={}
print({"data":u})
To get all the reviews you can try the following:
import requests
from bs4 import BeautifulSoup
URL = "https://www.yelp.com/biz/sushi-yasaka-new-york"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for star, review in zip(
soup.select(
".margin-b1__373c0__1khoT .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .overflow--hidden__373c0__2y4YK"
),
soup.select(".comment__373c0__3EKjH .raw__373c0__3rcx7"),
):
print(star.get("aria-label"))
print(review.text)
print("-" * 50)
Output:
5 star rating
I've been craving sushi for weeks now and Sushi Yasaka hit the spot for me. Their lunch prices are unbeatable. Their lunch specials seem to extend through weekends which is also amazing.I got the Miyabi lunch as take out and ate in along the benches near the MTA. It came with 4 nigiri, 7 sashimi and you get to pick the other roll (6 pieces). It also came with a side (choose salad or soup, add $1 for both). It was an incredible deal for only $20. I was so full and happy! The fish tasted very fresh with wonderful flavor. I ordered right as they opened and there were at least 10 people waiting outside when I picked up my food so I imagine there is high turnover, keeping the seafood fresh. This will be a regular splurge lunch spot for sure.
--------------------------------------------------
5 star rating
If you're looking for great sushi on Manhattan's upper west side, head over to Sushi Yakasa ! Best sushi lunch specials, especially for sashimi. I ordered the Miyabi - it included a fresh oyster ! The oyster was delicious, served raw on the half shell. The sashimi was delicious too. The portion size was very good for the area, which tends to be a pricey neighborhood. The restaurant is located on a busy street (west 72nd) & it was packed when I dropped by around lunchtimeStill, they handled my order with ease & had it ready quickly. Streamlined service & highly professional. It's a popular sushi place for a reason. Every piece of sashimi was perfect. The salmon avocado roll was delicious too. Very high quality for the price. Highly recommend! Update - I've ordered from Sushi Yasaka a few times since the pandemic & it's just as good as it was before. Fresh, and they always get my order correct. I like their takeout system - you can order over the phone (no app required) & they text you when it's ready. Home delivery is also available & very reliable. One of my favorite restaurants- I'm so glad they're still in business !
--------------------------------------------------
...
...
Edit to only get the first 100 reviews:
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.yelp.com/biz/sushi-yasaka-new-york?start={}"
offset = 0
review_count = 0
with open("output.csv", "a", encoding="utf-8") as f:
csv_writer = csv.writer(f, delimiter="\t")
csv_writer.writerow(["rating", "review"])
while True:
resp = requests.get(url.format(offset))
soup = BeautifulSoup(resp.content, "html.parser")
for rating, review in zip(
soup.select(
".margin-b1__373c0__1khoT .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .overflow--hidden__373c0__2y4YK"
),
soup.select(".comment__373c0__3EKjH .raw__373c0__3rcx7"),
):
print(f"review # {review_count}. link: {resp.url}")
csv_writer.writerow([rating.get("aria-label"), review.text])
review_count += 1
if review_count > 100:
raise Exception("Exceeded 100 reviews.")
offset += 20

BeautifulSoup can not find <h3> tags

I want to collect the most upvoted text in a reddit page with BeautfiulSoup library but whenever i tried to run the code it can not find the h3 tags and returns an empty array.How can i fix it?
import requests as re
from bs4 import BeautifulSoup
r=re.get("https://www.reddit.com/r/funfacts/")
soup=BeautifulSoup(r.content,"html.parser")
gelen=soup.find_all("h3")
print(gelen)
If you add .json to the Reddit URL, you get all data in Json format.
For example:
import json
import requests
url = 'https://www.reddit.com/r/funfacts.json' # <-- add .json here!
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
data = requests.get(url, headers=headers).json()
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
# print some data to screen:
print('{:<8} {:<8} {}'.format('UPS', 'DOWNS', 'TITLE'))
for c in data['data']['children']:
print('{:<8} {:<8} {}'.format(c['data']['ups'], c['data']['downs'], c['data']['title']))
Prints:
UPS DOWNS TITLE
38 0 People! Please remember to begin all posts with "Fun Fact:".
276 0 Fun Fact: Marathons are a tradition that originated from the ancient greeks where someone ran 26 miles and died.
165 0 Fun Fact: Andean condor in flight flaps its wings for just 1 % of its flight time.
347 0 fun fact i found on youtube..
0 0 Fun fact: planes only crash once
4 0 Fun fact dump
24 0 FUN FACT. DID YOU KNOW? An airplane mechanic invented Slinky while he was playing with engine parts and realized the possible secondary use for the springs. Seems like most invention occurred by just playing around sometimes.
119 0 Fun Fact: Walking the perimeter of this city = Walking from Philadelphia to Denver
0 0 Here's a funny fun fact!
3 0 Fun Fact: The Pygmy Marmoset is the World's Smallest Monkey
7 0 Fun Fact: Actor Christopher Lee was related to a king, a Confederate general, and the creator of James Bond, and was a soldier in WWII
111 0 Fun Fact: This is probably the most famous image on youtube. In just three days; over 4000 users had this image as their profile picture. The image appears to be a blood-covered creature thing.
5 0 Fun Fact: Walt Disney originally wanted to call his most famous creation Mortimer, but his wife convinced him to change the name to Mickey.
3 0 Fun fact: Sweden has the most islands of any country in the world, sitting on at least 220,000 islands. For comparison, The Phillipines has 7000 and Indonesia 17000 islands.
19 0 Fun Fact: Almost all humans have Neanderthal DNA.
197 0 Fun Fact: "The Anarchist Cookbook" is Riddled With Errors, and Publishers Won't Pull It, Even Though The Author Has Tried To Get It Off Shelves For Years (cross post from /r/Trivia)
0 0 Fun Fact: 23 Most Interesting Facts That You Have Never Heard Before
3 0 Fun Fact: Emily from Hannah Montanna and Cole from the sixth sense are siblings.
120 0 Fun Fact did you know that NBA player Lamarcus Aldridge is the cousin of NBA sideline reporter David Aldridge?
6 0 fun fact
12 0 Fun Fact: If you travel 69.169 miles in a direction, then you have traveled 1 degree of the entire earth's circumference.
113 0 Fun Fact: Rarest And Strangest Sharks Species Hidden In The Ocean
372 0 Putting scallions in water will grow more scallions
2 0 Fun Fact about the Ocean! Top 10
474 0 Wtf fun fact
4 0 Fun fact: Watch Dogs NPCs almost do not exist.

Categories

Resources