urllib2 / requests does not display iframe of the webpage

urllib2 / requests does not display iframe of the webpage - python

I'm trying to scrap some book data from www.amazon.in
http://www.amazon.in/Life-What-Make-Preeti-Shenoy/dp/9380349300/ref=sr_1_6?s=books&ie=UTF8&qid=1424652069&sr=1-6
I need the summary of that book which is located in an iframe, but the problem is that when I try to use 'requests' to open that url it does not contain iframe in it.
for example, when I do
bookPage = requests.get(bookURL).text
bookSoup = BeautifulSoup(bookPage, "lxml")
There is no iframe in bookPage, but the actual page contains it.
I've also tried it with urllib2 but it does not seem to work.
What's wrong?

You can get the book summary from the noscript tag located in the div element with id="bookDescription_feature_div":
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> response = requests.get('http://www.amazon.in/Life-What-Make-Preeti-Shenoy/dp/9380349300/ref=sr_1_6?s=books&ie=UTF8&qid=1424652069&sr=1-6',
... headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'})
>>>
>>> soup = BeautifulSoup(response.content)
>>> print soup.select('div#bookDescription_feature_div noscript')[0].get_text(strip=True)
Ankita Sharma has the world in her palms. She is young, smart and heads turn at every corner she walks by. Born into a conservative middle class household - this defines the chronicle of her life. Set in a time when Doordarshan was the prime source of entertainment and writing love letters was the general fad, every youngster dreams of the thrills of college life. And so, her admission into an MBA institute in Mumbai follows. Ankita's story begins here, from her life as a college student. Life seems all sunshine and flowers until a drastic turn leaves her staring at a disturbing path, only because of her own misdoing. Jump to six months later. The sun glistens on a sombre building. Magnetized in view, the words - “Mental Institute”. Who is the face staring out of the window?What if destiny twisted your journey? What if it dragged you to a place that houses your worst fears? Would you stand and fight or would you run? Set in the late eighties, across two cities, Life is What You Make It is a compelling account of growing up, determination, faith and how an unconquerable spirit can overcome the punches destiny throws at you. At its core, it is a love story that makes us question our identity and the concept of sanity.

Related

How to scrape specified div tag from HTML code using Pandas?

Hate asking this because I've seen a lot of similar issues, but nothing that really guides me to the light unfortunately.
I'm trying to scrape the game story from this link (ultimately I want to build in the ability to do multiple links, but hoping I can handle that)
It seems like I should be able to take the div story tag, right? I am struggling through how to do that - I've tried various codes I've found online & have tried to tweak, but nothing really applies.
I found this amazing source which really taught me a lot about how to do this.
However, I'm still struggling - this is my code thus far:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
# print(soup.prettify())
story = soup.find_all('div', {'id': 'story'})
print (story)
I'm really trying to learn this, not just copy and paste, in my mind right now this says (in English):
Import packages needed
Get URL HTML Text Data ---- (when I printed the complete code it worked fine)
Narrow down the HTML code to only include div tags labeled as "story" -- this obviously is the hiccup
Struggling to understand, going to keep playing with this but figured I'd turn here for some advice - any thoughts are greatly appreciated. Just getting blank result right now;

Page is being rendered by javascript, which requests cannot execute, so the info (which is being pulled down by the original requests) remains in its incipient state, within the script tag.
This is one way to get that story with requests:
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])
Result printed in terminal:
Title: Durant, Nets rout 76ers in Simmons' return to Philadelphia
Date: 2022-03-11T00:30:27
Content: ['PHILADELPHIA (AP) The 76ers fans came to boo Ben Simmons. They left booing their own team.', "Kevin Durant scored 18 of his 25 points in Brooklyn's dominating first half in the Nets' 129-100 blowout victory over the 76ers on Thursday night in Simmons' much-hyped return to Philadelphia.", 'Seth Curry added 24 points, and Kyrie Irving had 22 for the Nets. They entered in eighth place in the East, but looked like a legitimate conference contender while badly outplaying the third-place 76ers.', 'Joel Embiid had 27 points and 12 rebounds for the 76ers, and James Harden finished with just 11 points. It was the first loss for Philadelphia in six games with Harden in the lineup.', "The game was dubbed as ''Boo Ben'' night, but the raucous fans instead turned their displeasure on the home team when the 76ers went to the locker room trailing 72-51 and again when Brooklyn built a stunning 32-point lead in the third quarter.", "''I think all of us look at Ben as our brother,'' Durant said. ''We knew this was a hostile environment. It's hard to chant at Ben Simmons when you're losing by that much.''", 'Simmons, wearing a designer hockey jersey and flashy jewelry, watched from the bench, likely taking delight in the vitriol deflected away from him. The three-time All-Star is continuing to recover from a back injury that has sidelined him since being swapped for Harden in a blockbuster deal at the trade deadline.', "''We definitely felt like Ben was on our heart,'' Irving said. ''If you come at Ben, you come at us.''", "While Simmons hasn't taken the floor yet, Harden had been a boon for the 76ers unlike his time in Brooklyn, where the so-called Big 3 of Harden, Durant and Irving managed to play just 16 games together following Harden's trade to Brooklyn last January that was billed as a potentially championship move. Harden exchanged fist bumps with Nets staff members just before tip before a shockingly poor performance from the 10-time All-Star and former MVP.", 'Harden missed 14 of 17 field-goal attempts.', "''We just didn't have the pop that we needed,'' Harden said.", 'The only shot Simmons took was a dunk during pregame warmups that drew derisive cheers from the Philly fans.', "The boos started early, as Simmons was met with catcalls while boarding the team bus to shootaround from the Nets' downtown hotel. Simmons did oblige one fan for an autograph, with another being heard on a video widely circulated on social media yelling, ''Why the grievance? Why spit in the face of Sixers fans? We did nothing but support you for five years, Ben. You know that.''", "The heckling continued when Simmons was at the arena. He entered the court 55 minutes prior to tip, wearing a sleeveless Nets warmup shirt and sweats and spent 20 minutes passing for Patty Mills' warmup shots. He didn't embrace any of his former teammates, though he did walk the length of the court to hug a 76ers official and then exchanged fist pumps with coach Doc Rivers at halftime.", "''Looked good to me, looked happy to be here,'' Nets coach Steve Nash said. ''I think he was happy to get it out of the way.''", "A large security presence closely watched the crowd and cell phones captured every Simmons move. By the end of the game, though, many 76ers fans had left and the remaining Nets fans were chanting: ''BEN SIM-MONS! BEN SIM-MONS!'' in a remarkable turnaround from the start of the evening.", 'WELCOME BACK', 'Former 76ers Curry and Andre Drummond, who also were part of the Simmons for Harden trade, were cheered during introductions, Curry made 10 of 14 shots, including 4 of 8 from 3-point range. Drummond had seven points and seven rebounds.', 'MOVE OVER, REGGIE', "Harden passed Reggie Miller for third on the NBA's 3-point list when he made his 2,561st trey with 6:47 left in the first quarter.", "TRAINER'S ROOM", 'Nets: LaMarcus Aldridge (hip) missed his second straight contest.', "76ers: Danny Green sat out after injuring his left middle finger in the first half of the 76ers' 121-106 win over the Bulls on Monday.", 'TIP-INS', 'Nets: Improved to 21-15 on the road, where Irving is only allowed to play due to his vaccination status. ... Durant also had 14 rebounds and seven assists.', "76ers: Paul Millsap returned after missing Monday's game against Chicago due to personal reasons but didn't play. . Former Sixers and Hall of Famers Allen Iverson and Julus ''Dr. J'' Erving were in attendance. Erving rang the ceremonial Liberty Bell before the contest.", 'UP NEXT', 'Nets: Host New York on Sunday.', '76ers: At Orlando on Sunday.', '---', 'More AP NBA: https://apnews.com/hub/NBA and https://twitter.com/AP-Sports']
Requests documentation: https://requests.readthedocs.io/en/latest/
Also, for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

BeautifulSoup web scraping for a webpage where information is obtained after clicking a button

So, I am trying to get "Amenities and More" portion of the Yelp page for a few restaurants. The issue is that I can get to the Amenities from the restaurant's yelp page that are displayed first. It however has "n more" button that when clicked gives more amenities. Using BeautifulSoup and selenium with the webpage url and using BeautifulSoup with requests gives exact same results and I am stuck as to how to open the whole Amenities before grabbing them in my code. Two pictures below show what happens before and after click of the button.
"Before clicking '5 More Attributes': The first pic shows 4 "div" within which lies "span" that I can get to using any of the above methods.
"After clicking '5 More Attributes': The second pic shows 9 "div" within which lies "span" that I am trying to get to.
Here is the code using selenium/beautifulsoup
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
URL ='https://www.yelp.com/biz/ziggis-coffee-longmont'
driver =
webdriver.Chrome(r"C:\Users\Fariha\AppData\Local\Programs\chromedriver_win32\chromedriver.exe")
driver.get(URL)
yelp_page_source_page1 = driver.page_source
soup = BeautifulSoup(yelp_page_source_page1,'html.parser')
spans = soup.find_all('span')
Result: There are 990 elements in "spans". I am only showing what is relevant for my question:

An alternative approach would be to extract the data directly from the JSON api on the site. This could be done without the overhead of selenium as follows:
from bs4 import BeautifulSoup
import requests
import json
session = requests.Session()
r = session.get('https://www.yelp.com/biz/ziggis-coffee-longmont')
#r = session.get('https://www.yelp.com/biz/menchies-frozen-yogurt-lafayette')
soup = BeautifulSoup(r.content, 'lxml')
# Locate the business ID to use (from JSON inside one of the script entries)
for script in soup.find_all('script', attrs={"type" : "application/json"}):
json_text = script.text.strip('<!->')
if "businessId" in json_text:
gaConfig = json.loads(json_text)
try:
biz_id = gaConfig["legacyProps"]["bizDetailsProps"]["bizDetailsMetaProps"]["businessId"]
break
except KeyError:
pass
# Build a suitable JSON request for the required information
json_post = [
{
"operationName": "GetBusinessAttributes",
"variables": {
"BizEncId": biz_id
},
"extensions": {
"documentId": "35e0950cee1029aa00eef5180adb55af33a0217c64f379d778083eb4d1c805e7"
}
},
{
"operationName": "GetBizPageProperties",
"variables": {
"BizEncId": biz_id
},
"extensions": {
"documentId": "f06d155f02e55e7aadb01d6469e34d4bad301f14b6e0eba92a31e635694ebc21"
}
},
]
r = session.post('https://www.yelp.com/gql/batch', json=json_post)
j = r.json()
business = j[0]['data']['business']
print(business['name'], '\n')
for property in j[1]['data']['business']['organizedProperties'][0]['properties']:
print(f'{"Yes" if property["isActive"] else "No":5} {property["displayText"]}')
This would give you the following entries:
Ziggi's Coffee
Yes Offers Delivery
Yes Offers Takeout
Yes Accepts Credit Cards
Yes Private Lot Parking
Yes Bike Parking
Yes Drive-Thru
No No Outdoor Seating
No No Wi-Fi
Reviews could be obtained as follows:
r_reviews = session.get(f'https://www.yelp.com/biz/{biz_id}/review_feed', params={"start" : "0", "sort_by" : "relevance_desc", "q" : ""})
reviews = r_reviews.json()
for review in reviews["reviews"]:
print(review["user"]["markupDisplayName"])
print(review["comment"]["text"])
print("----------")
Giving something like:
Jennifer C.
I am a huge local fan of Ziggi's. I find every experience with them them to be good.  I love that they have an app so you can order ahead if you happen to be closer to Main St vs Hover.  The app is easy to use and lets me customize everything. <br><br>I love that they have two drive thru locations and they are easy to navigate to and from. Their staff is always so nice too. <br><br>Their rewards program has been great for me since I go often enough to get the free drinks. <br><br>And the drinks!!  From coffee to lattes to italian sodas!! The frozen chocolate peanut butter drink!! The Colorado Sunrise and Limesicle!! Their citrus green tea! I mean really?? Its all so good. <br><br>Another favorite of mine is their large kids drink menu. Its so nice to take my son for a treat there. <br><br>I am so glad they are part of Longmont and  definitely have indefinite plans to remain a permanent customer.
----------
Judd O.
My wife was sold a defective gift-card as a gift for a colleague's wedding, it didn't work when the new bride attempted to use it at an Estes Park Ziggi's.  Funnily enough, the recipient's mother had said the same thing happened with a GC she'd bought from this same location.  We brought it back here and were told that, without the receipt that was the size of a thimble, we were boned.<br><br>That being said, we did get our second card's free drink and it'll be our last.<br><br>We're just lucky we only spent 25 bucks.<br><br>Edit: Also, the manager's name is Pristine, there's some grand irony in that.
----------
How was this solved?
Your best friend here is your browser's network dev tools. With this you can watch the requests made to obtain the information. The normal process flow is the initial HTML page is downloaded, this runs the javascript and requests more data to further fill the page.
The trick is to first locate where the data you want is (often returned as JSON), then determine what you need to recreate the parameters needed to make the request for it.
To further understand this code, use print(). Print everything, it will show you how each part builds on the next part. It is how the script was written, one bit at a time.
Approaches using Selenium allow the javascript to work, but most times this is not needed as it is just making requests and formatting the data for display.

Problems with code to pull data from website

I have this website and I would like to pull via Python all company names such as West Wood Events, or Mitchell Event Planning.
But I am stuck on soup.find since it results me [].
When I inspect the page, lets say this:
< div class="LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd">Mitchell Event Planning<wbr></div >
in that I would write:
week = soup.find(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
print(week)
And I get 0.
Am I missing something? I'm pretty new to this.

This string is not single class but many classes separated by spaces.
In some modules you would have to use original string with all spaces but it seems in BS you have to use classes separated by single space.
Code works for me if I uses single space between LinesEllipsis and vendor-name--55315.
week = soup.find_all(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
Or if I use CSS selector with dot for every class in string
week = soup.select('.LinesEllipsis.vendor-name--55315.primaryBold--a3d1e.body1--24afd')
Minimal working code
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.theknot.com/marketplace/wedding-planners-acworth-ga?page=2'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
#week = soup.select('.LinesEllipsis.vendor-name--55315.primaryBold--a3d1e.body1--24afd')
week = soup.find_all(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
for item in week:
print(item.text)
Result:
The Charming Details
Enraptured Events
pearl and sky events - planning, design and florals
Unique Occasions ByTNicole, Inc
Platinum Eventions
RED COMPANY ATLANTA
Pop + Fizz: Event Planning and Design
Patricia Elizabeth, certified wedding planners/producer
Rienza Events
Pollyanna Richter Weddings
Calabash Events, Inc.
Weddings by Carmona LLC
Emily Jordan Events
Perfectly Taylored Events
Lindsey Wise Designs
Elegant Weddings and Affairs
Party PLANit
Wedded Bliss
Above the Fray
Willow Jaymes Events
Coco Red Events
Liz D. Events, LLC
Leslie Cox Events
YSE LLC
Marmaros Productions
PerfectionsID, LLC
All Things Love
West Wood Events
Everlasting Elegance
Prestigious Occasions

Missing data when using BeautifulSoup to find Xbox Game Pass games

This page contains a list of all of the games available on Xbox Game Pass. I'd like to use BeautifulSoup to retrieve a list of the game names.
This is what I'm doing:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.xbox.com/en-US/xbox-game-pass/games?=pcgames')
if (page.status_code != 200):
print("Unable to load game pass games page")
exit()
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('h3', class_='gameDevLink') # returns []
s = soup.prettify()
with open('dump.html', 'w', encoding='utf-8') as f:
f.write(s)
If I inspect a game on the page I see something like this:
Inspecting a game's html
Each game is enclosed in an 'a' tag with its class set to gameDevLink.
My problem is that soup.find_all('a', class_='gameDevLink') returns no hits.
If I save the html generated by BeautifulSoup to disk and search for gameDevLink there are, again, no hits.
I don't understand why I can see the information in my browser but BeautifulSoup doesn't seem to see it.

The info about games is loaded from other URL via Javascript. You can use this script to simulate it:
import json
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36'}
game_ids_url = 'https://catalog.gamepass.com/sigls/v2?id=fdd9e2a7-0fee-49f6-ad69-4354098401ff&language=en-us&market=US'
game_info_url = 'https://displaycatalog.mp.microsoft.com/v7.0/products?bigIds={}&market=US&languages=en-us&MS-CV=XXX'
game_ids = requests.get(game_ids_url).json()
s = ','.join(i['id'] for i in game_ids if 'id' in i)
data = requests.get(game_info_url.format(s)).json()
# uncomment this to print all data:
#print(json.dumps(data, indent=4))
# print some data to screen:
for p in data['Products']:
print(p['LocalizedProperties'][0]['ProductTitle'])
print(p['LocalizedProperties'][0]['ShortDescription'])
print('-' * 80)
Prints:
A Plague Tale: Innocence - Windows 10
Follow the grim tale of young Amicia and her little brother Hugo, in a heartrending journey through the darkest hours of history. Hunted by Inquisition soldiers and surrounded by unstoppable swarms of rats, Amicia and Hugo will come to know and trust each other. As they struggle to survive against overwhelming odds, they will fight to find purpose in this brutal, unforgiving world.
--------------------------------------------------------------------------------
Age of Empires Definitive Edition
Age of Empires, the pivotal real-time strategy game that launched a 20-year legacy returns with modernized gameplay, all-new 4K visuals, 8-person multiplayer battles and a host of other new features. Welcome back to history.
--------------------------------------------------------------------------------
Age of Empires II: Definitive Edition
Age of Empires II: Definitive Edition celebrates the 20th anniversary of one of the most popular strategy games ever with stunning 4K Ultra HD graphics, a new and fully remastered soundtrack, and brand-new content, “The Last Khans” with 3 new campaigns and 4 new civilizations.
Choose your path to greatness with this definitive remaster to one of the most beloved strategy games of all time.
--------------------------------------------------------------------------------
Age of Empires III: Definitive Edition
Age of Empires III: Definitive Edition completes the celebration of one of the most beloved real-time strategy franchises with remastered graphics and music, all previously released expansions and brand-new content to enjoy for the very first time.
--------------------------------------------------------------------------------
Age of Wonders: Planetfall
Emerge from the cosmic dark age of a fallen galactic empire to build a new future for your people. Age of Wonders: Planetfall is the new strategy game from Triumph Studios, creators of the critically acclaimed Age of Wonders series, bringing all the exciting tactical turn-based combat and in-depth empire building of its predecessors to space in an all-new, sci-fi setting.
--------------------------------------------------------------------------------
...and so on (Total 204 games)

Is there a smoother way in adding text file to be appended into the URL address bar?

Using Python 3.6.4 on MacBook Pro.
Is there a smoother way in adding text file to be appended into the URL address bar?
Excuse my ignorance, I'm a novice to Python. And this was my best shot...
import urllib
import urllib.request
import urllib.parse
text = "Global ambitions with regional insight is a common mantra among local firms but not many have the financial backing and pedigree of this leading financial institution. While many brokerages have shut their doors or downsized, this trading firm has continued to grow and there are no plans to slow down now. Already with a portfolio of substantial institutional and private clients, it is important to offer a diverse range of products to meet demand.
Skills."
site_url = "http://www.wdylike.appspot.com/?"
file_contents_in_url = urllib.parse.urlencode({'q' : text})
check = site_url + file_contents_in_url
#print(check)
url = urllib.request.urlopen(check)
url_bytes = url.read()
#print(url_bytes)
site_url_contents = url_bytes.decode("utf8")
#print(site_url_contents)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

urllib2 / requests does not display iframe of the webpage - python

Related

How to scrape specified div tag from HTML code using Pandas?

BeautifulSoup web scraping for a webpage where information is obtained after clicking a button

Problems with code to pull data from website

Missing data when using BeautifulSoup to find Xbox Game Pass games

Is there a smoother way in adding text file to be appended into the URL address bar?

Categories

Resources