Problems with code to pull data from website

Problems with code to pull data from website - python

I have this website and I would like to pull via Python all company names such as West Wood Events, or Mitchell Event Planning.
But I am stuck on soup.find since it results me [].
When I inspect the page, lets say this:
< div class="LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd">Mitchell Event Planning<wbr></div >
in that I would write:
week = soup.find(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
print(week)
And I get 0.
Am I missing something? I'm pretty new to this.

This string is not single class but many classes separated by spaces.
In some modules you would have to use original string with all spaces but it seems in BS you have to use classes separated by single space.
Code works for me if I uses single space between LinesEllipsis and vendor-name--55315.
week = soup.find_all(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
Or if I use CSS selector with dot for every class in string
week = soup.select('.LinesEllipsis.vendor-name--55315.primaryBold--a3d1e.body1--24afd')
Minimal working code
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.theknot.com/marketplace/wedding-planners-acworth-ga?page=2'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
#week = soup.select('.LinesEllipsis.vendor-name--55315.primaryBold--a3d1e.body1--24afd')
week = soup.find_all(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
for item in week:
print(item.text)
Result:
The Charming Details
Enraptured Events
pearl and sky events - planning, design and florals
Unique Occasions ByTNicole, Inc
Platinum Eventions
RED COMPANY ATLANTA
Pop + Fizz: Event Planning and Design
Patricia Elizabeth, certified wedding planners/producer
Rienza Events
Pollyanna Richter Weddings
Calabash Events, Inc.
Weddings by Carmona LLC
Emily Jordan Events
Perfectly Taylored Events
Lindsey Wise Designs
Elegant Weddings and Affairs
Party PLANit
Wedded Bliss
Above the Fray
Willow Jaymes Events
Coco Red Events
Liz D. Events, LLC
Leslie Cox Events
YSE LLC
Marmaros Productions
PerfectionsID, LLC
All Things Love
West Wood Events
Everlasting Elegance
Prestigious Occasions

Related

How to scrape specified div tag from HTML code using Pandas?

Hate asking this because I've seen a lot of similar issues, but nothing that really guides me to the light unfortunately.
I'm trying to scrape the game story from this link (ultimately I want to build in the ability to do multiple links, but hoping I can handle that)
It seems like I should be able to take the div story tag, right? I am struggling through how to do that - I've tried various codes I've found online & have tried to tweak, but nothing really applies.
I found this amazing source which really taught me a lot about how to do this.
However, I'm still struggling - this is my code thus far:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
# print(soup.prettify())
story = soup.find_all('div', {'id': 'story'})
print (story)
I'm really trying to learn this, not just copy and paste, in my mind right now this says (in English):
Import packages needed
Get URL HTML Text Data ---- (when I printed the complete code it worked fine)
Narrow down the HTML code to only include div tags labeled as "story" -- this obviously is the hiccup
Struggling to understand, going to keep playing with this but figured I'd turn here for some advice - any thoughts are greatly appreciated. Just getting blank result right now;

Page is being rendered by javascript, which requests cannot execute, so the info (which is being pulled down by the original requests) remains in its incipient state, within the script tag.
This is one way to get that story with requests:
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])
Result printed in terminal:
Title: Durant, Nets rout 76ers in Simmons' return to Philadelphia
Date: 2022-03-11T00:30:27
Content: ['PHILADELPHIA (AP) The 76ers fans came to boo Ben Simmons. They left booing their own team.', "Kevin Durant scored 18 of his 25 points in Brooklyn's dominating first half in the Nets' 129-100 blowout victory over the 76ers on Thursday night in Simmons' much-hyped return to Philadelphia.", 'Seth Curry added 24 points, and Kyrie Irving had 22 for the Nets. They entered in eighth place in the East, but looked like a legitimate conference contender while badly outplaying the third-place 76ers.', 'Joel Embiid had 27 points and 12 rebounds for the 76ers, and James Harden finished with just 11 points. It was the first loss for Philadelphia in six games with Harden in the lineup.', "The game was dubbed as ''Boo Ben'' night, but the raucous fans instead turned their displeasure on the home team when the 76ers went to the locker room trailing 72-51 and again when Brooklyn built a stunning 32-point lead in the third quarter.", "''I think all of us look at Ben as our brother,'' Durant said. ''We knew this was a hostile environment. It's hard to chant at Ben Simmons when you're losing by that much.''", 'Simmons, wearing a designer hockey jersey and flashy jewelry, watched from the bench, likely taking delight in the vitriol deflected away from him. The three-time All-Star is continuing to recover from a back injury that has sidelined him since being swapped for Harden in a blockbuster deal at the trade deadline.', "''We definitely felt like Ben was on our heart,'' Irving said. ''If you come at Ben, you come at us.''", "While Simmons hasn't taken the floor yet, Harden had been a boon for the 76ers unlike his time in Brooklyn, where the so-called Big 3 of Harden, Durant and Irving managed to play just 16 games together following Harden's trade to Brooklyn last January that was billed as a potentially championship move. Harden exchanged fist bumps with Nets staff members just before tip before a shockingly poor performance from the 10-time All-Star and former MVP.", 'Harden missed 14 of 17 field-goal attempts.', "''We just didn't have the pop that we needed,'' Harden said.", 'The only shot Simmons took was a dunk during pregame warmups that drew derisive cheers from the Philly fans.', "The boos started early, as Simmons was met with catcalls while boarding the team bus to shootaround from the Nets' downtown hotel. Simmons did oblige one fan for an autograph, with another being heard on a video widely circulated on social media yelling, ''Why the grievance? Why spit in the face of Sixers fans? We did nothing but support you for five years, Ben. You know that.''", "The heckling continued when Simmons was at the arena. He entered the court 55 minutes prior to tip, wearing a sleeveless Nets warmup shirt and sweats and spent 20 minutes passing for Patty Mills' warmup shots. He didn't embrace any of his former teammates, though he did walk the length of the court to hug a 76ers official and then exchanged fist pumps with coach Doc Rivers at halftime.", "''Looked good to me, looked happy to be here,'' Nets coach Steve Nash said. ''I think he was happy to get it out of the way.''", "A large security presence closely watched the crowd and cell phones captured every Simmons move. By the end of the game, though, many 76ers fans had left and the remaining Nets fans were chanting: ''BEN SIM-MONS! BEN SIM-MONS!'' in a remarkable turnaround from the start of the evening.", 'WELCOME BACK', 'Former 76ers Curry and Andre Drummond, who also were part of the Simmons for Harden trade, were cheered during introductions, Curry made 10 of 14 shots, including 4 of 8 from 3-point range. Drummond had seven points and seven rebounds.', 'MOVE OVER, REGGIE', "Harden passed Reggie Miller for third on the NBA's 3-point list when he made his 2,561st trey with 6:47 left in the first quarter.", "TRAINER'S ROOM", 'Nets: LaMarcus Aldridge (hip) missed his second straight contest.', "76ers: Danny Green sat out after injuring his left middle finger in the first half of the 76ers' 121-106 win over the Bulls on Monday.", 'TIP-INS', 'Nets: Improved to 21-15 on the road, where Irving is only allowed to play due to his vaccination status. ... Durant also had 14 rebounds and seven assists.', "76ers: Paul Millsap returned after missing Monday's game against Chicago due to personal reasons but didn't play. . Former Sixers and Hall of Famers Allen Iverson and Julus ''Dr. J'' Erving were in attendance. Erving rang the ceremonial Liberty Bell before the contest.", 'UP NEXT', 'Nets: Host New York on Sunday.', '76ers: At Orlando on Sunday.', '---', 'More AP NBA: https://apnews.com/hub/NBA and https://twitter.com/AP-Sports']
Requests documentation: https://requests.readthedocs.io/en/latest/
Also, for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

How to locate the element in api?

I am new to the web scraping. I am trying to scrape "When purchase Online"
When purchased online in the Target. But i did not find it in the HTML.
.
Does anyone konw how to locate the element in HTML? Any help appreciates. Thanks!
Product Url:
https://www.target.com/c/allergy-sinus-medicines-treatments-health/-/N-4y5ny?Nao=144
https://www.target.com/p/genexa-dextromethorphan-kids-39-cough-and-chest-congestion-suppressant-4-fl-oz/-/A-80130848#lnk=sametab

I have no idea which element you want to get but API sends JSON data, not HTML, and you may simply convert it to dictionary/list and use keys/indexes to get value.
But you have to manually find correct keys in JSON data.
Or you may write some script to search in JSON (using for-loops and recursions)
Minimal working code. I found keys manually.
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130848&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA' # JSON
response = requests.get(url)
data = response.json()
product = data['data']['product']
print('price:', product['price']['current_retail'])
print('title:', product['item']['product_description']['title'])
print('description:', product['item']['product_description']['downstream_description'])
print('------------')
for bullet in product['item']['product_description']['bullet_descriptions']:
print(bullet)
print('------------')
print(product['item']['product_description']['soft_bullets']['title'])
for bullet in product['item']['product_description']['soft_bullets']['bullets']:
print('-', bullet)
print('------------')
for attribute in product['item']['wellness_merchandise_attributes']:
print('-', attribute['value_name'])
print(' ', attribute['wellness_description'])
Result:
price: 13.99
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
description: Genexa Kids’ Cough & Chest Congestion is real medicine, made clean - a powerful cough suppressant and expectorant that helps control cough, relieves chest congestion and helps thin and loosen mucus. This liquid, non-drowsy medicine has the same active ingredients you need (dextromethorphan HBr and guaifenesin), but without the artificial ones you don’t (dyes, common allergens, parabens). We only use ingredients people deserve to make the first gluten-free, non-GMO, certified vegan medicines to help your little ones feel better. <br /><br />Genexa is the first clean medicine company. Founded by two dads who believe in putting People Over Everything, Genexa makes medicine with the same active ingredients people need, but without the artificial ones they don’t. It’s real medicine, made clean.
------------
<B>Suggested Age:</B> 4 Years and Up
<B>Product Form:</B> Liquid
<B>Primary Active Ingredient:</B> Dextromethorphan
<B>Package Quantity:</B> 1
<B>Net weight:</B> 4 fl oz (US)
------------
highlights
- This is an age restricted item and will require us to take a quick peek at your ID upon pick-up
- Helps relieve kids’ chest congestion and makes coughs more productive by thinning and loosening mucus
- Non-drowsy so your little ones (ages 4+) can get back to playing
- Our medicine is junk-free, with no artificial sweeteners or preservatives, no dyes, no parabens, and no common allergens
- Certified gluten-free, vegan, and non-GMO
- Flavored with real organic blueberries
- Gentle on little tummies
------------
- Dye-Free
A product that either makes an unqualified on-pack statement indicating that it does not contain dye, or carries an unqualified on-pack statement such as "no dyes" or "dye-free."
- Gluten Free
A product that has an unqualified independent third-party certification, or carries an on-pack statement relating to the finished product being gluten-free.
- Non-GMO
A product that has an independent third-party certification, or carries an unqualified on-pack statement relating to the final product being made without genetically engineered ingredients.
- Vegan
A product that carries an unqualified independent, third-party certification, or carries on-pack statement relating to the product being 100% vegan.
- HSA/FSA Eligible
Restrictions apply; contact your insurance provider about plan allowances and requirements
EDIT:
Information "When purchased online" (or "at Cedar Rapids South") are in different url.
For example
Product url:
https://www.target.com/p/genexa-kids-39-diphenhydramine-allergy-liquid-medicine-organic-agave-4-fl-oz/-/A-80130847
API product data:
https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130847&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA
API "at Cedar Rapids South":
https://redsky.target.com/redsky_aggregations/v1/web_platform/product_fulfillment_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&is_bot=false&tcin=80130847&store_id=1771&zip=52404&state=IA&latitude=41.9831&longitude=-91.6686&scheduled_delivery_store_id=1771&required_store_id=1771&has_required_store_id=true
But probably in some situations it may use other information in product data to put "When purchased online" instead of "at Cedar Rapids South" - and this can be hardcoded in JavaScript. For example product which displays "When purchased online" has formatted_price $13.99 but product which displays "at Cedar Rapids South" has formatted_price "See price in cart"
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&brand_id=q643lel65ir&channel=WEB&count=24&default_purchasability_filter=true&offset=0&page=%2Fb%2Fq643lel65ir&platform=desktop&pricing_store_id=1771&store_ids=1771%2C1768%2C1113%2C3374%2C1792&useragent=Mozilla%2F5.0+%28X11%3B+Linux+x86_64%3B+rv%3A101.0%29+Gecko%2F20100101+Firefox%2F101.0&visitor_id=01819D268B380201B177CA755BCE70CC' # JSON
response = requests.get(url)
data = response.json()
for product in data['data']['search']['products']:
print('title:', product['item']['product_description']['title'])
print('price:', product['price']['current_retail'])
print('formatted:', product['price']['formatted_current_price'])
print('---')
Result:
title: Genexa Kids' Diphenhydramine Allergy Liquid Medicine - Organic Agave - 4 fl oz
price: 7.99
formatted: See price in cart
---
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
price: 13.99
formatted: $13.99
---

Scraping Issue with BeautifulSoup only while using for loop

I am writing a python code using the 'BeautifulSoup' library that would pull titles and authors of all the opinion pieces from a news website. While the for loop is working as intended for the titles, the find function within it meant to pull the author's name for each of the titles is repeatedly returning the author of the first piece as the output.
Any ideas where I am going wrong?
The Code
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nytimes.com/international/').text
soup = BeautifulSoup(source, 'lxml')
opinion = soup.find('div', class_='css-717c4s')
for story in opinion.find_all('article'):
title = story.h2.text
print(title)
author = opinion.find('div', class_='css-1xdt15l')
print (author.text)
The Output:
The Nazi-Fighting Women of the Jewish Resistance
Judy Batalion
My Great-Grandfather Knew How to Fix America’s Food System
Judy Batalion
Old Pol, New Tricks
Judy Batalion
Do We Have to Pay Businesses to Obey the Law?
Judy Batalion
I Don’t Want My Role Models Erased
Judy Batalion
Progressive Christians Arise! Hallelujah!
Judy Batalion
What the 2020s Need: Sex and Romance at the Movies
Judy Batalion
Biden Has Disappeared
Judy Batalion
What Republicans Could Learn From My Grandmother
Judy Batalion
Your Home’s Value Is Based on Racism
Judy Batalion
Once I Am Fully Vaccinated, What Is Safe for Me to Do?
Judy Batalion

You should do,
author = story.find('div', class_='css-1xdt15l')
What's wrong is, you are doing
author = opinion.find('div', class_='css-1xdt15l')
and it fetches the first author in all the authors in the opinions section and since you run this statement in every iteration of the loop, hence no matter how many times you do it, you will only get the first author.
Replacing it with
author = story.find('div', class_='css-1xdt15l')
fetches you the first author for each story iteration and since each story has a single author, it works fine.

It's because you target only the first tag, hence the one author.
Here's a fix for your code using zip():
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nytimes.com/international/').text
soup = BeautifulSoup(source, 'lxml').find('div', class_='css-717c4s')
authors = soup.find_all('div', class_='css-1xdt15l')
stories = soup.find_all('article')
for author, story in zip(authors, stories):
print(author.text)
print(story.h2.text)
Output:
Judy Batalion
The Nazi-Fighting Women of the Jewish Resistance
Gracy Olmstead
My Great-Grandfather Knew How to Fix America’s Food System
Maureen Dowd
Old Pol, New Tricks
Nikolas Bowie
Do We Have to Pay Businesses to Obey the Law?
Elizabeth Becker
I Don’t Want My Role Models Erased
Nicholas Kristof
Progressive Christians Arise! Hallelujah!
Ross Douthat
What the 2020s Need: Sex and Romance at the Movies
Frank Bruni
Biden Has Disappeared
Cecilia Gentili
What Republicans Could Learn From My Grandmother
Dorothy A. Brown
Your Home’s Value Is Based on Racism
Linsey Marr, Juliet Morrison and Caitlin Rivers
Once I Am Fully Vaccinated, What Is Safe for Me to Do?

Very little mistake.
You're supposed to do a search on each "article"/story object in the loop Not your initial "opinion" object i.e
author = story.find('div', class_='css-1xdt15l')
This produces the desired output:
The Nazi-Fighting Women of the Jewish Resistance
Judy Batalion
My Great-Grandfather Knew How to Fix America’s Food System
Gracy Olmstead
Old Pol, New Tricks
Maureen Dowd
Do We Have to Pay Businesses to Obey the Law?
Nikolas Bowie
I Don’t Want My Role Models Erased
Elizabeth Becker
Progressive Christians Arise! Hallelujah!
Nicholas Kristof
What the 2020s Need: Sex and Romance at the Movies
Ross Douthat
Biden Has Disappeared
Frank Bruni
What Republicans Could Learn From My Grandmother
Cecilia Gentili
Your Home’s Value Is Based on Racism
Dorothy A. Brown
Once I Am Fully Vaccinated, What Is Safe for Me to Do?
Linsey Marr, Juliet Morrison and Caitlin Rivers

Missing data when using BeautifulSoup to find Xbox Game Pass games

This page contains a list of all of the games available on Xbox Game Pass. I'd like to use BeautifulSoup to retrieve a list of the game names.
This is what I'm doing:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.xbox.com/en-US/xbox-game-pass/games?=pcgames')
if (page.status_code != 200):
print("Unable to load game pass games page")
exit()
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('h3', class_='gameDevLink') # returns []
s = soup.prettify()
with open('dump.html', 'w', encoding='utf-8') as f:
f.write(s)
If I inspect a game on the page I see something like this:
Inspecting a game's html
Each game is enclosed in an 'a' tag with its class set to gameDevLink.
My problem is that soup.find_all('a', class_='gameDevLink') returns no hits.
If I save the html generated by BeautifulSoup to disk and search for gameDevLink there are, again, no hits.
I don't understand why I can see the information in my browser but BeautifulSoup doesn't seem to see it.

The info about games is loaded from other URL via Javascript. You can use this script to simulate it:
import json
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36'}
game_ids_url = 'https://catalog.gamepass.com/sigls/v2?id=fdd9e2a7-0fee-49f6-ad69-4354098401ff&language=en-us&market=US'
game_info_url = 'https://displaycatalog.mp.microsoft.com/v7.0/products?bigIds={}&market=US&languages=en-us&MS-CV=XXX'
game_ids = requests.get(game_ids_url).json()
s = ','.join(i['id'] for i in game_ids if 'id' in i)
data = requests.get(game_info_url.format(s)).json()
# uncomment this to print all data:
#print(json.dumps(data, indent=4))
# print some data to screen:
for p in data['Products']:
print(p['LocalizedProperties'][0]['ProductTitle'])
print(p['LocalizedProperties'][0]['ShortDescription'])
print('-' * 80)
Prints:
A Plague Tale: Innocence - Windows 10
Follow the grim tale of young Amicia and her little brother Hugo, in a heartrending journey through the darkest hours of history. Hunted by Inquisition soldiers and surrounded by unstoppable swarms of rats, Amicia and Hugo will come to know and trust each other. As they struggle to survive against overwhelming odds, they will fight to find purpose in this brutal, unforgiving world.
--------------------------------------------------------------------------------
Age of Empires Definitive Edition
Age of Empires, the pivotal real-time strategy game that launched a 20-year legacy returns with modernized gameplay, all-new 4K visuals, 8-person multiplayer battles and a host of other new features. Welcome back to history.
--------------------------------------------------------------------------------
Age of Empires II: Definitive Edition
Age of Empires II: Definitive Edition celebrates the 20th anniversary of one of the most popular strategy games ever with stunning 4K Ultra HD graphics, a new and fully remastered soundtrack, and brand-new content, “The Last Khans” with 3 new campaigns and 4 new civilizations.
Choose your path to greatness with this definitive remaster to one of the most beloved strategy games of all time.
--------------------------------------------------------------------------------
Age of Empires III: Definitive Edition
Age of Empires III: Definitive Edition completes the celebration of one of the most beloved real-time strategy franchises with remastered graphics and music, all previously released expansions and brand-new content to enjoy for the very first time.
--------------------------------------------------------------------------------
Age of Wonders: Planetfall
Emerge from the cosmic dark age of a fallen galactic empire to build a new future for your people. Age of Wonders: Planetfall is the new strategy game from Triumph Studios, creators of the critically acclaimed Age of Wonders series, bringing all the exciting tactical turn-based combat and in-depth empire building of its predecessors to space in an all-new, sci-fi setting.
--------------------------------------------------------------------------------
...and so on (Total 204 games)

matching commas in BeautifulSoup

I've tested my regex with Pythex and it works as it's supposed to:
The HTML:
Something Very Important (SVI) 2013 Sercret Information, Big Company
Name (LBCN) Catalog Number BCN2013R18 and BSSN 3-55564-789-Y, was
developed as part of the SUP 2012 Something Task force was held in
conjunction with *SEM 2013, the second joint conference on study of
banana hand grenades and gorilla tactics (Association of Ape Warfare
Studies) interest groups BUDDY HOLLY and LION KING. It is comprised of
one hairy object containing 750 gross stories told in the voice of
Morgan Freeman and his trusty sidekick Michelle Bachman.
My regex:
,[\s\w()-]+,
When used with Pythex it selects the area I'm looking for, which is between the 2 commas in the paragraph:
Something Very Important (SVI) 2013 Sercret Information , Big
Company Name (LBCN) Catalog Number BCN2013R18 and BSSN
3-55564-789-Y, was developed as part of the SUP 2012 Something Task
force was held in conjunction with <a href="http://justaURL.com">*SEM
2013</a>, the second joint
conference on study of banana hand grenades and gorilla tactics
(Association of Ape Warfare Studies) interest groups BUDDY HOLLY and
LION KING. It is comprised of one hairy object containing 750 gross
stories told in the voice of Morgan Freeman and his trusty sidekick
Michelle Bachman.
However when I use BeautifulSoup's text regex:
print HTML.body.p.find_all(text=re.compile('\,[\s\w()-]+\,'))
I'm returned this instead of the area between the commas:
[u'Something Very Important (SVI) 2013 Sercret Information, Big Company Name (LBCN) Catalog Number BCN2013R18 and BSSN 3-55564-789-Y, was developed as part of the SUP 2012 Something Task force was held in conjunction with ']
I've also tried escaping the commas but to no luck. Beautiful soup just wants to return the whole <p> instead of the regex that I specified. Also I noticed that it returns the paragraph up until that link in the middle. Is this a problem with how I'm using BeautifulSoup or is this a regex problem?

BeautifulSoup uses the regular expression to search for matching elements. That whole text node matches your search.
You still then have to extract the part you want; BeautifulSoup does not do this for you. You could just reuse your regex here:
expression = re.compile('\,[\s\w()-]+\,')
textnode = HTML.body.p.find_all(text=expression)
print expression.search(textnode).group(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems with code to pull data from website - python

Related

How to scrape specified div tag from HTML code using Pandas?

How to locate the element in api?

Scraping Issue with BeautifulSoup only while using for loop

Missing data when using BeautifulSoup to find Xbox Game Pass games

matching commas in BeautifulSoup

Categories

Resources