Scraping Issue with BeautifulSoup only while using for loop - python

I am writing a python code using the 'BeautifulSoup' library that would pull titles and authors of all the opinion pieces from a news website. While the for loop is working as intended for the titles, the find function within it meant to pull the author's name for each of the titles is repeatedly returning the author of the first piece as the output.
Any ideas where I am going wrong?
The Code
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nytimes.com/international/').text
soup = BeautifulSoup(source, 'lxml')
opinion = soup.find('div', class_='css-717c4s')
for story in opinion.find_all('article'):
title = story.h2.text
print(title)
author = opinion.find('div', class_='css-1xdt15l')
print (author.text)
The Output:
The Nazi-Fighting Women of the Jewish Resistance
Judy Batalion
My Great-Grandfather Knew How to Fix America’s Food System
Judy Batalion
Old Pol, New Tricks
Judy Batalion
Do We Have to Pay Businesses to Obey the Law?
Judy Batalion
I Don’t Want My Role Models Erased
Judy Batalion
Progressive Christians Arise! Hallelujah!
Judy Batalion
What the 2020s Need: Sex and Romance at the Movies
Judy Batalion
Biden Has Disappeared
Judy Batalion
What Republicans Could Learn From My Grandmother
Judy Batalion
Your Home’s Value Is Based on Racism
Judy Batalion
Once I Am Fully Vaccinated, What Is Safe for Me to Do?
Judy Batalion

You should do,
author = story.find('div', class_='css-1xdt15l')
What's wrong is, you are doing
author = opinion.find('div', class_='css-1xdt15l')
and it fetches the first author in all the authors in the opinions section and since you run this statement in every iteration of the loop, hence no matter how many times you do it, you will only get the first author.
Replacing it with
author = story.find('div', class_='css-1xdt15l')
fetches you the first author for each story iteration and since each story has a single author, it works fine.

It's because you target only the first tag, hence the one author.
Here's a fix for your code using zip():
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nytimes.com/international/').text
soup = BeautifulSoup(source, 'lxml').find('div', class_='css-717c4s')
authors = soup.find_all('div', class_='css-1xdt15l')
stories = soup.find_all('article')
for author, story in zip(authors, stories):
print(author.text)
print(story.h2.text)
Output:
Judy Batalion
The Nazi-Fighting Women of the Jewish Resistance
Gracy Olmstead
My Great-Grandfather Knew How to Fix America’s Food System
Maureen Dowd
Old Pol, New Tricks
Nikolas Bowie
Do We Have to Pay Businesses to Obey the Law?
Elizabeth Becker
I Don’t Want My Role Models Erased
Nicholas Kristof
Progressive Christians Arise! Hallelujah!
Ross Douthat
What the 2020s Need: Sex and Romance at the Movies
Frank Bruni
Biden Has Disappeared
Cecilia Gentili
What Republicans Could Learn From My Grandmother
Dorothy A. Brown
Your Home’s Value Is Based on Racism
Linsey Marr, Juliet Morrison and Caitlin Rivers
Once I Am Fully Vaccinated, What Is Safe for Me to Do?

Very little mistake.
You're supposed to do a search on each "article"/story object in the loop Not your initial "opinion" object i.e
author = story.find('div', class_='css-1xdt15l')
This produces the desired output:
The Nazi-Fighting Women of the Jewish Resistance
Judy Batalion
My Great-Grandfather Knew How to Fix America’s Food System
Gracy Olmstead
Old Pol, New Tricks
Maureen Dowd
Do We Have to Pay Businesses to Obey the Law?
Nikolas Bowie
I Don’t Want My Role Models Erased
Elizabeth Becker
Progressive Christians Arise! Hallelujah!
Nicholas Kristof
What the 2020s Need: Sex and Romance at the Movies
Ross Douthat
Biden Has Disappeared
Frank Bruni
What Republicans Could Learn From My Grandmother
Cecilia Gentili
Your Home’s Value Is Based on Racism
Dorothy A. Brown
Once I Am Fully Vaccinated, What Is Safe for Me to Do?
Linsey Marr, Juliet Morrison and Caitlin Rivers

Related

How to scrape specified div tag from HTML code using Pandas?

Hate asking this because I've seen a lot of similar issues, but nothing that really guides me to the light unfortunately.
I'm trying to scrape the game story from this link (ultimately I want to build in the ability to do multiple links, but hoping I can handle that)
It seems like I should be able to take the div story tag, right? I am struggling through how to do that - I've tried various codes I've found online & have tried to tweak, but nothing really applies.
I found this amazing source which really taught me a lot about how to do this.
However, I'm still struggling - this is my code thus far:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
# print(soup.prettify())
story = soup.find_all('div', {'id': 'story'})
print (story)
I'm really trying to learn this, not just copy and paste, in my mind right now this says (in English):
Import packages needed
Get URL HTML Text Data ---- (when I printed the complete code it worked fine)
Narrow down the HTML code to only include div tags labeled as "story" -- this obviously is the hiccup
Struggling to understand, going to keep playing with this but figured I'd turn here for some advice - any thoughts are greatly appreciated. Just getting blank result right now;
Page is being rendered by javascript, which requests cannot execute, so the info (which is being pulled down by the original requests) remains in its incipient state, within the script tag.
This is one way to get that story with requests:
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])
Result printed in terminal:
Title: Durant, Nets rout 76ers in Simmons' return to Philadelphia
Date: 2022-03-11T00:30:27
Content: ['PHILADELPHIA (AP) The 76ers fans came to boo Ben Simmons. They left booing their own team.', "Kevin Durant scored 18 of his 25 points in Brooklyn's dominating first half in the Nets' 129-100 blowout victory over the 76ers on Thursday night in Simmons' much-hyped return to Philadelphia.", 'Seth Curry added 24 points, and Kyrie Irving had 22 for the Nets. They entered in eighth place in the East, but looked like a legitimate conference contender while badly outplaying the third-place 76ers.', 'Joel Embiid had 27 points and 12 rebounds for the 76ers, and James Harden finished with just 11 points. It was the first loss for Philadelphia in six games with Harden in the lineup.', "The game was dubbed as ''Boo Ben'' night, but the raucous fans instead turned their displeasure on the home team when the 76ers went to the locker room trailing 72-51 and again when Brooklyn built a stunning 32-point lead in the third quarter.", "''I think all of us look at Ben as our brother,'' Durant said. ''We knew this was a hostile environment. It's hard to chant at Ben Simmons when you're losing by that much.''", 'Simmons, wearing a designer hockey jersey and flashy jewelry, watched from the bench, likely taking delight in the vitriol deflected away from him. The three-time All-Star is continuing to recover from a back injury that has sidelined him since being swapped for Harden in a blockbuster deal at the trade deadline.', "''We definitely felt like Ben was on our heart,'' Irving said. ''If you come at Ben, you come at us.''", "While Simmons hasn't taken the floor yet, Harden had been a boon for the 76ers unlike his time in Brooklyn, where the so-called Big 3 of Harden, Durant and Irving managed to play just 16 games together following Harden's trade to Brooklyn last January that was billed as a potentially championship move. Harden exchanged fist bumps with Nets staff members just before tip before a shockingly poor performance from the 10-time All-Star and former MVP.", 'Harden missed 14 of 17 field-goal attempts.', "''We just didn't have the pop that we needed,'' Harden said.", 'The only shot Simmons took was a dunk during pregame warmups that drew derisive cheers from the Philly fans.', "The boos started early, as Simmons was met with catcalls while boarding the team bus to shootaround from the Nets' downtown hotel. Simmons did oblige one fan for an autograph, with another being heard on a video widely circulated on social media yelling, ''Why the grievance? Why spit in the face of Sixers fans? We did nothing but support you for five years, Ben. You know that.''", "The heckling continued when Simmons was at the arena. He entered the court 55 minutes prior to tip, wearing a sleeveless Nets warmup shirt and sweats and spent 20 minutes passing for Patty Mills' warmup shots. He didn't embrace any of his former teammates, though he did walk the length of the court to hug a 76ers official and then exchanged fist pumps with coach Doc Rivers at halftime.", "''Looked good to me, looked happy to be here,'' Nets coach Steve Nash said. ''I think he was happy to get it out of the way.''", "A large security presence closely watched the crowd and cell phones captured every Simmons move. By the end of the game, though, many 76ers fans had left and the remaining Nets fans were chanting: ''BEN SIM-MONS! BEN SIM-MONS!'' in a remarkable turnaround from the start of the evening.", 'WELCOME BACK', 'Former 76ers Curry and Andre Drummond, who also were part of the Simmons for Harden trade, were cheered during introductions, Curry made 10 of 14 shots, including 4 of 8 from 3-point range. Drummond had seven points and seven rebounds.', 'MOVE OVER, REGGIE', "Harden passed Reggie Miller for third on the NBA's 3-point list when he made his 2,561st trey with 6:47 left in the first quarter.", "TRAINER'S ROOM", 'Nets: LaMarcus Aldridge (hip) missed his second straight contest.', "76ers: Danny Green sat out after injuring his left middle finger in the first half of the 76ers' 121-106 win over the Bulls on Monday.", 'TIP-INS', 'Nets: Improved to 21-15 on the road, where Irving is only allowed to play due to his vaccination status. ... Durant also had 14 rebounds and seven assists.', "76ers: Paul Millsap returned after missing Monday's game against Chicago due to personal reasons but didn't play. . Former Sixers and Hall of Famers Allen Iverson and Julus ''Dr. J'' Erving were in attendance. Erving rang the ceremonial Liberty Bell before the contest.", 'UP NEXT', 'Nets: Host New York on Sunday.', '76ers: At Orlando on Sunday.', '---', 'More AP NBA: https://apnews.com/hub/NBA and https://twitter.com/AP-Sports']
Requests documentation: https://requests.readthedocs.io/en/latest/
Also, for BeautifulSoup: https://beautiful-soup-4.readthedocs.io/en/latest/index.html

How to remove all the 'ORG' entities collected from Spacy

I'm working on an NLP project and using Spacy. Now, I have identified different entities using NER of Spacy, and I want to remove the ORG (those identified as Organisations) from the original input string.
doc = "I'm here with the three of Nikkei Asia's stalwart editors, three Brits in Tokyo. First off, we have Michael Peel, who is executive editor, a journalist from our affiliate, The Financial Times . He is now in Tokyo but has previously reported from the likes of Brussels, Bangkok, Abu Dhabi and Lagos. Welcome, Michael.MICHAEL PEEL, EXECUTIVE EDITOR: Welcome Waj. Thank you very much.KHAN: All right. And we have Stephen Foley, our business editor who, like Michael, is on secondment from the FT, where he was deputy U.S. News Editor. Prior to the FT, he was a reporter at The Independent and like Michael, he's a fresh-off-the-boat arrival in Tokyo and has left some pretty big shoes to fill in the New York bureau, where we miss him. Welcome, Stephen.STEPHEN FOLEY, BUSINESS EDITOR: Thanks for having me, Waj.KHAN: Alright, and last but certainly not least, my brother in arms when it comes to cricket commentary across the high seas is Andy Sharp, or deputy editor who joined Nikkei Asia nearly four years ago, after a long stint at Bloomberg in Tokyo and other esteemed Japanese publications. Welcome, Andy.ANDREW SHARP"
text = NER(doc)
org_stopwords = [ent.text for ent in text.ents if ent.label_ == 'ORG']
output of org_stopwords
['The Financial Times ', 'Abu Dhabi and Lagos', 'Bloomberg ']
This is my code now, I've identified and made a list with all those identified as ORG by Spacy, but now I don't know how to remove those from the string. One problem I'm facing to normally split the string and remove the org_stopwords is beacause org_stopwords ar n-grams. Please help with some coded example how to tackle this issue.
Use regex instead of replace
import re
org_stopwords = ['The Financial Times',
'Abu Dhabi ',
'U.S. News Editor',
'Independent',
'ANDREW']
regex = re.compile('|'.join(org_stopwords))
new_doc = re.sub(regex, '', doc)

How to locate the element in api?

I am new to the web scraping. I am trying to scrape "When purchase Online"
When purchased online in the Target. But i did not find it in the HTML.
.
Does anyone konw how to locate the element in HTML? Any help appreciates. Thanks!
Product Url:
https://www.target.com/c/allergy-sinus-medicines-treatments-health/-/N-4y5ny?Nao=144
https://www.target.com/p/genexa-dextromethorphan-kids-39-cough-and-chest-congestion-suppressant-4-fl-oz/-/A-80130848#lnk=sametab
I have no idea which element you want to get but API sends JSON data, not HTML, and you may simply convert it to dictionary/list and use keys/indexes to get value.
But you have to manually find correct keys in JSON data.
Or you may write some script to search in JSON (using for-loops and recursions)
Minimal working code. I found keys manually.
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130848&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA' # JSON
response = requests.get(url)
data = response.json()
product = data['data']['product']
print('price:', product['price']['current_retail'])
print('title:', product['item']['product_description']['title'])
print('description:', product['item']['product_description']['downstream_description'])
print('------------')
for bullet in product['item']['product_description']['bullet_descriptions']:
print(bullet)
print('------------')
print(product['item']['product_description']['soft_bullets']['title'])
for bullet in product['item']['product_description']['soft_bullets']['bullets']:
print('-', bullet)
print('------------')
for attribute in product['item']['wellness_merchandise_attributes']:
print('-', attribute['value_name'])
print(' ', attribute['wellness_description'])
Result:
price: 13.99
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
description: Genexa Kids’ Cough & Chest Congestion is real medicine, made clean - a powerful cough suppressant and expectorant that helps control cough, relieves chest congestion and helps thin and loosen mucus. This liquid, non-drowsy medicine has the same active ingredients you need (dextromethorphan HBr and guaifenesin), but without the artificial ones you don’t (dyes, common allergens, parabens). We only use ingredients people deserve to make the first gluten-free, non-GMO, certified vegan medicines to help your little ones feel better. <br /><br />Genexa is the first clean medicine company. Founded by two dads who believe in putting People Over Everything, Genexa makes medicine with the same active ingredients people need, but without the artificial ones they don’t. It’s real medicine, made clean.
------------
<B>Suggested Age:</B> 4 Years and Up
<B>Product Form:</B> Liquid
<B>Primary Active Ingredient:</B> Dextromethorphan
<B>Package Quantity:</B> 1
<B>Net weight:</B> 4 fl oz (US)
------------
highlights
- This is an age restricted item and will require us to take a quick peek at your ID upon pick-up
- Helps relieve kids’ chest congestion and makes coughs more productive by thinning and loosening mucus
- Non-drowsy so your little ones (ages 4+) can get back to playing
- Our medicine is junk-free, with no artificial sweeteners or preservatives, no dyes, no parabens, and no common allergens
- Certified gluten-free, vegan, and non-GMO
- Flavored with real organic blueberries
- Gentle on little tummies
------------
- Dye-Free
A product that either makes an unqualified on-pack statement indicating that it does not contain dye, or carries an unqualified on-pack statement such as "no dyes" or "dye-free."
- Gluten Free
A product that has an unqualified independent third-party certification, or carries an on-pack statement relating to the finished product being gluten-free.
- Non-GMO
A product that has an independent third-party certification, or carries an unqualified on-pack statement relating to the final product being made without genetically engineered ingredients.
- Vegan
A product that carries an unqualified independent, third-party certification, or carries on-pack statement relating to the product being 100% vegan.
- HSA/FSA Eligible
Restrictions apply; contact your insurance provider about plan allowances and requirements
EDIT:
Information "When purchased online" (or "at Cedar Rapids South") are in different url.
For example
Product url:
https://www.target.com/p/genexa-kids-39-diphenhydramine-allergy-liquid-medicine-organic-agave-4-fl-oz/-/A-80130847
API product data:
https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130847&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA
API "at Cedar Rapids South":
https://redsky.target.com/redsky_aggregations/v1/web_platform/product_fulfillment_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&is_bot=false&tcin=80130847&store_id=1771&zip=52404&state=IA&latitude=41.9831&longitude=-91.6686&scheduled_delivery_store_id=1771&required_store_id=1771&has_required_store_id=true
But probably in some situations it may use other information in product data to put "When purchased online" instead of "at Cedar Rapids South" - and this can be hardcoded in JavaScript. For example product which displays "When purchased online" has formatted_price $13.99 but product which displays "at Cedar Rapids South" has formatted_price "See price in cart"
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&brand_id=q643lel65ir&channel=WEB&count=24&default_purchasability_filter=true&offset=0&page=%2Fb%2Fq643lel65ir&platform=desktop&pricing_store_id=1771&store_ids=1771%2C1768%2C1113%2C3374%2C1792&useragent=Mozilla%2F5.0+%28X11%3B+Linux+x86_64%3B+rv%3A101.0%29+Gecko%2F20100101+Firefox%2F101.0&visitor_id=01819D268B380201B177CA755BCE70CC' # JSON
response = requests.get(url)
data = response.json()
for product in data['data']['search']['products']:
print('title:', product['item']['product_description']['title'])
print('price:', product['price']['current_retail'])
print('formatted:', product['price']['formatted_current_price'])
print('---')
Result:
title: Genexa Kids' Diphenhydramine Allergy Liquid Medicine - Organic Agave - 4 fl oz
price: 7.99
formatted: See price in cart
---
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
price: 13.99
formatted: $13.99
---

Problems with code to pull data from website

I have this website and I would like to pull via Python all company names such as West Wood Events, or Mitchell Event Planning.
But I am stuck on soup.find since it results me [].
When I inspect the page, lets say this:
< div class="LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd">Mitchell Event Planning<wbr></div >
in that I would write:
week = soup.find(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
print(week)
And I get 0.
Am I missing something? I'm pretty new to this.
This string is not single class but many classes separated by spaces.
In some modules you would have to use original string with all spaces but it seems in BS you have to use classes separated by single space.
Code works for me if I uses single space between LinesEllipsis and vendor-name--55315.
week = soup.find_all(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
Or if I use CSS selector with dot for every class in string
week = soup.select('.LinesEllipsis.vendor-name--55315.primaryBold--a3d1e.body1--24afd')
Minimal working code
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.theknot.com/marketplace/wedding-planners-acworth-ga?page=2'
r = requests.get(url)
soup = BS(r.text, 'html.parser')
#week = soup.select('.LinesEllipsis.vendor-name--55315.primaryBold--a3d1e.body1--24afd')
week = soup.find_all(class_='LinesEllipsis vendor-name--55315 primaryBold--a3d1e body1--24afd')
for item in week:
print(item.text)
Result:
The Charming Details
Enraptured Events
pearl and sky events - planning, design and florals
Unique Occasions ByTNicole, Inc
Platinum Eventions
RED COMPANY ATLANTA
Pop + Fizz: Event Planning and Design
Patricia Elizabeth, certified wedding planners/producer
Rienza Events
Pollyanna Richter Weddings
Calabash Events, Inc.
Weddings by Carmona LLC
Emily Jordan Events
Perfectly Taylored Events
Lindsey Wise Designs
Elegant Weddings and Affairs
Party PLANit
Wedded Bliss
Above the Fray
Willow Jaymes Events
Coco Red Events
Liz D. Events, LLC
Leslie Cox Events
YSE LLC
Marmaros Productions
PerfectionsID, LLC
All Things Love
West Wood Events
Everlasting Elegance
Prestigious Occasions

Given an html paragraph and a link, is there a way to retrieve the text before and the text after the link inside the paragraph in Python?

I am using urllib3 to get the html of some pages.
I want to retrieve the text from the paragraph where the link is, with the text before and after the link stored separately.
For example:
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
r = http.request('get', "https://www.snopes.com/fact-check/michael-novenche/")
body = r.data
soup = BeautifulSoup(body, 'lxml')
for a in soup.findAll('a'):
if a.has_attr('href'):
if (a['href'] == "http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"):
link_text = a
link_para = a.find_parent("p")
print(link_text)
print(link_para)
Paragraph
<p>The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with
all the changes in his condition proved a challenge. The message quoted above
stated that Michael had a large tumor in his brain, was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery. An <nobr>October 2000</nobr> article in <a
href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/conten
t/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany
Weekly</i></a> didn’t mention anything about little Michael’s medical
condition but said that his family was “in need of funds to help pay for the
transportation to the hospital and other costs not covered by their
insurance.” A June 2000 message posted to the <a
href="http://www.ecunet.org/whatisecupage.html"
onmouseout="window.status='';return true"
onmouseover="window.status='Ecunet';return true" target="_blank">Ecunet</a>
mailing list indicated that Michael had just turned <nobr>3 years</nobr> old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:</p>
Link
<a href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/conten
t/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany
Weekly</i></a>
Text to be retrieved (2 parts)
The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy ... was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery. An October 2000 article in
didn’t mention anything about little Michael’s medical
condition but said that his family was ... turned 3 years old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:
I cant simply get_text() then use split as the link text might be repeated.
I thought I might just add a counter to see how many times the link text is repeated, use split(), then use a loop to get the parts I want.
I would appreciate a better, less messy method though.
You can iterate a tag parent's content and compare if actual value is our a tag. If it is, we found one part and continue building another:
data = '''<p>The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with
all the changes in his condition proved a challenge. The message quoted above
stated that Michael had a large tumor in his brain, was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery. An <nobr>October 2000</nobr> article in <a
href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"
onmouseout="window.status='';return true" onmouseover="window.status='The
Local Albany Weekly';return true" target="_blank"><i>The Local Albany
Weekly</i></a> didn’t mention anything about little Michael’s medical
condition but said that his family was “in need of funds to help pay for the
transportation to the hospital and other costs not covered by their
insurance.” A June 2000 message posted to the <a
href="http://www.ecunet.org/whatisecupage.html"
onmouseout="window.status='';return true"
onmouseover="window.status='Ecunet';return true" target="_blank">Ecunet</a>
mailing list indicated that Michael had just turned <nobr>3 years</nobr> old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
link_url='http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general'
a = soup.find('a', href=link_url)
s, parts = '', []
for t in a.parent.contents:
if t == a:
parts += [s]
s = ''
continue
s += str(t)
parts += [s]
for part in parts:
print(BeautifulSoup(part, 'lxml').body.text.strip())
print('*' * 80)
Prints:
The message quoted above about Michael Novenche, a two-year-old boy
undergoing chemotherapy to treat a brain tumor, was real, but keeping up with
all the changes in his condition proved a challenge. The message quoted above
stated that Michael had a large tumor in his brain, was operated upon to
remove part of the tumor, and needed prayers to help him through chemotherapy
to a full recovery. An October 2000 article in
********************************************************************************
didn’t mention anything about little Michael’s medical
condition but said that his family was “in need of funds to help pay for the
transportation to the hospital and other costs not covered by their
insurance.” A June 2000 message posted to the Ecunet
mailing list indicated that Michael had just turned 3 years old,
mentioned that his tumor appeared to be shrinking, and provided a mailing
address for him:
********************************************************************************
You can easily do this with bs4 4.7.1. Use :has and an attribute = value selector to get the parent p tag then split it's html on the a tag html. Then re-parse with bs for p tags. This gets around the potentially repeated phrase problem. Only poses a problem if it is possible for the entire html of the a tag to appear repeated within the block which seems highly unlikely.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.snopes.com/fact-check/michael-novenche/')
soup = bs(r.content, 'lxml')
data = soup.select_one('p:has(>[href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"])').encode_contents().split(soup.select_one('[href="http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"]').encode_contents())
items = [bs(i, 'lxml').select_one('p').text for i in data]
print(items)
I found a solution based on #Andrej kesely's solution.
It deals with two problems:
That there is no text before/after the link
That the link isn't a direct child of the paragraph
Here it is (as a function):
import urllib3
from bs4 import BeautifulSoup
import lxml
def get_info(page,link):
r = http.request('get', page)
body = r.data
soup = BeautifulSoup(body, 'lxml')
a = soup.find('a', href=link)
s, parts = '', []
if a.parent.name=="p":
for t in a.parent.contents:
if t == a:
parts += [s]
s = ''
continue
s += str(t)
parts += [s]
else:
prnt = a.find_parents("p")[0]
for t in prnt.contents:
if t == a or (str(a) in str(t)):
parts+=[s]
s=''
continue
s+=str(t)
parts+=[s]
try:
text_before_link = BeautifulSoup(parts[0], 'lxml').body.text.strip()
except AttributeError as error:
text_before_link = ""
try:
text_after_link = BeautifulSoup(parts[1], 'lxml').body.text.strip()
except AttributeError as error:
text_after_link = ""
return text_before_link, text_after_link
This assumes that there is no paragraph inside another paragraph.
If anyone has any ideas about scenarios where this fails, please feel free to mention it.
can you clarify what you mean by:
I cant simply get_text() then use split as the link text might be
repeated
When I run:
import urllib3
from bs4 import BeautifulSoup
import certifi
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
r = http.request('GET', "https://www.snopes.com/fact-check/michael-novenche/")
body = r.data
soup = BeautifulSoup(body, 'lxml')
for a in soup.findAll('a'):
if a.has_attr('href'):
if (a['href'] == "http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"):
link_text = a
link_para = a.find_parent("p")
print(link_para.get_text())
I get:
The message quoted above about Michael Novenche, a two-year-old boy undergoing chemotherapy to treat a brain tumor, was real, but keeping up with all the changes in his condition proved a challenge. The message quoted above stated that Michael had a large tumor in his brain, was operated upon to remove part of the tumor, and needed prayers to help him through chemotherapy to a full recovery. An October 2000 article in The Local Albany Weekly didn’t mention anything about little Michael’s medical condition but said that his family was “in need of funds to help pay for the transportation to the hospital and other costs not covered by their insurance.” A June 2000 message posted to the Ecunet mailing list indicated that Michael had just turned 3 years old, mentioned that his tumor appeared to be shrinking, and provided a mailing address for him:
The text is being split by the 'The Local Albany Weekly' which is the name of the link.. so why not get the link name and split by that?
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
r = http.request('GET', "https://www.snopes.com/fact-check/michael-novenche/")
body = r.data
soup = BeautifulSoup(body, 'lxml')
for a in soup.findAll('a'):
if a.has_attr('href'):
if (a['href'] == "http://web.archive.org/web/20040330161553/http://newyork.local.ie/content/31666.shtml/albany/news/newsletters/general"):
link_text = a
link_para = a.find_parent("p")
the_link = link_para.find('a')
#change the name of <i> to something unique
the_link.string.replace_with('ooqieri')
name_link = link_text.findAll('i')[0].get_text()
full_text = link_para.get_text().split(name_link)
print(full_text)
which gives:
['The message quoted above about Michael Novenche, a two-year-old boy undergoing chemotherapy to treat a brain tumor, was real, but keeping up with all the changes in his condition proved a challenge. The message quoted above stated that Michael had a large tumor in his brain, was operated upon to remove part of the tumor, and needed prayers to help him through chemotherapy to a full recovery. An October 2000 article in ', ' didn’t mention anything about little Michael’s medical condition but said that his family was “in need of funds to help pay for the transportation to the hospital and other costs not covered by their insurance.” A June 2000 message posted to the Ecunet mailing list indicated that Michael had just turned 3 years old, mentioned that his tumor appeared to be shrinking, and provided a mailing address for him:']

Categories

Resources