How to get sub-content from wikipedia page using BeautifulSoup - python

I am trying to scrape sub-content from Wikipedia pages based on the internal link using python, The problem is that scrape all content from the page, how can scrape just internal link paragraph, Thanks in advance
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
sub_link="#الأسباب"
total=base_link+sub_link
r=requests.get(total)
soup = bs(r.text, 'html.parser')
results=soup.find('p')
print(results)

It is because it's not a sublink you are trying to scrape. It's an anchor.
Try to request the entire page and then to find the given id.
Something like this:
from bs4 import BeautifulSoup as soup
import requests
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
anchor_id="ﺍﻸﺴﺑﺎﺑ"
r=requests.get(base_link)
page = soup(r.text, 'html.parser')
span = page.find('span', {'id': anchor_id})
results = span.parent.find_next_siblings('p')
print(results[0].text)

Related

how should i scrape href links from this website?

I'm trying to get every products individual URL link from this link https://www.goodricketea.com/product/darjeeling-tea
.How should I do that with beautifulsoup? Is there anyone who can help me?
To get product links from this site, you can for example do:
import requests
from bs4 import BeautifulSoup
url = "https://www.goodricketea.com/product/darjeeling-tea"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for a in soup.select("a:has(>h2)"):
print("https://www.goodricketea.com" + a["href"])
Prints:
https://www.goodricketea.com/product/darjeeling-tea/roasted-darjeeling-tea-250gm
https://www.goodricketea.com/product/darjeeling-tea/thurbo-darjeeling-tea-whole-leaf-250gm
https://www.goodricketea.com/product/darjeeling-tea/roasted-darjeeling-tea-organic-250gm
https://www.goodricketea.com/product/darjeeling-tea/roasted-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/thurbo-darjeeling-tea-whole-leaf-100gm
https://www.goodricketea.com/product/darjeeling-tea/thurbo-darjeeling-tea-fannings-250gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-premium-muscatel-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-250gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-bags-50-tea-bags
https://www.goodricketea.com/product/darjeeling-tea/castleton-vintage-darjeeling-tea-bags-100-tea-bags
https://www.goodricketea.com/product/darjeeling-tea/badamtam-exclusive-organic-darjeeling-tea-250gm
https://www.goodricketea.com/product/darjeeling-tea/badamtam-exclusive-organic-darjeeling-tea-100gm
https://www.goodricketea.com/product/darjeeling-tea/seasons-3-in-1-darjeeling-leaf-tea-150gm-first-flush-second-flush-pre-winter-flush

Web scraping href link in Python with Beautifulsoup

I'm trying to code a web scraping to obtain information of Linkedin Jobs post, including Job Description, Date, role, and link of the Linkedin job post. While I have made great progress obtaining job information about the job posts I'm currently stuck on how I could get the 'href' link of each job post. I have made many attempts including using class driver.find_element_by_class_name, and select_one method, neither seems to obtain the 'canonical' link by resulting none value. Could you please provide me some light?
This is the part of my code that tries to get the href link:
import requests
from bs4 import BeautifulSoup
url = https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('link'):
print(link.get('href'))
link: https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click
Picture of the code where the href link is stored
I think you were trying to access the href attribute incorrectly, to access them, use object["attribute_name"].
this works for me, searching for just links where rel = "canonical":
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
for link in soup.find_all('link', rel='canonical'):
print(link['href'])
The <link> has an attribute of rel="canonical". You can use an [attribute=value] CSS selector: [rel="canonical"] to get the value.
To use a CSS selector, use the .select_one() method instead of find().
import requests
from bs4 import BeautifulSoup
url = "https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D&position=1&pageNum=0&trk=public_jobs_job-result-card_result-card_full-click"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
print(soup.select_one('[rel="canonical"]')['href'])
Output:
https://www.linkedin.com/jobs/view/manager-risk-management-at-american-express-2545560153?refId=tOl7rHbYeo8JTdcUjN3Jdg%3D%3D&trackingId=Jhu1wPbsTyRZg4cRRN%2BnYg%3D%3D

Trying to Scrape a Span

I 've been trying to scrape two values from a website using beautiful soup in Python, and it's been giving me trouble. Here is the URL of the page I'm scraping:
https://www.stjosephpartners.com/Home/Index
Here are the values I'm trying to scrape:
HTML of Website to be Scraped
I tried:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.stjosephpartners.com/Home/Index').text
soup = BeautifulSoup(source, 'lxml')
gold_spot_shell = soup.find('div', class_ = 'col-lg-10').children
print(gold_spot_shell)
the output I got was: <list_iterator object at 0x039FD0A0>
When I tried using: gold_spot_shell = soup.find('div', class_ = 'col-lg-10').children
The output was: ['\n']
when I tried using: gold_spot_shell = soup.find('div', class_ = 'col-lg-10').span
The output was: none
The HTML definitely has at least one span child. I'm not sure how to scrape the values I'm after. Thanks.
Beautifulsoup + Request is not a good method to scrape dynamic website like this. That span is generated by javascript so when you get the html using request, it just does not exist.
You can try to use selenium instead.
You can check if the website is using javascript to render element or not by disabling javascript on the page and find that element again, or just "view page source"

How to get specific urls from a website in a class tag with beautiful soup? (Python)

I'm trying to get the urls of the main articles from a news outlet using beautiful soup. Since I do not want to get ALL of the links on the entire page, I specified the class. My code only manages to display the titles of the news articles, not the links. This is the website: https://www.reuters.com/news/us
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("h3", {"class": "story-title"})
for i in links:
print(i.get_text().strip())
print()
Any help is greatly apreciated!
To get link to all articles you can use following code:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("div", {"class": "story-content"})
for i in links:
print(i.a.get('href'))

Web Scraping through links with Beautiful Soup

I'm trying to Scrape a blog "https://blog.feedspot.com/ai_rss_feeds/" and crawl through all the links in it to look for Artificial Intelligence related information in each of the crawled links.
The blog follows a pattern - It has multiple RSS Feeds and each Feed has an attribute called "Site" in the UI. I need to get all the links in the "Site" attribute. Example : aitrends.com, sciecedaily.com/... etc. In the code, the main div has a class called "rss-block", which has another nested class called "data" and each data has several tags and the tags have in them. The value in href gives the links to be crawled upon. We need to look for AI related articles in each of those links found by scraping the above-mentioned structure.
I've tried various variations of the following code but nothing seemed to help much.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = BeautifulSoup(page.text, 'html.parser')
class_name='data'
dataSoup = soup.find(class_=class_name)
print(dataSoup)
artist_name_list_items = dataSoup.find('a', href=True)
print(artist_name_list_items)
I'm struggling to even get the links in that page, let alone craling through each of these links to scrape articles related to AI in them.
If you could help me finish both the parts of the problem, that'd be a great learning for me. Please refer to the source of https://blog.feedspot.com/ai_rss_feeds/ for the HTML Structure. Thanks in advance!
The first twenty results are stored in the html as you see on page. The others are pulled from a script tag and you can regex them out to create the full list of 67. Then loop that list and issue requests to those for further info. I offer a choice of two different selectors for the initial list population (the second - commented out - uses :contains - available with bs4 4.7.1+)
from bs4 import BeautifulSoup as bs
import requests, re
p = re.compile(r'feed_domain":"(.*?)",')
with requests.Session() as s:
r = s.get('https://blog.feedspot.com/ai_rss_feeds/')
soup = bs(r.content, 'lxml')
results = [i['href'] for i in soup.select('.data [rel="noopener nofollow"]:last-child')]
## or use with bs4 4.7.1 +
#results = [i['href'] for i in soup.select('strong:contains(Site) + a')]
results+= [re.sub(r'\n\s+','',i.replace('\\','')) for i in p.findall(r.text)]
for link in results:
#do something e.g.
r = s.get(link)
soup = bs(r.content, 'lxml')
# extract info from indiv page
To get all the sublinks for each block, you can use soup.find_all:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://blog.feedspot.com/ai_rss_feeds/').text, 'html.parser')
results = [[i['href'] for i in c.find('div', {'class':'data'}).find_all('a')] for c in d.find_all('div', {'class':'rss-block'})]
Output:
[['http://aitrends.com/feed', 'https://www.feedspot.com/?followfeedid=4611684', 'http://aitrends.com/'], ['https://www.sciencedaily.com/rss/computers_math/artificial_intelligence.xml', 'https://www.feedspot.com/?followfeedid=4611682', 'https://www.sciencedaily.com/news/computers_math/artificial_intelligence/'], ['http://machinelearningmastery.com/blog/feed', 'https://www.feedspot.com/?followfeedid=4575009', 'http://machinelearningmastery.com/blog/'], ['http://news.mit.edu/rss/topic/artificial-intelligence2', 'https://www.feedspot.com/?followfeedid=4611685', 'http://news.mit.edu/topic/artificial-intelligence2'], ['https://www.reddit.com/r/artificial/.rss', 'https://www.feedspot.com/?followfeedid=4434110', 'https://www.reddit.com/r/artificial/'], ['https://chatbotsmagazine.com/feed', 'https://www.feedspot.com/?followfeedid=4470814', 'https://chatbotsmagazine.com/'], ['https://chatbotslife.com/feed', 'https://www.feedspot.com/?followfeedid=4504512', 'https://chatbotslife.com/'], ['https://aws.amazon.com/blogs/ai/feed', 'https://www.feedspot.com/?followfeedid=4611538', 'https://aws.amazon.com/blogs/ai/'], ['https://developer.ibm.com/patterns/category/artificial-intelligence/feed', 'https://www.feedspot.com/?followfeedid=4954414', 'https://developer.ibm.com/patterns/category/artificial-intelligence/'], ['https://lexfridman.com/category/ai/feed', 'https://www.feedspot.com/?followfeedid=4968322', 'https://lexfridman.com/ai/'], ['https://medium.com/feed/#Francesco_AI', 'https://www.feedspot.com/?followfeedid=4756982', 'https://medium.com/#Francesco_AI'], ['https://blog.netcoresmartech.com/rss.xml', 'https://www.feedspot.com/?followfeedid=4998378', 'https://blog.netcoresmartech.com/'], ['https://www.aitimejournal.com/feed', 'https://www.feedspot.com/?followfeedid=4979214', 'https://www.aitimejournal.com/'], ['https://blogs.nvidia.com/feed', 'https://www.feedspot.com/?followfeedid=4611576', 'https://blogs.nvidia.com/'], ['http://feeds.feedburner.com/AIInTheNews', 'https://www.feedspot.com/?followfeedid=623918', 'http://aitopics.org/whats-new'], ['https://blogs.technet.microsoft.com/machinelearning/feed', 'https://www.feedspot.com/?followfeedid=4431827', 'https://blogs.technet.microsoft.com/machinelearning/'], ['https://machinelearnings.co/feed', 'https://www.feedspot.com/?followfeedid=4611235', 'https://machinelearnings.co/'], ['https://www.artificial-intelligence.blog/news?format=RSS', 'https://www.feedspot.com/?followfeedid=4611100', 'https://www.artificial-intelligence.blog/news/'], ['https://news.google.com/news?cf=all&hl=en&pz=1&ned=us&q=artificial+intelligence&output=rss', 'https://www.feedspot.com/?followfeedid=4611157', 'https://news.google.com/news/section?q=artificial%20intelligence&tbm=nws&*'], ['https://www.youtube.com/feeds/videos.xml?channel_id=UCEqgmyWChwvt6MFGGlmUQCQ', 'https://www.feedspot.com/?followfeedid=4611505', 'https://www.youtube.com/channel/UCEqgmyWChwvt6MFGGlmUQCQ/videos']]

Categories

Resources