How to receive website link in Python using BeautifulSoup - python

I want to collect the link : /hmarchhak/102217 from a site (https://www.vanglaini.org/) and print it as https://www.vanglaini.org/hmarchhak/102217. Please help
Img
import requests
import pandas as pd
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
headline = article.a.text
summary=article.p.text
link = article.a.href
print(headline)
print(summary)
print(link)
print()
This is my code.

Unless I am missing something headline and summary appear to be the same text. You can use :has with bs4 4.7.1+ to ensure your article has a child href; and this seems to strip out article tag elements that are not part of main body which I suspect is actually your aim
from bs4 import BeautifulSoup as bs
import requests
base = 'https://www.vanglaini.org'
r = requests.get(base)
soup = bs(r.content, 'lxml')
for article in soup.select('article:has([href])'):
headline = article.h5.text.strip()
summary = re.sub(r'\n+|\r+',' ',article.p.text.strip())
link = f"{base}{article.a['href']})"
print(headline)
print(summary)
print(link)

Related

Webscrape a table with BeautifulSoup

I'm trying to get the tables (and then the tr and td contents) with requests and BeautifulSoup from this link: https://www.basketball-reference.com/teams/PHI/2022/lineups/ , but I get no results.
I tried with:
import requests
from bs4 import BeautifulSoup
url = "https://www.basketball-reference.com/teams/PHI/2022/lineups/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
tables = soup.find_all('table')
However the result of tables is [].
It looks like the tables are placed in the comments, so you have to adjust the response text:
page = page.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(page, 'html.parser')
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.basketball-reference.com/teams/PHI/2022/lineups/"
page = requests.get(url)
page = page.text.replace("<!--","").replace("-->","")
soup = BeautifulSoup(page, 'html.parser')
tables = soup.find_all('table')
Just in addition as mentioned also by #chitown88 there is an option with beautifulsoup method of Comment, to find all comments in HTML. Be aware you have to transform the strings into bs4 again:
soup.find_all(string=lambda text: isinstance(text, Comment) and '<table' in text))
Example
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url = "https://www.basketball-reference.com/teams/PHI/2022/lineups/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
soupTables = BeautifulSoup(''.join(soup.find_all(string=lambda text: isinstance(text, Comment) and '<table' in text)))
soupTables.find_all('table')

How to have some link sand not all the links with BeautifulSoup

I would like to have the links on this website : https://www.bilansgratuits.fr/secteurs/finance-assurance,k.html
But not all the links, only those : links
Unfortunately my script here give me ALL the links.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.bilansgratuits.fr/secteurs/finance-assurance,k.html'
links = []
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
links = [a['href'] for a in soup.find_all('a', href=True)]
print(links)
Any ideas how to do that ?
All of the links you want are contained in a div with class name listeEntreprises so you can do
links = [a['href'] for a in soup.find("div", {"class": "listeEntreprises"}).find_all('a', href=True)]

Retrive html tag content using beautifulSoup

I'm trying to get the plain text of a website article using python. I've heard about the BeautifulSoup library, but how to retrieve a specific tag in html page?
This is what I have done:
base_url = 'http://www.nytimes.com'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
Look this:
import bs4 as bs
import requests as rq
html = rq.get('site.com')
s = bs.BeautifulSoup(html.text, features="html.parser")
div = s.find('div', {'class': 'yourclass'}) # or id
print(str(div.text)) # print text

BS4 returns [] instead of the wanted HTML tag

I want to parse the given website and scrape the table. To me the code looks right. New to python and web parsing
import requests
from bs4 import BeautifulSoup
response = requests.get('https://delhifightscorona.in/')
doc = BeautifulSoup(response.text, 'lxml-xml')
cases = doc.find_all('div', {"class": "cell"})
print(cases)
doing this returns
[]
Change your parser and the class and there you have it.
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://delhifightscorona.in/').text, 'html.parser').find('div', {"class": "grid-x grid-padding-x small-up-2"})
print(soup.find("h3").getText())
Output:
423,831
You can choose to print only the cases or the total stats with the date.
import requests
from bs4 import BeautifulSoup
response = requests.get('https://delhifightscorona.in/')
doc = BeautifulSoup(response.text, 'html.parser')
stats = doc.find('div', {"class": "cell medium-5"})
print(stats.text) #Print the whole block with dates and the figures
cases = stats.find('h3')
print(cases.text) #Print the cases only

How to get specific urls from a website in a class tag with beautiful soup? (Python)

I'm trying to get the urls of the main articles from a news outlet using beautiful soup. Since I do not want to get ALL of the links on the entire page, I specified the class. My code only manages to display the titles of the news articles, not the links. This is the website: https://www.reuters.com/news/us
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("h3", {"class": "story-title"})
for i in links:
print(i.get_text().strip())
print()
Any help is greatly apreciated!
To get link to all articles you can use following code:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("div", {"class": "story-content"})
for i in links:
print(i.a.get('href'))

Categories

Resources