I would like to have the links on this website : https://www.bilansgratuits.fr/secteurs/finance-assurance,k.html
But not all the links, only those : links
Unfortunately my script here give me ALL the links.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.bilansgratuits.fr/secteurs/finance-assurance,k.html'
links = []
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
links = [a['href'] for a in soup.find_all('a', href=True)]
print(links)
Any ideas how to do that ?
All of the links you want are contained in a div with class name listeEntreprises so you can do
links = [a['href'] for a in soup.find("div", {"class": "listeEntreprises"}).find_all('a', href=True)]
Related
I'm trying to get the urls of the main articles from a news outlet using beautiful soup. Since I do not want to get ALL of the links on the entire page, I specified the class. My code only manages to display the titles of the news articles, not the links. This is the website: https://www.reuters.com/news/us
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("h3", {"class": "story-title"})
for i in links:
print(i.get_text().strip())
print()
Any help is greatly apreciated!
To get link to all articles you can use following code:
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.reuters.com/news/us').text
soup = BeautifulSoup(req, 'html.parser')
links = soup.findAll("div", {"class": "story-content"})
for i in links:
print(i.a.get('href'))
I want to collect the link : /hmarchhak/102217 from a site (https://www.vanglaini.org/) and print it as https://www.vanglaini.org/hmarchhak/102217. Please help
Img
import requests
import pandas as pd
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
headline = article.a.text
summary=article.p.text
link = article.a.href
print(headline)
print(summary)
print(link)
print()
This is my code.
Unless I am missing something headline and summary appear to be the same text. You can use :has with bs4 4.7.1+ to ensure your article has a child href; and this seems to strip out article tag elements that are not part of main body which I suspect is actually your aim
from bs4 import BeautifulSoup as bs
import requests
base = 'https://www.vanglaini.org'
r = requests.get(base)
soup = bs(r.content, 'lxml')
for article in soup.select('article:has([href])'):
headline = article.h5.text.strip()
summary = re.sub(r'\n+|\r+',' ',article.p.text.strip())
link = f"{base}{article.a['href']})"
print(headline)
print(summary)
print(link)
I am trying something new with pulling out all the href's in the a tags. It isn't pulling out the hrefs though and cant figure out why.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.findAll('a'):
h = href.attrs['href']
print(h)
You should check if the key exists, since it may also not exist an href between <a> tags.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
print(page.text)
soup = BeautifulSoup(page.text, 'html.parser')
for a in soup.findAll('a'):
if 'href' in a.attrs:
print(a.attrs['href'])
So I'm trying to write a mediocre script to download subtitles from one particular website as y'all can see. I'm a newbie to beautifulsoup, so far I have a list of all the "href" after a search query(GET). So how do I navigate further, after getting all the links?
Here's the code:
import requests
from bs4 import BeautifulSoup
usearch = input("Movie Name? : ")
url = "https://www.yifysubtitles.com/search?q="+usearch
print(url)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
for link in soup.find_all('a'):
dictn = link.get('href')
print(dictn)
You need to use resp.text instead of resp.content
Try this to get the search results.
import requests
from bs4 import BeautifulSoup
base_url_f = "https://www.yifysubtitles.com"
search_url = base_url_f + "/search?q=last+jedi"
resp = requests.get(search_url)
soup = BeautifulSoup(resp.text, 'lxml')
for media in soup.find_all("div", {"class": "media-body"}):
print(base_url_f + media.find('a')['href'])
out: https://www.yifysubtitles.com/movie-imdb/tt2527336
url = "https://twitter.com/realDonaldTrump?
ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all('a')
for link in soup.find_all('a'):
print(link.text, link.get('href'))
I have trouble retrieving the 'href' tag from the html. The code works in retrieving all the other 'href' except of the one i wanted which is "/realDonaldTrump/status/868985285207629825". I would like to retrieve the 'data-original-title' tag as well. Any help or suggestion?
import requests
from bs4 import BeautifulSoup
url = "https://twitter.com/realDonaldTrump?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all('a', {'class':'tweet-timestamp js-permalink js-nav js-tooltip'})
for link in links:
try:
print(link['href'])
if link['data-original-title']:
print(link['data-original-title'])
except:
pass