Print only one link of the web with python3 web scraping - python

I´m trying to print the link o text of each headline of the following website http://www.infobolsa.es/news but when I run the code I keep getting the same output, the correct headline text but every link is the same. Here is the part of the link code, thank you:
from urllib.request import urlopen
html_page = urlopen("http://www.infobolsa.es/news")
soup = BeautifulSoup(html_page, 'lxml')
links = list()
for titleM in bodyDictWeb2:
for link in soup.findAll('a', attrs={'href': re.compile("^/news/detail")}):
print(link)
bodyDictWeb2[titleM] = link.get('href')
break
for k,v in bodyDictWeb2.items():
print(k,":",v)

I have solved it, here is the code that works:
from urllib.request import urlopen
html_page = urlopen("http://www.infobolsa.es/news")
soup = BeautifulSoup(html_page, 'lxml')
links = list()
for titleM in bodyDictWeb2:
for link in soup.findAll('a', attrs={'href': re.compile("^/news/detail")}):
print(link.text , link.get('href'))
break

Related

How to have some link sand not all the links with BeautifulSoup

I would like to have the links on this website : https://www.bilansgratuits.fr/secteurs/finance-assurance,k.html
But not all the links, only those : links
Unfortunately my script here give me ALL the links.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.bilansgratuits.fr/secteurs/finance-assurance,k.html'
links = []
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
links = [a['href'] for a in soup.find_all('a', href=True)]
print(links)
Any ideas how to do that ?
All of the links you want are contained in a div with class name listeEntreprises so you can do
links = [a['href'] for a in soup.find("div", {"class": "listeEntreprises"}).find_all('a', href=True)]

BeautifulSoup : Fetched all the links on a webpage how to navigate through them without selenium?

So I'm trying to write a mediocre script to download subtitles from one particular website as y'all can see. I'm a newbie to beautifulsoup, so far I have a list of all the "href" after a search query(GET). So how do I navigate further, after getting all the links?
Here's the code:
import requests
from bs4 import BeautifulSoup
usearch = input("Movie Name? : ")
url = "https://www.yifysubtitles.com/search?q="+usearch
print(url)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
for link in soup.find_all('a'):
dictn = link.get('href')
print(dictn)
You need to use resp.text instead of resp.content
Try this to get the search results.
import requests
from bs4 import BeautifulSoup
base_url_f = "https://www.yifysubtitles.com"
search_url = base_url_f + "/search?q=last+jedi"
resp = requests.get(search_url)
soup = BeautifulSoup(resp.text, 'lxml')
for media in soup.find_all("div", {"class": "media-body"}):
print(base_url_f + media.find('a')['href'])
out: https://www.yifysubtitles.com/movie-imdb/tt2527336

Unable to crawl Reddit's NBA page

I'm new to web crawling and want to learn how to use beautifulsoup to integrate it on a mini project. I was following thenewboston tutorial on beautifulsoup on his youtube channel then got stuck trying to crawl off of Reddit. I want to crawl titles and links on each of the NBA news on Reddit/r/nba but didn't have any success. Only thing that return in the terminal was "Process finished with exit code 0". I have a feeling it was to do with my selections? Any guidance and help would be greatly appreciated.
This is the original code, didn't work:
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://reddit.com/r/nba' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.find_all('a', {'class': 'title'}):
href = link.get('href')
print(href)
page += 1
spider(1)
I tried doing this way but that didn't solve the problem:
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.reddit.com/r/nba/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class': 'title'}):
href = "https://www.reddit.com/" + link.get('href')
title = link.string
print(href)
print(title)
page += 1
spider(1)
Get titles and links on main page:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://www.reddit.com/r/nba/")
soup = BeautifulSoup(html, 'lxml')
for link in soup.find('div', {'class':'content'}).find_all('a', {'class':'title may-blank outbound'}):
print(link.attrs['href'], link.get_text())

Python - Get data from <a> tab using BeautifulSoup

url = "https://twitter.com/realDonaldTrump?
ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all('a')
for link in soup.find_all('a'):
print(link.text, link.get('href'))
I have trouble retrieving the 'href' tag from the html. The code works in retrieving all the other 'href' except of the one i wanted which is "/realDonaldTrump/status/868985285207629825". I would like to retrieve the 'data-original-title' tag as well. Any help or suggestion?
import requests
from bs4 import BeautifulSoup
url = "https://twitter.com/realDonaldTrump?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all('a', {'class':'tweet-timestamp js-permalink js-nav js-tooltip'})
for link in links:
try:
print(link['href'])
if link['data-original-title']:
print(link['data-original-title'])
except:
pass

Beautiful Soup Tutorial Error: jeriwieringa.com

I run this code from the tutorial here (http://jeriwieringa.com/blog/2012/11/04/beautiful-soup-tutorial-part-1/):
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("43rd-congress.htm"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink
And I get this error:
File "soupexample.py", line 11, in <module>
fullLink = link.get('href')
link is not defined
Why would I need to define link in links for this loop? What's the logic? Thanks for your help.
I guess mistake comes from here (somehow there is no indent in the example and there certainly should be):
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink

Categories

Resources