Python - Get data from <a> tab using BeautifulSoup

Python - Get data from <a> tab using BeautifulSoup - python

url = "https://twitter.com/realDonaldTrump?
ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all('a')
for link in soup.find_all('a'):
print(link.text, link.get('href'))
I have trouble retrieving the 'href' tag from the html. The code works in retrieving all the other 'href' except of the one i wanted which is "/realDonaldTrump/status/868985285207629825". I would like to retrieve the 'data-original-title' tag as well. Any help or suggestion?

import requests
from bs4 import BeautifulSoup
url = "https://twitter.com/realDonaldTrump?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all('a', {'class':'tweet-timestamp js-permalink js-nav js-tooltip'})
for link in links:
try:
print(link['href'])
if link['data-original-title']:
print(link['data-original-title'])
except:
pass

Related

How to have some link sand not all the links with BeautifulSoup

I would like to have the links on this website : https://www.bilansgratuits.fr/secteurs/finance-assurance,k.html
But not all the links, only those : links
Unfortunately my script here give me ALL the links.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.bilansgratuits.fr/secteurs/finance-assurance,k.html'
links = []
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
links = [a['href'] for a in soup.find_all('a', href=True)]
print(links)
Any ideas how to do that ?

All of the links you want are contained in a div with class name listeEntreprises so you can do
links = [a['href'] for a in soup.find("div", {"class": "listeEntreprises"}).find_all('a', href=True)]

Print only one link of the web with python3 web scraping

I´m trying to print the link o text of each headline of the following website http://www.infobolsa.es/news but when I run the code I keep getting the same output, the correct headline text but every link is the same. Here is the part of the link code, thank you:
from urllib.request import urlopen
html_page = urlopen("http://www.infobolsa.es/news")
soup = BeautifulSoup(html_page, 'lxml')
links = list()
for titleM in bodyDictWeb2:
for link in soup.findAll('a', attrs={'href': re.compile("^/news/detail")}):
print(link)
bodyDictWeb2[titleM] = link.get('href')
break
for k,v in bodyDictWeb2.items():
print(k,":",v)

I have solved it, here is the code that works:
from urllib.request import urlopen
html_page = urlopen("http://www.infobolsa.es/news")
soup = BeautifulSoup(html_page, 'lxml')
links = list()
for titleM in bodyDictWeb2:
for link in soup.findAll('a', attrs={'href': re.compile("^/news/detail")}):
print(link.text , link.get('href'))
break

Pull out href's with beautifulsoup attrs

I am trying something new with pulling out all the href's in the a tags. It isn't pulling out the hrefs though and cant figure out why.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.findAll('a'):
h = href.attrs['href']
print(h)

You should check if the key exists, since it may also not exist an href between <a> tags.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
print(page.text)
soup = BeautifulSoup(page.text, 'html.parser')
for a in soup.findAll('a'):
if 'href' in a.attrs:
print(a.attrs['href'])

Python3 : BeautifulSoup4 not returning expected value

I'm currently trying to scrap some data over a website using BS4 under python 3.6.4 but the value returned is not what I am expecting:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.content
soup = BeautifulSoup(page, "html5lib")
price = soup.find("div", {"class" : "fieldPrice sizeC"}).text
print(price)
I should get "39 900 €" but the code return "47Â 880Â â¬".
NB: Even without JS, the data should be "39 900 €".
Thanks for your help !

The encoding declaration is wrong on this page so BeautifulSoup gets told to use the wrong encoding. You can force it to use the correct encoding like this:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.content
soup = BeautifulSoup(page.decode('utf-8','ignore'), "html5lib")
price = soup.find("div", {"class": "fieldPrice sizeC"}).text
print(price)
Outputs:
49 070 €

Instead of page.content use page.text
Ex:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.text
soup = BeautifulSoup(page, "html.parser")
price = soup.find("div", {"class" : "fieldPrice sizeC"}).text
print(price)
.text automatically decode content from the server

BeautifulSoup : Fetched all the links on a webpage how to navigate through them without selenium?

So I'm trying to write a mediocre script to download subtitles from one particular website as y'all can see. I'm a newbie to beautifulsoup, so far I have a list of all the "href" after a search query(GET). So how do I navigate further, after getting all the links?
Here's the code:
import requests
from bs4 import BeautifulSoup
usearch = input("Movie Name? : ")
url = "https://www.yifysubtitles.com/search?q="+usearch
print(url)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
for link in soup.find_all('a'):
dictn = link.get('href')
print(dictn)

You need to use resp.text instead of resp.content
Try this to get the search results.
import requests
from bs4 import BeautifulSoup
base_url_f = "https://www.yifysubtitles.com"
search_url = base_url_f + "/search?q=last+jedi"
resp = requests.get(search_url)
soup = BeautifulSoup(resp.text, 'lxml')
for media in soup.find_all("div", {"class": "media-body"}):
print(base_url_f + media.find('a')['href'])
out: https://www.yifysubtitles.com/movie-imdb/tt2527336

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Get data from <a> tab using BeautifulSoup - python

Related

How to have some link sand not all the links with BeautifulSoup

Print only one link of the web with python3 web scraping

Pull out href's with beautifulsoup attrs

Python3 : BeautifulSoup4 not returning expected value

BeautifulSoup : Fetched all the links on a webpage how to navigate through them without selenium?

Categories

Resources