Why this code not working on Vscode (Python code) - python

I used this code in Vs code:
import requests
from bs4 import BeautifulSoup
url = "https://www.gov.uk/search/news-and-communications"
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, "html.parser")
class_name= "gem-c-document-list__item-link"
titres = soup.find_all("a", class_=class_name)
titres_textes=[]
for titre in titres:
titres_textes.append(titre.string)
titres_textes
But when I try to run it with Ctrl+Alt+N
nothing happens ,why ?
python versions>3.10
extensions python on Vscode> Python ok,Django ok,Magic-python ok,code runner,python for vscode ok
pip> Latest versions currently installed

Use print and maintain proper code readability.
import requests
from bs4 import BeautifulSoup
url = "https://www.gov.uk/search/news-and-communications"
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, "html.parser")
class_name= "gem-c-document-list__item-link"
titres = soup.find_all("a", class_=class_name)
titres_textes=[]
for titre in titres:
titres_textes.append(titre.string)
print(titres_textes)
Try running your code from VS Code terminal. Go to the path first then type the command:
python filename.py

Agree with BrutusForcus, It is just because the HTML page has been changed. You can change the value of class_name to something others, and remove the string after titre to make it work.
Such as this:
import requests
from bs4 import BeautifulSoup
url = "https://www.gov.uk/search/news-and-communications"
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, "html.parser")
class_name = "gem-c-document-list__item-metadata"
titres = soup.find_all("ul", class_=class_name)
titres_textes = []
for titre in titres:
titres_textes.append(titre)
print(titres_textes)

Related

Get source from image from webpage

So I want to get the image source from this website:
https://www.pixiv.net/en/artworks/77619496
But every time I try to scrape it with bs4 I keep failing, I've tried other posts too but couldn't get it to work.
It keeps returning None
import requests
import bs4
from bs4 import BeautifulSoup
url = 'https://www.pixiv.net/en/artworks/77564597'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
x = soup.find("img")
print(x)
If you look at chrome debug console's network section or the console in the browser you are browsing in, you should see that there is no img elements at the beginning, the page generates img elements by executing javascript. However, I inspected the page and there is a meta element which has image data in it and you can parse it with JSON as shown:
import requests, json
from bs4 import BeautifulSoup
url = 'https://www.pixiv.net/en/artworks/77564597'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
x = soup.find("meta", {"id": "meta-preload-data"}).get("content")
usefulData = json.loads(x)
print(usefulData)
The sample output is here.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
url = 'https://www.pixiv.net/en/artworks/77564597'
sada = browser.get(url)
time.sleep(3)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
for item in soup.findAll('div', attrs={'class': 'sc-fzXfPI fRnFme'}):
for img in item.findAll('img', attrs={'class': 'sc-fzXfPJ lclRkv'}):
print(img.get('src'))
Output:
https://i.pximg.net/c/250x250_80_a2/custom-thumb/img/2019/11/28/00/02/59/78026183_p0_custom1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/10/31/04/15/04/77564597_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/08/30/07/23/45/76528190_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/08/23/08/01/08/76410568_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/07/24/03/41/47/75881545_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/05/30/04/24/27/74969583_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/custom-thumb/img/2019/11/28/00/02/59/78026183_p0_custom1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/10/31/04/15/04/77564597_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/08/30/07/23/45/76528190_p0_square1200.jpg

'NoneType' object is not callable in Beautiful Soup 4

I'm new-ish to python and started experimenting with Beautiful Soup 4. I tried writing code that would get all the links on one page then with those links repeat the prosses until I have an entire website parsed.
import bs4 as bs
import urllib.request as url
links_unclean = []
links_clean = []
soup = bs.BeautifulSoup(url.urlopen('https://pythonprogramming.net/parsememcparseface/').read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
print(links_clean)
while True:
for link in links_clean:
soup = bs.BeautifulSoup(url.urlopen(link).read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
links_clean = list(dict.fromkeys(links_clean))
input()
But I'm now getting this error:
'NoneType' object is not callable
line 20, in
soup = bs.BeautifulSoup(url.urlopen(link).read(),
'html.parser')
Can you pls help.
Be careful when importing modules as something. In this case, url on line 2 gets overridden in your for loop when you iterate.
Here is a shorter solution that will also give back only URLs containing https as part of the href attribute:
from bs4 import BeautifulSoup
from urllib.request import urlopen
content = urlopen('https://pythonprogramming.net/parsememcparseface/')
soup = BeautifulSoup(content, "html.parser")
base = soup.find('body')
for link in BeautifulSoup(str(base), "html.parser").findAll("a"):
if 'href' in link.attrs:
if 'https' in link['href']:
print(link['href'])
However, this paints an incomplete picture as not all links are captured because of errors on the page with HTML tags. May I recommend also the following alternative, which is very simple and works flawlessly in your scenario (note: you will need the package Requests-HTML):
from requests_html import HTML, HTMLSession
session = HTMLSession()
r = session.get('https://pythonprogramming.net/parsememcparseface/')
for link in r.html.absolute_links:
print(link)
This will output all URLs, including both those that reference other URLs on the same domain and those that are external websites.
I would consider using an attribute = value css selector and using the ^ operator to specify that the href attributes begin with https. You will then only have valid protocols. Also, use set comprehensions to ensure no duplicates and Session to re-use connection.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
final = []
with requests.Session() as s:
r = s.get('https://pythonprogramming.net/parsememcparseface/')
soup = bs(r.content, 'lxml')
httpsLinks = {item['href'] for item in soup.select('[href^=https]')}
for link in httpsLinks:
r = s.get(link)
soup = bs(r.content, 'lxml')
newHttpsLinks = [item['href'] for item in soup.select('[href^=https]')]
final.append(newHttpsLinks)
tidyList = list({item for sublist in final for item in sublist})
df = pd.DataFrame(tidyList)
print(df)

Python3 : BeautifulSoup4 not returning expected value

I'm currently trying to scrap some data over a website using BS4 under python 3.6.4 but the value returned is not what I am expecting:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.content
soup = BeautifulSoup(page, "html5lib")
price = soup.find("div", {"class" : "fieldPrice sizeC"}).text
print(price)
I should get "39 900 €" but the code return "47 880 â¬".
NB: Even without JS, the data should be "39 900 €".
Thanks for your help !
The encoding declaration is wrong on this page so BeautifulSoup gets told to use the wrong encoding. You can force it to use the correct encoding like this:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.content
soup = BeautifulSoup(page.decode('utf-8','ignore'), "html5lib")
price = soup.find("div", {"class": "fieldPrice sizeC"}).text
print(price)
Outputs:
49 070 €
Instead of page.content use page.text
Ex:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.text
soup = BeautifulSoup(page, "html.parser")
price = soup.find("div", {"class" : "fieldPrice sizeC"}).text
print(price)
.text automatically decode content from the server

Scraping stopping at first line

I need to scrap a website to obtain some information like Film's title and the relative links. My code run correctly but it stops at the first line of the website. This is my code, thank you in advance for your help and sorry if this is not a smart question but I'm a novice.
import requests
from bs4 import BeautifulSoup
URL= 'http://www.simplyscripts.com/genre/horror-scripts.html'
def scarica_pagina(URL):
page = requests.get(URL)
html = page.text
soup = BeautifulSoup(html, 'lxml') l
films = soup.find_all("div",{"id": "movie_wide"})
for film in films:
link = film.find('p').find("a").attrs['href']
title = film.find('p').find("a").text.strip('>')
print (link)
print(title)
Try the below way. I've slightly modified your script to serve the purpose and make it look better. Let me know if you encounter any further issues:
import requests
from bs4 import BeautifulSoup
URL = 'http://www.simplyscripts.com/genre/horror-scripts.html'
def scarica_pagina(link):
page = requests.get(link)
soup = BeautifulSoup(page.text, 'lxml')
for film in soup.find(id="movie_wide").find_all("p"):
link = film.find("a")['href']
title = film.find("a").text
print (link,title)
if __name__ == '__main__':
scarica_pagina(URL)

BeautifulSoup : Fetched all the links on a webpage how to navigate through them without selenium?

So I'm trying to write a mediocre script to download subtitles from one particular website as y'all can see. I'm a newbie to beautifulsoup, so far I have a list of all the "href" after a search query(GET). So how do I navigate further, after getting all the links?
Here's the code:
import requests
from bs4 import BeautifulSoup
usearch = input("Movie Name? : ")
url = "https://www.yifysubtitles.com/search?q="+usearch
print(url)
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
for link in soup.find_all('a'):
dictn = link.get('href')
print(dictn)
You need to use resp.text instead of resp.content
Try this to get the search results.
import requests
from bs4 import BeautifulSoup
base_url_f = "https://www.yifysubtitles.com"
search_url = base_url_f + "/search?q=last+jedi"
resp = requests.get(search_url)
soup = BeautifulSoup(resp.text, 'lxml')
for media in soup.find_all("div", {"class": "media-body"}):
print(base_url_f + media.find('a')['href'])
out: https://www.yifysubtitles.com/movie-imdb/tt2527336

Categories

Resources