I tried to scrape a image from a reddit post. But when I run this code snippet It show me html snippet sometimes, but sometimes it prints None (NO Error occurred). Anybody can tell me why? Here is the code.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.reddit.com/r/programmingmemes/').text
soup = BeautifulSoup(source, 'lxml')
img = soup.find('div', class_='_3Oa0THmZ3f5iZXAQ0hBJ0k')
print(img)
Check the return code of the request:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.reddit.com/r/programmingmemes/')
if source.status_code == 200:
soup = BeautifulSoup(source.text, 'lxml')
img = soup.find('div', class_='_3Oa0THmZ3f5iZXAQ0hBJ0k')
print(img)
else:
print(f"Error (code {source})")
Also check if the class is constant during time (it may be randomized).
Related
I am following this web scrapping tutorial and I am getting an error.
My code is as follows:
import requests
URL = "http://books.toscrape.com/" # Replace this with the website's URL
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
print(getURL.status_code)
from bs4 import BeautifulSoup
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
print(images)
imageSources=[]
for image in images:
imageSources.append(image.get("src"))
print(imageSources)
for image in imageSources:
webs=requests.get(image)
open("images/"+image.split("/")[-1], "wb").write(webs.content)
Unfortunately, I am getting an error in the line webs=requests.get(image), which is as follows:
MissingSchema: Invalid URL 'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg': No schema supplied. Perhaps you meant http://media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg?
I am totally new to scrapping and I don't know what this means. Any suggestion is appreciated.
You need to supply a proper URL in this line:
webs=requests.get(image)
Because this media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg is not a valid URL. Hence, the MissingSchema error.
For example:
full_image_url = f"http://books.toscrape.com/{image}"
This gives you:
http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg
Full code:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get("http://books.toscrape.com/").text, 'html.parser')
images = soup.find_all('img')
imageSources = []
for image in images:
imageSources.append(image.get("src"))
for image in imageSources:
full_image_url = f"http://books.toscrape.com/{image}"
webs = requests.get(full_image_url)
open(image.split("/")[-1], "wb").write(webs.content)
I used this code in Vs code:
import requests
from bs4 import BeautifulSoup
url = "https://www.gov.uk/search/news-and-communications"
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, "html.parser")
class_name= "gem-c-document-list__item-link"
titres = soup.find_all("a", class_=class_name)
titres_textes=[]
for titre in titres:
titres_textes.append(titre.string)
titres_textes
But when I try to run it with Ctrl+Alt+N
nothing happens ,why ?
python versions>3.10
extensions python on Vscode> Python ok,Django ok,Magic-python ok,code runner,python for vscode ok
pip> Latest versions currently installed
Use print and maintain proper code readability.
import requests
from bs4 import BeautifulSoup
url = "https://www.gov.uk/search/news-and-communications"
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, "html.parser")
class_name= "gem-c-document-list__item-link"
titres = soup.find_all("a", class_=class_name)
titres_textes=[]
for titre in titres:
titres_textes.append(titre.string)
print(titres_textes)
Try running your code from VS Code terminal. Go to the path first then type the command:
python filename.py
Agree with BrutusForcus, It is just because the HTML page has been changed. You can change the value of class_name to something others, and remove the string after titre to make it work.
Such as this:
import requests
from bs4 import BeautifulSoup
url = "https://www.gov.uk/search/news-and-communications"
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, "html.parser")
class_name = "gem-c-document-list__item-metadata"
titres = soup.find_all("ul", class_=class_name)
titres_textes = []
for titre in titres:
titres_textes.append(titre)
print(titres_textes)
I ran the code fine, then I tweaked the code and saved and closed it, tried to run it again and got a syntax error. My stupid self didn't backup the original code and now anything I change doesn't seem to fix it. I checked the source code of the website and that hasn't changed. It's erroring before even checking the website. Any suggestions on what I overlooked?
import requests
import time
import bs4
import sys
sys.stdout = open("links2.txt", "a")
for x in range(0, 100000):
try:
URL = f'https://wesbite.com/{x}'
page = requests.get(URL)
time.sleep(1)
soup = BeautifulSoup(page.content, 'html.parser')
website = "https://v.website.com/"
for links in soup.find('div',id='view').find_all('a'):
parts = links['href'].split("/")
new_link = parts[1].replace(parts[1], website) + '/'.join(parts[2:]) + ".mp4"
print(new_link)
except:
continue
It's reporting a syntax error on the line that reads: URL = f'https://wesbite.com/{x}'
Here is your working code now:
import requests
import time
from bs4 import BeautifulSoup
import sys
sys.stdout = open("links2.txt", "a")
for x in range(0, 100000):
try:
URL = f'https://wesbite.com/{x}'
page = requests.get(URL)
time.sleep(1)
soup = BeautifulSoup(page.content, 'html.parser')
website = "https://v.website.com/"
for links in soup.find('div',id='view').find_all('a'):
parts = links['href'].split("/")
new_link = parts[1].replace(parts[1], website) + '/'.join(parts[2:]) + ".mp4"
print(new_link)
except:
continue
It was:
import bs4
Now:
from bs4 import BeautifulSoup
So I want to get the image source from this website:
https://www.pixiv.net/en/artworks/77619496
But every time I try to scrape it with bs4 I keep failing, I've tried other posts too but couldn't get it to work.
It keeps returning None
import requests
import bs4
from bs4 import BeautifulSoup
url = 'https://www.pixiv.net/en/artworks/77564597'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
x = soup.find("img")
print(x)
If you look at chrome debug console's network section or the console in the browser you are browsing in, you should see that there is no img elements at the beginning, the page generates img elements by executing javascript. However, I inspected the page and there is a meta element which has image data in it and you can parse it with JSON as shown:
import requests, json
from bs4 import BeautifulSoup
url = 'https://www.pixiv.net/en/artworks/77564597'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
x = soup.find("meta", {"id": "meta-preload-data"}).get("content")
usefulData = json.loads(x)
print(usefulData)
The sample output is here.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
url = 'https://www.pixiv.net/en/artworks/77564597'
sada = browser.get(url)
time.sleep(3)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
for item in soup.findAll('div', attrs={'class': 'sc-fzXfPI fRnFme'}):
for img in item.findAll('img', attrs={'class': 'sc-fzXfPJ lclRkv'}):
print(img.get('src'))
Output:
https://i.pximg.net/c/250x250_80_a2/custom-thumb/img/2019/11/28/00/02/59/78026183_p0_custom1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/10/31/04/15/04/77564597_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/08/30/07/23/45/76528190_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/08/23/08/01/08/76410568_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/07/24/03/41/47/75881545_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/05/30/04/24/27/74969583_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/custom-thumb/img/2019/11/28/00/02/59/78026183_p0_custom1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/10/31/04/15/04/77564597_p0_square1200.jpg
https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/08/30/07/23/45/76528190_p0_square1200.jpg
I need to scrap a website to obtain some information like Film's title and the relative links. My code run correctly but it stops at the first line of the website. This is my code, thank you in advance for your help and sorry if this is not a smart question but I'm a novice.
import requests
from bs4 import BeautifulSoup
URL= 'http://www.simplyscripts.com/genre/horror-scripts.html'
def scarica_pagina(URL):
page = requests.get(URL)
html = page.text
soup = BeautifulSoup(html, 'lxml') l
films = soup.find_all("div",{"id": "movie_wide"})
for film in films:
link = film.find('p').find("a").attrs['href']
title = film.find('p').find("a").text.strip('>')
print (link)
print(title)
Try the below way. I've slightly modified your script to serve the purpose and make it look better. Let me know if you encounter any further issues:
import requests
from bs4 import BeautifulSoup
URL = 'http://www.simplyscripts.com/genre/horror-scripts.html'
def scarica_pagina(link):
page = requests.get(link)
soup = BeautifulSoup(page.text, 'lxml')
for film in soup.find(id="movie_wide").find_all("p"):
link = film.find("a")['href']
title = film.find("a").text
print (link,title)
if __name__ == '__main__':
scarica_pagina(URL)