I need to scrap a website to obtain some information like Film's title and the relative links. My code run correctly but it stops at the first line of the website. This is my code, thank you in advance for your help and sorry if this is not a smart question but I'm a novice.
import requests
from bs4 import BeautifulSoup
URL= 'http://www.simplyscripts.com/genre/horror-scripts.html'
def scarica_pagina(URL):
page = requests.get(URL)
html = page.text
soup = BeautifulSoup(html, 'lxml') l
films = soup.find_all("div",{"id": "movie_wide"})
for film in films:
link = film.find('p').find("a").attrs['href']
title = film.find('p').find("a").text.strip('>')
print (link)
print(title)
Try the below way. I've slightly modified your script to serve the purpose and make it look better. Let me know if you encounter any further issues:
import requests
from bs4 import BeautifulSoup
URL = 'http://www.simplyscripts.com/genre/horror-scripts.html'
def scarica_pagina(link):
page = requests.get(link)
soup = BeautifulSoup(page.text, 'lxml')
for film in soup.find(id="movie_wide").find_all("p"):
link = film.find("a")['href']
title = film.find("a").text
print (link,title)
if __name__ == '__main__':
scarica_pagina(URL)
Related
I'd like to scrape news headline, link of news and picture of that news.
I try to use web scraping following as below.
but It's only headline code and It is not work.
import requests
import pandas as pd
from bs4 import BeautifulSoup
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False)
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2',{'class':'post-title-news'})
len(headlines)
for i in range(len(headlines)):
print(headlines[i].text)
Please recommend it to me.
This is because the site blocks bot. If you print the res.content which shows 403.
Add headers={'User-Agent':'Mozilla/5.0'} to the request.
Try the code below,
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2', class_='post-title-news')
print(len(headlines))
for i in range(len(headlines)):
print(headlines[i].text)
First things first: never post code as an image.
<h2> in your HTML has no text. What it does have, is an <a> element, so:
for hl in headlines:
link = hl.findChild()
text = link.text
url = link.attrs['href']
I am trying to scrape a book from a website and while parsing it with Beautiful Soup I noticed that there were some errors. For example this sentence:
"You have more… direct control over your skaa here. How many woul "Oh, a half dozen or so,"
The "more…" and " woul" are both errors that occurred somewhere in the script.
Is there anyway to automatically clean mistakes like this up?
Example code of what I have is below.
import requests
from bs4 import BeautifulSoup
url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'html.parser')
print(soup.prettify())
trin = soup.tr.get_text()
final = str(trin)
print(final)
You need to escape the convert the html entities as detailed here. To apply in your situation however, and retain the text, you can use stripped_strings:
import requests
from bs4 import BeautifulSoup
import html
url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'lxml')
for r in soup.select_one('table tr').stripped_strings:
s = html.unescape(r)
print(s)
I'm using BeautifulSoup in python, to scrape a website.
While the addrs, a_earths was crawled, points = soup.select('.addr_point') at the end This section can't be crawled. I don't know the cause (the dashed red box in Image of webpage)
Following is code block I'm using:
import urllib.parse
from bs4 import BeautifulSoup
import re
url = 'http://www.dooinauction.com/auction/ca_list.php'
req = urllib.request.Request(url) #
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
tots = soup.select('div.title_left font') #total
tot = int(re.findall('\d+', tots[0].text)[0])
print(f'total : {tot}건')
url = f'http://www.dooinauction.com/auction/ca_list.php?total_record={tot}&search_fm_off=1&search_fm_off=1&start=0'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
addrs = soup.select('.addr') # crawling OK
a_earths = soup.select('.list_class.bold') #crawling OK
points = soup.select('.addr_point') #crawling NO
print()
Image of webpage
I browse your website and it seems that I can't see the addr_points section. I think maybe this is the reason.
Screenshot:
I ran similar code on another website and it works, but on opensubtitle.org I'm having a problem! I don't know why it is not able to recognize href (the link I need) and titles.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.opensubtitles.org/it/search/sublanguageid-eng/searchonlymovies-on/genre-horror/movielanguage-english/moviecountry-usa/subformat-srt/hd-on/offset-4040'
def scarica_pagina(link):
page = requests.get(link)
soup = BeautifulSoup(page.text, 'lxml')
cnt=0
for film in soup.find(id="search_results").find_all("td"):
cnt=cnt+1
link = film.find("a")["href"]
title = film.find("a").text
#genres = film.find("i").text
print(link)
if __name__ == '__main__':
scarica_pagina(URL)
all you need is to follow the DOM correctly:
1-first choose the table with id = 'search_results'.
2-find all td tags with class name 'sb_star_odd' or 'sb_star_even'.
3-find all('a')[0]['href'] for the link you want.
4-find all('a')[0].text for the title you want.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.opensubtitles.org/it/search/sublanguageid-eng/searchonlymovies-on/genre-horror/movielanguage-english/moviecountry-usa/subformat-srt/hd-on/offset-4040'
def scarica_pagina(link):
page = requests.get(link)
soup = BeautifulSoup(page.text, 'lxml')
cnt=0
for film in soup.find(id="search_results").find_all('td',class_=re.compile('^sb*')):
cnt=cnt+1
link = film.find_all('a')[0]['href']
title = film.find_all('a')[0].text
print(link)
if __name__ == '__main__':
scarica_pagina(URL)
you were using find instead find_all that what caused your problem.
I am new to python. I am building a crawler for the company I work for. Crawling its website, there is a internal link that is not in the link format that it is used to. How can I get the entire link instead of the directory only. If I was not too clear, please run the code that I made bellow:
import urllib2
from bs4 import BeautifulSoup
web_page_string = []
def get_first_page(seed):
response = urllib2.urlopen(seed)
web_page = response.read()
soup = BeautifulSoup(web_page)
for link in soup.find_all('a'):
print (link.get('href'))
print soup
print get_first_page('http://www.fashionroom.com.br')
print web_page_string
Tks everyone for the answers I tried to put an if in the script. If anyone sees a potential problem with something I will find in the future, pls let me know
import urllib2
from bs4 import BeautifulSoup
web_page_string = []
def get_first_page(seed):
response = urllib2.urlopen(seed)
web_page = response.read()
soup = BeautifulSoup(web_page)
final_page_string = soup.get_text()
for link in soup.find_all('a'):
if (link.get('href'))[0:4]=='http':
print (link.get('href'))
else:
print seed+'/'+(link.get('href'))
print final_page_string
print get_first_page('http://www.fashionroom.com.br')
print web_page_string