BeautifulSoup : Why is the "dlink_find('a')['href']" not working? - python

I'm trying to create a script to download subtitles from one specific website. Please read the comments in the code.
Here's the code:
import requests
from bs4 import BeautifulSoup
count = 0
usearch = input("Movie Name? : ")
search_url = "https://www.yifysubtitles.com/search?q="+usearch
base_url = "https://www.yifysubtitles.com"
print(search_url)
resp = requests.get(search_url)
soup = BeautifulSoup(resp.content, 'lxml')
for link in soup.find_all("div",{"class": "media-body"}): #Get the exact class:'media-body'
imdb = link.find('a')['href'] #Find the link in that class, which is the exact link we want
movie_url = base_url+imdb #Merge the result with base string to navigate to the movie page
print("Movie URL : {}".format(movie_url)) #Print the URL just to check.. :p
next_page = requests.get(movie_url) #Soup number 2 begins here, after navigating to the movie page
soup2 = BeautifulSoup(next_page.content,'lxml')
#print(soup2.prettify())
for links in soup2.find_all("tr",{"class": "high-rating"}): #Navigate to subtitle options with class as high-rating
for flags in links.find("td", {"class": "flag-cell"}): #Look for all the flags of subtitles with high-ratings
if flags.text == "English": #If flag is set to English then get the download link
print("After if : {}".format(links))
for dlink in links.find("td",{"class": "download-cell"}): #Once English check is done, navigate to the download class "download-cell" where the download href exists
half_dlink = dlink.find('a')['href'] #STUCK HERE!!!HERE'S THE PROBLEM!!! SOS!!! HELP!!!
download = base_url + half_dlink
print(download)
I'm getting the following error :
File "C:/Users/PycharmProjects/WhatsApp_API/SubtitleDownloader.py", line 24, in <module>
for x in dlink.find("a"):
TypeError: 'NoneType' object is not iterable

Just change the above line this:
for dlink in links.find("td",{"class": "download-cell"}):
to this:
for dlink in links.find_all("td",{"class": "download-cell"}):
because you are running a loop on an single element rather than a list.
Note: The only difference is that find_all() returns a list containing the single result, and find() just returns the result.
Hope this will helps you! :)

Have a look at the documentation of find_all() and find().
find_all():
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
find:
The find_all() method scans the entire document looking for results, but sometimes you only want to find one result. If you know a
document only has one <body> tag, it’s a waste of time to scan the
entire document looking for more. Rather than passing in limit=1
every time you call find_all, you can use the find() method.
So, you don't need to loop over the find() function to get the tags. You need to make the following changes in your code (removed the unnecessary for loops):
...
# Previous code is the same
soup2 = BeautifulSoup(next_page.content,'lxml')
for links in soup2.find_all("tr",{"class": "high-rating"}):
if links.find("td", {"class": "flag-cell"}).text == "English":
print("After if : {}".format(links))
half_dlink = links.find('td', {'class': 'download-cell'}).a['href']
download = base_url + half_dlink
print(download)

Related

Web Scraping - Extract list of text from multiple pages

I want to extract a list of names from multiple pages of a website.
The website has over 200 pages and i want to save all the names to a text file. I have wrote some code but it's giving me index error.
CODE:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
for book in books:
data = book.find_all('b')[0].get_text()
print(data)
OUTPUT:
Aabbaz
Aabid
Aabideen
Aabinus
Aadam
Aadeel
Aadil
Aadroop
Aafandi
Aafaq
Aaki
Aakif
Aalah
Aalam
Aalamgeer
Aalif
Traceback (most recent call last):
File "C:\Users\Mujtaba\Documents\names.py", line 15, in <module>
data = book.find_all('b')[0].get_text()
IndexError: list index out of range
>>>
The reason for getting the error is since it can't find a <b> tag.
Try this code to request each page and save the data to a file:
import requests
from bs4 import BeautifulSoup as bs
MAIN_URL = "https://hamariweb.com/names/muslim/boy/"
URL = "https://hamariweb.com/names/muslim/boy/page-{}"
with open("output.txt", "a", encoding="utf-8") as f:
for page in range(203):
if page == 0:
req = requests.get(MAIN_URL.format(page))
else:
req = requests.get(URL.format(page))
soup = bs(req.text, "html.parser")
print(f"page # {page}, Getting: {req.url}")
book_name = (
tag.get_text(strip=True)
for tag in soup.select(
"tr.bottom-divider:nth-of-type(n+2) td:nth-of-type(1)"
)
)
f.seek(0)
f.write("\n".join(book_name) + "\n")
I suggest to change your parser to html5lib #pip install html5lib. I just think it's better. Second It's better NOT to do a .find() from your soup object DIRECTLY since it might cause some problems where the tags and classes might have duplicates. SO you might be finding data on a html tag where your data isn't even there. So it's better to check everything and inspect element the the tags you want to get and see on what block of code they might be in cause it is easier that way to scrape, also to avoid more errors.
What I did there is I inspected the elements first and FIND the BLOCK of code where you want to get your data and I found that it is on a div and its class is mb-40 content-box that is where all the names you are trying to get are. Luckily the class is UNIQUE and there are no other elements with the same tag and class so we can just directly .find() it.
Then the value of trs are simply the tr tags inside of that block
(Take note also that those <tr> tags are inside of a <table> tag but the good thing is those are the only <tr> tags that exist so there wouldn't be much of a problem like if there would be another <table> tag with the same class value)
which the <tr> tags contains the names you want to get. You may ask why is there [1:] it's because to start at index 1 to NOT include the Header from the table on the website.
Then just loop through those tr tags and get the text. With regards to your error on why is it happening it is simply because of index out of range you are trying to access a .find_all() result list item where it is out of bounds and this might happen if cases that there are no such data that is being found and that also might happen if you DIRECTLY do a .find() function on your soup variable, because there would be times that there are tags and their respective class values are the same BUT! WITH DIFFERENT CONTENT WITHIN IT. So what happens is you're expecting to scrape that particular part of the website but what actually happening is you're scraping a different part, that's why you might not get any data and wonder why it is happening.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.content, 'html5lib')
div_container = soup.find('div', class_='mb-40 content-box')
trs = div_container.find_all("tr",class_="bottom-divider")[1:]
for tr in trs:
text = tr.find("td").find("a").text
print(text)
The issue you're having with the IndexError means that in this case the b-tag you found doesn't contains the information that you are looking for.
You can simply wrap that piece of code in a try-except clause.
for book in books:
try:
data = book.find_all('b')[0].get_text()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
This will catch you error and move on. But not break the code.
Below I have also added some extra lines to save your title to a text-file.
Take a look at the inline comments.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
# Theres is where your titles will be saved. Changes as needed
PATH = '/tmp/title_file.txt'
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
# Here your title will be stored before writing to file
all_titles = []
for book in books:
try:
# Add strip() to cleanup the input
data = book.find_all('b')[0].get_text().strip()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
# Open path to write
with open(PATH, 'w') as f:
# Write all titles on a new line
f.write('\n'.join(all_titles))

What is the return type of the find_all method in Beautiful Soup?

from bs4 import BeautifulSoup, SoupStrainer
from urllib.request import urlopen
import pandas as pd
import numpy as np
import re
import csv
import ssl
import json
from googlesearch import search
from queue import Queue
import re
links = []
menu = []
filtered_menu = []
def contains(substring, string):
if substring.lower() in string.lower():
return True
else:
return False
for website in search("mr puffs", tld="com", num=1, stop=1, country="canada", pause=4):
links.append(website)
soup = BeautifulSoup(urlopen(links.pop(0)), features="html.parser")
menu = soup.find_all('a', href=True)
for string in menu:
if contains("contact", string):
filtered_menu.append(string)
print(filtered_menu)
I am creating a webscraper that will extract contact information from sites. However, in order to do that, I need to get to the contact page of the website. Using the googlesearch library, the code searches for a keyword and puts all the results (up to a certain limit) in a list. For simplicity, in this code, we are just putting in the first link. Now, from this link, I am creating a beautiful soup object and I am extracting all the other links on the website(because the contact information is usually not found on the homepage). I am putting these links in a list called menu.
Now, I want to filter menu for only links that have "contact" in it. Example: "www.smallBusiness.com/our-services" would be deleted from the new list while "www.smallBusiness.com/contact" or "www.smallBusiness.com/contact-us" will stay in the list.
I defined a method that checks if a substring is in a string. However, I get the following exception:
TypeError: 'NoneType' object is not callable.
I've tried using regex by doing re.search but it says that the expected type of string or byte-like value is not in the parameters.
I think it's because the return type of find_all is not a string. It's probably something else which I can't find in the docs. If so, how do I convert it into a string?
As requested in the answer below, here's what printing menu list gives:
From here, I just want to extract the highlighted links:
BeautifulSoup.find_all() type is bs4.element.ResultSet (which is actually a list)
Individual items of find_all(), in your case the variable you call "string" are of type bs4.element.Tag.
As your contains function expects type str, your for loop should look something like:
for string in menu:
if contains("contact", str(string)):
filtered_menu.append(string)
This is how I did it to search google for user-inputted search terms, then scrape all the urls from the serps. The program goes on to visit each of those links directly and scrape text off of those.
Maybe you could modify this for your purposes?
#First-stage scrape of Google
headers = {"user-agent": USER_AGENT}
URL=f"https://google.com/search?q={squery}"
URL2=f"https://google.com/search?q={squery2}"
URL3=f"https://google.com/search?q={squery3}"
URL4=f"https://google.com/search?q={squery4}"
resp=requests.get(URL, headers=headers)
resp2=requests.get(URL2, headers=headers)
resp3=requests.get(URL3, headers=headers)
resp4=requests.get(URL4, headers=headers)
results=[]
s2results=[]
s3results=[]
s4results=[]
def scrapeURL(a,b,c):
if a.status_code == 200:
print("Searching Google for information about: ", c)
soup = BeautifulSoup(a.content, "html.parser")
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
link
}
b.append(item)
else:
print("Couldn't scrape URLS from first phase")
else:
print("Could not perform search. Status code: ",a.status_code)
#Create list of urls and format to enable second scrape
scrapeURL(resp,results,query)
scrapeURL(resp2,s2results,query2)
scrapeURL(resp3,s3results,query3)
scrapeURL(resp4,s4results,query4)
#Create list of urls and format to enable second scrape
def formaturls(res,resstorage):
a=0
listurls=str(res)
listurls=listurls.replace("[","")
listurls=listurls.replace("{","")
listurls=listurls.replace("'}","")
listurls=listurls.replace("]","")
listurls=listurls.lower()
re=listurls.split(",")
for items in re:
s=str(re[a])
resstorage.append(s)
a=a+1
qresults=[]
q2results=[]
q3results=[]
q4results=[]
formaturls(results,qresults)
formaturls(s2results,q2results)
formaturls(s3results,q3results)
formaturls(s4results,q4results)

Python not progressing a list of links

So, as i need more detailed data I have to dig a bit deeper in the HTML code of a website. I wrote a script that returns me a list of specific links to detail pages, but I can't bring Python to search each link of this list for me, it always stops at the first one. What am I doing wrong?
from BeautifulSoup import BeautifulSoup
import urllib2
from lxml import html
import requests
#Open site
html_page = urllib2.urlopen("http://www.sitetoscrape.ch/somesite.aspx")
#Inform BeautifulSoup
soup = BeautifulSoup(html_page)
#Search for the specific links
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
#print found links
print link.get('href')
#complete links
complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
#print complete links
print complete_links
#
#EVERYTHING WORKS FINE TO THIS POINT
#
page = requests.get(complete_links)
tree = html.fromstring(page.text)
#Details
name = tree.xpath('//dl[#class="services"]')
for i in name:
print i.text_content()
Also: What tutorial can you recommend me to learn how to put my output in a file and clean it up, give variable names, etc?
I think that you want a list of links in complete_links instead of a single link. As #Pynchia and #lemonhead said you're overwritting complete_links every iteration of first for loop.
You need two changes:
Append links to a list and use it to loop and scrap each link
# [...] Same code here
links_list = []
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
print link.get('href')
complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
print complete_links
link_list.append(complete_links) # append new link to the list
Scrap each accumulated link in another loop
for link in link_list:
page = requests.get(link)
tree = html.fromstring(page.text)
#Details
name = tree.xpath('//dl[#class="services"]')
for i in name:
print i.text_content()
PS: I recommend scrapy framework for tasks like that.

How to use BeautifulSoup to scrape links in a html

I need download few links in a html. But I don't need all of them, I only need few of them in certain section on this webpage.
For example, in http://www.nytimes.com/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning, I need links in the debaters section. I plan to use BeautifulSoup and I looked the html of one of the links:
Data Collection Is Out of Control
Here's my code:
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
link_set = set()
for link in soup.find_all("a", class = "bl-bigger"):
href = link.get('href')
if href == None:
continue
elif '/roomfordebate/' in href:
link_set.add(href)
for link in link_set:
print link
This code is supposed to give me all the links with bl-bigger class. But it actually returns nothing. Could anyone figure what's wrong with my code or how to make it work?
Thanks
I don't see bl-bigger class at all when I view the source from Chrome. May be that's why your code is not working?
Lets start looking at the source. The whole Debaters section seems to be put within div with class nytint-discussion-content. So using BeautifulSoup, lets get that whole div first.
debaters_div = soup.find('div', class_="nytint-discussion-content")
Again learning from the source, seems all the links are within a list, li tag. Now all you have to do is, find all li tags and find anchor tags within them. One more thing you can notice is, all the li tags have class nytint-bylines-1.
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
list_items[0].find('a')
# Data Collection Is Out of Control
So, your whole code can be:
link_set = set()
response = requests.get(url)
html_data = response.text
soup = BeautifulSoup(html_data)
debaters_div = soup.find('div', class_="nytint-discussion-content")
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
for each_item in list_items:
html_link = each_item.find('a').get('href')
if html_link.startswith('/roomfordebate'):
link_set.add(html_link)
Now link_set will contain all the links you want. From the link given in question, it will fetch 5 links.
PS: link_set contains only uri and not actual html addresses. So I would add http://www.nytimes.com at start before adding those links to link_set. Just change the last line to:
link_set.add('http://www.nytimes.com' + html_link)
You need to call the method with an object instead of keyword argument:
soup.find("tagName", { "class" : "cssClass" })
or use .select method which executes CSS queries:
soup.select('a.bl-bigger')
Examples are in the docs, just search for '.select' string. Also, instead of writing the entire script you'll quickly get some working code with ipython interactive shell.

Using beautiful soup 4 to scrape URLS within a <p class="postbody"> tag and save them to a text file

I realize this is probably incredibly straightforward but please bear with me. I'm trying to use beautifulsoup 4 to scrape a website that has a list of blog posts for the urls of those posts. The tag that I want is within an tag. There are multiple tags that include a header and then a link that I want to capture. This is the code I'm working with:
with io.open('TPNurls.txt', 'a', encoding='utf8') as logfile:
snippet = soup.find_all('p', class="postbody")
for link in snippet.find('a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
The error I'm getting is:
AttributeError: 'ResultSet' object has no attribute 'find'
I understand that means "head" is a set and beautifulsoup doesn't let me look for tags within a set. But then how can I do this? I need it to find the entire set of tags and then look for the tag within each one and then save each one on a separate line to a file.
The actual reason for the error is that snippet is a result of find_all() call and is basically a list of results, there is no find() function available on it. Instead, you meant:
snippet = soup.find('p', class_="postbody")
for link in snippet.find_all('a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
Also, note the use of class_ here - class is a reserved keyword and cannot be used as a keyword argument here. See Searching by CSS class for more info.
Alternatively, make use of CSS selectors:
for link in snippet.select('p.postbody a'):
fulllink = link.get('href')
logfile.write(fulllink + "\n")
p.postbody a would match all a tags inside the p tag with class postbody.
In your code,
snippet = soup.find_all('p', class="postbody")
for link in snippet.find('a'):
Here snippet is a bs4.element.ResultSet type object. So you are getting this error. But the elements of this ResultSet object are bs4.element.Tag type where you can apply find method.
Change your code like this,
snippet = soup.find_all("p", { "class" : "postbody" })
for link in snippet:
if link.find('a'):
fulllink = link.a['href']
logfile.write(fulllink + "\n")

Categories

Resources