How to refine and limit BeautifulSoup results

How to refine and limit BeautifulSoup results - python

So I'm stuck here. I'm a doctor so my programming background and skills are close to none and most likely that's the problem. I'm trying to learn some basics about Python and for me, the best way is by doing stuff.
The project:
scrape the cover images from several books
Some of the links used:
http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html
http://coleccaoargonauta.blogspot.com/2011/09/n-2-o-estranho-mundo-de-kilsona.html
http://coleccaoargonauta.blogspot.com/2011/09/n-3-ultima-cidade-da-terra.html
http://coleccaoargonauta.blogspot.com/2011/09/n-4-nave-sideral.html
http://coleccaoargonauta.blogspot.com/2011/09/n-5-o-universo-vivo.html
That website structure is messed up.
The links are located inside a div with class:"post-title entry-title" which in turn has two or more "separator" class div's that can have content or be empty.
What I can tell so far is that 95% of the time what I want is the last two links in the first two "separator" class DIV's. And for this stage that's good enough.
My code so far is as follow:
#intro
r=requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
#select the first teo 'separator' divs
separador = soup.select("div.separator")[:2]
#we need a title for each page - for debugging and later used to rename images
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)
#the find all links loop
for div in separador:
imagens = div.find_all('a')
for link in imagens:
print (link['href'], '\n')
What I can do right now:
I can print the right URL's, I can then use wget to download and rename files. However, I only want the last two links from the results and that is the only thing that's missing in my google-fu. I think the problem is in the way BeautifulSoup exports results (ResultSet) and my lack ok knowledge in things such as lists. If the first "separator" has one link and the second two links I get a list with two items (and the second item is two links), hence not slicable.
Example output
2-O Estranho Mundo de Kilsona.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg
http://4.bp.blogspot.com/-D0cUIP8PkEU/UPfbByjSuII/AAAAAAAAB0E/LP6kbIEJ_eI/s1600/Argonauta002.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg
But I wanted it to be
2-O Estranho Mundo de Kilsona.jpg
http://4.bp.blogspot.com/-D0cUIP8PkEU/UPfbByjSuII/AAAAAAAAB0E/LP6kbIEJ_eI/s1600/Argonauta002.jpg
http://3.bp.blogspot.com/-tAyl2wdRT1g/UPfbGczmv2I/AAAAAAAAB0M/mP71TRQIg3c/s1600/2+-+O+Estranho+Mundo+de+Kilsona.jpg
Can anyone shed some light on this?

The issue is due to the line imagens = div.find_all('a') being called within a loop. This creates a list of lists. As such we need to find a way to flatten them into a single list. I do this below with the line merged_list = [] [merged_list.extend(list) for list in imagens].
From here I then create a new list with just the links and then dedupe the list by calling using set (a set is a useful data structure to use when you don't want duplicated data). I then turn It back into a list and it's back to your code.
import requests
from bs4 import BeautifulSoup
link1 = "http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html"
link2 = "http://coleccaoargonauta.blogspot.com/2011/09/n-2-o-estranho-mundo-de-kilsona.html"
link3 = "http://coleccaoargonauta.blogspot.com/2011/09/n-3-ultima-cidade-da-terra.html"
link4 = "http://coleccaoargonauta.blogspot.com/2011/09/n-4-nave-sideral.html"
link5 = "http://coleccaoargonauta.blogspot.com/2011/09/n-5-o-universo-vivo.html"
#intro
r=requests.get(link2)
soup = BeautifulSoup(r.content, 'lxml')
#select the first teo 'separator' divs
separador = soup.select("div.separator")[:2]
#we need a title for each page - for debugging and later used to rename images
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)
imagens = [div.find_all('a') for div in separador]
merged_list = []
[merged_list.extend(list) for list in imagens]
link_list = [link['href'] for link in merged_list]
deduped_list = list(set(link_list))
for link in deduped_list:
print(link, '\n')

You can use CSS selectors to extract image directly from div with class separator (link to docs).
I also use list comprehension instead of for loop.
Below is working example for url from your list.
import requests
from bs4 import BeautifulSoup
#intro
url = "http://coleccaoargonauta.blogspot.com/2011/09/1-perdidos-na-estratosfera.html"
r=requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
#we need a title for each page - for debugging and later used to rename images
titulo = soup.find_all("h3", {"class": "post-title entry-title"})[0]
m = titulo.string
print (m)
#find all links
links = [link['href'] for link in soup.select('.separator a')]
print(links)

Related

Awkward problem with iterrating over the list and extracting only last linked link from the page [BS4]

I am trying to scrape the website, and there are 12 pages with X links on them - I just want to extract all the links, and store them for later usage.
But there is an awkward problem with extracting links from the pages. To be precise, my output contains only the last link from each of the pages.
I know that this description may sound confusing, so let me show you the code and images:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"
archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)
#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
print(paper_urls)
time.sleep(5)
But problem is, my output looks like [redacted].
Instead of ~80 links, I'm getting this! I wondered what happened, and it looks like my script from every generated URL (from the list named "issues" in the code) gets only the last listed link?! How to fix it? I do not have any idea what should be the problem here.

Were you perhaps missing an indentation when appending to paper_urls?
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags) # added missing indentation
print(paper_urls)
time.sleep(5)
The whole code, after moving the print outside the loop, would look like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"
archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)
#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
#print(ahrefTags) #uncomment if you wish to print each and every link by itself
#time.sleep(5) #uncomment if you wish to add a delay between each request
print(paper_urls)

Web Scraping - Extract list of text from multiple pages

I want to extract a list of names from multiple pages of a website.
The website has over 200 pages and i want to save all the names to a text file. I have wrote some code but it's giving me index error.
CODE:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
for book in books:
data = book.find_all('b')[0].get_text()
print(data)
OUTPUT:
Aabbaz
Aabid
Aabideen
Aabinus
Aadam
Aadeel
Aadil
Aadroop
Aafandi
Aafaq
Aaki
Aakif
Aalah
Aalam
Aalamgeer
Aalif
Traceback (most recent call last):
File "C:\Users\Mujtaba\Documents\names.py", line 15, in <module>
data = book.find_all('b')[0].get_text()
IndexError: list index out of range
>>>

The reason for getting the error is since it can't find a <b> tag.
Try this code to request each page and save the data to a file:
import requests
from bs4 import BeautifulSoup as bs
MAIN_URL = "https://hamariweb.com/names/muslim/boy/"
URL = "https://hamariweb.com/names/muslim/boy/page-{}"
with open("output.txt", "a", encoding="utf-8") as f:
for page in range(203):
if page == 0:
req = requests.get(MAIN_URL.format(page))
else:
req = requests.get(URL.format(page))
soup = bs(req.text, "html.parser")
print(f"page # {page}, Getting: {req.url}")
book_name = (
tag.get_text(strip=True)
for tag in soup.select(
"tr.bottom-divider:nth-of-type(n+2) td:nth-of-type(1)"
)
)
f.seek(0)
f.write("\n".join(book_name) + "\n")

I suggest to change your parser to html5lib #pip install html5lib. I just think it's better. Second It's better NOT to do a .find() from your soup object DIRECTLY since it might cause some problems where the tags and classes might have duplicates. SO you might be finding data on a html tag where your data isn't even there. So it's better to check everything and inspect element the the tags you want to get and see on what block of code they might be in cause it is easier that way to scrape, also to avoid more errors.
What I did there is I inspected the elements first and FIND the BLOCK of code where you want to get your data and I found that it is on a div and its class is mb-40 content-box that is where all the names you are trying to get are. Luckily the class is UNIQUE and there are no other elements with the same tag and class so we can just directly .find() it.
Then the value of trs are simply the tr tags inside of that block
(Take note also that those <tr> tags are inside of a <table> tag but the good thing is those are the only <tr> tags that exist so there wouldn't be much of a problem like if there would be another <table> tag with the same class value)
which the <tr> tags contains the names you want to get. You may ask why is there [1:] it's because to start at index 1 to NOT include the Header from the table on the website.
Then just loop through those tr tags and get the text. With regards to your error on why is it happening it is simply because of index out of range you are trying to access a .find_all() result list item where it is out of bounds and this might happen if cases that there are no such data that is being found and that also might happen if you DIRECTLY do a .find() function on your soup variable, because there would be times that there are tags and their respective class values are the same BUT! WITH DIFFERENT CONTENT WITHIN IT. So what happens is you're expecting to scrape that particular part of the website but what actually happening is you're scraping a different part, that's why you might not get any data and wonder why it is happening.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.content, 'html5lib')
div_container = soup.find('div', class_='mb-40 content-box')
trs = div_container.find_all("tr",class_="bottom-divider")[1:]
for tr in trs:
text = tr.find("td").find("a").text
print(text)

The issue you're having with the IndexError means that in this case the b-tag you found doesn't contains the information that you are looking for.
You can simply wrap that piece of code in a try-except clause.
for book in books:
try:
data = book.find_all('b')[0].get_text()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
This will catch you error and move on. But not break the code.
Below I have also added some extra lines to save your title to a text-file.
Take a look at the inline comments.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
# Theres is where your titles will be saved. Changes as needed
PATH = '/tmp/title_file.txt'
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
# Here your title will be stored before writing to file
all_titles = []
for book in books:
try:
# Add strip() to cleanup the input
data = book.find_all('b')[0].get_text().strip()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
# Open path to write
with open(PATH, 'w') as f:
# Write all titles on a new line
f.write('\n'.join(all_titles))

How to get nested href in python?

GOAL
(I need to repeatedly do the search for hundreds of times):
1. Search (e.g. "WP_000177210.1") in "https://www.ncbi.nlm.nih.gov/ipg/"
(i.e. https://www.ncbi.nlm.nih.gov/ipg/?term=WP_000177210.1)
2. Select the first record in the second column "CDS Region in Nucleotide" of the table
(i.e. " NC_011415.1 1997353-1998831 (-)", https://www.ncbi.nlm.nih.gov/nuccore/NC_011415.1?from=1997353&to=1998831&strand=2)
3. Select "FASTA" under the name of this sequence
4. Get the fasta sequence
(i.e. ">NC_011415.1:c1998831-1997353 Escherichia coli SE11, complete sequence
ATGACTTTATGGATTAACGGTGACTGGATAACGGGCCAGGGCGCATCGCGTGTGAAGCGTAATCCGGTAT
CGGGCGAG.....").
CODE
1. Search (e.g. "WP_000177210.1") in "https://www.ncbi.nlm.nih.gov/ipg/"
import requests
from bs4 import BeautifulSoup
url = "https://www.ncbi.nlm.nih.gov/ipg/"
r = requests.get(url, params = "WP_000177210.1")
if r.status_code == requests.codes.ok:
soup = BeautifulSoup(r.text,"lxml")
2. Select the first record in the second column "CDS Region in Nucleotide" of the table (In this case " NC_011415.1 1997353-1998831 (-)") (i.e. https://www.ncbi.nlm.nih.gov/nuccore/NC_011415.1?from=1997353&to=1998831&strand=2)
# try 1 (wrong)
## I tried this first, but it seemed like it only accessed to the first level of the href?!
for a in soup.find_all('a', href=True):
if (a['href'][:8]) =="/nuccore":
print("Found the URL:", a['href'])
# try 2 (not sure how to access nested href)
## According to the label I saw in the Develop Tools, I think I need to get the href in the following nested structure. However, it didn't work.
soup.select("html div #maincontent div div div #ph-ipg div table tbody tr td a")
I am stuck in this step right now....
PS
It's my first time to deal with html format. It's also my first time to ask question here. I might not phrase the problem very well. If there's anything wrong, please let me know.

Without using NCBI's REST API,
import time
from bs4 import BeautifulSoup
from selenium import webdriver
# Opens a firefox webbrowser for scrapping purposes
browser = webdriver.Firefox(executable_path=r'your\path\geckodriver.exe') # Put your own path here
# Allows you to load a page completely (with all of the JS)
browser.get('https://www.ncbi.nlm.nih.gov/ipg/?term=WP_000177210.1')
# Delay turning the page into a soup in order to collect the newly fetched data
time.sleep(3)
# Creates the soup
soup = BeautifulSoup(browser.page_source, "html")
# Gets all the links by filtering out ones with just '/nuccore' and keeping ones that include '/nuccore'
links = [a['href'] for a in soup.find_all('a', href=True) if '/nuccore' in a['href'] and not a['href'] == '/nuccore']
Note:
You'll need the package selenium
You'll need to install GeckoDriver

Python not progressing a list of links

So, as i need more detailed data I have to dig a bit deeper in the HTML code of a website. I wrote a script that returns me a list of specific links to detail pages, but I can't bring Python to search each link of this list for me, it always stops at the first one. What am I doing wrong?
from BeautifulSoup import BeautifulSoup
import urllib2
from lxml import html
import requests
#Open site
html_page = urllib2.urlopen("http://www.sitetoscrape.ch/somesite.aspx")
#Inform BeautifulSoup
soup = BeautifulSoup(html_page)
#Search for the specific links
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
#print found links
print link.get('href')
#complete links
complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
#print complete links
print complete_links
#
#EVERYTHING WORKS FINE TO THIS POINT
#
page = requests.get(complete_links)
tree = html.fromstring(page.text)
#Details
name = tree.xpath('//dl[#class="services"]')
for i in name:
print i.text_content()
Also: What tutorial can you recommend me to learn how to put my output in a file and clean it up, give variable names, etc?

I think that you want a list of links in complete_links instead of a single link. As #Pynchia and #lemonhead said you're overwritting complete_links every iteration of first for loop.
You need two changes:
Append links to a list and use it to loop and scrap each link
# [...] Same code here
links_list = []
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
print link.get('href')
complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
print complete_links
link_list.append(complete_links) # append new link to the list
Scrap each accumulated link in another loop
for link in link_list:
page = requests.get(link)
tree = html.fromstring(page.text)
#Details
name = tree.xpath('//dl[#class="services"]')
for i in name:
print i.text_content()
PS: I recommend scrapy framework for tasks like that.

How to use BeautifulSoup to scrape links in a html

I need download few links in a html. But I don't need all of them, I only need few of them in certain section on this webpage.
For example, in http://www.nytimes.com/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning, I need links in the debaters section. I plan to use BeautifulSoup and I looked the html of one of the links:
Data Collection Is Out of Control
Here's my code:
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
link_set = set()
for link in soup.find_all("a", class = "bl-bigger"):
href = link.get('href')
if href == None:
continue
elif '/roomfordebate/' in href:
link_set.add(href)
for link in link_set:
print link
This code is supposed to give me all the links with bl-bigger class. But it actually returns nothing. Could anyone figure what's wrong with my code or how to make it work?
Thanks

I don't see bl-bigger class at all when I view the source from Chrome. May be that's why your code is not working?
Lets start looking at the source. The whole Debaters section seems to be put within div with class nytint-discussion-content. So using BeautifulSoup, lets get that whole div first.
debaters_div = soup.find('div', class_="nytint-discussion-content")
Again learning from the source, seems all the links are within a list, li tag. Now all you have to do is, find all li tags and find anchor tags within them. One more thing you can notice is, all the li tags have class nytint-bylines-1.
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
list_items[0].find('a')
# Data Collection Is Out of Control
So, your whole code can be:
link_set = set()
response = requests.get(url)
html_data = response.text
soup = BeautifulSoup(html_data)
debaters_div = soup.find('div', class_="nytint-discussion-content")
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
for each_item in list_items:
html_link = each_item.find('a').get('href')
if html_link.startswith('/roomfordebate'):
link_set.add(html_link)
Now link_set will contain all the links you want. From the link given in question, it will fetch 5 links.
PS: link_set contains only uri and not actual html addresses. So I would add http://www.nytimes.com at start before adding those links to link_set. Just change the last line to:
link_set.add('http://www.nytimes.com' + html_link)

You need to call the method with an object instead of keyword argument:
soup.find("tagName", { "class" : "cssClass" })
or use .select method which executes CSS queries:
soup.select('a.bl-bigger')
Examples are in the docs, just search for '.select' string. Also, instead of writing the entire script you'll quickly get some working code with ipython interactive shell.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to refine and limit BeautifulSoup results - python

Related

Awkward problem with iterrating over the list and extracting only last linked link from the page [BS4]

Web Scraping - Extract list of text from multiple pages

How to get nested href in python?

Python not progressing a list of links

How to use BeautifulSoup to scrape links in a html

Categories

Resources