Removing duplicate links from scraper I'm making - python

#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
down at the bottom, where it prints the link... I know it'll go in there, but I can't think of a way to remove duplicate entries there. Can someone help me with that please?

Use a set to remove duplicates. You call add() to add an item and if the item is already present then it won't be added again.
Try this:
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
urls = set()
for link in soup.find_all('a', attrs={'href': re.compile(r"^https://")}):
urls.add(link.get('href'))
print(urls) # urls contains unique set of URLs
Note some URLs might start with http:// so may want to use the regexp ^https?:// to catch both http and https URLs.
You can also use set comprehension syntax to rewrite the assignment and for statements like this.
urls = {
link.get("href")
for link in soup.find_all("a", attrs={"href": re.compile(r"^https://")})
}

instead of printing it you need to catch is somehow to compare.
Try this:
you get a list with all result by find_all and make it a set.
data = set(link.get('href') for link in soup.find_all('a', attrs={'href': re.compile("^https://")}))
for elem in data:
print(elem)

Related

How do I get all the links from multiple web pages in python?

enter image description here
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
#import re
req = Request("https://www.indiegogo.com/individuals/23489031")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
print(links)
This code works if I use only one url but does not work with multiple urls. how do i do the same if i want to do it with multiple urls?
I haven't used bs4 ever, but you may be able to just create a list containing all the URLs you want to check. Then you can use a loop to iterate and work over each URL seperatly. Like:
urls = ["https://","https://","http://"] #But with actual links
for link in urls:
#Work with each link seperatly here
pass
Here I leave you a small code that I had to do at some point of scraping
you can adapt it to what you want to achieve .. I hope it helps you
import requests
from bs4 import BeautifulSoup as bs
url_list=['https://www.example1.com' , 'https://www.example2.com' ]
def getlinks(url) :
r=requests.get(url)
tags_list=[ a for a in bs(r.text,'html.parser').find_all('a')]
links=[ f'{url.split("//")[0]}//{url.split("//")[1]}{link}' if link.split('/')[0]!='https:' else link for link in [href.attrs['href'] if 'href' in href.attrs else '' for href in tags_list ] ]
return links
you can loop through url_list and execute getlinks(url) with it

removing elements from href python '#'

I am looking to remove href elements from the following code, I am able to return the results when I run but it will not remove the '#' and '#contents' from the list of urls in python.
from bs4 import BeautifulSoup
import requests
url = 'https://www.census.gov/programs-surveys/popest.html'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
elif a.text:
links_with_text.decompose(a['#content','#'])
print(links_with_text)
You can use string#startswith to blacklist any links starting with a "#", or whitelist anything starting with "http" or "https". Since there are hrefs like "/" in your data, I'd use the second option.
import requests
from bs4 import BeautifulSoup
url = 'https://www.census.gov/programs-surveys/popest.html'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text and a['href'].startswith('http'):
links_with_text.append(a['href'])
print(links_with_text)
Note that list.decompose is not a function (and this branch of the program is unreachable anyway).
If you only want https/http links use the inbuilt css filtering via href attribute selector with starts with operator. 'lxml' is also a faster parser if installed.
import requests
from bs4 import BeautifulSoup
url = 'https://www.census.gov/programs-surveys/popest.html'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
links = [i['href'] for i in soup.select('[href^=http]')]

i want to extract href for the links in this particular website

Can you please help me figure this out?
i'm trying to scrape this website https://industrydirectory.mjbizdaily.com/accounting/
i'm trying to scrape all the links such as
https://industrydirectory.mjbizdaily.com/420-businesses/
but i can't figure it out
from bs4 import BeautifulSoup
import requests
url = 'https://industrydirectory.mjbizdaily.com/accounting/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
test = soup.find_all('ul', class_='business-results')
print(test)
You can use #main a to get all urls:
urls = [url["href"] for url in soup.select("#main a")]
List of dictionaries with key as a text and value as a URL:
urls = []
for url in soup.select("#main a"):
print(url.text, url["href"])
urls.append({url.text: url["href"]})
This is what you are looking for
for each in test:
li = each.findAll('li')
for a in li:
print(a.find('a').attrs['href'])

'NoneType' object is not callable in Beautiful Soup 4

I'm new-ish to python and started experimenting with Beautiful Soup 4. I tried writing code that would get all the links on one page then with those links repeat the prosses until I have an entire website parsed.
import bs4 as bs
import urllib.request as url
links_unclean = []
links_clean = []
soup = bs.BeautifulSoup(url.urlopen('https://pythonprogramming.net/parsememcparseface/').read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
print(links_clean)
while True:
for link in links_clean:
soup = bs.BeautifulSoup(url.urlopen(link).read(), 'html.parser')
for url in soup.find_all('a'):
print(url.get('href'))
links_unclean.append(url.get('href'))
for link in links_unclean:
if (link[:8] == 'https://'):
links_clean.append(link)
links_clean = list(dict.fromkeys(links_clean))
input()
But I'm now getting this error:
'NoneType' object is not callable
line 20, in
soup = bs.BeautifulSoup(url.urlopen(link).read(),
'html.parser')
Can you pls help.
Be careful when importing modules as something. In this case, url on line 2 gets overridden in your for loop when you iterate.
Here is a shorter solution that will also give back only URLs containing https as part of the href attribute:
from bs4 import BeautifulSoup
from urllib.request import urlopen
content = urlopen('https://pythonprogramming.net/parsememcparseface/')
soup = BeautifulSoup(content, "html.parser")
base = soup.find('body')
for link in BeautifulSoup(str(base), "html.parser").findAll("a"):
if 'href' in link.attrs:
if 'https' in link['href']:
print(link['href'])
However, this paints an incomplete picture as not all links are captured because of errors on the page with HTML tags. May I recommend also the following alternative, which is very simple and works flawlessly in your scenario (note: you will need the package Requests-HTML):
from requests_html import HTML, HTMLSession
session = HTMLSession()
r = session.get('https://pythonprogramming.net/parsememcparseface/')
for link in r.html.absolute_links:
print(link)
This will output all URLs, including both those that reference other URLs on the same domain and those that are external websites.
I would consider using an attribute = value css selector and using the ^ operator to specify that the href attributes begin with https. You will then only have valid protocols. Also, use set comprehensions to ensure no duplicates and Session to re-use connection.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
final = []
with requests.Session() as s:
r = s.get('https://pythonprogramming.net/parsememcparseface/')
soup = bs(r.content, 'lxml')
httpsLinks = {item['href'] for item in soup.select('[href^=https]')}
for link in httpsLinks:
r = s.get(link)
soup = bs(r.content, 'lxml')
newHttpsLinks = [item['href'] for item in soup.select('[href^=https]')]
final.append(newHttpsLinks)
tidyList = list({item for sublist in final for item in sublist})
df = pd.DataFrame(tidyList)
print(df)

Python and BeautifulSoup Opening pages

I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?
Here is my code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")
for link in soup.find_all("a"):
print link.get("href")
for link in soup.find_all("a"):
print link.text
for link in soup.find_all("a"):
print link.text, link.get("href")
g_data = soup.find_all("div", {"class":"listing__left-column"})
for item in g_data:
print item.contents
for item in g_data:
print item.contents[0].text
print link.get('href')
for item in g_data:
print item.contents[0]
I am trying to collect the href's from the titles of each business, and then open them and scrape that data.
I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:
import requests
from bs4 import BeautifulSoup
r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.
Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.
If you then want to get the HTML for each of the links, the following could be used:
import requests
from bs4 import BeautifulSoup
# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print '-' * 60 # Add a line of dashes
print 'href: ', a_tag['href']
request_href = requests.get(base_url + a_tag['href'])
print request_href.content
Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.
I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.
You might need to find the attributes of the "href" link itself:
You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:
from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
for link in item.find_all("a"):
if 'href' in link.attrs:
a = link.attrs['href']
print(a)
print("")
I did this and I was able to get into another link which was embedded in the home page

Categories

Resources