How to get the full link using BeautifulSoap

How to get the full link using BeautifulSoap - python

The function get("href") is not returning the full link. In the html file exist the link:
But, the function link.get("href") return:
"navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO"
sub_site = "https://www.fotoregistro.com.br/navhome.php?vitrine-produto-slim"
response = urllib.request.urlopen(sub_site)
data = response.read()
soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):
url = link.get("href")
print (url)

Use select and seems to print fine
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.fotoregistro.com.br/fotolivros/180-slim?cpmdsc=MOZAO')
soup = bs(r.content, 'lxml')
print([item['href'] for item in soup.select('.warp_lightbox')])
Use
print([item['href'] for item in soup.select('[href]')])
for all links.

Let me focus on the specific part of your problem in the html:
<a class='warp_lightbox' title='Comprar' href='//www.fotoregistro.com.br/
navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'><img src='
//sh.digipix.com.br/subhomes/_lojas_consumer/paginas/fotolivro/img/180slim/vitrine/classic_01_tb.jpg' alt='slim' />
</a>
You can get it by doing:
for link in soup.find_all('a', {'class':'warp_lightbox'}):
url = link.get("href")
break
you find out that url is:
'//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'
You can see two important patterns at the begininning of the string:
// which is a way to keep the current protocol, see this;
\r which is ASCII Carriage Return (CR).
When you print it, you simply lose this part:
//www.fotoregistro.com.br/\r
If you need the raw string, you can use repr in your for loop:
print(repr(url))
and you get:
//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
If you need the path, you can replace the initial part:
base = 'www.fotoregistro.com.br/'
for link in soup.find_all('a', {'class':'warp_lightbox'}):
url = link.get("href").replace('//www.fotoregistro.com.br/\r',base)
print(url)
and you get:
www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/preview=true/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
.
.
.
Without specifying the class:
for link in soup.find_all('a'):
url = link.get("href")
print(repr(url))

Related

Removing duplicate links from scraper I'm making

#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
down at the bottom, where it prints the link... I know it'll go in there, but I can't think of a way to remove duplicate entries there. Can someone help me with that please?

Use a set to remove duplicates. You call add() to add an item and if the item is already present then it won't be added again.
Try this:
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
urls = set()
for link in soup.find_all('a', attrs={'href': re.compile(r"^https://")}):
urls.add(link.get('href'))
print(urls) # urls contains unique set of URLs
Note some URLs might start with http:// so may want to use the regexp ^https?:// to catch both http and https URLs.
You can also use set comprehension syntax to rewrite the assignment and for statements like this.
urls = {
link.get("href")
for link in soup.find_all("a", attrs={"href": re.compile(r"^https://")})
}

instead of printing it you need to catch is somehow to compare.
Try this:
you get a list with all result by find_all and make it a set.
data = set(link.get('href') for link in soup.find_all('a', attrs={'href': re.compile("^https://")}))
for elem in data:
print(elem)

Python - How do I find a link on webpage that has no class?

I am a beginner python programmer and I am trying to make a webcrawler as practice.
Currently I am facing a problem that I cannot find the right solution for. The problem is that I am trying to get a link location/address from a page that has no class, so I have no idea how to filter that specific link.
It is probably better to show you.
The page I am trying to get the link from.
As you can see, I am trying to get what is inside of the href attribute of the "Historical prices" link. Here is my python code:
import requests
from bs4 import BeautifulSoup
def find_historicalprices_link(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find_all('li', 'fjfe-nav-sub')
href = str(link.get('href'))
find_spreadsheet(href)
def find_spreadsheet(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find('a', {'class' : 'nowrap'})
href = str(link.get('href'))
download_spreadsheet(href)
def download_spreadsheet(url):
response = requests.get(url)
text = response.text
lines = text.split("\\n")
filename = r'google.csv'
file = open(filename, 'w')
for line in lines:
file.write(line + "\n")
file.close()
find_historicalprices_link('https://www.google.com/finance?q=NASDAQ%3AGOOGL&ei=3lowWYGRJNSvsgGPgaywDw')
In the function "find_spreadsheet(url)", I could easily filter the link by looking for the class called "nowrap". Unfortunately, the Historical prices link does not have such a class and right now my script just gives me the following error:
AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
How do I make sure that my crawler only takes the href from the "Historical prices"?
Thank you in advance.
UPDATE:
I found the way to do it. By only looking for the link with a specific text attached to it, I could find the href I needed.
Solution:
soup.find('a', string="Historical prices")

Does the following code sniplet helps you? I think you can solve your problem with the following code as I hope:
from bs4 import BeautifulSoup
html = """<a href='http://www.google.com'>Something else</a>
<a href='http://www.yahoo.com'>Historical prices</a>"""
soup = BeautifulSoup(html, "html5lib")
urls = soup.find_all("a")
print(urls)
print([a["href"] for a in urls if a.text == "Historical prices"])

Same python function giving different output

I am making a scraping script in python. I first collect the links of the movie from where I have to scrap the songs list.
Here is the movie.txt list containing movies link
https://www.lyricsbogie.com/category/movies/a-flat-2010
https://www.lyricsbogie.com/category/movies/a-night-in-calcutta-1970
https://www.lyricsbogie.com/category/movies/a-scandall-2016
https://www.lyricsbogie.com/category/movies/a-strange-love-story-2011
https://www.lyricsbogie.com/category/movies/a-sublime-love-story-barsaat-2005
https://www.lyricsbogie.com/category/movies/a-wednesday-2008
https://www.lyricsbogie.com/category/movies/aa-ab-laut-chalen-1999
https://www.lyricsbogie.com/category/movies/aa-dekhen-zara-2009
https://www.lyricsbogie.com/category/movies/aa-gale-lag-jaa-1973
https://www.lyricsbogie.com/category/movies/aa-gale-lag-jaa-1994
https://www.lyricsbogie.com/category/movies/aabra-ka-daabra-2004
https://www.lyricsbogie.com/category/movies/aabroo-1943
https://www.lyricsbogie.com/category/movies/aabroo-1956
https://www.lyricsbogie.com/category/movies/aabroo-1968
https://www.lyricsbogie.com/category/movies/aabshar-1953
Here is my first python function:
import requests
from bs4 import BeautifulSoup as bs
def get_songs_links_for_movies1():
url='https://www.lyricsbogie.com/category/movies/a-flat-2010'
source_code = requests.get(url)
plain_text = source_code.text
soup = bs(plain_text,"html.parser")
for link in soup.find_all('h3',class_='entry-title'):
href = link.a.get('href')
href = href+"\n"
print(href)
output of the above function:
https://www.lyricsbogie.com/movies/a-flat-2010/pyar-itna-na-kar.html
https://www.lyricsbogie.com/movies/a-flat-2010/chal-halke-halke.html
https://www.lyricsbogie.com/movies/a-flat-2010/meetha-sa-ishq.html
https://www.lyricsbogie.com/movies/a-flat-2010/dil-kashi.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
It successfully fetches the songs url of the specified link.
But now when I try to automate the process and passes a file movie.txt to read url one by one and get the result but its output does not match with the function above in which I add url by myself one by one. Also this function does not get the songs url.
Here is my function that does not work correctly.
import requests
from bs4 import BeautifulSoup as bs
def get_songs_links_for_movies():
file = open("movie.txt","r")
for url in file:
source_code = requests.get(url)
plain_text = source_code.text
soup = bs(plain_text,"html.parser")
for link in soup.find_all('h3',class_='entry-title'):
href = link.a.get('href')
href = href+"\n"
print(href)
output of the above function
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
and so on..........
By comparing 1st function output and 2nd function output. You clearly see that there is no song url that function 1 fetches and also function 2 repeating the same output again and again.
Can Anyone help me in that why is it happening.

To understand what is happening, you can print the representation of the url read from the file in the for loop:
for url in file:
print(repr(url))
...
Printing this representation (and not just the string) makes it easier to see special characters. In this case, the output gave
'https://www.lyricsbogie.com/category/movies/a-flat-2010\n'. As you see, there is a line break in the url, so the fetched url is not correct.
Use for instance the rstrip() method to remove the newline character, by replacing url by url.rstrip().

I have a doubt that your file is not read as a single line, to be sure, can you test this code:
import requests
from bs4 import BeautifulSoup as bs
def get_songs_links_for_movies(url):
print("##Getting songs from %s" % url)
source_code = requests.get(url)
plain_text = source_code.text
soup = bs(plain_text,"html.parser")
for link in soup.find_all('h3',class_='entry-title'):
href = link.a.get('href')
href = href+"\n"
print(href)
def get_urls_from_file(filename):
with open(filename, 'r') as f:
return [url for url in f.readlines()]
urls = get_urls_from_file("movie.txt")
for url in urls:
get_songs_links_for_movies(url)

Python and BeautifulSoup Opening pages

I am wondering how would I open another page in my list with BeautifulSoup? I have followed this tutorial, but it does not tell us how to open another page on the list. Also how would I open a "a href" that is nested inside of a class?
Here is my code:
# coding: utf-8
import requests
from bs4 import BeautifulSoup
r = requests.get("")
soup = BeautifulSoup(r.content)
soup.find_all("a")
for link in soup.find_all("a"):
print link.get("href")
for link in soup.find_all("a"):
print link.text
for link in soup.find_all("a"):
print link.text, link.get("href")
g_data = soup.find_all("div", {"class":"listing__left-column"})
for item in g_data:
print item.contents
for item in g_data:
print item.contents[0].text
print link.get('href')
for item in g_data:
print item.contents[0]
I am trying to collect the href's from the titles of each business, and then open them and scrape that data.

I am still not sure where you are getting the HTML from, but if you are trying to extract all of the href tags, then the following approach should work based on the image you have posted:
import requests
from bs4 import BeautifulSoup
r = requests.get("<add your URL here>")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
By adding href=True to the find_all(), it ensures that only a elements that contain an href attribute are returned therefore removing the need to test for it as an attribute.
Just to warn you, you might find some websites will lock you out after one or two attempts as they are able to detect that you are trying to access a site via a script, rather than as a human. If you feel you are not getting the correct responses, I would recommend printing the HTML you are getting back to ensure it it still as you expect.
If you then want to get the HTML for each of the links, the following could be used:
import requests
from bs4 import BeautifulSoup
# Configure this to be your first request URL
r = requests.get("http://www.mywebsite.com/search/")
soup = BeautifulSoup(r.content)
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print 'href: ', a_tag['href']
# Configure this to the root of the above website, e.g. 'http://www.mywebsite.com'
base_url = "http://www.mywebsite.com"
for a_tag in soup.find_all('a', class_='listing-name', href=True):
print '-' * 60 # Add a line of dashes
print 'href: ', a_tag['href']
request_href = requests.get(base_url + a_tag['href'])
print request_href.content
Tested using Python 2.x, for Python 3.x please add parentheses to the print statements.

I had the same problem and I will like to share my findings because I did try the answer, for some reasons it did not work but after some research I found something interesting.
You might need to find the attributes of the "href" link itself:
You will need the exact class which contains the href link in your case, I am thinking="class":"listing__left-column" and equate it to a variable say "all" for example:
from bs4 import BeautifulSoup
all = soup.find_all("div", {"class":"listing__left-column"})
for item in all:
for link in item.find_all("a"):
if 'href' in link.attrs:
a = link.attrs['href']
print(a)
print("")
I did this and I was able to get into another link which was embedded in the home page

Retrieve the first href from a div tag

I need to retrieve is the href containing /questions/20702626/javac1-8-class-not-found. But the output I get for the code below is //stackoverflow.com:
from bs4 import BeautifulSoup
import urllib2
url = "http://stackoverflow.com/search?q=incorrect+operator"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
for tag in soup.find_all('div'):
if tag.get("class")==['summary']:
for tag in soup.find_all('div'):
if tag.get("class")==['result-link']:
for link in soup.find_all('a'):
print link.get('href')
break;

Instead of making nested loops, write a CSS selector:
for link in soup.select('div.summary div.result-link a'):
print link.get('href')
Which is not only more readable, but also solves your problem. It prints:
/questions/11977228/incorrect-answer-in-operator-overloading
/questions/8347592/sizeof-operator-returns-incorrect-size
/questions/23984762/c-incorrect-signature-for-assignment-operator
...
/questions/24896659/incorrect-count-when-using-comparison-operator
/questions/7035598/patter-checking-check-of-incorrect-number-of-operators-and-brackets
Additional note: you might want to look into using StackExchange API instead of the current web-scraping/HTML-parsing approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to get the full link using BeautifulSoap - python

Related

Removing duplicate links from scraper I'm making

Python - How do I find a link on webpage that has no class?

Same python function giving different output

Python and BeautifulSoup Opening pages

Retrieve the first href from a div tag

Categories

Resources