I am making a scraping script in python. I first collect the links of the movie from where I have to scrap the songs list.
Here is the movie.txt list containing movies link
https://www.lyricsbogie.com/category/movies/a-flat-2010
https://www.lyricsbogie.com/category/movies/a-night-in-calcutta-1970
https://www.lyricsbogie.com/category/movies/a-scandall-2016
https://www.lyricsbogie.com/category/movies/a-strange-love-story-2011
https://www.lyricsbogie.com/category/movies/a-sublime-love-story-barsaat-2005
https://www.lyricsbogie.com/category/movies/a-wednesday-2008
https://www.lyricsbogie.com/category/movies/aa-ab-laut-chalen-1999
https://www.lyricsbogie.com/category/movies/aa-dekhen-zara-2009
https://www.lyricsbogie.com/category/movies/aa-gale-lag-jaa-1973
https://www.lyricsbogie.com/category/movies/aa-gale-lag-jaa-1994
https://www.lyricsbogie.com/category/movies/aabra-ka-daabra-2004
https://www.lyricsbogie.com/category/movies/aabroo-1943
https://www.lyricsbogie.com/category/movies/aabroo-1956
https://www.lyricsbogie.com/category/movies/aabroo-1968
https://www.lyricsbogie.com/category/movies/aabshar-1953
Here is my first python function:
import requests
from bs4 import BeautifulSoup as bs
def get_songs_links_for_movies1():
url='https://www.lyricsbogie.com/category/movies/a-flat-2010'
source_code = requests.get(url)
plain_text = source_code.text
soup = bs(plain_text,"html.parser")
for link in soup.find_all('h3',class_='entry-title'):
href = link.a.get('href')
href = href+"\n"
print(href)
output of the above function:
https://www.lyricsbogie.com/movies/a-flat-2010/pyar-itna-na-kar.html
https://www.lyricsbogie.com/movies/a-flat-2010/chal-halke-halke.html
https://www.lyricsbogie.com/movies/a-flat-2010/meetha-sa-ishq.html
https://www.lyricsbogie.com/movies/a-flat-2010/dil-kashi.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
It successfully fetches the songs url of the specified link.
But now when I try to automate the process and passes a file movie.txt to read url one by one and get the result but its output does not match with the function above in which I add url by myself one by one. Also this function does not get the songs url.
Here is my function that does not work correctly.
import requests
from bs4 import BeautifulSoup as bs
def get_songs_links_for_movies():
file = open("movie.txt","r")
for url in file:
source_code = requests.get(url)
plain_text = source_code.text
soup = bs(plain_text,"html.parser")
for link in soup.find_all('h3',class_='entry-title'):
href = link.a.get('href')
href = href+"\n"
print(href)
output of the above function
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
and so on..........
By comparing 1st function output and 2nd function output. You clearly see that there is no song url that function 1 fetches and also function 2 repeating the same output again and again.
Can Anyone help me in that why is it happening.
To understand what is happening, you can print the representation of the url read from the file in the for loop:
for url in file:
print(repr(url))
...
Printing this representation (and not just the string) makes it easier to see special characters. In this case, the output gave
'https://www.lyricsbogie.com/category/movies/a-flat-2010\n'. As you see, there is a line break in the url, so the fetched url is not correct.
Use for instance the rstrip() method to remove the newline character, by replacing url by url.rstrip().
I have a doubt that your file is not read as a single line, to be sure, can you test this code:
import requests
from bs4 import BeautifulSoup as bs
def get_songs_links_for_movies(url):
print("##Getting songs from %s" % url)
source_code = requests.get(url)
plain_text = source_code.text
soup = bs(plain_text,"html.parser")
for link in soup.find_all('h3',class_='entry-title'):
href = link.a.get('href')
href = href+"\n"
print(href)
def get_urls_from_file(filename):
with open(filename, 'r') as f:
return [url for url in f.readlines()]
urls = get_urls_from_file("movie.txt")
for url in urls:
get_songs_links_for_movies(url)
Related
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
down at the bottom, where it prints the link... I know it'll go in there, but I can't think of a way to remove duplicate entries there. Can someone help me with that please?
Use a set to remove duplicates. You call add() to add an item and if the item is already present then it won't be added again.
Try this:
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
urls = set()
for link in soup.find_all('a', attrs={'href': re.compile(r"^https://")}):
urls.add(link.get('href'))
print(urls) # urls contains unique set of URLs
Note some URLs might start with http:// so may want to use the regexp ^https?:// to catch both http and https URLs.
You can also use set comprehension syntax to rewrite the assignment and for statements like this.
urls = {
link.get("href")
for link in soup.find_all("a", attrs={"href": re.compile(r"^https://")})
}
instead of printing it you need to catch is somehow to compare.
Try this:
you get a list with all result by find_all and make it a set.
data = set(link.get('href') for link in soup.find_all('a', attrs={'href': re.compile("^https://")}))
for elem in data:
print(elem)
I have a Python script that imports a list of url's from a CSV named list.csv, scrapes them and outputs any anchor text and href links found on each url from the csv:
(For reference the list of urls in the csv are all in column A)
from requests_html import HTMLSession
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import pandas
import csv
contents = []
with open('list.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
for url in contents:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "lxml")
for link in soup.find_all('a'):
if len(link.text)>0:
print(url, link.text, '-', link.get('href'))
The output results look something like this where https://www.example.com/csv-url-one/ and https://www.example.com/csv-url-two/ are the url's in column A in the csv:
['https://www.example.com/csv-url-one/'] Creative - https://www.example.com/creative/
['https://www.example.com/csv-url-one/'] Web Design - https://www.example.com/web-design/
['https://www.example.com/csv-url-two/'] PPC - https://www.example.com/ppc/
['https://www.example.com/csv-url-two/'] SEO - https://www.example.com/seo/
The issue is i want the output results to look more like this i.e not repeatedly print the url in the CSV before each result AND have a break after each line from the CSV:
['https://www.example.com/csv-url-one/']
Creative - https://www.example.com/creative/
Web Design - https://www.example.com/web-design/
['https://www.example.com/csv-url-two/']
PPC - https://www.example.com/ppc/
SEO - https://www.example.com/seo/
Is this possible?
Thanks
Does the following solve your problem?
for url in contents:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "lxml")
print('\n','********',', '.join(url),'********','\n')
for link in soup.find_all('a'):
if len(link.text)>0:
print(link.text, '-', link.get('href'))
It is possible.
Simply add \n at the end of print.
\n is a break line special character.
for url in contents:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "lxml")
for link in soup.find_all('a'):
if len(link.text)>0:
print(url, ('\n'), link.text, '-', link.get('href'), ('\n'),)
To add a separation between urls add a \n before printing each url.
If you want to print the urls only if it has valid links ieif len(link.text)>0:, use the for loop to save valid links to a list, and only print url and links if this list is not empty.
try this:
for url in contents:
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "lxml")
valid_links = []
for link in soup.find_all('a'):
if len(link.text)>0:
valid_links .append(link.text)
if len (valid_links ):
print('\n', url)
for item in valid_links :
print(item.text, '-', item.get('href')))
I'm doing python scraping and i'm trying to get all the links between href tags and then accessing it one by one to scrape data from these links. I'm a newbie and can't figure it out how to continue from this.The code is as follows:
import requests
import urllib.request
import re
from bs4 import BeautifulSoup
import csv
url = 'https://menupages.com/restaurants/ny-new-york'
url1 = 'https://menupages.com'
response = requests.get(url)
f = csv.writer(open('Restuarants_details.csv', 'w'))
soup = BeautifulSoup(response.text, "html.parser")
menu_sections=[]
for url2 in soup.find_all('h3',class_='restaurant__title'):
completeurl = url1+url2.a.get('href')
print(completeurl)
#print(url)
If you want to scrape all the links obtained from the first page, and then scrape all the links obtained from these links, etc, you need a recursive function.
Here is some initial code to get you started:
if __name__ == "__main__":
initial_url = "https://menupages.com/restaurants/ny-new-york"
scrape(initial_url)
def scrape(url):
print("now looking at " + url)
# scrape URL
# do something with the data
if (STOP_CONDITION): # update this!
return
# scrape new URLs:
for new_url in soup.find_all(...):
scrape(new_url, file)
The problem with this recursive function is that it will not stop until there are no links on the pages, which probably won't happen anytime soon. You will need to add a stop condition.
The function get("href") is not returning the full link. In the html file exist the link:
But, the function link.get("href") return:
"navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO"
sub_site = "https://www.fotoregistro.com.br/navhome.php?vitrine-produto-slim"
response = urllib.request.urlopen(sub_site)
data = response.read()
soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):
url = link.get("href")
print (url)
Use select and seems to print fine
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.fotoregistro.com.br/fotolivros/180-slim?cpmdsc=MOZAO')
soup = bs(r.content, 'lxml')
print([item['href'] for item in soup.select('.warp_lightbox')])
Use
print([item['href'] for item in soup.select('[href]')])
for all links.
Let me focus on the specific part of your problem in the html:
<a class='warp_lightbox' title='Comprar' href='//www.fotoregistro.com.br/
navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'><img src='
//sh.digipix.com.br/subhomes/_lojas_consumer/paginas/fotolivro/img/180slim/vitrine/classic_01_tb.jpg' alt='slim' />
</a>
You can get it by doing:
for link in soup.find_all('a', {'class':'warp_lightbox'}):
url = link.get("href")
break
you find out that url is:
'//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO'
You can see two important patterns at the begininning of the string:
// which is a way to keep the current protocol, see this;
\r which is ASCII Carriage Return (CR).
When you print it, you simply lose this part:
//www.fotoregistro.com.br/\r
If you need the raw string, you can use repr in your for loop:
print(repr(url))
and you get:
//www.fotoregistro.com.br/\rnavhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
If you need the path, you can replace the initial part:
base = 'www.fotoregistro.com.br/'
for link in soup.find_all('a', {'class':'warp_lightbox'}):
url = link.get("href").replace('//www.fotoregistro.com.br/\r',base)
print(url)
and you get:
www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
www.fotoregistro.com.br/navhome.php?lightbox&dpxshig=/iprop_prod=180-slim/tipo=fotolivro/width=950/height=615/control=true/tema=tema_02/preview=true/nome_tema=Q2wmYWFjdXRlO3NzaWNvIFByZXRv&cpmdsc=MOZAO
.
.
.
Without specifying the class:
for link in soup.find_all('a'):
url = link.get("href")
print(repr(url))
I am working on this bot that extracts the urls from a specific page. I have extracted all the links and put them in a list now I can't seem to get realist urls(lead to other sites starting with http or https) out from the list and append them to another list or delete the ones that don't start with http. Thanks in advance
import urllib2
import requests
from bs4 import BeautifulSoup
def main():
#get all the links from bing about cancer
site = "http://www.bing.com/search?q=cancer&qs=n&form=QBLH&pq=cancer&sc=8-4&sp=-1&sk=&cvid=E56491F36028416EB41694212B7C33F2"
urls =[]
true_links = []
r = requests.get(site)
html_content = r.content
soup = BeautifulSoup(html_content, 'html.parser')
links = soup.find_all("a")
for link in links:
link = link.get("href")
urls.append(str(link))
#urls.append(link.get("href"))
#print map(str, urls)
#REMOVE GARBAGE LINKS
print len(urls)
print urls
main()
You can use urlparse.urljoin:
link = urlparse.urljoin(site, link.get("href"))
This will create absolute URLs out of relative ones.
You should also be using html_content = r.text instead of html_content = r.content. r.text takes care of using the proper encoding.