Web scrape links with Python, then turn them into a string [duplicate] - python

This question already has answers here:
Print string to text file
(8 answers)
Closed 3 months ago.
With Python I'm having issues turning web scrapped links into strings so I can save them as either a txt or csv file. I would really like them as a txt file. This is what I have at the moment.
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
type(link)
print(link, file=open('example.txt','w'))
I've tried all sort of things with no luck. I'm pretty much at a lose.

print(link, file=open('example.txt','w'))
Will write the link variable, but that's only the last one.
To write them all, use:
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/"
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
with open("example.txt", "w") as file:
for link in soup.find_all('a'):
file.write(link.get('href') + '\n')
Which uses a context manager to open the file, then write each href with a newline.

Related

Web scraping with beautiful soup [duplicate]

This question already has answers here:
Getting all Links from a page Beautiful Soup
(3 answers)
Closed 2 months ago.
I am trying to scrape links from this page(https://www.setlist.fm/search?query=nightwish)
Whilst this code does retrieve the links i want it also comes back with a load of other stuff i don't want.
Example of what i want:
setlist/nightwish/2022/quarterback-immobilien-arena-leipzig-germany-2bbca8f2.html
setlist/nightwish/2022/brose-arena-bamberg-germany-3bf4963.html
setlist/nightwish/2022/arena-gliwice-gliwice-poland-3bc9dc7.html
Can i use beautiful soup to get these links or do i need to use regex?
url = 'https://www.setlist.fm/search?query=nightwish'
reqs = requests.get(url)
soup = bs4.BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.select('a'):
urls.append(link)
print(link.get('href'))
Please check to see if the code snippet below is useful.
import requests
from bs4 import BeautifulSoup
url = 'https://www.setlist.fm/search?query=nightwish'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
for g_data in soup.find_all('a', {'class': 'link-primary'}, href=True):
print(g_data['href'])

Python 3 - Extract content between <td></td> [duplicate]

This question already has an answer here:
How to get inner text value of an HTML tag with BeautifulSoup bs4?
(1 answer)
Closed 5 years ago.
from bs4 import BeautifulSoup
import re
data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser')
emails = soup.find_all('td', text = re.compile('#'))
for line in emails:
print(line)
I have the script above that works perfect in Python 2.7 with Beautifulsoup for extracting content between several in a HTML-file. When I run the same script in Python 3.6.4, however, I get the following results:
<td>xxx#xxx.com</td>
<td>xxx#xxx.com</td>
I want the content between without the TD stuff...
Why is this happening in Python 3?
I found the answer...
from bs4 import BeautifulSoup
import re
data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser') #Lade till html.parser
emails = soup.find_all('td', text = re.compile('#'))
for td in emails:
print(td.get_text())
Look close at the two last lines :)

Same python function giving different output

I am making a scraping script in python. I first collect the links of the movie from where I have to scrap the songs list.
Here is the movie.txt list containing movies link
https://www.lyricsbogie.com/category/movies/a-flat-2010
https://www.lyricsbogie.com/category/movies/a-night-in-calcutta-1970
https://www.lyricsbogie.com/category/movies/a-scandall-2016
https://www.lyricsbogie.com/category/movies/a-strange-love-story-2011
https://www.lyricsbogie.com/category/movies/a-sublime-love-story-barsaat-2005
https://www.lyricsbogie.com/category/movies/a-wednesday-2008
https://www.lyricsbogie.com/category/movies/aa-ab-laut-chalen-1999
https://www.lyricsbogie.com/category/movies/aa-dekhen-zara-2009
https://www.lyricsbogie.com/category/movies/aa-gale-lag-jaa-1973
https://www.lyricsbogie.com/category/movies/aa-gale-lag-jaa-1994
https://www.lyricsbogie.com/category/movies/aabra-ka-daabra-2004
https://www.lyricsbogie.com/category/movies/aabroo-1943
https://www.lyricsbogie.com/category/movies/aabroo-1956
https://www.lyricsbogie.com/category/movies/aabroo-1968
https://www.lyricsbogie.com/category/movies/aabshar-1953
Here is my first python function:
import requests
from bs4 import BeautifulSoup as bs
def get_songs_links_for_movies1():
url='https://www.lyricsbogie.com/category/movies/a-flat-2010'
source_code = requests.get(url)
plain_text = source_code.text
soup = bs(plain_text,"html.parser")
for link in soup.find_all('h3',class_='entry-title'):
href = link.a.get('href')
href = href+"\n"
print(href)
output of the above function:
https://www.lyricsbogie.com/movies/a-flat-2010/pyar-itna-na-kar.html
https://www.lyricsbogie.com/movies/a-flat-2010/chal-halke-halke.html
https://www.lyricsbogie.com/movies/a-flat-2010/meetha-sa-ishq.html
https://www.lyricsbogie.com/movies/a-flat-2010/dil-kashi.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
It successfully fetches the songs url of the specified link.
But now when I try to automate the process and passes a file movie.txt to read url one by one and get the result but its output does not match with the function above in which I add url by myself one by one. Also this function does not get the songs url.
Here is my function that does not work correctly.
import requests
from bs4 import BeautifulSoup as bs
def get_songs_links_for_movies():
file = open("movie.txt","r")
for url in file:
source_code = requests.get(url)
plain_text = source_code.text
soup = bs(plain_text,"html.parser")
for link in soup.find_all('h3',class_='entry-title'):
href = link.a.get('href')
href = href+"\n"
print(href)
output of the above function
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
https://www.lyricsbogie.com/movies/ae-dil-hai-mushkil-2016/ae-dil-hai-mushkil-title.html
https://www.lyricsbogie.com/movies/m-s-dhoni-the-untold-story-2016/kaun-tujhe.html
https://www.lyricsbogie.com/movies/raaz-reboot-2016/raaz-aankhein-teri.html
https://www.lyricsbogie.com/albums/akira-2016/baadal-2.html
https://www.lyricsbogie.com/movies/baar-baar-dekho-2016/sau-aasmaan.html
https://www.lyricsbogie.com/albums/gajanan-2016/gajanan-title.html
https://www.lyricsbogie.com/movies/days-of-tafree-2016/jeeley-yeh-lamhe.html
https://www.lyricsbogie.com/tv-shows/coke-studio-pakistan-season-9-2016/ala-baali.html
https://www.lyricsbogie.com/albums/piya-2016/piya-title.html
https://www.lyricsbogie.com/albums/sach-te-supna-2016/sach-te-supna-title.html
and so on..........
By comparing 1st function output and 2nd function output. You clearly see that there is no song url that function 1 fetches and also function 2 repeating the same output again and again.
Can Anyone help me in that why is it happening.
To understand what is happening, you can print the representation of the url read from the file in the for loop:
for url in file:
print(repr(url))
...
Printing this representation (and not just the string) makes it easier to see special characters. In this case, the output gave
'https://www.lyricsbogie.com/category/movies/a-flat-2010\n'. As you see, there is a line break in the url, so the fetched url is not correct.
Use for instance the rstrip() method to remove the newline character, by replacing url by url.rstrip().
I have a doubt that your file is not read as a single line, to be sure, can you test this code:
import requests
from bs4 import BeautifulSoup as bs
def get_songs_links_for_movies(url):
print("##Getting songs from %s" % url)
source_code = requests.get(url)
plain_text = source_code.text
soup = bs(plain_text,"html.parser")
for link in soup.find_all('h3',class_='entry-title'):
href = link.a.get('href')
href = href+"\n"
print(href)
def get_urls_from_file(filename):
with open(filename, 'r') as f:
return [url for url in f.readlines()]
urls = get_urls_from_file("movie.txt")
for url in urls:
get_songs_links_for_movies(url)

How to scrape href with Python 3.5 and BeautifulSoup [duplicate]

This question already has answers here:
retrieve links from web page using python and BeautifulSoup [closed]
(16 answers)
Closed 6 years ago.
I want to scrape the href of every project from the website https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1 with Python 3.5 and BeautifulSoup.
That's my code
#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")
#Scraping "Link" (href)
project_ref = soup.findAll('h6', {'class': 'project-title'})
project_href = [project.findChildren('a')[0].href for project in project_ref if project.findChildren('a')]
print(project_href)
I get [None, None, .... None, None] back.
I need a list with all the href from the class .
Any ideas?
Try something like this:
import urllib.request
from bs4 import BeautifulSoup
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=magic&seed=2449064&page=1"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage)
project_href = [i['href'] for i in soup.find_all('a', href=True)]
print(project_href)
This will return all the href instances. As i see in your link, a lot of href tags have # inside them. You can avoid these with a simple regex for proper links, or just ignore the # symboles.
project_href = [i['href'] for i in soup.find_all('a', href=True) if i['href'] != "#"]
This will still give you some trash links like /discover?ref=nav, so if you want to narrow it down use a proper regex for the links you need.
EDIT:
To solve the problem you mentioned in the comments:
soup = BeautifulSoup(thepage)
for i in soup.find_all('div', attrs={'class' : 'project-card-content'}):
print(i.a['href'])

How to go through a list of urls to retrieve page data - Python

In a .py file, I have a variable that's storing a list of urls. How do I properly build a loop to retrieve the code from each url, so that I can extract specific data items from each page?
This is what I've tried so far:
import requests
import re
from bs4 import BeautifulSoup
import csv
#Read csv
csvfile = open("gymsfinal.csv")
csvfilelist = csvfile.read()
print csvfilelist
#Get data from each url
def get_page_data():
for page_data in csvfilelist.splitlines():
r = requests.get(page_data.strip())
soup = BeautifulSoup(r.text, 'html.parser')
return soup
pages = get_page_data()
print pages
By not using the csv module, you are reading the gymsfinal.csv file as text files. Read through the documentation on reading/writing csv files here: CSV File Reading and Writing.
Also you will get only the first page's soup content from your current code. Because get_page_data() function will return after creating the first soup. For your current code, You can yield from the function like,
def get_page_data():
for page_data in csvfilelist.splitlines():
r = requests.get(page_data.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup
pages = get_page_data()
# iterate over the generator
for page in pages:
print pages
Also close the file you just opened.

Categories

Resources