are there are any way using python to get all links in the web site not only in the web page ? I tried this code but that's give me only links in the web page
import urllib2
import re
#connect to a URL
website = urllib2.urlopen('http://www.example.com/')
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
Visit recursively the links you have gathered and scrap these pages too:
import urllib2
import re
stack = ['http://www.example.com/']
results = []
while len(stack) > 0:
url = stack.pop()
#connect to a URL
website = urllib2.urlopen(url)
#read html code
html = website.read()
#use re.findall to get all the links
# you should not only gather links with http/ftps but also relative links
# you could use beautiful soup for that (if you want <a> links)
links = re.findall('"((http|ftp)s?://.*?)"', html)
result.extend([link in links if is_not_relative_link(link)])
for link in links:
if link_is_valid(link): #this function has to be written
stack.push(link)
Related
I have been trying to scrape a website such as the one below. In the footer there are a bunch of links of their social media out of which the LinkedIn URL is the point of focus for me. Is there a way to fish out only that link maybe using regex or any other libraries available in Python.
This is what I have tried so far -
import requests
from bs4 import BeautifulSoup
url = "https://www.southcoast.org/"
req = requests.get(url)
soup = BeautifulSoup(reqs.text,"html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
But I'm fetching all the URLs instead of the one I'm looking for.
Note: I'd appreciate a dynamic code which I can use for other sites as well.
Thanks in advance for you suggestion/help.
One approach could be to use css selectors and look for string linkedin.com/company/ in values of href attributes:
soup.select_one('a[href*="linkedin.com/company/"]')['href']
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.southcoast.org/"
req = requests.get(url)
soup = BeautifulSoup(req.text,"html.parser")
# single (first) link
link = e['href'] if(e := soup.select_one('a[href*="linkedin.com/company/"]')) else None
# multiple links
links = [link['href'] for link in soup.select('a[href*="linkedin.com/company/"]')]
I'm doing python scraping and i'm trying to get all the links between href tags and then accessing it one by one to scrape data from these links. I'm a newbie and can't figure it out how to continue from this.The code is as follows:
import requests
import urllib.request
import re
from bs4 import BeautifulSoup
import csv
url = 'https://menupages.com/restaurants/ny-new-york'
url1 = 'https://menupages.com'
response = requests.get(url)
f = csv.writer(open('Restuarants_details.csv', 'w'))
soup = BeautifulSoup(response.text, "html.parser")
menu_sections=[]
for url2 in soup.find_all('h3',class_='restaurant__title'):
completeurl = url1+url2.a.get('href')
print(completeurl)
#print(url)
If you want to scrape all the links obtained from the first page, and then scrape all the links obtained from these links, etc, you need a recursive function.
Here is some initial code to get you started:
if __name__ == "__main__":
initial_url = "https://menupages.com/restaurants/ny-new-york"
scrape(initial_url)
def scrape(url):
print("now looking at " + url)
# scrape URL
# do something with the data
if (STOP_CONDITION): # update this!
return
# scrape new URLs:
for new_url in soup.find_all(...):
scrape(new_url, file)
The problem with this recursive function is that it will not stop until there are no links on the pages, which probably won't happen anytime soon. You will need to add a stop condition.
I am trying to extract urls for listings from a city page in AirBnb, using python 3 libraries. I am familiar with how to scrape simpler websites with Beautifulsoup and requests libraries.
url: 'https://www.airbnb.com/s/Denver--CO--United-States/homes'
element in the html
If I inspect the element of a link on the page (in Chrome), I get:
xpath: "//*[#id="listing-9770909"]/div[2]/a"
selector: "listing-9770909 > div._v72lrv > a"
My attempts:
import requests
from bs4 import BeautifulSoup
url = 'https://www.airbnb.com/s/Denver--CO--United-States/homes'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
divs = soup.find_all('div', attrs={'id': 'listing'})
attempt 2:
import requests
from lxml import html
page = requests.get(url)
root = html.fromstring(page.content)
tree = root.getroottree()
result = root.xpath('//div[#id="listing-9770909"]/div[2]/a')
for r in result:
print(r)
Neither of these returns anything. What I need to be able to extract is the url for the page link. Any ideas?
To extract the links, first you have to make sure that the urls to the links exists in the page source. For this you can search with any of the listing ids in the page source(ctrl+u if you are using google chrome,mozilla firefox). If the urls exist in the page source you can directly scrape them using xpath in the response text of the listing page. Here the above listing page of Airbnb is not having the links in the page source, so the page might be sending requests to some other pages(usually json requests). You can find out those requests and send requests to those pages and get the required data.
Please comment if you have any doubt regarding this.
I am trying to write a python script that lists all the links in a webpage that contain some substring. The problem that I am running into is that the webpage has multiple "pages" so that it doesn't clutter all the screen. Take a look at https://www.go-hero.net/jam/17/solutions/1/1/C++ for an example.
This is what I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Any suggestions on how I might get this to work? Thanks in advance.
Edit/Update: Using Selenium, you could click the page links before scraping the html to collect all the content into the html. Many/most websites with pagination don't collect all the text in the html when you click through the pages, but I noticed that the example you provided does. Take a look at this SO question for a quick example of making Selenium work with BeautifulSoup. Here is how you could use it in your code:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
driver.get(original_url)
# click the links for pages 1-29
for i in range(1, 30):
path_string = '/jam/17/solutions/1/1/C++#page-' + str(i)
driver.find_element_by_xpath('//a[#href=' + path_string + ']').click()
# scrape from the accumulated html
html = driver.page_source
soup = BeautifulSoup(html)
links = soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Original Answer: For the link you provided above, you could simply loop through possible urls and run your scraping code in the loop:
import requests
from bs4 import BeautifulSoup
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
# scrape from the original page (has no page number)
response = requests.get(original_url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
# prepare to scrape from the pages numbered 1-29
# (note that the original page is not numbered, and the next page is "#page-1")
url_suffix = '#page-'
for i in range(1, 30):
# add page number to the url
paginated_url = original_url + url_suffix + str(i)
response = requests.get(paginated_url)
soup = BeautifulSoup(response.content, "html5lib")
# append resulting list to 'links' list
links += soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
I don't know if you mind that you'll get duplicates in your results. You will get duplicate results in your link list as the code currently stands, but you could add the links to a Set or something instead to easily remedy that.
I am using Python and Beautiful Soup to obtain url of available software from Civic Commons - Social Media link. I want the link of all the Social Media software (spread across 20 pages). I am able to get the url of software listed in the first page.
Below is the Python code that I wrote for obtaining these values.
from bs4 import BeautifulSoup
import re
import urllib2
base_url = "http://civiccommons.org"
url = "http://civiccommons.org/software-functions/social-media"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
list_of_links = []
for link_tag in soup.findAll('a', href=re.compile('^/apps/.*')):
string_temp_link = base_url+link_tag.get('href')
list_of_links.append(string_temp_link)
list_of_links = list(set(list_of_links))
for link_item in list_of_links:
print link_item
print ("\n")
#Newly added code to get all Next Page links from a url
next_page_links = []
for link_tag in soup.findAll('a', href=re.compile('^/.*page=')):
string_temp_link = base_url+link_tag.get('href')
next_page_links.append(string_temp_link)
for next_page in next_page_links:
print next_page
I used /apps/ regex to get the list of software.
But I wanted to know if there is better approach to crawl through next page. I am able to match the next page link by using regex "*page=". But this gives repeated list of pages.
How can I do this in a better way?
Looking at the page, there's 5 pages, the last of which is "...?page=4", so, we know there's the first page, then page=1 through page=4...
<li class="pager-last last">
last ยป
</li>
So you could retrieve that by the class (or by title), then parse the href...
from urlparse import urlparse, parse_qs
for pageno in xrange(1, int(parse_qs(urlparse(url).query)['page'][0]) + 1):
pass # do something useful here like building a url string with pageno