Crawling a specific link on a page with multiple links? - python

I'm trying to collect a specific link to visit it later throughout my script, but there are many links on the page I'm crawling and they all have the same a href tag.
How can I select one specifically? The site is bbb.org and my code is below.
Example, search lamps on bbb and i want to collect the links embedded with the business names so i can visit their profiles later.
#!/usr/bin/python
import requests
from bs4 import BeautifulSoup
def bbb_spider(max_pages):
bus_cat = raw_input('Enter a business category: ')
pages = 1
while pages <= max_pages:
url = 'http://www.bbb.org/search/?type=category&input=' + str(bus_cat) + '&page=' + str(pages)
sauce_code = requests.get(url)
plain_text = sauce_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
pages += 1

You need the links located inside h4 elements which are inside the search results table. There are different ways to get to them, but I would make a CSS selector:
soup.select("table.search-results-table tr h4 a")

I have created something similar like this.
Look at my example for a crawler.
https://github.com/shiva1791/Python_webcrawler
The code takes the url it needs to parse from link.csv.
All the logic behind parsing every link on the page is in webcrawler.py file.

Related

When using a loop trying to web scrape multiple pages I get all the links but when I do a list comprehension I only get some of the links

I am using requests and BeautifulSoup to scrape a website.
I am trying to learn how to scrape with different methods for different purposes and I am using a press release website to do that.
I am trying to scrape each article from each link from each page.
So doing a multi-page scrape where I first scrape the links for all the articles from each page and then I loop through the links and scrape the content of each one.
I am having trouble with the first part where I scrape all the links and save them to a variable so I can then use it for the next step of scraping content from each link.
I was able to get each link with this code
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://www...page='
for page in range(1,32):
req = requests.get(URL + str(page))
html_document = req.text
soup = BeautifulSoup(html_document, 'html.parser')
for link in soup.find_all('a',
attrs={'href': re.compile("^https://www...")}):
# print(link.get('href'))
soup_link = link.get('href') +'\n'
print(soup_link)
The output is all the links from each of the pages in the specified range (1 to 32). Exactly what I want!
However, I want to save the output to a variable so I can use it in my next function to scrape the content of each link as well as to save the links to a .txt file.
When I change the above code to be able to save the output to a variable, I only get a limited amount of random links and not all the links I was able to scrape with the code from above.
URL = 'https://www....page='
for page in range(1,32):
req = requests.get(URL + str(page))
html_document = req.text
soup = BeautifulSoup(html_document, 'html.parser')
links = [link['href'] for link in soup.find_all('a', attrs={'href':
re.compile("^https://...")})]
The output is a few random links. Not the full list I get from the first code.
What am I doing wrong?

Web Crawling Google Scholar - Extracting part of HTML URL to be able crawl the next/previous page

I have been tasked with creating a search engine. I understand that I need to create an adaptable URL, I have found the source code that I need to use from the onclick attribute on the button however as this changes from page to page. I need my for loop to be able to read this each time the page changes to be able to update the new URL. I have provided an example of the URL I need to change in square brackets.
I have provided a picture with the highlighted source code I require and part of my unfinished code.
Any help with this would be greatly appreciated.
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author=c7lwAPTu__8J&astart=20
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author=[NEW AUTHOR/USER CODE]&astart=[NEW PAGE NUMBER]
def main_page(max_pages):
page = 0
newpage = soup.find_all('button', {'onclick': ''})
while page <= max_pages:
url = 'https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author='+str(newpage)'&astart='+str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'href': '/citations?hl=en&user='}):
href = link.get('href')
print(href)
page += 10
main_page(1)
Highlighted source code required
You can use a little regular expression and urllib.
from bs4 import BeautifulSoup
import re
from urllib import parse
data = '''
<button onclick="window.location='/citations?view_op\x3dview_org\x26hl\x3den\x26org\x3d9117984065169182779\x26after_author\x3doHpYACHy__8J\x26astart\x3d30'">click me</button>
'''
PATTERN = re.compile(r"^window.location='(.+)'$")
soup = BeautifulSoup(data, 'html.parser')
for button in soup.find_all('button'):
location = PATTERN.match(button.attrs['onclick']).group(1)
parseresult = parse.urlparse(location)
d = parse.parse_qs(parseresult.query)
print(d['after_author'][0])
print(d['astart'][0])

List links in web page with python

I am trying to write a python script that lists all the links in a webpage that contain some substring. The problem that I am running into is that the webpage has multiple "pages" so that it doesn't clutter all the screen. Take a look at https://www.go-hero.net/jam/17/solutions/1/1/C++ for an example.
This is what I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Any suggestions on how I might get this to work? Thanks in advance.
Edit/Update: Using Selenium, you could click the page links before scraping the html to collect all the content into the html. Many/most websites with pagination don't collect all the text in the html when you click through the pages, but I noticed that the example you provided does. Take a look at this SO question for a quick example of making Selenium work with BeautifulSoup. Here is how you could use it in your code:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
driver.get(original_url)
# click the links for pages 1-29
for i in range(1, 30):
path_string = '/jam/17/solutions/1/1/C++#page-' + str(i)
driver.find_element_by_xpath('//a[#href=' + path_string + ']').click()
# scrape from the accumulated html
html = driver.page_source
soup = BeautifulSoup(html)
links = soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Original Answer: For the link you provided above, you could simply loop through possible urls and run your scraping code in the loop:
import requests
from bs4 import BeautifulSoup
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
# scrape from the original page (has no page number)
response = requests.get(original_url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
# prepare to scrape from the pages numbered 1-29
# (note that the original page is not numbered, and the next page is "#page-1")
url_suffix = '#page-'
for i in range(1, 30):
# add page number to the url
paginated_url = original_url + url_suffix + str(i)
response = requests.get(paginated_url)
soup = BeautifulSoup(response.content, "html5lib")
# append resulting list to 'links' list
links += soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
I don't know if you mind that you'll get duplicates in your results. You will get duplicate results in your link list as the code currently stands, but you could add the links to a Set or something instead to easily remedy that.

Python not progressing a list of links

So, as i need more detailed data I have to dig a bit deeper in the HTML code of a website. I wrote a script that returns me a list of specific links to detail pages, but I can't bring Python to search each link of this list for me, it always stops at the first one. What am I doing wrong?
from BeautifulSoup import BeautifulSoup
import urllib2
from lxml import html
import requests
#Open site
html_page = urllib2.urlopen("http://www.sitetoscrape.ch/somesite.aspx")
#Inform BeautifulSoup
soup = BeautifulSoup(html_page)
#Search for the specific links
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
#print found links
print link.get('href')
#complete links
complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
#print complete links
print complete_links
#
#EVERYTHING WORKS FINE TO THIS POINT
#
page = requests.get(complete_links)
tree = html.fromstring(page.text)
#Details
name = tree.xpath('//dl[#class="services"]')
for i in name:
print i.text_content()
Also: What tutorial can you recommend me to learn how to put my output in a file and clean it up, give variable names, etc?
I think that you want a list of links in complete_links instead of a single link. As #Pynchia and #lemonhead said you're overwritting complete_links every iteration of first for loop.
You need two changes:
Append links to a list and use it to loop and scrap each link
# [...] Same code here
links_list = []
for link in soup.findAll('a', href=re.compile('/d/part/of/thelink/ineed.aspx')):
print link.get('href')
complete_links = 'http://www.sitetoscrape.ch' + link.get('href')
print complete_links
link_list.append(complete_links) # append new link to the list
Scrap each accumulated link in another loop
for link in link_list:
page = requests.get(link)
tree = html.fromstring(page.text)
#Details
name = tree.xpath('//dl[#class="services"]')
for i in name:
print i.text_content()
PS: I recommend scrapy framework for tasks like that.

Python Web Scraping - Navigating to Next_Page link and obtaining data

I am using Python and Beautiful Soup to obtain url of available software from Civic Commons - Social Media link. I want the link of all the Social Media software (spread across 20 pages). I am able to get the url of software listed in the first page.
Below is the Python code that I wrote for obtaining these values.
from bs4 import BeautifulSoup
import re
import urllib2
base_url = "http://civiccommons.org"
url = "http://civiccommons.org/software-functions/social-media"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
list_of_links = []
for link_tag in soup.findAll('a', href=re.compile('^/apps/.*')):
string_temp_link = base_url+link_tag.get('href')
list_of_links.append(string_temp_link)
list_of_links = list(set(list_of_links))
for link_item in list_of_links:
print link_item
print ("\n")
#Newly added code to get all Next Page links from a url
next_page_links = []
for link_tag in soup.findAll('a', href=re.compile('^/.*page=')):
string_temp_link = base_url+link_tag.get('href')
next_page_links.append(string_temp_link)
for next_page in next_page_links:
print next_page
I used /apps/ regex to get the list of software.
But I wanted to know if there is better approach to crawl through next page. I am able to match the next page link by using regex "*page=". But this gives repeated list of pages.
How can I do this in a better way?
Looking at the page, there's 5 pages, the last of which is "...?page=4", so, we know there's the first page, then page=1 through page=4...
<li class="pager-last last">
last ยป
</li>
So you could retrieve that by the class (or by title), then parse the href...
from urlparse import urlparse, parse_qs
for pageno in xrange(1, int(parse_qs(urlparse(url).query)['page'][0]) + 1):
pass # do something useful here like building a url string with pageno

Categories

Resources