List links in web page with python - python

I am trying to write a python script that lists all the links in a webpage that contain some substring. The problem that I am running into is that the webpage has multiple "pages" so that it doesn't clutter all the screen. Take a look at https://www.go-hero.net/jam/17/solutions/1/1/C++ for an example.
This is what I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Any suggestions on how I might get this to work? Thanks in advance.

Edit/Update: Using Selenium, you could click the page links before scraping the html to collect all the content into the html. Many/most websites with pagination don't collect all the text in the html when you click through the pages, but I noticed that the example you provided does. Take a look at this SO question for a quick example of making Selenium work with BeautifulSoup. Here is how you could use it in your code:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
driver.get(original_url)
# click the links for pages 1-29
for i in range(1, 30):
path_string = '/jam/17/solutions/1/1/C++#page-' + str(i)
driver.find_element_by_xpath('//a[#href=' + path_string + ']').click()
# scrape from the accumulated html
html = driver.page_source
soup = BeautifulSoup(html)
links = soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Original Answer: For the link you provided above, you could simply loop through possible urls and run your scraping code in the loop:
import requests
from bs4 import BeautifulSoup
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
# scrape from the original page (has no page number)
response = requests.get(original_url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
# prepare to scrape from the pages numbered 1-29
# (note that the original page is not numbered, and the next page is "#page-1")
url_suffix = '#page-'
for i in range(1, 30):
# add page number to the url
paginated_url = original_url + url_suffix + str(i)
response = requests.get(paginated_url)
soup = BeautifulSoup(response.content, "html5lib")
# append resulting list to 'links' list
links += soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
I don't know if you mind that you'll get duplicates in your results. You will get duplicate results in your link list as the code currently stands, but you could add the links to a Set or something instead to easily remedy that.

Related

When using a loop trying to web scrape multiple pages I get all the links but when I do a list comprehension I only get some of the links

I am using requests and BeautifulSoup to scrape a website.
I am trying to learn how to scrape with different methods for different purposes and I am using a press release website to do that.
I am trying to scrape each article from each link from each page.
So doing a multi-page scrape where I first scrape the links for all the articles from each page and then I loop through the links and scrape the content of each one.
I am having trouble with the first part where I scrape all the links and save them to a variable so I can then use it for the next step of scraping content from each link.
I was able to get each link with this code
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://www...page='
for page in range(1,32):
req = requests.get(URL + str(page))
html_document = req.text
soup = BeautifulSoup(html_document, 'html.parser')
for link in soup.find_all('a',
attrs={'href': re.compile("^https://www...")}):
# print(link.get('href'))
soup_link = link.get('href') +'\n'
print(soup_link)
The output is all the links from each of the pages in the specified range (1 to 32). Exactly what I want!
However, I want to save the output to a variable so I can use it in my next function to scrape the content of each link as well as to save the links to a .txt file.
When I change the above code to be able to save the output to a variable, I only get a limited amount of random links and not all the links I was able to scrape with the code from above.
URL = 'https://www....page='
for page in range(1,32):
req = requests.get(URL + str(page))
html_document = req.text
soup = BeautifulSoup(html_document, 'html.parser')
links = [link['href'] for link in soup.find_all('a', attrs={'href':
re.compile("^https://...")})]
The output is a few random links. Not the full list I get from the first code.
What am I doing wrong?

Web Crawling Google Scholar - Extracting part of HTML URL to be able crawl the next/previous page

I have been tasked with creating a search engine. I understand that I need to create an adaptable URL, I have found the source code that I need to use from the onclick attribute on the button however as this changes from page to page. I need my for loop to be able to read this each time the page changes to be able to update the new URL. I have provided an example of the URL I need to change in square brackets.
I have provided a picture with the highlighted source code I require and part of my unfinished code.
Any help with this would be greatly appreciated.
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author=c7lwAPTu__8J&astart=20
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author=[NEW AUTHOR/USER CODE]&astart=[NEW PAGE NUMBER]
def main_page(max_pages):
page = 0
newpage = soup.find_all('button', {'onclick': ''})
while page <= max_pages:
url = 'https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author='+str(newpage)'&astart='+str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'href': '/citations?hl=en&user='}):
href = link.get('href')
print(href)
page += 10
main_page(1)
Highlighted source code required
You can use a little regular expression and urllib.
from bs4 import BeautifulSoup
import re
from urllib import parse
data = '''
<button onclick="window.location='/citations?view_op\x3dview_org\x26hl\x3den\x26org\x3d9117984065169182779\x26after_author\x3doHpYACHy__8J\x26astart\x3d30'">click me</button>
'''
PATTERN = re.compile(r"^window.location='(.+)'$")
soup = BeautifulSoup(data, 'html.parser')
for button in soup.find_all('button'):
location = PATTERN.match(button.attrs['onclick']).group(1)
parseresult = parse.urlparse(location)
d = parse.parse_qs(parseresult.query)
print(d['after_author'][0])
print(d['astart'][0])

Cannot reach to the href in a html element -Beautifulsoup

I have been trying to randomize a wikipedia page and get the URL of that randomized site. Even though I can fetch every link on the site, I can not reach to this piece of html code and fetch the href for some reason.
An example of a randomized wikipedia page.
<a accesskey="v" href="https://en.wikipedia.org/wiki/T%C5%99eb%C3%ADvlice?action=edit" class="oo-ui-element-hidden"></a>
All the wikipedia pages have this and I need to get the href so that I can manipulate this in a way that I can get the current URL.
The code I have written this far:
from bs4 import BeautifulSoup
import requests
links = []
for x in range(0, 1):
source = requests.get("https://en.wikipedia.org/wiki/Special:Random").text
soup = BeautifulSoup(source, "lxml")
print(soup.find(id="firstHeading"))
for link in soup.findAll('a'):
links.append(link.get('href'))
print(links)
Directly getting the current URL would also help too, however I couldn't find a solution for that online.
Also I'm using Lunix OS -if that would help-
Take a look for the attributes
You should specify your search by using the attribute of this <a>:
soup.find_all('a', accesskey='e')
Example
import requests
from bs4 import BeautifulSoup
links = []
for x in range(0, 1):
source = requests.get("https://en.wikipedia.org/wiki/Special:Random").text
soup = BeautifulSoup(source, "lxml")
print(soup.find(id="firstHeading"))
for link in soup.find_all('a', accesskey='e'):
links.append(link.get('href'))
print(links)
Output
<h1 class="firstHeading" id="firstHeading" lang="en">James Stack (golfer)</h1>
['/w/index.php?title=James_Stack_(golfer)&action=edit']
Just in case
You do not need the second loop, if you just wanna handle that single <a> use find() instead of find_all()
Example
import requests
from bs4 import BeautifulSoup
links = []
for x in range(0, 5):
source = requests.get("https://en.wikipedia.org/wiki/Special:Random").text
soup = BeautifulSoup(source, "lxml")
links.append(soup.find('a', accesskey='e').get('href'))
links
Output
['/w/index.php?title=Rick_Moffat&action=edit',
'/w/index.php?title=Mount_Burrows&action=edit',
'/w/index.php?title=The_Rock_Peter_and_the_Wolf&action=edit',
'/w/index.php?title=Yamato,_Yamanashi&action=edit',
'/w/index.php?title=Craig_Henderson&action=edit']

using python to pull href tags

tyring to pull the href links for the products on this webpage. The code pulls all of the href's except the products that are listed on the page.
from bs4 import BeautifulSoup
import requests
url = "https://www.neb.com/search#t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))
The products are loaded through rest API dynamically, the URL is this:
https://international.neb.com/coveo/rest/v2/?sitecoreItemUri=sitecore%3A%2F%2Fweb%2F%7BA1D9D237-B272-4C5E-A23F-EC954EB71A26%7D%3Flang%3Den%26ver%3D1&siteName=nebinternational
Loading this response will get you the URLs.
Next time, check your network inspector if any part of web page isn't loading dynamically (or use selenium).
Try to verify if the product href's is in the received response. I'm telling you to do this because if the part of the products is being dynamically generated by ajax, for example, a simple get on the main page will not bring them.
Print the response and verifiy if the products are being received in the html
I think you want something like this:
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '100'):
resp = urllib.request.urlopen("https://www.neb.com/search#first=" + numb + "&t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])

How to scrape a url having no of pages

I am scraping a webpage which is having a no of page in it how can I scrape those pages to get the Information which I want. Suppose I am Scraping a URL http://i.cantonfair.org.cn/en/ExpProduct.aspx?corpid=0776011226&categoryno=446
and this page is having two page how can I scrape those total pages and get the total product list.
What I did till now:
I am scraping a url from their I am scraping a particular url through regex
and trying to go to that url and from that link their are no of other pages contain information link product name. And I want to get that product name from all the pages.
My Code:
from bs4 import BeautifulSoup
import urllib.request
import re
import json
response = urllib.request.urlopen("http://i.cantonfair.org.cn/en/ExpProduct.aspx?corpid=0776011226&categoryno=446")
soup = BeautifulSoup(response, "html.parser")
productlink = soup.find_all("a", href=re.compile(r"ExpProduct\.aspx\?corpid=[0-9]+.categoryno=[0-9]+"))
productlink = ([link["href"] for link in productlink])
print (productlink)
After this I am stuck. I am using python 3.5.1 and Beautifulsoup
If you want to scrape the page for pictures, I'd advise CSS Selectors
Get the list of items, afterwards You can search for the next page. when you stop getting the next page you know you're done.
def get_next_page(soup):
pages = soup.select('div[id="AspNetPager1] a[href]')
for page in pages:
if page.text == 'Next':
return page
response = urllib.request.urlopen("http://i.cantonfair.org.cn/en/ExpProduct.aspx?corpid=0776011226&categoryno=446")
soup = BeautifulSoup(response, "html.parser")
url = 'http://i.cantonfair.org.cn/en/'
products = []
next_page = get_next_page(soup)
while next_page is not None:
products += soup.select('div[class="photolist"] li')
response = urllib.request.urlopen(url + next_page['href'])
soup = BeautifulSoup(response, "html.parser")
next_page = get_next_page(soup)
products += soup.select('div[class="photolist"] li')
product_names = set()
for product in products:
product_names.add(product.text)
print(product_names)
As far as I understand, what you would like to do is crawl a couple pages and scrape them as well.
I would suggest you to give a look at Scrapy.
You can crawl webpages and scrape them, the Documentation contains a tutorial and is pretty good in my opinion.

Categories

Resources