How to scrape a url having no of pages - python

I am scraping a webpage which is having a no of page in it how can I scrape those pages to get the Information which I want. Suppose I am Scraping a URL http://i.cantonfair.org.cn/en/ExpProduct.aspx?corpid=0776011226&categoryno=446
and this page is having two page how can I scrape those total pages and get the total product list.
What I did till now:
I am scraping a url from their I am scraping a particular url through regex
and trying to go to that url and from that link their are no of other pages contain information link product name. And I want to get that product name from all the pages.
My Code:
from bs4 import BeautifulSoup
import urllib.request
import re
import json
response = urllib.request.urlopen("http://i.cantonfair.org.cn/en/ExpProduct.aspx?corpid=0776011226&categoryno=446")
soup = BeautifulSoup(response, "html.parser")
productlink = soup.find_all("a", href=re.compile(r"ExpProduct\.aspx\?corpid=[0-9]+.categoryno=[0-9]+"))
productlink = ([link["href"] for link in productlink])
print (productlink)
After this I am stuck. I am using python 3.5.1 and Beautifulsoup

If you want to scrape the page for pictures, I'd advise CSS Selectors
Get the list of items, afterwards You can search for the next page. when you stop getting the next page you know you're done.
def get_next_page(soup):
pages = soup.select('div[id="AspNetPager1] a[href]')
for page in pages:
if page.text == 'Next':
return page
response = urllib.request.urlopen("http://i.cantonfair.org.cn/en/ExpProduct.aspx?corpid=0776011226&categoryno=446")
soup = BeautifulSoup(response, "html.parser")
url = 'http://i.cantonfair.org.cn/en/'
products = []
next_page = get_next_page(soup)
while next_page is not None:
products += soup.select('div[class="photolist"] li')
response = urllib.request.urlopen(url + next_page['href'])
soup = BeautifulSoup(response, "html.parser")
next_page = get_next_page(soup)
products += soup.select('div[class="photolist"] li')
product_names = set()
for product in products:
product_names.add(product.text)
print(product_names)

As far as I understand, what you would like to do is crawl a couple pages and scrape them as well.
I would suggest you to give a look at Scrapy.
You can crawl webpages and scrape them, the Documentation contains a tutorial and is pretty good in my opinion.

Related

When using a loop trying to web scrape multiple pages I get all the links but when I do a list comprehension I only get some of the links

I am using requests and BeautifulSoup to scrape a website.
I am trying to learn how to scrape with different methods for different purposes and I am using a press release website to do that.
I am trying to scrape each article from each link from each page.
So doing a multi-page scrape where I first scrape the links for all the articles from each page and then I loop through the links and scrape the content of each one.
I am having trouble with the first part where I scrape all the links and save them to a variable so I can then use it for the next step of scraping content from each link.
I was able to get each link with this code
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://www...page='
for page in range(1,32):
req = requests.get(URL + str(page))
html_document = req.text
soup = BeautifulSoup(html_document, 'html.parser')
for link in soup.find_all('a',
attrs={'href': re.compile("^https://www...")}):
# print(link.get('href'))
soup_link = link.get('href') +'\n'
print(soup_link)
The output is all the links from each of the pages in the specified range (1 to 32). Exactly what I want!
However, I want to save the output to a variable so I can use it in my next function to scrape the content of each link as well as to save the links to a .txt file.
When I change the above code to be able to save the output to a variable, I only get a limited amount of random links and not all the links I was able to scrape with the code from above.
URL = 'https://www....page='
for page in range(1,32):
req = requests.get(URL + str(page))
html_document = req.text
soup = BeautifulSoup(html_document, 'html.parser')
links = [link['href'] for link in soup.find_all('a', attrs={'href':
re.compile("^https://...")})]
The output is a few random links. Not the full list I get from the first code.
What am I doing wrong?

Scrape next page content beautifulsoup

So I'm trying to scrape this news website. I can scrape news article from each topics there. But sometimes the articles page contain more than 1 page in there like this. The next page had the same HTML structure like the first page. Is there any way to automatically scrape the rest of articles on the next page if there is more than one page in there?
This is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
detik = requests.get('https://www.detik.com/terpopuler')
beautify = BeautifulSoup(detik.content, 'html5lib')
news = beautify.find_all('article', {'class','list-content__item'})
arti = []
for each in news:
try:
title = each.find('h3', {'class','media__title'}).text
lnk = each.a.get('href')
r = requests.get(lnk)
soup = BeautifulSoup(r.text, 'html5lib')
content = soup.find('div', {'class', 'detail__body-text itp_bodycontent'}).text.strip()
print(title)
print(lnk)
arti.append({
'Headline': title,
'Content':content,
'Link': lnk
})
except:
continue
df = pd.DataFrame(arti)
df.to_csv('detik.csv', index=False)
This is the next page button image. "Selanjutnya" means next, and "Halaman" means page.
Really appreciated if you willing to help.
the way you would approach this is first write a separate function to extract info from article page and then check if there is any pagination on the article page by checking for this class "detail__anchor-numb" and you would loop through the pages
and extract data from article:
pages= soup.select('.detail__anchor-numb')
if len(pages):
page_links= [i.attrs.get('href') for i in soup.select('.detail__anchor-numb')]
for page in range(1, len(page_links)+1):
#scrape_article function will handle requesting a url and getting data from article
next_article_url = page_links[page ]
scrape_article(next_article_url)
I hope that answers your question

Trying to scrape the links on the page www.zath.co.uk using python

I am trying to scrape a website www.zath.co.uk, and extract the links to all of the articles using Python 3. Looking at the raw html file I identified one of the sections I am interested in, displayed below using BeautifulSoup.
<article class="post-32595 post type-post status-publish format-standard has-post-thumbnail category-games entry" itemscope="" itemtype="https://schema.org/CreativeWork">
<header class="entry-header">
<h2 class="entry-title" itemprop="headline">
<a class="entry-title-link" href="https://www.zath.co.uk/family-games-day-night-event-giffgaff/" rel="bookmark">
A Family Games Night (& Day) With giffgaff
</a>
I then wrote this code to excute this, I started by setting up a list of urls from the website to scrape.
urlList = ["https://www.zath.co.uk/","https://www.zath.co.uk/page/2/",....."https://www.zath.co.uk/page/35/"
Then (after importing the necessary libraries) defined a function get all Zeth articles.
def getAllZathPosts(url,links):
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response)
for a in soup.findAll('a'):
url = a['href']
c = a['class']
if c == "entry-title-link":
print(url)
links.append(url)
return
Then call the function.
links = []
zathPosts = {}
for url in urlList:
zathPosts = getAllZathPosts(url,links)
The code runs with no errors but the links list remains empty with no urls printed as if the class never equals "entry-title-link". I have tried adding an else case.
else:
print(url + " not article")
and all the links from the pages printed as expected. Any suggestions?
You can simply iterate it using range and extract article tag
import requests
from bs4 import BeautifulSoup
for page_no in range(35):
page=requests.get("https://www.zath.co.uk/page/{}/".format(page_no))
parser=BeautifulSoup(page.content,'html.parser')
for article in parser.findAll('article'):
print(article.h2.a['href'])
You can do something like the below code:
import requests
from bs4 import BeautifulSoup
def getAllZathPosts(url,links):
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
results = soup.select("a.entry-title-link")
#for i in results:
#print(i.text)
#links.append(url)
if len(results) >0:
links.append(url)
links = []
urlList = ["https://www.zath.co.uk/","https://www.zath.co.uk/page/2/","https://www.zath.co.uk/page/35/"]
for url in urlList:
getAllZathPosts(url,links)
print(set(links))

using python to pull href tags

tyring to pull the href links for the products on this webpage. The code pulls all of the href's except the products that are listed on the page.
from bs4 import BeautifulSoup
import requests
url = "https://www.neb.com/search#t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))
The products are loaded through rest API dynamically, the URL is this:
https://international.neb.com/coveo/rest/v2/?sitecoreItemUri=sitecore%3A%2F%2Fweb%2F%7BA1D9D237-B272-4C5E-A23F-EC954EB71A26%7D%3Flang%3Den%26ver%3D1&siteName=nebinternational
Loading this response will get you the URLs.
Next time, check your network inspector if any part of web page isn't loading dynamically (or use selenium).
Try to verify if the product href's is in the received response. I'm telling you to do this because if the part of the products is being dynamically generated by ajax, for example, a simple get on the main page will not bring them.
Print the response and verifiy if the products are being received in the html
I think you want something like this:
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '100'):
resp = urllib.request.urlopen("https://www.neb.com/search#first=" + numb + "&t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])

List links in web page with python

I am trying to write a python script that lists all the links in a webpage that contain some substring. The problem that I am running into is that the webpage has multiple "pages" so that it doesn't clutter all the screen. Take a look at https://www.go-hero.net/jam/17/solutions/1/1/C++ for an example.
This is what I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Any suggestions on how I might get this to work? Thanks in advance.
Edit/Update: Using Selenium, you could click the page links before scraping the html to collect all the content into the html. Many/most websites with pagination don't collect all the text in the html when you click through the pages, but I noticed that the example you provided does. Take a look at this SO question for a quick example of making Selenium work with BeautifulSoup. Here is how you could use it in your code:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
driver.get(original_url)
# click the links for pages 1-29
for i in range(1, 30):
path_string = '/jam/17/solutions/1/1/C++#page-' + str(i)
driver.find_element_by_xpath('//a[#href=' + path_string + ']').click()
# scrape from the accumulated html
html = driver.page_source
soup = BeautifulSoup(html)
links = soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Original Answer: For the link you provided above, you could simply loop through possible urls and run your scraping code in the loop:
import requests
from bs4 import BeautifulSoup
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
# scrape from the original page (has no page number)
response = requests.get(original_url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
# prepare to scrape from the pages numbered 1-29
# (note that the original page is not numbered, and the next page is "#page-1")
url_suffix = '#page-'
for i in range(1, 30):
# add page number to the url
paginated_url = original_url + url_suffix + str(i)
response = requests.get(paginated_url)
soup = BeautifulSoup(response.content, "html5lib")
# append resulting list to 'links' list
links += soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
I don't know if you mind that you'll get duplicates in your results. You will get duplicate results in your link list as the code currently stands, but you could add the links to a Set or something instead to easily remedy that.

Categories

Resources