Scrape next page content beautifulsoup - python

So I'm trying to scrape this news website. I can scrape news article from each topics there. But sometimes the articles page contain more than 1 page in there like this. The next page had the same HTML structure like the first page. Is there any way to automatically scrape the rest of articles on the next page if there is more than one page in there?
This is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
detik = requests.get('https://www.detik.com/terpopuler')
beautify = BeautifulSoup(detik.content, 'html5lib')
news = beautify.find_all('article', {'class','list-content__item'})
arti = []
for each in news:
try:
title = each.find('h3', {'class','media__title'}).text
lnk = each.a.get('href')
r = requests.get(lnk)
soup = BeautifulSoup(r.text, 'html5lib')
content = soup.find('div', {'class', 'detail__body-text itp_bodycontent'}).text.strip()
print(title)
print(lnk)
arti.append({
'Headline': title,
'Content':content,
'Link': lnk
})
except:
continue
df = pd.DataFrame(arti)
df.to_csv('detik.csv', index=False)
This is the next page button image. "Selanjutnya" means next, and "Halaman" means page.
Really appreciated if you willing to help.

the way you would approach this is first write a separate function to extract info from article page and then check if there is any pagination on the article page by checking for this class "detail__anchor-numb" and you would loop through the pages
and extract data from article:
pages= soup.select('.detail__anchor-numb')
if len(pages):
page_links= [i.attrs.get('href') for i in soup.select('.detail__anchor-numb')]
for page in range(1, len(page_links)+1):
#scrape_article function will handle requesting a url and getting data from article
next_article_url = page_links[page ]
scrape_article(next_article_url)
I hope that answers your question

Related

How to scrape headline news, link and image?

I'd like to scrape news headline, link of news and picture of that news.
I try to use web scraping following as below.
but It's only headline code and It is not work.
import requests
import pandas as pd
from bs4 import BeautifulSoup
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False)
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2',{'class':'post-title-news'})
len(headlines)
for i in range(len(headlines)):
print(headlines[i].text)
Please recommend it to me.
This is because the site blocks bot. If you print the res.content which shows 403.
Add headers={'User-Agent':'Mozilla/5.0'} to the request.
Try the code below,
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2', class_='post-title-news')
print(len(headlines))
for i in range(len(headlines)):
print(headlines[i].text)
First things first: never post code as an image.
<h2> in your HTML has no text. What it does have, is an <a> element, so:
for hl in headlines:
link = hl.findChild()
text = link.text
url = link.attrs['href']

When using a loop trying to web scrape multiple pages I get all the links but when I do a list comprehension I only get some of the links

I am using requests and BeautifulSoup to scrape a website.
I am trying to learn how to scrape with different methods for different purposes and I am using a press release website to do that.
I am trying to scrape each article from each link from each page.
So doing a multi-page scrape where I first scrape the links for all the articles from each page and then I loop through the links and scrape the content of each one.
I am having trouble with the first part where I scrape all the links and save them to a variable so I can then use it for the next step of scraping content from each link.
I was able to get each link with this code
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://www...page='
for page in range(1,32):
req = requests.get(URL + str(page))
html_document = req.text
soup = BeautifulSoup(html_document, 'html.parser')
for link in soup.find_all('a',
attrs={'href': re.compile("^https://www...")}):
# print(link.get('href'))
soup_link = link.get('href') +'\n'
print(soup_link)
The output is all the links from each of the pages in the specified range (1 to 32). Exactly what I want!
However, I want to save the output to a variable so I can use it in my next function to scrape the content of each link as well as to save the links to a .txt file.
When I change the above code to be able to save the output to a variable, I only get a limited amount of random links and not all the links I was able to scrape with the code from above.
URL = 'https://www....page='
for page in range(1,32):
req = requests.get(URL + str(page))
html_document = req.text
soup = BeautifulSoup(html_document, 'html.parser')
links = [link['href'] for link in soup.find_all('a', attrs={'href':
re.compile("^https://...")})]
The output is a few random links. Not the full list I get from the first code.
What am I doing wrong?

scrape book body text from project gutenberg de

I am new to python and I am looking for a way to extract with beautiful soup existing open source books that are available on gutenberg-de, such as this one
I need to use them for further analysis and text mining.
I tried this code, found in a tutorial, and it extracts metadata, but instead of the body content it gives me a list of the "pages" I need to scrape the text from.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.projekt-gutenberg.org/keller/heinrich/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_title, page_head)
I suppose I could use that as a second step to extract it then? I am not sure how, though.
Ideally I would like to store them in a tabular way and be able to save them as csv, preserving the metadata author, title, year, and chapter. any ideas?
What happens?
First of all you will get a list of pages, cause you not requesting the right url it to:
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
Recommend that if your looping all the urls store the content in a list of dicts and push it to csv or pandas or ...
Example
import requests
from bs4 import BeautifulSoup
data = []
# Make a request
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
soup = BeautifulSoup(page.content, 'html.parser')
data.append({
'title': soup.title,
'chapter': soup.h2.get_text(),
'text': ' '.join([p.get_text(strip=True) for p in soup.select('body p')[2:]])
}
)
data

How to get sub-content from wikipedia page using BeautifulSoup

I am trying to scrape sub-content from Wikipedia pages based on the internal link using python, The problem is that scrape all content from the page, how can scrape just internal link paragraph, Thanks in advance
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
sub_link="#الأسباب"
total=base_link+sub_link
r=requests.get(total)
soup = bs(r.text, 'html.parser')
results=soup.find('p')
print(results)
It is because it's not a sublink you are trying to scrape. It's an anchor.
Try to request the entire page and then to find the given id.
Something like this:
from bs4 import BeautifulSoup as soup
import requests
base_link='https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%AA%D9%87%D8%A7%D8%A8_%D8%A7%D9%84%D9%82%D8%B5%D8%A8%D8%A7%D8%AA'
anchor_id="ﺍﻸﺴﺑﺎﺑ"
r=requests.get(base_link)
page = soup(r.text, 'html.parser')
span = page.find('span', {'id': anchor_id})
results = span.parent.find_next_siblings('p')
print(results[0].text)

How to scrape a url having no of pages

I am scraping a webpage which is having a no of page in it how can I scrape those pages to get the Information which I want. Suppose I am Scraping a URL http://i.cantonfair.org.cn/en/ExpProduct.aspx?corpid=0776011226&categoryno=446
and this page is having two page how can I scrape those total pages and get the total product list.
What I did till now:
I am scraping a url from their I am scraping a particular url through regex
and trying to go to that url and from that link their are no of other pages contain information link product name. And I want to get that product name from all the pages.
My Code:
from bs4 import BeautifulSoup
import urllib.request
import re
import json
response = urllib.request.urlopen("http://i.cantonfair.org.cn/en/ExpProduct.aspx?corpid=0776011226&categoryno=446")
soup = BeautifulSoup(response, "html.parser")
productlink = soup.find_all("a", href=re.compile(r"ExpProduct\.aspx\?corpid=[0-9]+.categoryno=[0-9]+"))
productlink = ([link["href"] for link in productlink])
print (productlink)
After this I am stuck. I am using python 3.5.1 and Beautifulsoup
If you want to scrape the page for pictures, I'd advise CSS Selectors
Get the list of items, afterwards You can search for the next page. when you stop getting the next page you know you're done.
def get_next_page(soup):
pages = soup.select('div[id="AspNetPager1] a[href]')
for page in pages:
if page.text == 'Next':
return page
response = urllib.request.urlopen("http://i.cantonfair.org.cn/en/ExpProduct.aspx?corpid=0776011226&categoryno=446")
soup = BeautifulSoup(response, "html.parser")
url = 'http://i.cantonfair.org.cn/en/'
products = []
next_page = get_next_page(soup)
while next_page is not None:
products += soup.select('div[class="photolist"] li')
response = urllib.request.urlopen(url + next_page['href'])
soup = BeautifulSoup(response, "html.parser")
next_page = get_next_page(soup)
products += soup.select('div[class="photolist"] li')
product_names = set()
for product in products:
product_names.add(product.text)
print(product_names)
As far as I understand, what you would like to do is crawl a couple pages and scrape them as well.
I would suggest you to give a look at Scrapy.
You can crawl webpages and scrape them, the Documentation contains a tutorial and is pretty good in my opinion.

Categories

Resources