Scrape multiple pages with BeautifulSoup and Python

Scrape multiple pages with BeautifulSoup and Python - python

My code successfully scrapes the tr align=center tags from [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ] and writes the td elements to a text file.
However, there are multiple pages available at the site above in which I would like to be able to scrape.
For example, with the url above, when I click the link to "page 2" the overall url does NOT change. I looked at the page source and saw a javascript code to advance to the next page.
How can my code be changed to scrape data from all the available listed pages?
My code that works for page 1 only:
import bs4
import requests
response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')
soup = bs4.BeautifulSoup(response.text)
soup.prettify()
acct = open("/Users/it/Desktop/accounting.txt", "w")
for tr in soup.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')

The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.
Now, what we want to see is either a GET request to another page or a POST request that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POST method. All the other elements will quickly follow and fill the page. See below for what we're looking for.
Click on the above POST method. It should bring up a sub-window of sorts that has tabs. Click on the Headers tab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).
Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.
As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Data with pageNum: 2 beneath it. This is the key.
POST requests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POST requests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.html or somesuch.
What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POST request for you on execution. To know the exact string you have to append, click on view source.
Test if it works by adding it to the URL.
Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.
Modified code is below:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.
Results:
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]
Hope that helps.
EDIT:
Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list
with open("results.txt","wb") as acct:
for class_url in classes_url_list:
base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try:
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')

Related

How to get data past the "Show More" button that DOESN'T change the URL?

I am trying to scrape article titles and links from Vogue with a site search keyword. I can't get the top 100 results because the "Show More" button obscures them. I've gotten around this before by using the changing URL, but Vogue's URL does not change to include the page number, result number, etc.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.vogue.com/search?q=HARRY+STYLES&sort=score+desc'
r = requests.get(url)
soup = bs(r.content, 'html')
links = soup.find_all('a', {'class':"summary-item-tracking__hed-link summary-item__hed-link"})
titles = soup.find_all('h2', {'class':"summary-item__hed"})
res = []
for i in range(len(titles)):
entry = {'Title': titles[i].text.strip(), 'Link': 'https://www.vogue.com'+links[i]['href'].strip()}
res.append(entry)
Any tips on how to scrape the data past the "Show More" button?

You have to examine the Network from developer tools. Then you have to determine how to website requests the data. You can see the request and the response in the screenshot.
The website is using page parameter as you see.
Each page has 8 titles. So you have to use the loop to get 100 titles.
Code:
import cloudscraper,json,html
counter=1
for i in range(1,14):
url = f'https://www.vogue.com/search?q=HARRY%20STYLES&page={i}&sort=score%20desc&format=json'
scraper = cloudscraper.create_scraper(browser={'browser': 'firefox','platform': 'windows','mobile': False},delay=10)
byte_data = scraper.get(url).content
json_data = json.loads(byte_data)
for j in range(0,8):
title_url = 'https://www.vogue.com' + (html.unescape(json_data['search']['items'][j]['url']))
t = html.unescape(json_data['search']['items'][j]['source']['hed'])
print(counter," - " + t + ' - ' + title_url)
if (counter == 100):
break
counter = counter + 1
Output:

You can inspect the requests on the website using your browser's web developer tools to find out if its making a specific request for data of your interest.
In this case, the website is loading more info by making GET requests to an URL like this:
https://www.vogue.com/search?q=HARRY STYLES&page=<page_number>&sort=score desc&format=json
Where <page_number> is > 1 as page 1 is what you see by default when you visit the website.
Assuming you can/will request a limited amount of pages and as the data format is JSON, you will have to transform it to a dict() or other data structure to extract the data you want. Specifically targeting the "search.items" key of the JSON object since it contains an array of data of the articles for the requested page.
Then, the "Title" would be search.items[i].source.hed and you could assemble the link with search.items[i].url.
As a tip, I think is a good practice to try to see how the website works manually and then attempt to automate the process.
If you want to request data to that URL, make sure to include some delay between requests so you don't get kicked out or blocked.

how to scrape multiple pages in python with bs4

I have a query as I have been scraping a website "https://www.zaubacorp.com/company-list" as not able to scrape the email id from the given link in the table. Although the need to scrape Name, Email and Directors from the link in the given table. Can anyone please, resolve my issue as I am a newbie to web scraping using python with beautiful soup and requests.
Thank You
Dieksha
#Scraping the website
#Import a liabry to query a website
import requests
#Specify the URL
companies_list = "https://www.zaubacorp.com/company-list"
link = requests.get("https://www.zaubacorp.com/company-list").text
#Import BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(link,'lxml')
soup.table.find_all('a')
all_links = soup.table.find_all('a')
for link in all_links:
print(link.get("href"))

Well let's break down the website and see what we can do.
First off, I can see that this website is paginated. This means that we have to deal with something as simple as the website using part of the GET query string to determine what page we are requesting to some AJAX call that is filling the table with new data when you click next. From clicking on the next page and subsequent pages, we are in some luck that the website uses the GET query parameter.
Our URL for requesting the webpage to scrape is going to be
https://www.zaubacorp.com/company-list/p-<page_num>-company.html
We are going to write a bit of code that will fill that page num with values ranging from 1 to the last page you want to scrape. In this case, we do not need to do anything special to determine the last page of the table since we can skip to the end and find that it will be page 13,333. This means that we will be making 13,333 page requests to this website to fully collect all of its data.
As for gathering the data from the website we will need to find the table that holds the information and then iteratively select the elements to pull out the information.
In this case we can actually "cheat" a little since there appears to be only a single tbody on the page. We want to iterate over all the and pull out the text. I'm going to go ahead and write the sample.
import requests
import bs4
def get_url(page_num):
page_num = str(page_num)
return "https://www.zaubacorp.com/company-list/p-1" + page_num + "-company.html"
def scrape_row(tr):
return [td.text for td in tr.find_all("td")]
def scrape_table(table):
table_data = []
for tr in table.find_all("tr"):
table_data.append(scrape_row(tr))
return table_data
def scrape_page(page_num):
req = requests.get(get_url(page_num))
soup = bs4.BeautifulSoup(req.content, "lxml")
data = scrape_table(soup)
for line in data:
print(line)
for i in range(1, 3):
scrape_page(i)
This code will scrape the first two pages of the website and by just changing the for loop range you can get all 13,333 pages. From here you should be able to just modify the printout logic to save to a CSV.

I'm trying to scrape a website for some list items, but beautiful soup does not find any on the page

I'm attempting to make a table where I collect all the works of each composer from this page and arrange them by adding "score" e.g. 1 point for 300th place, 290 points for 10th place, etc. using a Python script.
However, BeautifulSoup does not seem to find the li elements. What am I doing wrong? A screenshot of the page HTML: https://gyazo.com/73ff53fb332755300d9b7450011a7130
I have already tried using soup.li, soup.findAll("li") and soup.find_all("li"), but all return "none" or similar. Printing soup.body does return the body though, so I think I do have an HTML document.
from bs4 import BeautifulSoup as bsoup
import requests
link = "https://halloffame.classicfm.com/2019/"
response = requests.get(link)
soup = bsoup(response.text, "html.parser")
print(soup.li)
I was hoping this would give me at least one li item, but instead it returns None.

I don't see all rankings from 300-1. Sometimes page shows only to 148, other times to 146, and lowest I have seen is 143. Don't know if this is a design flaw/bug. The page is javascript updated which is why you are getting an empty list. That content hasn't been rendered.
requests only returns content that doesn't rely on javascript to render i.e. you don't get everything that you see when using a browser which, if javascript is enabled, will allow additional content to be loaded as various scripts on the page run. This is a feature of modern responsive/dynamic webpages where you no longer have to update an entire page when, for example, selections are made on the page.
Often you can use dev tools F12 to inspect the web traffic the page is using to update the content via the network tab. With the network tab open refresh the entire page and then filter on XHR.
In this case, the info is actually pulled from a script tag which already holds that info. You can open the elements tab (Chrome) and do Ctrl+F and search for a composer's name. You will find one match occurs in a script tag. I use regex to find the script tag this is in by matching on javascript var songs = []; which is then followed by the object containing the composer info in the following regex group.
Sample from target script tag:
You can grab these from script tag
import requests
from bs4 import BeautifulSoup as bs
import re
soup = bs(requests.get('https://halloffame.classicfm.com/2019/', 'lxml').content, 'lxml')
r = re.compile(r'var songs = \[\];(.*)' , re.DOTALL)
data = soup.find('script', text=r).text
script = r.findall(data)[0].strip()
rp = re.compile(r'position:\s+(\d+)')
rankings = rp.findall(script)
rt = re.compile(r'title:\s+"(.*)"')
titles = rt.findall(script)
print(len(titles))
print(len(rankings))
If you can locate the rest of these rankings you can then zip your lists whilst reversing the rankings list
results = list(zip(titles, rankings[::-1]))
Either way, you can use the len of the titles to generate a list of numbers in reverse that will give the rankings:
rankings = list(range(len(titles), 0, -1))
results = list(zip(titles, rankings[::-1]))

Beautiful Soup - Blank screen for a long time without any output

I am quite new to python and am working on a scraping based project- where I am supposed to extract all the contents from links containing a particular search term and place them in a csv file. As a first step, I wrote this code to extract all the links from a website based on a search term entered. I only get a blank screen as output and I am unable to find my mistake.
import urllib
import mechanize
from bs4 import BeautifulSoup
import datetime
def searchAP(searchterm):
newlinks = []
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
text = ""
start = 0
while "There were no matches for your search" not in text:
url = "http://www.marketing-interactive.com/"+"?s="+searchterm
text = urllib.urlopen(url).read()
soup = BeautifulSoup(text, "lxml")
results = soup.findAll('a')
for r in results:
if "rel=bookmark" in r['href'] :
newlinks.append("http://www.marketing-interactive.com"+ str(r["href"]))
start +=10
return newlinks
print searchAP("digital marketing")

You made four mistakes:
You are defining start but you never use it. (Nor can you, as far as I can see on http://www.marketing-interactive.com/?s=something. There is no url based pagination.) So you endlessly looping over the first set of results.
"There were no matches for your search" is not the no-results string returned by that site. So it would go on forever anyway.
You are appending the link, including http://www.marketing-interactive.com to http://www.marketing-interactive.com. So you would end up with http://www.marketing-interactive.comhttp://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/
Concerning rel=bookmark selection: arifs solution is the proper way to go. But if you really want to do it this way you'd need to something like this:
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
This first checks if rel exists and then checks if its first child is "bookmark", as r['href'] simply does not contain the rel. That's not how BeautifulSoup structures things.
To scrape this specific site you can do two things:
You could do something with Selenium or something else that supports Javascript and press that "Load more" button. But this is quite a hassle.
You can use this loophole: http://www.marketing-interactive.com/wp-content/themes/MI/library/inc/loop_handler.php?pageNumber=1&postType=search&searchValue=digital+marketing
This is the url that feeds the list. It has pagination, so you can easily loop over all results.

The following script extracts all the links from the web page based on given search key. But it does not explore beyond the first page. Although the following code can easily be modified to get all results from multiple pages by manipulating page-number in the URL (as described by Rutger de Knijf in the other answer.).
from pprint import pprint
import requests
from BeautifulSoup import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content)
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
Usage:
pprint(get_url_for_search_key('digital marketing'))
Output:
[u'http://www.marketing-interactive.com/astro-launches-digital-marketing-arm-blaze-digital/',
u'http://www.marketing-interactive.com/singapore-polytechnic-on-the-hunt-for-digital-marketing-agency/',
u'http://www.marketing-interactive.com/how-to-get-your-bosses-on-board-your-digital-marketing-plan/',
u'http://www.marketing-interactive.com/digital-marketing-institute-launches-brand-refresh/',
u'http://www.marketing-interactive.com/entropia-highlights-the-7-original-sins-of-digital-marketing/',
u'http://www.marketing-interactive.com/features/futurist-right-mindset-digital-marketing/',
u'http://www.marketing-interactive.com/lenovo-brings-board-new-digital-marketing-head/',
u'http://www.marketing-interactive.com/video/discussing-digital-marketing-indonesia-video/',
u'http://www.marketing-interactive.com/ubs-melvin-kwek-joins-credit-suisse-as-apac-digital-marketing-lead/',
u'http://www.marketing-interactive.com/linkedins-top-10-digital-marketing-predictions-2017/']
Hope this is what you wanted as the first step for your project.

How to gather entire source of web page (Source only shows top 10 X.)

I'm trying to create a program that will go through a bunch of tumblr photos and extract the username of the person who uploaded them.
http://www.tumblr.com/tagged/food
If you look here, you can see multiple pictures of food with multiple different uploaders. If you scroll down you will begin to see even more pictures with even more uploaders. If you right click in your browser to view the source, and search "username", however, it will only yield 10 results. Every time, no matter how far down you scroll.
Is there any way to counter this and have instead have it display the entire source for all images, or for X amount of images, or for however far you scrolled?
Here is my code to show what I'm doing:
#Imports
import requests
from bs4 import BeautifulSoup
import re
#Start of code
r = requests.get('http://www.tumblr.com/tagged/skateboard')
page = r.content
soup = BeautifulSoup(page)
soup.prettify()
arrayDiv = []
for anchor in soup.findAll("div", { "class" : "post_info" }):
anchor = str(anchor)
tempString = anchor.replace('</a>:', '')
tempString = tempString.replace('<div class="post_info">', '')
tempString = tempString.replace('</div>', '')
tempString = tempString.split('>')
newString = tempString[1]
newString = newString.strip()
arrayDiv.append(newString)
print arrayDiv

I had solved a similiar problem using beautifulsoup. what I did is looping through the paged pages. check with beautifulsoup is there is a continue element - here(in the tumbler page) for example this is an element with an id "next_page_link"
if there is one I would loop the photo scraping code while changing the url fetched by requests. you would need to encapsulate all the code in a function ofcourse
good luck.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.