how to scrape multiple pages in python with bs4 - python

I have a query as I have been scraping a website "https://www.zaubacorp.com/company-list" as not able to scrape the email id from the given link in the table. Although the need to scrape Name, Email and Directors from the link in the given table. Can anyone please, resolve my issue as I am a newbie to web scraping using python with beautiful soup and requests.
Thank You
Dieksha
#Scraping the website
#Import a liabry to query a website
import requests
#Specify the URL
companies_list = "https://www.zaubacorp.com/company-list"
link = requests.get("https://www.zaubacorp.com/company-list").text
#Import BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(link,'lxml')
soup.table.find_all('a')
all_links = soup.table.find_all('a')
for link in all_links:
print(link.get("href"))

Well let's break down the website and see what we can do.
First off, I can see that this website is paginated. This means that we have to deal with something as simple as the website using part of the GET query string to determine what page we are requesting to some AJAX call that is filling the table with new data when you click next. From clicking on the next page and subsequent pages, we are in some luck that the website uses the GET query parameter.
Our URL for requesting the webpage to scrape is going to be
https://www.zaubacorp.com/company-list/p-<page_num>-company.html
We are going to write a bit of code that will fill that page num with values ranging from 1 to the last page you want to scrape. In this case, we do not need to do anything special to determine the last page of the table since we can skip to the end and find that it will be page 13,333. This means that we will be making 13,333 page requests to this website to fully collect all of its data.
As for gathering the data from the website we will need to find the table that holds the information and then iteratively select the elements to pull out the information.
In this case we can actually "cheat" a little since there appears to be only a single tbody on the page. We want to iterate over all the and pull out the text. I'm going to go ahead and write the sample.
import requests
import bs4
def get_url(page_num):
page_num = str(page_num)
return "https://www.zaubacorp.com/company-list/p-1" + page_num + "-company.html"
def scrape_row(tr):
return [td.text for td in tr.find_all("td")]
def scrape_table(table):
table_data = []
for tr in table.find_all("tr"):
table_data.append(scrape_row(tr))
return table_data
def scrape_page(page_num):
req = requests.get(get_url(page_num))
soup = bs4.BeautifulSoup(req.content, "lxml")
data = scrape_table(soup)
for line in data:
print(line)
for i in range(1, 3):
scrape_page(i)
This code will scrape the first two pages of the website and by just changing the for loop range you can get all 13,333 pages. From here you should be able to just modify the printout logic to save to a CSV.

Related

How to get data past the "Show More" button that DOESN'T change the URL?

I am trying to scrape article titles and links from Vogue with a site search keyword. I can't get the top 100 results because the "Show More" button obscures them. I've gotten around this before by using the changing URL, but Vogue's URL does not change to include the page number, result number, etc.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.vogue.com/search?q=HARRY+STYLES&sort=score+desc'
r = requests.get(url)
soup = bs(r.content, 'html')
links = soup.find_all('a', {'class':"summary-item-tracking__hed-link summary-item__hed-link"})
titles = soup.find_all('h2', {'class':"summary-item__hed"})
res = []
for i in range(len(titles)):
entry = {'Title': titles[i].text.strip(), 'Link': 'https://www.vogue.com'+links[i]['href'].strip()}
res.append(entry)
Any tips on how to scrape the data past the "Show More" button?
You have to examine the Network from developer tools. Then you have to determine how to website requests the data. You can see the request and the response in the screenshot.
The website is using page parameter as you see.
Each page has 8 titles. So you have to use the loop to get 100 titles.
Code:
import cloudscraper,json,html
counter=1
for i in range(1,14):
url = f'https://www.vogue.com/search?q=HARRY%20STYLES&page={i}&sort=score%20desc&format=json'
scraper = cloudscraper.create_scraper(browser={'browser': 'firefox','platform': 'windows','mobile': False},delay=10)
byte_data = scraper.get(url).content
json_data = json.loads(byte_data)
for j in range(0,8):
title_url = 'https://www.vogue.com' + (html.unescape(json_data['search']['items'][j]['url']))
t = html.unescape(json_data['search']['items'][j]['source']['hed'])
print(counter," - " + t + ' - ' + title_url)
if (counter == 100):
break
counter = counter + 1
Output:
You can inspect the requests on the website using your browser's web developer tools to find out if its making a specific request for data of your interest.
In this case, the website is loading more info by making GET requests to an URL like this:
https://www.vogue.com/search?q=HARRY STYLES&page=<page_number>&sort=score desc&format=json
Where <page_number> is > 1 as page 1 is what you see by default when you visit the website.
Assuming you can/will request a limited amount of pages and as the data format is JSON, you will have to transform it to a dict() or other data structure to extract the data you want. Specifically targeting the "search.items" key of the JSON object since it contains an array of data of the articles for the requested page.
Then, the "Title" would be search.items[i].source.hed and you could assemble the link with search.items[i].url.
As a tip, I think is a good practice to try to see how the website works manually and then attempt to automate the process.
If you want to request data to that URL, make sure to include some delay between requests so you don't get kicked out or blocked.

How do I scrape a 'td' in web scraping

I am learning web scraping and I'm scraping in this following website: ivmp servers. I have trouble with scraping the number of players in the server, can someone help me? I will send the code of what I've done so far
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find('table')
summary = players.find('div', class_ ='players')
print(summary)
Looking at the page you provided, i can assume that the table you want to extract information from is the one with server names and ip adresses.
There are actually 4 "table" element on this page.
Luckily for you, this table has an id (serverlist). You can easily find it with right click > inspect on Chrome
players = soup.select_one('table#serverlist')
Now you want to get the td.
You can print all of them using :
for td in players.select("td"):
print(td)
Or you can select the one you are interested in :
players.select("td.hostname")
for example.
Hope this helps.
Looking at the structure of the page, there are a few table cells (td) with the class "players", it looks like two of them are for sorting the table, so we'll assume you don't want those.
In order to extract the one(s) you do want, I would first query for all the td elements with the class "players", and then loop through them adding only the ones we do want to an array.
Something like this:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find_all('td', class_='players')
summary = []
for cell in players:
# Exclude the cells which are for sorting
if cell.get_text() != 'Players':
summary.append(cell.get_text())
print(summary)

Scraping a table for links, getting data from each link, and doing the same for many pages (pagination)

I have a list of 5000 best movies, spanning 50 pages. The website is
http://5000best.com/movies/
I want to extract the names of the 5000 movies, then click on each movie name link. Each link will redirect me to the IMDb page. Then, I want the director's name to be extracted.
This will give me a table with 5000 rows, with the columns being the name of the movie and the director.
This data will be exported to csv or to xlsx.
I have the following for extracting text so far:
import requests
start_url = 'http://5000best.com/movies/'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text)
Ok, here is the main logic for pagination. Hope you get along from there. To capture all pages just loop until the next page doesn't exist.
import requests
import bs4
i = 1
while 1:
url = f'http://5000best.com/movies/{i}'
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text)
# looking at the HTML we can find the main table
table = soup.find('table', id="ttable")
# analyse the HTML and process the table here
# if the table is empty, we are beyond the last page
if len(table.find_all('tr')) == 0:
break
i += 1
I think the issue is getting the pagination link
This is how the link works
http://5000best.com/?m.c&xml=1&ta=13&p=1&s=&sortby=0&y0=&y1=&ise=&h=01000000000000000
There are 2 parameters that change with each page the p and h (Although the links seem to work irrespective of the h parameter)
so the link for page 2 will look like this:
http://5000best.com/?m.c&xml=1&ta=13&p=2&s=&sortby=0&y0=&y1=&ise=&h=02000000000000000
and 50 be like:
http://5000best.com/?m.c&xml=1&ta=13&p=50&s=&sortby=0&y0=&y1=&ise=&h=05000000000000000
Hope you can handle the rest

Unable to navigate Amazon pagination with Python and BS4

I've been trying to create a simple web scraper program to scrape the book titles of a 100 bestseller list on Amazon. I've used this code before on another site with no problems. But for some reason, it scraps the first page fine but then posts the same results for the following iterations.
I'm not sure if it's something to do with how Amazon creates its urls or not. When I manually enter the "#2" (and beyond) at the end of the url in the browser it navigates fine.
(Once the scrape is working I plan on dumping the data in csv files. But for now, print to the terminal will do.)
import requests
from bs4 import BeautifulSoup
for i in range(5):
url = "https://smile.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_nav_kstore_4_158591011#{}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for book in soup.find_all('div', class_='zg_itemWrapper'):
title = book.find('div', class_='p13n-sc-truncate')
name = book.find('a', class_='a-link-child')
price = book.find('span', class_='p13n-sc-price')
print(title)
print(name)
print(price)
print("END")
This is a common problem that you have to face, some sites load the data asynchronous(with ajax) those are XMLHttpRequest that you can see in the tab networking of your DOM inspector. Usually the websites load the data from a different endpoint with POST method to solve that you can use urllib or requests library.
In this case the request is through a GET method and you can scrape it from this url with no need of extend your code https://www.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_pg_3?_encoding=UTF8&pg=3&ajax=1 where you only change the pg parameter

Scrape multiple pages with BeautifulSoup and Python

My code successfully scrapes the tr align=center tags from [ http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY ] and writes the td elements to a text file.
However, there are multiple pages available at the site above in which I would like to be able to scrape.
For example, with the url above, when I click the link to "page 2" the overall url does NOT change. I looked at the page source and saw a javascript code to advance to the next page.
How can my code be changed to scrape data from all the available listed pages?
My code that works for page 1 only:
import bs4
import requests
response = requests.get('http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY')
soup = bs4.BeautifulSoup(response.text)
soup.prettify()
acct = open("/Users/it/Desktop/accounting.txt", "w")
for tr in soup.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.
Now, what we want to see is either a GET request to another page or a POST request that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POST method. All the other elements will quickly follow and fill the page. See below for what we're looking for.
Click on the above POST method. It should bring up a sub-window of sorts that has tabs. Click on the Headers tab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).
Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.
As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Data with pageNum: 2 beneath it. This is the key.
POST requests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POST requests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.html or somesuch.
What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POST request for you on execution. To know the exact string you have to append, click on view source.
Test if it works by adding it to the URL.
Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.
Modified code is below:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')
We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.
Results:
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]
Hope that helps.
EDIT:
Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list
with open("results.txt","wb") as acct:
for class_url in classes_url_list:
base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
r = rq.get(base_url)
soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try:
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1
# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]
# Open the text file. Use with to save self from grief.
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')

Categories

Resources