Parsing All Pages of HTML using BeautifulSoup

Parsing All Pages of HTML using BeautifulSoup - python

I'm having problems within my code which works perfectly with one page, but when I try to parse all the 28 pages it doesn't parse 27 pages, but parse only the first one.
The main idea is parse the data from the mentioned url which has 28 pages in overall and I made for loop for it in order to make BS parse from all the pages. However, it parses only the first page, but doesn't parse others.
I would like to get your recommendations and ways to make it work.
Code:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
for t in range(28):
url = "https://boss.az/vacancies?action=index&controller=vacancies&only_path=true&page={}&type=vacancies".format(t)
r = requests.get(url)
soup = bs(r.content, 'html.parser')
titles = [i.text for i in soup.select('.results-i-title')]
#print(titles)
companies = [i.text for i in soup.select('.results-i-company')]
#print(companies)
summaries = [i.text for i in soup.select('.results-i-summary')]
df = pd.DataFrame(list(zip(titles, companies, summaries)), columns = ['Title', 'Company', 'Summary'])
df.to_csv(r'Data.csv', sep=',', encoding='utf-8-sig',index = False )

You are overwriting titles, companies and summaries with every iteration of the loop. Simply change titles = ... to titles += ...:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
titles = []
companies = []
summaries = []
for t in range(28):
url = "https://boss.az/vacancies?action=index&controller=vacancies&only_path=true&page={}&type=vacancies".format(t)
r = requests.get(url)
soup = bs(r.content, 'html.parser')
titles += [i.text for i in soup.select('.results-i-title')]
companies += [i.text for i in soup.select('.results-i-company')]
summaries += [i.text for i in soup.select('.results-i-summary')]
df = pd.DataFrame(list(zip(titles, companies, summaries)), columns = ['Title', 'Company', 'Summary'])
df.to_csv(r'Data.csv', sep=',', encoding='utf-8-sig',index = False )

Related

Problem with dataframe using beautiful soup

I create a dataframe using beautiful soup scraping the data. However, there have 2 problems.
Why does the for loop run 2 times?
How to remove the brackets on the data frame?
import urllib.request as req
from bs4 import BeautifulSoup
import bs4
import requests
import pandas as pd
url = "https://finance.yahoo.com/quote/BF-B/profile?p=BF-B"
root = requests.get(url)
soup = BeautifulSoup(root.text, 'html.parser')
records = []
for result in soup:
name = soup.find_all('h1', attrs={'D(ib) Fz(18px)'})
website = soup.find_all('a')[44]
sector = soup.find_all('span')[35]
industry = soup.find_all('span')[37]
records.append((name, website, sector, industry))
df = pd.DataFrame(records, columns=['name', 'website', 'sector', 'industry'])
df.head()
And the result like this:
DataFrame Output

To get information about the company, you don't have to loop over the soup, just extract necessary information directly. To get rid of [..] brackets, use .text property:
import requests
from bs4 import BeautifulSoup
url = 'https://finance.yahoo.com/quote/BF-B/profile?p=BF-B'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
all_data.append({
'Name': soup.h1.text,
'Website': soup.select_one('.asset-profile-container a[href^="http"]')['href'],
'Sector': soup.select_one('span:contains("Sector(s)") + span').text,
'Industry': soup.select_one('span:contains("Industry") + span').text
})
df = pd.DataFrame(all_data)
print(df)
Prints:
Name Website Sector Industry
0 Brown-Forman Corporation (BF-B) http://www.brown-forman.com Consumer Defensive Beverages—Wineries & Distilleries

Python Web Scraping - How to scrape this type of site?

Okay, so I need to scrape the following webpage: https://www.programmableweb.com/category/all/apis?deadpool=1
It's a list of APIs. There are approx 22,000 APIs to scrape.
I need to:
1) Get the URL of each API in the table (pages 1-889), and also to scrape the following info:
API name
Description
Category
Submitted
2) I then need to scrape a bunch of information from each URL.
3) Export the data to a CSV
The thing is, I’m a bit lost of how to think about this project. From what I can see, there are no AJAX calls been made to populate the table, which means I’m going to have to parse the HTML directly (right?)
In my head, the logic would be something like this:
Use the requests & BS4 libraries to scrape the table
Then, somehow grab the HREF from every row
Access that HREF, scrape the data, move onto the next one
Rinse and repeat for all table rows.
Am I on the right track, is this possible with requests & BS4?
Here's are some screenshots of what I've been trying to explain.
Thank you SOO much for any help. This is hurting my head haha

Here we go using requests, BeautifulSoup and pandas:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page='
num = int(input('How Many Page to Parse?> '))
print('please wait....')
name = []
desc = []
cat = []
sub = []
for i in range(0, num):
r = requests.get(f"{url}{i}")
soup = BeautifulSoup(r.text, 'html.parser')
for item1 in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}):
name.append(item1.text)
for item2 in soup.findAll('td', attrs={'class': 'views-field views-field-search-api-excerpt views-field-field-api-description hidden-xs visible-md visible-sm col-md-8'}):
desc.append(item2.text)
for item3 in soup.findAll('td', attrs={'class': 'views-field views-field-field-article-primary-category'}):
cat.append(item3.text)
for item4 in soup.findAll('td', attrs={'class': 'views-field views-field-created'}):
sub.append(item4.text)
result = []
for item in zip(name, desc, cat, sub):
result.append(item)
df = pd.DataFrame(
result, columns=['API Name', 'Description', 'Category', 'Submitted'])
df.to_csv('output.csv')
print('Task Completed, Result saved to output.csv file.')
Result can be viewed online: Check Here
Output Simple:
Now For href parsing:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.programmableweb.com/category/all/apis?deadpool=0&page='
num = int(input('How Many Page to Parse?> '))
print('please wait....')
links = []
for i in range(0, num):
r = requests.get(f"{url}{i}")
soup = BeautifulSoup(r.text, 'html.parser')
for link in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}):
for href in link.findAll('a'):
result = 'https://www.programmableweb.com'+href.get('href')
links.append(result)
spans = []
for link in links:
r = requests.get(link)
soup = soup = BeautifulSoup(r.text, 'html.parser')
span = [span.text for span in soup.select('div.field span')]
spans.append(span)
data = []
for item in spans:
data.append(item)
df = pd.DataFrame(data)
df.to_csv('data.csv')
print('Task Completed, Result saved to data.csv file.')
Check Result Online: Here
Sample View is Below:
In Case if you want those 2 csv files together so here's the code:
import pandas as pd
a = pd.read_csv("output.csv")
b = pd.read_csv("data.csv")
merged = a.merge(b)
merged.to_csv("final.csv", index=False)
Online Result: Here

You should read more about scraping if you are going to pursue it .
from bs4 import BeautifulSoup
import csv , os , requests
from urllib import parse
def SaveAsCsv(list_of_rows):
try:
with open('data.csv', mode='a', newline='', encoding='utf-8') as outfile:
csv.writer(outfile).writerow(list_of_rows)
except PermissionError:
print("Please make sure data.csv is closed\n")
if os.path.isfile('data.csv') and os.access('data.csv', os.R_OK):
print("File data.csv Already exists \n")
else:
SaveAsCsv([ 'api_name','api_link','api_desc','api_cat'])
BaseUrl = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page={}'
for i in range(1, 890):
print('## Getting Page {} out of 889'.format(i))
url = BaseUrl.format(i)
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
table_rows = soup.select('div.view-content > table[class="views-table cols-4 table"] > tbody tr')
for row in table_rows:
tds = row.select('td')
api_name = tds[0].text.strip()
api_link = parse.urljoin(url, tds[0].find('a').get('href'))
api_desc = tds[1].text.strip()
api_cat = tds[2].text.strip() if len(tds) >= 3 else ''
SaveAsCsv([api_name,api_link,api_desc,api_cat])

How to specify table for BeautifulSoup to find?

I'm trying to grab the table on this page https://nces.ed.gov/collegenavigator/?id=139755 under the Net Price expandable object. I've gone through tutorials for BS4, but I get so confused by the complexity of the html in this case that I can't figure out what syntax and which tags to use.
Here's a screenshot of the table and html I'm trying to get:
This is what I have so far. How do I add other tags to narrow down the results to just that one table?
import requests
from bs4 import BeautifulSoup
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
soup = soup.find(id="divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02")
print(soup.prettify())
Once I can parse that data, I will format into a dataframe with pandas.

In this case I'd probably just use pandas to retrieve all tables then index in for appropriate
import pandas as pd
table = pd.read_html('https://nces.ed.gov/collegenavigator/?id=139755')[10]
print(table)
If you are worried about future ordering you could loop the tables returned by read_html and test for presence of a unique string to identify table or use bs4 functionality of :has , :contains (bs4 4.7.1+) to identify the right table to then pass to read_html or continue handling with bs4
import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
print(table)

ok , maybe this can help you , I add pandas
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find("div", {"id": "divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02"})
table = div.findAll("table", {"class": "tabular"})[1]
l = []
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
if td:
row = [i.text for i in td]
l.append(row)
df=pd.DataFrame(l, columns=["AVERAGE NET PRICE BY INCOME","2015-2016","2016-2017","2017-2018"])
print(df)

Here is a basic script to scrape that first table in that accordion:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
print(desired_table.prettify())
I assume you only want the values within the table so I did an overkill version of this as well that will combine the column names and values together:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
header_row = desired_table.find_all('th')
headers = []
for header in header_row:
header_text = header.get_text()
headers.append(header_text)
money_values = []
data_row =desired_table.find_all('td')
for rows in data_row:
row_text = rows.get_text()
money_values.append(row_text)
for yrs,money in zip(headers,money_values):
print(yrs,money)
This will print out the following:
Average net price
2015-2016 $13,340
2016-2017 $15,873
2017-2018 $16,950

How do you scrape tables on multiple webpage addresses using pandas & beautiful soup?

I would like to pull data from a table on a website. The table exist across 165 webpages, and I would like to scrape it all. I have only been able to get the first page.
I have tried pandas, beautifulsoup, requests
offset = 0
teacher_list = []
while offset <= 4500:
calls_df, =
pd.read_html("https://projects.newsday.com/databases/long-
island/teacher-administrator-salaries-2017-2018/?offset=0" +
str(offset), header=0, parse_dates=["Start date"])
offset = offset + 1500
print(calls_df)
# calls_df = "https:" + calls_df
collection_page = requests.get(calls_df)
page_html = collection_page.text
soup = BeautifulSoup(page_html, "html.parser")
print(page_html)
print(soup.prettify())
print(teacher_list)
offset = offset + 1500
print(teacher_list,calls_df.to_csv("calls.csv", index=False))

You could use a step parameter to increment your url
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
df = pd.DataFrame()
with requests.Session() as s:
for i in range(0, 246001, 1500):
url = 'https://projects.newsday.com/databases/long-%20island/teacher-administrator-salaries-2017-2018/?offset={}'.format(i)
r = s.get(url)
soup = bs(r.content, 'lxml')
dfCurrent = pd.read_html(str(soup.select_one('html')))[0]
dfCurrent.dropna(how='all', inplace = True)
df = pd.concat([df, dfCurrent])
df = df.reset_index(drop=True)
df.to_csv(r"C:\Users\User\Desktop\test.csv", encoding='utf-8-sig')

Beautiful Soup scrape table with table breaks

I'm trying to scrape a table into a dataframe. My attempt only returns the table name and not the data within rows for each region.
This is what i have so far:
from bs4 import BeautifulSoup as bs4
import requests
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
table_regions = soup.find('table', {'class': "t4"})
regions = table_regions.find_all('tr')
for row in regions:
print row
ideal outcome i'd like to get:
region | price
---------------|-------
new england | 2.59
new york city | 2.52
Thanks for any assistance.

If you check your html response (soup) you will see that the table tag you get in this line table_regions = soup.find('table', {'class': "t4"}) its closed up before the rows that contain the information you need (the ones that contain the td's with the class names: up dn d1 and s1.
So how about using the raw td tags like this:
from bs4 import BeautifulSoup as bs4
import requests
import pandas as pd
url = 'https://www.eia.gov/todayinenergy/prices.php'
r = requests.get(url)
soup = bs4(r.text, "html.parser")
a = soup.find_all('tr')
rows = []
subel = []
for tr in a[42:50]:
b = tr.find_all('td')
for td in b:
subel.append(td.string)
rows.append(subel)
subel = []
df = pd.DataFrame(rows, columns=['Region','Price_1', 'Percent_change_1', 'Price_2', 'Percent_change_2', 'Spark Spread'])
Notice that I use just the a[42:50] slice of the results because a contains all the td's of the website. You can use the rest too if you need to.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing All Pages of HTML using BeautifulSoup - python

Related

Problem with dataframe using beautiful soup

Python Web Scraping - How to scrape this type of site?

How to specify table for BeautifulSoup to find?

How do you scrape tables on multiple webpage addresses using pandas & beautiful soup?

Beautiful Soup scrape table with table breaks

Categories

Resources