Using Beautiful Soup on multiple URLs - python

I have searched through a lot of similar questions, but I'm unable to resolve the issue with the code below.
I am trying to scrape the same information from 2 separate URLs.
There is no issue when I scrape 1 URL (code 1). I then attempt to for loop through multiple URLs (code 2) and it throws this error:
ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Is it a case that the line where the error is returned (highlighted below) should not be included within the For Loop? (I have tried this unsuccessfully)
Could someone please educate me in why this is not working (my guess would be that the structure is wrong in someway - but I've been unable to adjust it correctly), or if this is infact not the optimal method at all
First code:
import csv
import pandas as pd
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
import numpy as np
import re
url = "https://www.espncricinfo.com/series/pepsi-indian-premier-league-2014-695871/chennai-super-kings-vs-royal-challengers-bangalore-42nd-match-734013/full-scorecard"
url_contents = ureq(url) #opening the URL
soup = soup(url_contents,"html.parser") #parse the
batsmen = soup.find_all("table", { "class":["table batsman"]})
bowlers = soup.find_all("table", { "class":["table bowler"]})
for batsman in batsmen[0]:
with open('testcsv3.csv', 'a',newline='') as csvfile:
f = csv.writer(csvfile)
print (batsmen)
for x in batsman:
rows = batsman.find_all('tr')[:-2] #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(columns)
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print(data)
for bowler in bowlers[1]:
with open('testcsv3.csv', 'a',newline='') as csvfile:
f = csv.writer(csvfile)
print (bowlers)
for x in bowler:
rows = bowler.find_all('tr') #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(columns)
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print(data)
Second code:
import csv # to do operations on CSV
import pandas as pd # file operations
from bs4 import BeautifulSoup as soup #Scraping tool
from urllib.request import urlopen as ureq # For requesting data from link
import numpy as np
import re
urls = ["https://www.espncricinfo.com/series/pepsi-indian-premier-league-2014-695871/chennai-super-kings-vs-royal-challengers-bangalore-42nd-match-734013/full-scorecard",
"https://www.espncricinfo.com/series/pepsi-indian-premier-league-2014-695871/chennai-super-kings-vs-kolkata-knight-riders-21st-match-733971/full-scorecard"]
for url in urls:
url_contents = ureq(url) #opening the URL
soup = soup(url_contents,"html.parser") #parse the
**batsmen = soup.find_all("table", { "class":["table batsman"]})** #error here
bowlers = soup.find_all("table", { "class":["table bowler"]})
for batsman in batsmen[0]:
with open('testcsv3.csv', 'a',newline='') as csvfile:
f = csv.writer(csvfile)
print (batsmen)
for x in batsman:
rows = batsman.find_all('tr')[:-2] #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(columns)
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print(data)
for bowler in bowlers[1]:
with open('testcsv3.csv', 'a',newline='') as csvfile:
f = csv.writer(csvfile)
print (bowlers)
for x in bowler:
rows = bowler.find_all('tr') #find all tr tag(rows)
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags(columns)
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print(data)

Your problem is because you use the same name soup for class/function soup(...) and for result soup = ... - and you run it in loop.
from bs4 import BeautifulSoup as soup
for url in urls:
soup = soup(...)
In first loop all work correctly but class/function soup() is replaces by result soup = ... and in next loop it tries to use result soup as a class/function - and this makes problem.
In first code you run soup = soup() only once so it makes no problem.
If you use different names - ie. BeautifoulSoup instead of soup - then it will work
from bs4 import BeautifulSoup
for url in urls:
soup = BeautifulSoup(...)
BTW:
In second code you have wrong indentations - you should run for batsman in ... and for bowler in ... inside for url in urls: but you run it outside (after exiting from loop for url in urls:) and this will give you results only for last url

You can use request lib and try this
import requests as req
from bs4 import BeautifulSoup
urls = ["https://www.espncricinfo.com/series/pepsi-indian-premier-league-2014-695871/chennai-super-kings-vs-royal-challengers-bangalore-42nd-match-734013/full-scorecard",
"https://www.espncricinfo.com/series/pepsi-indian-premier-league-2014-695871/chennai-super-kings-vs-kolkata-knight-riders-21st-match-733971/full-scorecard"]
for url in urls:
otp = req.get(url)
if otp.ok:
soup = BeautifulSoup(otp.text,'lxml')
batsmen = soup.find_all('table', {'class': 'table batsman'})
bowlers = soup.find_all('table', {'class': 'table bowler'})
for bat in batsmen:
print(bat.find_all('td')) # here you can use find/find_all method
for bowl in bowlers:
print(bowl.find_all('td')) # here you can use find/find_all method

Related

How can I save scraped date from soup object into CSV?

I am looking to only save scraped date into a CSV file.
This is the scraped data and code:
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-
SkillsNetwork/labs/datasets/Programming_Languages.html"
from bs4 import BeautifulSoup
import requests
data = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")
table = soup.find('table')
for row in table.find_all('tr'):
cols = row.find_all('td')
programing_language = cols[1].getText()
salary = cols[3].getText()
print("{}--->{}".format(programing_language,salary))
Here is the solution.
import pandas as pd
from bs4 import BeautifulSoup
import requests
data=[]
url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"
from bs4 import BeautifulSoup
import requests
data = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")
table = soup.find('table')
for row in table.find_all('tr'):
cols = row.find_all('td')
programing_language = cols[1].getText()
salary = cols[3].getText()
data.append([programing_language,salary])
#print("{}--->{}".format(programing_language,salary))
cols=['programing_language','salary']
df = pd.DataFrame(data,columns=cols)
df.to_csv("data.csv", index=False)
For a lightweight solution you can just use csv. Ignore headers row by using tr:nth-child(n+2). This nth-child range selector selects from the second tr. Then within a loop over the subsequent rows, select for the second and fourth columns as follows:
from bs4 import BeautifulSoup as bs
import requests, csv
response = requests.get('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html',
headers={'User-Agent': 'Mozilla/5.0'})
soup = bs(response.content, 'lxml')
with open("programming.csv", "w", encoding="utf-8-sig", newline='') as f:
w = csv.writer(f, delimiter=",", quoting=csv.QUOTE_MINIMAL)
w.writerow(["Language", "Average Annual Salary"])
for item in soup.select('tr:nth-child(n+2)'):
w.writerow([item.select_one('td:nth-child(2)').text,
item.select_one('td:nth-child(4)').text])

Python Web Scraping - How to scrape this type of site?

Okay, so I need to scrape the following webpage: https://www.programmableweb.com/category/all/apis?deadpool=1
It's a list of APIs. There are approx 22,000 APIs to scrape.
I need to:
1) Get the URL of each API in the table (pages 1-889), and also to scrape the following info:
API name
Description
Category
Submitted
2) I then need to scrape a bunch of information from each URL.
3) Export the data to a CSV
The thing is, I’m a bit lost of how to think about this project. From what I can see, there are no AJAX calls been made to populate the table, which means I’m going to have to parse the HTML directly (right?)
In my head, the logic would be something like this:
Use the requests & BS4 libraries to scrape the table
Then, somehow grab the HREF from every row
Access that HREF, scrape the data, move onto the next one
Rinse and repeat for all table rows.
Am I on the right track, is this possible with requests & BS4?
Here's are some screenshots of what I've been trying to explain.
Thank you SOO much for any help. This is hurting my head haha
Here we go using requests, BeautifulSoup and pandas:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page='
num = int(input('How Many Page to Parse?> '))
print('please wait....')
name = []
desc = []
cat = []
sub = []
for i in range(0, num):
r = requests.get(f"{url}{i}")
soup = BeautifulSoup(r.text, 'html.parser')
for item1 in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}):
name.append(item1.text)
for item2 in soup.findAll('td', attrs={'class': 'views-field views-field-search-api-excerpt views-field-field-api-description hidden-xs visible-md visible-sm col-md-8'}):
desc.append(item2.text)
for item3 in soup.findAll('td', attrs={'class': 'views-field views-field-field-article-primary-category'}):
cat.append(item3.text)
for item4 in soup.findAll('td', attrs={'class': 'views-field views-field-created'}):
sub.append(item4.text)
result = []
for item in zip(name, desc, cat, sub):
result.append(item)
df = pd.DataFrame(
result, columns=['API Name', 'Description', 'Category', 'Submitted'])
df.to_csv('output.csv')
print('Task Completed, Result saved to output.csv file.')
Result can be viewed online: Check Here
Output Simple:
Now For href parsing:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.programmableweb.com/category/all/apis?deadpool=0&page='
num = int(input('How Many Page to Parse?> '))
print('please wait....')
links = []
for i in range(0, num):
r = requests.get(f"{url}{i}")
soup = BeautifulSoup(r.text, 'html.parser')
for link in soup.findAll('td', attrs={'class': 'views-field views-field-title col-md-3'}):
for href in link.findAll('a'):
result = 'https://www.programmableweb.com'+href.get('href')
links.append(result)
spans = []
for link in links:
r = requests.get(link)
soup = soup = BeautifulSoup(r.text, 'html.parser')
span = [span.text for span in soup.select('div.field span')]
spans.append(span)
data = []
for item in spans:
data.append(item)
df = pd.DataFrame(data)
df.to_csv('data.csv')
print('Task Completed, Result saved to data.csv file.')
Check Result Online: Here
Sample View is Below:
In Case if you want those 2 csv files together so here's the code:
import pandas as pd
a = pd.read_csv("output.csv")
b = pd.read_csv("data.csv")
merged = a.merge(b)
merged.to_csv("final.csv", index=False)
Online Result: Here
You should read more about scraping if you are going to pursue it .
from bs4 import BeautifulSoup
import csv , os , requests
from urllib import parse
def SaveAsCsv(list_of_rows):
try:
with open('data.csv', mode='a', newline='', encoding='utf-8') as outfile:
csv.writer(outfile).writerow(list_of_rows)
except PermissionError:
print("Please make sure data.csv is closed\n")
if os.path.isfile('data.csv') and os.access('data.csv', os.R_OK):
print("File data.csv Already exists \n")
else:
SaveAsCsv([ 'api_name','api_link','api_desc','api_cat'])
BaseUrl = 'https://www.programmableweb.com/category/all/apis?deadpool=1&page={}'
for i in range(1, 890):
print('## Getting Page {} out of 889'.format(i))
url = BaseUrl.format(i)
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
table_rows = soup.select('div.view-content > table[class="views-table cols-4 table"] > tbody tr')
for row in table_rows:
tds = row.select('td')
api_name = tds[0].text.strip()
api_link = parse.urljoin(url, tds[0].find('a').get('href'))
api_desc = tds[1].text.strip()
api_cat = tds[2].text.strip() if len(tds) >= 3 else ''
SaveAsCsv([api_name,api_link,api_desc,api_cat])

How to specify table for BeautifulSoup to find?

I'm trying to grab the table on this page https://nces.ed.gov/collegenavigator/?id=139755 under the Net Price expandable object. I've gone through tutorials for BS4, but I get so confused by the complexity of the html in this case that I can't figure out what syntax and which tags to use.
Here's a screenshot of the table and html I'm trying to get:
This is what I have so far. How do I add other tags to narrow down the results to just that one table?
import requests
from bs4 import BeautifulSoup
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
soup = soup.find(id="divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02")
print(soup.prettify())
Once I can parse that data, I will format into a dataframe with pandas.
In this case I'd probably just use pandas to retrieve all tables then index in for appropriate
import pandas as pd
table = pd.read_html('https://nces.ed.gov/collegenavigator/?id=139755')[10]
print(table)
If you are worried about future ordering you could loop the tables returned by read_html and test for presence of a unique string to identify table or use bs4 functionality of :has , :contains (bs4 4.7.1+) to identify the right table to then pass to read_html or continue handling with bs4
import pandas as pd
from bs4 import BeautifulSoup as bs
r = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:has(td:contains("Average net price"))')))
print(table)
ok , maybe this can help you , I add pandas
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://nces.ed.gov/collegenavigator/?id=139755')
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find("div", {"id": "divctl00_cphCollegeNavBody_ucInstitutionMain_ctl02"})
table = div.findAll("table", {"class": "tabular"})[1]
l = []
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
if td:
row = [i.text for i in td]
l.append(row)
df=pd.DataFrame(l, columns=["AVERAGE NET PRICE BY INCOME","2015-2016","2016-2017","2017-2018"])
print(df)
Here is a basic script to scrape that first table in that accordion:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
print(desired_table.prettify())
I assume you only want the values within the table so I did an overkill version of this as well that will combine the column names and values together:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://nces.ed.gov/collegenavigator/?id=139755#netprc"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
parent_table = soup.find('div', attrs={'id':'netprc'})
desired_table = parent_table.find('table')
header_row = desired_table.find_all('th')
headers = []
for header in header_row:
header_text = header.get_text()
headers.append(header_text)
money_values = []
data_row =desired_table.find_all('td')
for rows in data_row:
row_text = rows.get_text()
money_values.append(row_text)
for yrs,money in zip(headers,money_values):
print(yrs,money)
This will print out the following:
Average net price
2015-2016 $13,340
2016-2017 $15,873
2017-2018 $16,950

Scraping data from a table and storing it in csv file

I would want to scrap the data from this website and store it in csv file in this manner.
But when I try to scrap the data it is not stored in exact format. All the data is stored in the 1st column itself. I have no idea how to approach this problem.
Link : https://pce.ac.in/students/bachelors-students/
Code:
import csv # file operations
from bs4 import BeautifulSoup as soup # lib for pulling data from html/xmlsites
from urllib.request import urlopen as uReq # lib for sending and rec info over http
Url = 'https://pce.ac.in/students/bachelors-students/'
pageHtml = uReq(Url)
soup = soup(pageHtml,"html.parser") #parse the html
table = soup.find_all("table", { "class" : "tablepress tablepress-id-10 tablepress-responsive-phone" })
f = csv.writer(open('BEPillaiDepart.csv', 'w'))
f.writerow(['Choice Code', 'Course Name', 'Year of Establishment','Sanctioned Strength']) # headers
for x in table:
data=""
table_body = x.find('tbody') #find tbody tag
rows = table_body.find_all('tr') #find all tr tag
for tr in rows:
cols = tr.find_all('td') #find all td tags
for td in cols:
data=data+ "\n"+ td.text.strip()
f.writerow([data])
#print(data)
Create variable data in each tr label,you can try like this:
import csv # file operations
from bs4 import BeautifulSoup as soup # lib for pulling data from html/xmlsites
from urllib.request import urlopen as uReq # lib for sending and rec info over http
Url = 'https://pce.ac.in/students/bachelors-students/'
pageHtml = uReq(Url)
soup = soup(pageHtml,"html.parser") #parse the html
table = soup.find_all("table", { "class" : "tablepress tablepress-id-10 tablepress-responsive-phone" })
with open('BEPillaiDepart.csv', 'w',newline='') as csvfile:
f = csv.writer(csvfile)
f.writerow(['Choice Code', 'Course Name', 'Year of Establishment','Sanctioned Strength']) # headers
for x in table:
table_body = x.find('tbody') #find tbody tag
rows = table_body.find_all('tr') #find all tr tag
for tr in rows:
data=[]
cols = tr.find_all('td') #find all td tags
for td in cols:
data.append(td.text.strip())
f.writerow(data)
print(data)
If you search the meaning of csv, you would find it means comma separated values, however I don't see any commas in your text while appending it to the file.

python webscraping and write data into csv

I'm trying to save all the data(i.e all pages) in single csv file but this code only save the final page data.Eg Here url[] contains 2 urls. the final csv only contains the 2nd url data.
I'm clearly doing something wrong in the loop.but i dont know what.
And also this page contains 100 data points. But this code only write first 44 rows.
please help this issue.............
from bs4 import BeautifulSoup
import requests
import csv
url = ["http://sfbay.craigslist.org/search/sfc/npo","http://sfbay.craigslist.org/search/sfc/npo?s=100"]
for ur in url:
r = requests.get(ur)
soup = BeautifulSoup(r.content)
g_data = soup.find_all("a", {"class": "hdrlnk"})
gen_list=[]
for row in g_data:
try:
name = row.text
except:
name=''
try:
link = "http://sfbay.craigslist.org"+row.get("href")
except:
link=''
gen=[name,link]
gen_list.append(gen)
with open ('filename2.csv','wb') as file:
writer=csv.writer(file)
for row in gen_list:
writer.writerow(row)
the gen_list is being initialized again inside your loop that runs over the urls.
gen_list=[]
Move this line outside the for loop.
...
url = ["http://sfbay.craigslist.org/search/sfc/npo","http://sfbay.craigslist.org/search/sfc/npo?s=100"]
gen_list=[]
for ur in url:
...
i found your post later, wanna try this method:
import requests
from bs4 import BeautifulSoup
import csv
final_data = []
url = "https://sfbay.craigslist.org/search/sss"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all(class_="result-row")
for details in get_details:
getclass = details.find_all(class_="hdrlnk")
for link in getclass:
link1 = link.get("href")
sublist = []
sublist.append(link1)
final_data.append(sublist)
print(final_data)
filename = "sfbay.csv"
with open("./"+filename, "w") as csvfile:
csvfile = csv.writer(csvfile, delimiter = ",")
csvfile.writerow("")
for i in range(0, len(final_data)):
csvfile.writerow(final_data[i])

Categories

Resources