How can I scrape this table more efficiently?

How can I scrape this table more efficiently? - python

Okay, I have built a program to scrape yahoo finance. I want the historical prices of a certain stock. I then want it to be written to an excel spreadsheet. It is doing everything the way it's supposed to, but it gives me ALL of the data on the whole page! I need just the data in the table. Thanks.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
import requests
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
playerdatasaved=""
soup = make_soup("https://finance.yahoo.com/q/hp?s=USO+Historical+Prices")
for record in soup.findAll('tr'):
playerdata=""
for data in record.findAll('td'):
playerdata=playerdata+","+data.text
if len(playerdata)!=0:
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
header="Open,Close,High,Low"
file = open(os.path.expanduser("Uso.csv"),"wb")
file.write(bytes(header, encoding="ascii",errors='ignore'))
file.write(bytes(playerdatasaved, encoding="ascii",errors='ignore'))
print(playerdatasaved)

To get the table of data:
soup = make_soup("https://finance.yahoo.com/q/hp?s=USO+Historical+Prices")
table = [[cell.text for row in soup.findAll('tr')] for cell in soup.findAll('td')]
To write out the table of data to a file:
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(table)

Related

Web scraping soup.findAll always return empty list

I try to scrape web page using python and BeautifulSoup. When I write:
table = soup.find('table')
it returns None.
and when I try get row content, it always returns empty list.
I also used Selenium and the same result empty list.
import requests
from bs4 import BeautifulSoup
import csv
url = "https://www.iea.org/data-and-statistics/data-tables?country=CANADA&energy=Balances&year=2010"
response = requests.get(url)
print(response.status_code) >>> print 200
soup = BeautifulSoup(response.text,"html.parser")
tr = soup.findAll('tr', attrs={'class': 'm-data-table__row '})
print(tr) >>> print []
print(len(tr)) >>> print 0
csvFile = open("C:/Users/User/Desktop/test27.csv",'wt',newline='', encoding='utf-8')
writer = csv.writer(csvFile)
try:
for cell in tr:
td = cell.find_all('td')
row = [i.text.replace('\n','') for i in td]
writer.writerow(row)
finally:
csvFile.close()
Any help?

When you analyse the website, the data is loaded via ajax call. The following script makes the ajax call and saves the required json to a file
import requests, json
from bs4 import BeautifulSoup
res = requests.get("https://api.iea.org/stats/?year=2010&countries=CANADA&series=BALANCES")
data = res.json()
with open("data.json", "w") as f:
json.dump(data,f)

How do I take a list of urls that are stored in a csv file and import them to python to scrape and export back to csv with newly gathered data

I am very new to python and BeautifulSoup and I am trying to use it to scrape multiple urls at the same time using a loop. The loop will consist of locating the banner slide on the home page of each website and get the len of how many banners that website has and place them into an excel file next to the corresponding url. I have a list of urls saved in a csv file and basically what I want to do is take each of those urls and run the loop, pulling the number of banners, and put that number next to the url into a separate column in excel.
This is the code I have so far and all it does for me is write the urls back into a csv file and gives me the number of banners for only the last url.
from bs4 import BeautifulSoup
import requests
with open("urls.csv", "r") as f:
csv_raw_cont=f.read()
split_csv=csv_raw_cont.split('\n')
split_csv.remove('')
separator=';'
filename = "DDC_number_of_banners.csv"
f = open(filename, "w")
headers = "url, Number_of_Banners\n"
f.write(headers)
for each in split_csv:
url_row_index=0
url = each.split(separator)[url_row_index]
html = requests.get(url).content
soup= BeautifulSoup(html, "html.parser")
banner_info = soup.findAll('div',{'class':['slide', 'slide has-link',
'html-slide slide has-link']})
Number_of_banners = len(banner_info)
f.write(csv_raw_cont + "," + str(Number_of_banners) + "," + "\n")
f.close()

Making use of Python's CSV library would make this a bit simpler:
from bs4 import BeautifulSoup
import requests
import csv
with open("urls.csv", "r") as f_urls, open("DDC_number_of_banners.csv", "w", newline="") as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['url', 'Number_of_banners'])
for url in f_urls:
url = url.strip()
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
banner_info = soup.findAll('div',{'class':['slide', 'slide has-link', 'html-slide slide has-link']})
csv_output.writerow([url, len(banner_info)])
To include information such as each banner's data-label:
from bs4 import BeautifulSoup
import requests
import csv
with open("urls.csv", "r") as f_urls, open("DDC_number_of_banners.csv", "w", newline="") as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['url', 'Number_of_banners', 'data_labels'])
for url in f_urls:
url = url.strip()
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
banner_info = soup.findAll('div',{'class':['slide', 'slide has-link', 'html-slide slide has-link']})
data_labels = [banner.get('data-label') for banner in banner_info]
csv_output.writerow([url, len(banner_info)] + data_labels)

Trouble dealing with header in a csv file

I've written some code using python to scrape some titles and price from a webpage and write the results in a csv file. The script is running awesome. As I'm appending data to a csv file the script is writing headers in such a way that if it runs 4 loops then the headers will be written 4 times. How to fix it so that the headers will be written once. Thanks.
This is the script:
import csv
import requests
from bs4 import BeautifulSoup
diction_page = ['http://www.bloomberg.com/quote/SPX:IND','http://www.bloomberg.com/quote/CCMP:IND']
for link in diction_page:
res = requests.get(link).text
soup = BeautifulSoup(res,'lxml')
title = soup.select_one('.name').text.strip()
price = soup.select_one('.price').text
print(title,price)
with open('item.csv','a',newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(["Title","Price"])
writer.writerow([title, price])

As an option you can try this:
import csv
import requests
from bs4 import BeautifulSoup
diction_page = ['http://www.bloomberg.com/quote/SPX:IND','http://www.bloomberg.com/quote/CCMP:IND']
for i,link in enumerate(diction_page):
res = requests.get(link).text
soup = BeautifulSoup(res,'lxml')
title = soup.select_one('.name').text.strip()
price = soup.select_one('.price').text
print(title,price)
with open('item.csv','a',newline='') as outfile:
writer = csv.writer(outfile)
if (i == 0):
writer.writerow(["Title","Price"])
writer.writerow([title, price])

Don't write the headers in the for loop:
import csv
import requests
from bs4 import BeautifulSoup
diction_page = ['http://www.bloomberg.com/quote/SPX:IND','http://www.bloomberg.com/quote/CCMP:IND']
outfile = open('item.csv','w',newline='')
writer = csv.writer(outfile)
writer.writerow(["Title","Price"])
for link in diction_page:
res = requests.get(link).text
soup = BeautifulSoup(res,'lxml')
title = soup.select_one('.name').text.strip()
price = soup.select_one('.price').text
print(title,price)
writer.writerow([title, price])
outfile.close()

Creating Pandas Dataframe from WebScraping Results

I am trying to scrape a table from espn and send the data to a pandas dataframe in order to export it to excel. I have completed most of the scraping, but am getting stuck on how to send each 'td' tag to a unique dataframe cell within my for loop. (Code is below) Any thoughts? Thanks!
import requests
import urllib.request
from bs4 import BeautifulSoup
import re
import os
import csv
import pandas as pd
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = make_soup("http://www.espn.com/nba/statistics/player/_/stat/scoring-
per-game/sort/avgPoints/qualified/false")
regex = re.compile("^[e-o]")
for record in soup.findAll('tr', {"class":regex}):
for data in record.findAll('td'):
print(data)

I was actually recently scraping sports websites working on a daily fantasy sports algorithm for a class. This is the script I wrote up. Perhaps this approach can work for you. Build a dictionary. Convert it to a dataframe.
url = http://www.footballdb.com/stats/stats.html?lg=NFL&yr={0}&type=reg&mode={1}&limit=all
result = requests.get(url)
c = result.content
# Set as Beautiful Soup Object
soup = BeautifulSoup(c)
# Go to the section of interest
tables = soup.find("table",{'class':'statistics'})
data = {}
headers = {}
for i, header in enumerate(tables.findAll('th')):
data[i] = {}
headers[i] = str(header.get_text())
table = tables.find('tbody')
for r, row in enumerate(table.select('tr')):
for i, cell in enumerate(row.select('td')):
try:
data[i][r] = str(cell.get_text())
except:
stat = strip_non_ascii(cell.get_text())
data[i][r] = stat
for i, name in enumerate(tables.select('tbody .left .hidden-xs a')):
data[0][i] = str(name.get_text())
df = pd.DataFrame(data=data)

python webscraping and write data into csv

I'm trying to save all the data(i.e all pages) in single csv file but this code only save the final page data.Eg Here url[] contains 2 urls. the final csv only contains the 2nd url data.
I'm clearly doing something wrong in the loop.but i dont know what.
And also this page contains 100 data points. But this code only write first 44 rows.
please help this issue.............
from bs4 import BeautifulSoup
import requests
import csv
url = ["http://sfbay.craigslist.org/search/sfc/npo","http://sfbay.craigslist.org/search/sfc/npo?s=100"]
for ur in url:
r = requests.get(ur)
soup = BeautifulSoup(r.content)
g_data = soup.find_all("a", {"class": "hdrlnk"})
gen_list=[]
for row in g_data:
try:
name = row.text
except:
name=''
try:
link = "http://sfbay.craigslist.org"+row.get("href")
except:
link=''
gen=[name,link]
gen_list.append(gen)
with open ('filename2.csv','wb') as file:
writer=csv.writer(file)
for row in gen_list:
writer.writerow(row)

the gen_list is being initialized again inside your loop that runs over the urls.
gen_list=[]
Move this line outside the for loop.
...
url = ["http://sfbay.craigslist.org/search/sfc/npo","http://sfbay.craigslist.org/search/sfc/npo?s=100"]
gen_list=[]
for ur in url:
...

i found your post later, wanna try this method:
import requests
from bs4 import BeautifulSoup
import csv
final_data = []
url = "https://sfbay.craigslist.org/search/sss"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all(class_="result-row")
for details in get_details:
getclass = details.find_all(class_="hdrlnk")
for link in getclass:
link1 = link.get("href")
sublist = []
sublist.append(link1)
final_data.append(sublist)
print(final_data)
filename = "sfbay.csv"
with open("./"+filename, "w") as csvfile:
csvfile = csv.writer(csvfile, delimiter = ",")
csvfile.writerow("")
for i in range(0, len(final_data)):
csvfile.writerow(final_data[i])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I scrape this table more efficiently? - python

Related

Web scraping soup.findAll always return empty list

How do I take a list of urls that are stored in a csv file and import them to python to scrape and export back to csv with newly gathered data

Trouble dealing with header in a csv file

Creating Pandas Dataframe from WebScraping Results

python webscraping and write data into csv

Categories

Resources