I am very new to Python and I am trying to learn on my own by doing some simple web scraping to get football stats.
I have been successful in getting the data for a single page at a time, but I have not been able to figure out how to add a loop into my code to scrape multiple pages at once (or multiple positions/years/conferences for that matter).
I have searched a fair amount on this and other websites but I can't seem to get it right.
Here's my code:
import csv
import requests
from BeautifulSoup import BeautifulSoup
url = 'http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p=1&d-447263-s=PASSING_YARDS&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=PASSING&conference=null&qualified=false'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'data-table1'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(''', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
#for line in list_of_rows: print ', '.join(line)
outfile = open("./2014.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Rk", "Player", "Team", "Pos", "Comp", "Att", "Pct", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Int", "1st", "1st%", "Lng", "20+", "40+", "Sck", "Rate"])
writer.writerows(list_of_rows)
outfile.close()
Here's my attempt at adding a variable into the URL and building a loop:
import csv
import requests
from BeautifulSoup import BeautifulSoup
pagelist = ["1", "2", "3"]
x = 0
while (x < 500):
url = "http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p="+str(x)).read(),'html'+"&d-447263-s=RUSHING_ATTEMPTS_PER_GAME_AVG&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=RUSHING&conference=null&qualified=false"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'data-table1'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(''', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
#for line in list_of_rows: print ', '.join(line)
outfile = open("./2014.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Rk", "Player", "Team", "Pos", "Att", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Long", "1st", "1st%", "20+", "40+", "FUM"])
writer.writerows(list_of_rows)
x = x + 0
outfile.close()
Thanks much in advance.
Here's my revised code that seems to be deleting each page as it writes to the csv file.
import csv
import requests
from BeautifulSoup import BeautifulSoup
url_template = 'http://www.nfl.com/stats/categorystats?tabSeq=0&season=2014&seasonType=REG&experience=&Submit=Go&archive=false&d-447263-p=%s&conference=null&statisticCategory=PASSING&qualified=false'
for p in ['1','2','3']:
url = url_template % p
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'data-table1'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(''', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
#for line in list_of_rows: print ', '.join(line)
outfile = open("./2014Passing.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Rk", "Player", "Team", "Pos", "Comp", "Att", "Pct", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Int", "1st", "1st%", "Lng", "20+", "40+", "Sck", "Rate"])
writer.writerows(list_of_rows)
outfile.close()
Assuming that you just want to change the page number, you could do something like this and use string formatting:
url_template = 'http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p=%s&d-447263-s=PASSING_YARDS&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=PASSING&conference=null&qualified=false'
for page in [1,2,3]:
url = url_template % page
response = requests.get(url)
# Rest of the processing code can go here
outfile = open("./2014.csv", "ab")
writer = csv.writer(outfile)
writer.writerow(...)
writer.writerows(list_of_rows)
outfile.close()
Note that you should open the file in append mode ("ab") instead of write mode ("wb"), as the latter overwrites existing contents, as you've experienced. Using append mode, the new contents are written at the end of the file.
This is outside the scope of the question, and more of a friendly code improvement suggestion, but the script would become easier to think about if you split it up into smaller functions that each do one thing, e.g., get the data from the site, write it to csv, etc..
Related
I have a BeautifulSoup script which scrapes the pages inside the hyperlinks on this page: https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html
My goal is to save CSV file with the file name as the webpage title. The title is the crypto address for the page it gathered data from.
For example, this web page: https://bitinfocharts.com/dogecoin/address/DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX
Would be saved as "DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX.csv"
To save the webpage title as the csv name, I am using a piece of code which gathers the title from the webpage, and assigns it to a variable called filename.
This is my code which creates the filename:
ad2 = (soup.title.string)
ad2 = ad2.replace('Dogecoin', '')
ad2 = ad2.replace('Address', '')
ad2 = ad2.replace('-', '')
filename = ad2.replace(' ', '')
When the CSV is written using the filename, the data is not the same as the respective filename.
For example, when the script runs and saves the csv name as "DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX.csv", the data in the CSV is not the correct data for the https://bitinfocharts.com/dogecoin/address/DKGpr71bR3h8RaQJNjVSboo3Xwa11wX1aX web page.
What I think is happening is the script is reading the wrong web page title and thus the CSV is created using the incorrect filename.
import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime
headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
soup = bs(r.content, 'lxml')
# select all tr elements (minus the first one, which is the header)
table_elements = soup.select('tr')[1:]
address_links = []
for element in table_elements:
children = element.contents # get children of table element
url = children[1].a['href']
last_out_str = children[8].text
# check to make sure the date field isn't empty
if last_out_str != "":
# load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
# if check to see if the date is after 2020/1/1
if last_out > after_date:
address_links.append(url)
for url in address_links:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.find(id="table_maina")
#Get the profit
sections = soup.find_all(class_='table-striped')
for section in sections:
oldprofit = section.find_all('td')[11].text
removetext = oldprofit.replace('USD', '')
removetext = removetext.replace(' ', '')
removetext = removetext.replace(',', '')
profit = float(removetext)
# Compare profit to goal
goal = float(50000)
if profit < goal:
continue
if table:
ad2 = (soup.title.string)
ad2 = ad2.replace('Dogecoin', '')
ad2 = ad2.replace('Address', '')
ad2 = ad2.replace('-', '')
filename = ad2.replace(' ', '')
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv = csv.writer(open(f'{filename}.csv', 'w', newline=''))
fcsv.writerow(headers)
fcsv.writerows(datarows)
Any help is greatly appreciated. Thank you.
You're reopening the file every time through the for loop, which empties the file and loses what you wrote on the previous iterations.
You should open the file once before the loop so you can write everything.
Also, you should initialize datarows to an empty list when processing each file. Otherwise you're combining the rows of all the pages you're scraping.
if table:
ad2 = (soup.title.string)
ad2 = ad2.replace('Dogecoin', '')
ad2 = ad2.replace('Address', '')
ad2 = ad2.replace('-', '')
filename = ad2.replace(' ', '')
with open(f'{filename}.csv', 'w', newline='') as f:
fcsv = csv.writer(f)
datarows = []
for row in table.find_all('tr'):
heads = row.find_all('th')
if heads:
headers = [th.text for th in heads]
else:
datarows.append([td.text for td in row.find_all('td')])
fcsv.writerow(headers)
fcsv.writerows(datarows)
I have been working on webscraping the infobox information on Wikipedia. This is the following code that I have been using:
import requests
import csv
from bs4 import BeautifulSoup
URL = ['https://en.wikipedia.org/wiki/Workers_Credit_Union','https://en.wikipedia.org/wiki/San_Diego_County_Credit_Union',
'https://en.wikipedia.org/wiki/USA_Federal_Credit_Union','https://en.wikipedia.org/wiki/Commonwealth_Credit_Union',
'https://en.wikipedia.org/wiki/Center_for_Community_Self-Help','https://en.wikipedia.org/wiki/ESL_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/State_Employees_Credit_Union','https://en.wikipedia.org/wiki/United_Heritage_Credit_Union']
for url in URL:
headers=[]
rows=[]
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
table = soup.find('table',class_ ='infobox')
credit_union_name= soup.find('h1', id = "firstHeading")
header_tags = table.find_all('th')
headers = [header.text.strip() for header in header_tags]
data_rows = table.find_all('tr')
for row in data_rows:
value = row.find_all('td')
beautified_value = [dp.text.strip() for dp in value]
if len(beautified_value) == 0:
continue
rows.append(beautified_value)
rows.append("")
rows.append([credit_union_name.text.strip()])
rows.append([url])
with open(r'credit_unions.csv','a+',newline="") as output:
writer=csv.writer(output)
writer.writerow(headers)
writer.writerow(rows)
However, I checked the csv file and information is not being presented in tabular form. The scraped elements are being stored in nested lists instead of a singular list. I need the scraped information of each URL to be stored in a singular list and print the list in csv file in tabular form with the headings. Need help regarding this.
The infoboxes have different structures and labels. So I think the best way to solve this is to use dicts and a DictWriter.
import requests
import csv
from bs4 import BeautifulSoup
URL = ['https://en.wikipedia.org/wiki/Workers_Credit_Union',
'https://en.wikipedia.org/wiki/San_Diego_County_Credit_Union',
'https://en.wikipedia.org/wiki/USA_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/Commonwealth_Credit_Union',
'https://en.wikipedia.org/wiki/Center_for_Community_Self-Help',
'https://en.wikipedia.org/wiki/ESL_Federal_Credit_Union',
'https://en.wikipedia.org/wiki/State_Employees_Credit_Union',
'https://en.wikipedia.org/wiki/United_Heritage_Credit_Union']
csv_headers = set()
csv_rows = []
for url in URL:
csv_row = {}
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
credit_union_name = soup.find('h1', id="firstHeading")
table = soup.find('table', class_='infobox')
data_rows = table.find_all('tr')
for data_row in data_rows:
label = data_row.find('th')
value = data_row.find('td')
if label is None or value is None:
continue
beautified_label = label.text.strip()
beautified_value = value.text.strip()
csv_row[beautified_label] = beautified_value
csv_headers.add(beautified_label)
csv_row["name"] = credit_union_name.text.strip()
csv_row["url"] = url
csv_rows.append(csv_row)
with open(r'credit_unions.csv', 'a+', newline="") as output:
headers = ["name", "url"]
headers += sorted(csv_headers)
writer = csv.DictWriter(output, fieldnames=headers)
writer.writeheader()
writer.writerows(csv_rows)
I want scraping the exchange prices informations from this website and after take it into a database: https://www.mnb.hu/arfolyamok
I wrote this code, but something wrong with it. How can i fix it, where i have to change it?
I am working with Python 2.7.13 on Windows 7.
The code is here:
import csv
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://www.mnb.hu/arfolyamok'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('tbody', attrs={'class': 'stripe'})
list_of_rows = []
for row in table.findAll('tr')[1:]:
list_of_cells = []
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
print list_of_rows
outfile = open("./inmates.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Pénznem", "Devizanév", "Egység", "Forintban kifejezett érték"])
writer.writerows(list_of_rows)
Add # coding=utf-8 to the top of your code. This will help solve the SyntaxError you are receiving. Also make sure your indentation is correct!
So i have a working code that pulls data from 30 websites on a domain.
with open("c:\source\list.csv") as f:
for row in csv.reader(f):
for url in row:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
tables = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"})
for rows in tables.find_all('tr', {'releasetype': 'Current_Releases'})[0::1]:
item = []
for val in rows.find_all('td'):
item.append(val.text.strip())
with open('c:\source\output_file.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow({url})
writer.writerows(item)
When I open the CSV file, I see each character taken from the 'Item' variable is stored in its own cell. I can't seem to find out what the heck is doing this and how to fix it.
Any thoughts?
I fixed this by changing
writer.writerows(item)
to
writer.writerow(item)
I want Python3.6 to write the output of the following code into a csv. It would be very nice to have it like this: one row for every article (it's a News-Website) and four columns with "Title", "URL", "Category" [#Politik, etc.], "PublishedAt".
from bs4 import BeautifulSoup
import requests
website = 'http://spiegel.de/schlagzeilen'
r = requests.get(website)
soup = BeautifulSoup((r.content), "lxml")
div = soup.find("div", {"class": "schlagzeilen-content schlagzeilen-overview"})
for a in div.find_all('a', title=True):
print(a.text, a.find_next_sibling('span').text)
print(a.get('href'))
For writing to a csv I already have this...
with open('%s_schlagzeilen.csv' % datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S.%f'), 'w', newline='',
encoding='utf-8') as file:
w = csv.writer(file, delimiter="|")
w.writerow([...])
..and need to know what's next to do. THX!! in advance!
You can collect all the desired extracted fields into a list of dictionaries and use the csv.DictWriter to write to the CSV file:
import csv
import datetime
from bs4 import BeautifulSoup
import requests
website = 'http://spiegel.de/schlagzeilen'
r = requests.get(website)
soup = BeautifulSoup((r.content), "lxml")
articles = []
for a in soup.select(".schlagzeilen-content.schlagzeilen-overview a[title]"):
category, published_at = a.find_next_sibling(class_="headline-date").get_text().split(",")
articles.append({
"Title": a.get_text(),
"URL": a.get('href'),
"Category": category.strip(" ()"),
"PublishedAt": published_at.strip(" ()")
})
filename = '%s_schlagzeilen.csv' % datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S.%f')
with open(filename, 'w', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=["Title", "URL", "Category", "PublishedAt"], )
writer.writeheader()
writer.writerows(articles)
Note how we are locating the categories and the "published at" - we need to go to the next sibling element and split the text by comma, stripping out the extra parenthesis.