I am currently working on a school project where I am scraping results from a cycling website. I managed to build the scraper to loop through all the urls containing the results. I would like to add the event title to the first column of every table but am facing some difficulties.
Here is my code:
# list of needed packages
import requests
from bs4 import BeautifulSoup
import time
import csv
# create string of urls to scrape
urls = ['https://cqranking.com/men/asp/gen/race.asp?raceid=36151', 'https://cqranking.com/men/asp/gen/race.asp?raceid=36151']
# Generates a csv-file named cycling_results.csv, with wanted headers
with open('cycling_results.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=';')
writer.writerow(['Start', 'Rank', '', '', '', 'Name', '', 'Team', '', 'Time', '', 'Points'])
# loop through all urls in the array
for url in urls:
time.sleep(2)
response = requests.get(url)
data = response.content
soup = BeautifulSoup(data, 'html.parser')
# Find the title of the racing event
titles = soup.find('title')
for title in titles:
writer.writerow(title)
tables = soup.find_all('table')
for table in tables:
rows = table.find_all('tr')
for row in rows:
csv_row = []
columns = row.find_all('td')
for column in columns:
csv_row.append(column.get_text())
writer.writerow(csv_row)
In the next fase I will add code to remove empty rows.
Thank you
Regards
Kevin
This code should be
titles = soup.find('title')
for title in titles:
writer.writerow(title)
---->
titles = soup.find('title')
writer.writerow([title.text])
find return just an element, not a list of elements. Write element text or your desired info, but not full element
Related
I'm fairly new to beautiful soup/Python/Web Scraping and I have been able to scrape data from a site, but I am only able to export the very first row to a csv file ( I want to export all scraped data into the file.)
I am stumped on how to make this code export ALL scraped data into multiple individual rows:
r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for span in soup.find_all("span", {"class": "article"}):
for link in span.select("a"):
name_and_date = link.text.split('(')
name = name_and_date[0].strip()
date = name_and_date[1].replace(')','').strip()
base_url = "https://www.infoplease.com"
links = link['href']
links = urljoin(base_url, links)
pres_data = {'Name': [name],
'Date': [date],
'Link': [links]
}
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])
df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)
print (df)
Any ideas here? I believe I need to loop it through the data parsing and grab each set and push it in.
Am I going about this the right way?
Thanks for any insight
The way it is currently set up, it looks like you are not adding each link as a new entry and instead it is only adding the last link. If you initialize a list and add a dictionary like you have it set up for each iteration of the "links" for loop, you will add each row and not just the last one.
import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
r = requests.get("https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
pres_data = []
for span in soup.find_all("span", {"class": "article"}):
for link in span.select("a"):
name_and_date = link.text.split('(')
name = name_and_date[0].strip()
date = name_and_date[1].replace(')','').strip()
base_url = "https://www.infoplease.com"
links = link['href']
links = urljoin(base_url, links)
this_data = {'Name': name,
'Date': date,
'Link': links
}
pres_data.append(this_data)
df = pd.DataFrame(pres_data, columns= ['Name', 'Date', 'Link'])
df.to_csv (r'C:\Users\ThinkPad\Documents\data_file.csv', index = False, header=True)
print (df)
You don't need to use Pandas here since you are not willing to apply any kind of Data operation there!
Usually try to limit yourself on the builtin libraries in case if the task is shorter.
import requests
from bs4 import BeautifulSoup
import csv
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
target = [([x.a['href']] + x.a.text[:-1].split(' ('))
for x in soup.select('span.article')]
with open('data.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Url', 'Name', 'Date'])
writer.writerows(target)
main('https://www.infoplease.com/primary-sources/government/presidential-speeches/state-union-addresses')
Sample of output:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
html = urlopen(
"https://www.accessdata.fda.gov/scripts/drugshortages/default.cfm")
bsObj = BeautifulSoup(html, "lxml")
table = bsObj.find('table', id="cont")
rows = table.findAll("tr")
links = [a['href'] for a in table.find_all('a', href=True) if a.text]
new_links = []
for link in links:
new_links.append(("https://www.accessdata.fda.gov/scripts/drugshortages/"+link).replace(" ", "%20"))
href_rows = []
for link in new_links:
link = link.replace("®", "%C2%AE")
html = urlopen(link)
bsObj_href = BeautifulSoup(html, "lxml")
#bsObj_href = BeautifulSoup (html.decode('utf-8', 'ignore'))
div_href = bsObj_href.find("div",{"id":"accordion"})
href_rows.append(div_href.findAll("tr"))
csvFile = open("drug_shortage.csv", 'wt', newline='')
writer = csv.writer(csvFile)
try:
for row in rows:
csvRow = []
for cell in row.findAll(['td', 'th']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
finally:
csvFile.close()
Hello, so I created two rows like that. If you go to the this website https://www.accessdata.fda.gov/scripts/drugshortages/default.cfm
they have the drug name and status column and when you click the drug name you can find four more columns. I like to combine together(based on drug name) in order
So It would be drug name,status, Presentation, Availability, and Estimated Shortage Duration,Related Information, Shortage Reason (per FDASIA).
But current codes only generate for the first one(drug names,status).
I tried
for row in rows,rows_href:
but then I get AttributeError: ResultSet object has no attribute 'findAll'. I get the same error for
for row in rows_href:
Any suggestion how do I generate as I wanted?
Your code is too chaotic.
You get all rows, next all links, and next you try to get all other information but this way you can't control which values to join in row. The biggest problem will be when some row will not have data on subpage and all your data will move to one row up.
You should get all rows from table on main page and then use for-loop to work with every row separatelly to get other elements only for this single row - read link only for this row, get data from subpage only for this row, etc. and put all data for this row on list as sublist [name, status, link, presentation, availability, related, reason]. And after that you get next work and work only with data for next row.
BTW: because subpage may have many rows so I create many rows in data with the same name, status but with different other values
[name, status, values from first row on subpage]
[name, status, values from second row on subpage]
[name, status, values from string row on subpage]
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
html = urlopen("https://www.accessdata.fda.gov/scripts/drugshortages/default.cfm")
bsObj = BeautifulSoup(html, "lxml")
# list for all rows with all values
data = []
# get table on main page
table = bsObj.find('table', {'id': 'cont'})
# work with every row separatelly
for row in table.find_all("tr")[1:]: # use `[1:]` to skip header
# get columns only in this row
cols = row.find_all('td')
# get name and url from first column
link = cols[0].find('a', href=True)
name = link.text.strip()
url = link['href']
url = "https://www.accessdata.fda.gov/scripts/drugshortages/" + url
url = url.replace(" ", "%20").replace("®", "%C2%AE")
print('name:', name)
print('url:', url)
# get status from second column
status = cols[1].text.strip()
print('status:', status)
# subpage
html = urlopen(url)
bsObj_href = BeautifulSoup(html, "lxml")
subtable = bsObj_href.find("table")
if not subtable:
data.append([name, status, link, '', '', '', ''])
print('---')
else:
for subrows in subtable.find_all('tr')[1:]: # use `[1:]` to skip header
#print(subrows)
subcols = subrows.find_all('td')
presentation = subcols[0].text.strip()
availability = subcols[1].text.strip()
related = subcols[2].text.strip()
reason = subcols[3].text.strip()
data.append([name, status, link, presentation, availability, related, reason])
print(presentation, availability, related, reason)
print('---')
print('----------')
with open("drug_shortage.csv", 'wt', newline='') as csvfile:
writer = csv.writer(csvFile)
# write header - one row - using `writerow` without `s` at the end
#writer.writerow(['Name', 'Status', 'Link', 'Presentation', 'Availability', 'Related', 'Reason'])
# write data - many rowr - using `writerows` with `s` at the end
writer.writerows(data)
# no need to close because it use `with`
I have a 45k+ rows CSV file, each one containing a different path of the same domain - which are structurally identical to each other - and every single one is clickable. I managed to use BeautifulSoup to scrape the title and content of each one and through the print function, I was able to validate the scraper. However, when I try to export the information gathered to a new CSV file, I only get the last URL's street name and description, and not all of them as I expected.
from bs4 import BeautifulSoup
import requests
import csv
with open('URLs.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
site = requests.get(row['addresses']).text
soup = BeautifulSoup(site, 'lxml')
StreetName = soup.find('div', class_='hist-title').text
Description = soup.find('div', class_='hist-content').text
with open('OutputList.csv','w', newline='') as output:
Header = ['StreetName', 'Description']
writer = csv.DictWriter(output, fieldnames=Header)
writer.writeheader()
writer.writerow({'StreetName' : StreetName, 'Description' : Description})
How can the output CSV have on each row the street name and description for the respective URL row in the input CSV file?
You need to open both files on the same level and then read and write on each iteration. Something like this:
from bs4 import BeautifulSoup
import requests
import csv
with open('URLs.csv') as a, open('OutputList.csv', 'w') as b:
reader = csv.reader(a)
writer = csv.writer(b, quoting=csv.QUOTE_ALL)
writer.writerow(['StreetName', 'Description'])
# Assuming url is the first field in the CSV
for url, *_ in reader:
r = requests.get(url)
if r.ok:
soup = BeautifulSoup(r.text, 'lxml')
street_name = soup.find('div', class_='hist-title').text.strip()
description = soup.find('div', class_='hist-content').text.strip()
writer.writerow([street_name, description])
I hope it helps.
I am trying to scrape from the first page to page 14 of this website: https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All®ion=All
Here is my code:
import requests as r
from bs4 import BeautifulSoup as soup
import pandas
#make a list of all web pages' urls
webpages=[]
for i in range(15):
root_url = 'https://cross-currents.berkeley.edu/archives?author=&title=&type=All&issue=All®ion=All&page='+ str(i)
webpages.append(root_url)
print(webpages)
#start looping through all pages
for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
page_soup = soup(data.text, 'html.parser')
#find targeted info and put them into a list to be exported to a csv file via pandas
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
title = [el.replace('\n', '') for el in title_list]
#export to csv file via pandas
dataset = {'Title': title}
df = pandas.DataFrame(dataset)
df.index.name = 'ArticleID'
df.to_csv('example31.csv',encoding="utf-8")
The output csv file only contains targeted info of the last page. When I print "webpages", it shows that all the pages' urls have been properly put into the list. What am I doing wrong? Thank you in advance!
You are simply overwriting the same output CSV file for all the pages, you can call .to_csv() in the "append" mode to have the new data added to the end of the existing file:
df.to_csv('example31.csv', mode='a', encoding="utf-8", header=False)
Or, even better would be to collect the titles into a list of titles and then dump into a CSV once:
#start looping through all pages
titles = []
for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
page_soup = soup(data.text, 'html.parser')
#find targeted info and put them into a list to be exported to a csv file via pandas
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
titles += [el.replace('\n', '') for el in title_list]
# export to csv file via pandas
dataset = [{'Title': title} for title in titles]
df = pandas.DataFrame(dataset)
df.index.name = 'ArticleID'
df.to_csv('example31.csv', encoding="utf-8")
Another way in addition to what alexce posted would be to keep appending the dataframe inside to a new dataframe and then write that to the CSV.
Declare finalDf as a dataframe outside the loops:
finalDf = pandas.DataFrame()
Later do this:
for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
page_soup = soup(data.text, 'html.parser')
#find targeted info and put them into lists to be exported to a csv file via pandas
title_list = [title.text for title in page_soup.find_all('div', {'class':'field field-name-node-title'})]
title = [el.replace('\n', '') for el in title_list]
#export to csv file via pandas
dataset = {'Title': title}
df = pandas.DataFrame(dataset)
finalDf = finalDf.append(df)
#df.index.name = 'ArticleID'
#df.to_csv('example31.csv', mode='a', encoding="utf-8", header=False)
finalDf = finalDf.reset_index(drop = True)
finalDf.index.name = 'ArticleID'
finalDf.to_csv('example31.csv', encoding="utf-8")
Notice the lines with finalDf
import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2
from lxml import html
base_url = 'http://www.pro-football-reference.com' # base url for concatenation
data = requests.get("http://www.pro-football-reference.com/years/2014/games.htm") #website for scraping
soup = BeautifulSoup(data.content)
list_of_cells = []
for link in soup.find_all('a'):
if link.has_attr('href'):
if link.get_text() == 'boxscore':
url = base_url + link['href']
for x in url:
response = requests.get('x')
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'stats_table x_large_text'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
I am using the code in order to get all the boxscore urls from http://www.pro-football-reference.com/years/2014/games.htm. After I get these boxscore urls I would like to loop through them to scrape the quarter by quarter data for each team but my syntax always seems to be off no matter how I format the code.
If it is possible I would like to scrape more than just the scoring data by also getting the Game Info, officials, and Expected points per game.
If you modify your loop slightly to:
for link in soup.find_all('a'):
if not link.has_attr('href'):
continue
if link.get_text() != 'boxscore':
continue
url = base_url + link['href']
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
# Scores
table = soup.find('table', attrs={'id': 'scoring'})
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
That returns each of the cells for each row in the scoring table for each page linked to with the 'boxscore' text.
The issues I found with the existing code were:
You were attempting to loop through each character in the href returned for the 'boxscore' link.
You were always requesting the string 'x'.
Not so much an issue, but I changed the table selector to identify the table by its id 'scoring' rather than the class. Ids at least should be unique within the page (though there is no guarentee).
I'd recommend that you find each table (or HTML element) containing the data you want in the main loop (e.g score_table = soup.find('table'...) but that you move the code that parses that data (e.g)...
for row in table.findAll('tr'):
for cell in row.findAll('td'):
text = cell.text.replace(' ', '')
list_of_cells.append(text)
print list_of_cells
...into a separate function that returns said data (one for each type of data you are extracting), just to keep the code slightly more manageable. The more the code indents to handle if tests and for loops the more difficult it tends to be to follow the flow. For example:
score_table = soup.find('table', attrs={'id': 'scoring'})
score_data = parse_score_table(score_table)
other_table = soup.find('table', attrs={'id': 'other'})
other_data = parse_other_table(other_table)