Writing data scraped from a HTML table to a CSV file - python
I'm trying to figure out what will be the next step to convert my webscrape to CSV.
I've tried putting every column into individual lists, but I feel like this is not the solution.
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/years/2018/passing.htm'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for row in tb.find_all('tr'):
i = row.get_text()
print(i)
This should work
import csv #quite crucial
final_table = []
for row in tb.findall('tr'):
next_line = row.get_text()
final_table.append([next_line])
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(final_table)
Use the csv module. We'll grab the headers with soup.find("tr").find_all("th"), then loop over the body and write it to the text file. The first cell of each row is a <th>, so we need to handle that separately and prepend it to the <td> data. Note that the staggered headers every 30 lines are omitted.
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.pro-football-reference.com/years/2018/passing.htm"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
with open("output.csv", "w") as f:
writer = csv.writer(f)
writer.writerow([x.get_text() for x in soup.find("tr").find_all("th")])
for row in soup.find_all("tr"):
data = [row.find("th").get_text()] + [x.get_text() for x in row.find_all("td")]
if data:
writer.writerow(data)
Output (just the top few rows):
Rk,Player,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,Yds,TD,TD%,Int,Int%,Lng,Y/A,AY/A,Y/C,Y/G,Rate,QBR,Sk,Yds,NY/A,ANY/A,Sk%,4QC,GWD
1,Ben Roethlisberger,PIT,36,QB,16,16,9-6-1,452,675,67.0,5129,34,5.0,16,2.4,97,7.6,7.5,11.3,320.6,96.5,71.0,24,166,7.10,7.04,3.4,2,3
2,Andrew Luck*,IND,29,QB,16,16,10-6-0,430,639,67.3,4593,39,6.1,15,2.3,68,7.2,7.4,10.7,287.1,98.7,69.4,18,134,6.79,6.95,2.7,3,3
3,Matt Ryan,ATL,33,QB,16,16,7-9-0,422,608,69.4,4924,35,5.8,7,1.2,75,8.1,8.7,11.7,307.8,108.1,68.5,42,296,7.12,7.71,6.5,1,1
4,Kirk Cousins,MIN,30,QB,16,16,8-7-1,425,606,70.1,4298,30,5.0,10,1.7,75,7.1,7.3,10.1,268.6,99.7,58.2,40,262,6.25,6.48,6.2,1,0
Check this thread if you see extra newlines in the CSV result on Windows.
Related
csv.writer not writing entire output to CSV file
I am attempting to scrape the artists' Spotify streaming rankings from Kworb.net into a CSV file and I've nearly succeeded except I'm running into a weird issue. The code below successfully scrapes all 10,000 of the listed artists into the console: import requests from bs4 import BeautifulSoup import csv URL = "https://kworb.net/spotify/artists.html" result = requests.get(URL) src = result.content soup = BeautifulSoup(src, 'html.parser') table = soup.find('table', id="spotifyartistindex") header_tags = table.find_all('th') headers = [header.text.strip() for header in header_tags] rows = [] data_rows = table.find_all('tr') for row in data_rows: value = row.find_all('td') beautified_value = [dp.text.strip() for dp in value] print(beautified_value) if len(beautified_value) == 0: continue rows.append(beautified_value) The issue arises when I use the following code to save the output to a CSV file: with open('artist_rankings.csv', 'w', newline="") as output: writer = csv.writer(output) writer.writerow(headers) writer.writerows(rows) For whatever reason, only 738 of the artists are saved to the file. Does anyone know what could be causing this? Thanks so much for any help!
As an alternative approach, you might want to make your life easier next time and use pandas. Here's how: import requests import pandas as pd source = requests.get("https://kworb.net/spotify/artists.html") df = pd.concat(pd.read_html(source.text, flavor="bs4")) df.to_csv("artists.csv", index=False) This outputs a .csv file with 10,000 artists.
BeautifulSoup Scraping Formatting
This is my first time using BeautifulSoup and I am attempting to scrap store location data from a local convenience store. However I'm running into some issues on trying to remove empty lines when data is being passed into a CSV file, I've tried .replace('\n','') and .strip() both did not worked. Also I'm having problems with splitting data that is scraped and contained in the same sibling method. I've added the script below: from bs4 import BeautifulSoup from requests import get import urllib.request import sched, time import csv url = 'http://www.cheers.com.sg/web/store_location.jsp' response = get(url) soup = BeautifulSoup(response.text, 'html.parser') #print (soup.prettify()) #open a file for writing location_data = open('data/soupdata.csv', 'w', newline='') #create the csv writer object csvwriter = csv.writer(location_data) cheers = soup.find('div' , id="store_container") count = 0 #Loop for Header tags for paragraph in cheers.find_all('b'): header1 = paragraph.text.replace(':' , '') header2 = paragraph.find_next('b').text.replace(':' , '') header3 = paragraph.find_next_siblings('b')[1].text.replace(':' , '') if count == 0: csvwriter.writerow([header1, header2, header3]) count += 1 break for paragraph in cheers.find_all('br'): brnext = paragraph.next_sibling.strip() brnext1 = paragraph.next_sibling test1 = brnext1.next_sibling.next_sibling print(test1) csvwriter.writerow([brnext, test1]) location_data.close() Sample of output generated: Sample of what output should look like: How can I achieve this? Thanks in advance.
To make it slightly organized, you can try like the following. I've used .select() instead of .find_all(). import csv from bs4 import BeautifulSoup import requests url = 'http://www.cheers.com.sg/web/store_location.jsp' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') with open("output.csv","w",newline="") as infile: writer = csv.writer(infile) writer.writerow(["Address","Telephone","Store hours"]) for items in soup.select("#store_container .store_col"): addr = items.select_one("b").next_sibling.next_sibling tel = items.select_one("b:nth-of-type(2)").next_sibling store = items.select_one("b:nth-of-type(3)").next_sibling writer.writerow([addr,tel,store])
You just need to change the way of extracting address, tel and store hours import csv from bs4 import BeautifulSoup from requests import get url = 'http://www.cheers.com.sg/web/store_location.jsp' response = get(url) soup = BeautifulSoup(response.text, 'html.parser') # print (soup.prettify()) # open a file for writing location_data = open('data/soupdata.csv', 'w', newline='') # create the csv writer object csvwriter = csv.writer(location_data) cheers = soup.find('div', id="store_container") count = 0 # Loop for Header tags for paragraph in cheers.find_all('b'): header1 = paragraph.text.replace(':', '') header2 = paragraph.find_next('b').text.replace(':', '') header3 = paragraph.find_next_siblings('b')[1].text.replace(':', '') if count == 0: csvwriter.writerow([header1, header2, header3]) count += 1 break for paragraph in cheers.find_all('div'): label = paragraph.find_all('b') if len(label) == 3: print(label) address = label[0].next_sibling.next_sibling tel = label[1].next_sibling hours = label[2].next_sibling csvwriter.writerow([address, tel, hours]) location_data.close()
How do I creat CSV file with webscraped content from several URLs?
I want to create a CSV file from webscraped content. The content is from FinViz.com I want to scrape the table from this website 20 times for 20 different stocks and input all the content into a CSV file. Within my code, I generate a list of stocks from a scrape of twitter content. The list of stocks that is generated is the same list that I want to get information on from the FinViz.com tables. Here is my code: import csv import urllib.request from bs4 import BeautifulSoup twiturl = "https://twitter.com/ACInvestorBlog" twitpage = urllib.request.urlopen(twiturl) soup = BeautifulSoup(twitpage,"html.parser") print(soup.title.text) tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')] print(tweets) url_base = "https://finviz.com/quote.ashx?t=" url_list = [url_base + tckr for tckr in tweets] for url in url_list: fpage = urllib.request.urlopen(url) fsoup = BeautifulSoup(fpage, 'html.parser') # scrape single page and add data to list # write datalist with open('today.csv', 'a') as file: writer = csv.writer(file) # write header row writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'}))) # write body row writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'}))) The trouble that I am running into is that my CSV file only has the webscraped data from the last item in the list. Instead I want the entire list in a sequence of rows. Here is what my CSV file looks like: Index,P/E,EPS (ttm),Insider Own,Shs Outstand,Perf Week,Market Cap,Forward P/E,EPS next Y,Insider Trans,Shs Float,Perf Month,Income,PEG,EPS next Q,Inst Own,Short Float,Perf Quarter,Sales,P/S,EPS this Y,Inst Trans,Short Ratio,Perf Half Y,Book/sh,P/B,EPS next Y,ROA,Target Price,Perf Year,Cash/sh,P/C,EPS next 5Y,ROE,52W Range,Perf YTD,Dividend,P/FCF,EPS past 5Y,ROI,52W High,Beta,Dividend %,Quick Ratio,Sales past 5Y,Gross Margin,52W Low,ATR,Employees,Current Ratio,Sales Q/Q,Oper. Margin,RSI (14),Volatility,Optionable,Debt/Eq,EPS Q/Q,Profit Margin,Rel Volume,Prev Close,Shortable,LT Debt/Eq,Earnings,Payout,Avg Volume,Price,Recom,SMA20,SMA50,SMA200,Volume,Change -,-,-1.75,7.94%,79.06M,-22.52%,296.48M,-,-1.74,-4.61%,72.41M,-23.16%,-85.70M,-,-0.36,62.00%,3.21%,1.63%,15.10M,19.63,-197.00%,18.05%,2.57,66.67%,-0.65,-,-8.10%,-127.70%,12.17,-6.25%,0.93,4.03,-,146.70%,2.05 - 5.86,3.59%,-,-,-,385.80%,-36.01%,-,-,1.30,-,76.50%,82.93%,0.41,100,1.30,-59.60%,-,36.98,16.13% 9.32%,Yes,-,90.00%,-,0.82,3.63,Yes,-,Nov 08,-,902.43K,3.75,2.30,-22.08%,-10.43%,11.96%,"742,414",3.31%
It would be better to open your output file first, rather than keep on opening/closing it for each URL that you fetch. Exception handling is needed to catch cases where the URL does not exist. Also on your output, you should open the file with newline='' to avoid extra empty lines being written to the file: import csv import urllib.request from bs4 import BeautifulSoup write_header = True twiturl = "https://twitter.com/ACInvestorBlog" twitpage = urllib.request.urlopen(twiturl) soup = BeautifulSoup(twitpage,"html.parser") print(soup.title.text) tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')] print(tweets) url_base = "https://finviz.com/quote.ashx?t=" url_list = [url_base + tckr for tckr in tweets] with open('today.csv', 'w', newline='') as file: writer = csv.writer(file) for url in url_list: try: fpage = urllib.request.urlopen(url) fsoup = BeautifulSoup(fpage, 'html.parser') # write header row (once) if write_header: writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'}))) write_header = False # write body row writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'}))) except urllib.error.HTTPError: print("{} - not found".format(url)) So today.csv would start like: Index,P/E,EPS (ttm),Insider Own,Shs Outstand,Perf Week,Market Cap,Forward P/E,EPS next Y,Insider Trans,Shs Float,Perf Month,Income,PEG,EPS next Q,Inst Own,Short Float,Perf Quarter,Sales,P/S,EPS this Y,Inst Trans,Short Ratio,Perf Half Y,Book/sh,P/B,EPS next Y,ROA,Target Price,Perf Year,Cash/sh,P/C,EPS next 5Y,ROE,52W Range,Perf YTD,Dividend,P/FCF,EPS past 5Y,ROI,52W High,Beta,Dividend %,Quick Ratio,Sales past 5Y,Gross Margin,52W Low,ATR,Employees,Current Ratio,Sales Q/Q,Oper. Margin,RSI (14),Volatility,Optionable,Debt/Eq,EPS Q/Q,Profit Margin,Rel Volume,Prev Close,Shortable,LT Debt/Eq,Earnings,Payout,Avg Volume,Price,Recom,SMA20,SMA50,SMA200,Volume,Change -,-,-10.85,4.60%,2.36M,11.00%,8.09M,-,-,-62.38%,1.95M,-16.14%,-14.90M,-,-,2.30%,10.00%,-44.42%,0.00M,-,21.80%,-5.24%,3.10,-38.16%,1.46,2.35,-,-155.10%,65.00,-50.47%,-,-,-,-238.40%,2.91 - 11.20,-38.29%,-,-,54.50%,-,-69.37%,1.63,-,2.20,-,-,17.87%,0.36,15,2.20,-,-,39.83,11.38% 10.28%,No,0.00,68.70%,-,1.48,3.30,Yes,0.00,Feb 28 AMC,-,62.76K,3.43,1.00,-5.21%,-25.44%,-37.33%,"93,166",3.94% -,-,-0.26,1.50%,268.98M,3.72%,2.25B,38.05,0.22,-0.64%,263.68M,-9.12%,-55.50M,-,0.05,-,9.96%,-12.26%,1.06B,2.12,-328.10%,25.95%,2.32,17.72%,12.61,0.66,650.00%,-0.90%,12.64,-38.73%,0.03,264.87,-,-1.90%,6.69 - 15.27,-0.48%,-,-,-28.70%,0.00%,-45.17%,2.20,-,0.70,16.40%,67.80%,25.11%,0.41,477,0.80,71.90%,5.30%,52.71,4.83% 5.00%,Yes,0.80,7.80%,-5.20%,0.96,7.78,Yes,0.80,Feb 27 AMC,-,11.31M,8.37,2.20,0.99%,-1.63%,-4.72%,"10,843,026",7.58% If you only want your file to contain data from one run of the script, you do not need a to append, just use w instead.
Write data into csv
I am crawling data from Wikipedia and it works so far. I can display it on the terminal, but I can't write it the way I need it into a csv file :-/ The code is pretty long, but I paste it here anyway and hope that somebody can help me. import csv import requests from bs4 import BeautifulSoup def spider(): url = 'https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9F-_und_Mittelst%C3%A4dte_in_Deutschland' code = requests.get(url).text # Read source code and make unicode soup = BeautifulSoup(code, "lxml") # create BS object table = soup.find(text="Rang").find_parent("table") for row in table.find_all("tr")[1:]: partial_url = row.find_all('a')[0].attrs['href'] full_url = "https://de.wikipedia.org" + partial_url get_single_item_data(full_url) # goes into the individual sites def get_single_item_data(item_url): page = requests.get(item_url).text # Read source code & format with .text to unicode soup = BeautifulSoup(page, "lxml") # create BS object def getInfoBoxBasisDaten(s): return str(s) == 'Basisdaten' and s.parent.name == 'th' basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0] basisdaten_list = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:', 'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:', 'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin', 'Oberbürgermeister', 'Oberbürgermeisterin'] with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile: fieldnames = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:', 'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:', 'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin', 'Oberbürgermeister', 'Oberbürgermeisterin'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL, extrasaction='ignore') writer.writeheader() for i in basisdaten_list: wanted = i current = basisdaten.parent.parent.nextSibling while True: if not current.name: current = current.nextSibling continue if wanted in current.text: items = current.findAll('td') print(BeautifulSoup.get_text(items[0])) print(BeautifulSoup.get_text(items[1])) writer.writerow({i: BeautifulSoup.get_text(items[1])}) if '<th ' in str(current): break current = current.nextSibling print(spider()) The output is incorrect in 2 ways. The cells are their right places and only one city is written, all others are missing. It looks like this: But it should look like this + all other cities in it:
'... only one city is written ...': You call get_single_item_data for each city. Then inside this function you open the output file with the same name, in the statement with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile: which will overwrite the output file each time you call the function. Each variable is written to a new row: In the statement writer.writerow({i: BeautifulSoup.get_text(items[1])}) you write the value for one variable to a row. What you need to do instead is to make a dictionary for values before you start looking for page values. As you accumulate the values from the page you shove them into the dictionary by field name. Then after you have found all of the values available you call writer.writerow.
Loop in python script with xpath. Why do I only get results form last url?
Why do I only get the results form the last url? The idea is that I get a list of results of both urls. Also, with the printing in csv I get eacht time an empty row. How do I remove this row? import csv import requests from lxml import html import urllib TV_category = ["_108-tot-127-cm-43-tot-50-,98952,501090","_128-tot-150-cm-51-tot-59-,98952,501091"] url_pattern = 'http://www.mediamarkt.be/mcs/productlist/{}.html?langId=-17' for item in TV_category: url = url_pattern.format(item) page = requests.get(url) tree = html.fromstring(page.content) outfile = open("./tv_test1.csv", "wb") writer = csv.writer(outfile) rows = tree.xpath('//*[#id="category"]/ul[2]/li') for row in rows: price = row.xpath('normalize-space(div/aside[2]/div[1]/div[1]/div/text())') product_ref = row.xpath('normalize-space(div/div/h2/a/text())') writer.writerow([product_ref,price])
As I explained in the question's comments, you need to put the second for loop inside (at the end) the first one. Otherwise, only the last rows results will be saved/written to the CSV-format file. You don't need to open the file in each loop (a with statement will close it automagically). It is, as well, important to highlight that if you open a file with write flags it will overwrite, and if it's inside a loop it will overwrite each time it's opened. I'd refactor your code as follows: import csv import requests from lxml import html import urllib TV_category = ["_108-tot-127-cm-43-tot-50-,98952,501090","_128-tot-150-cm-51-tot-59-,98952,501091"] url_pattern = 'http://www.mediamarkt.be/mcs/productlist/{}.html?langId=-17' with open("./tv_test1.csv", "wb") as outfile: writer = csv.writer(outfile) for item in TV_category: url = url_pattern.format(item) page = requests.get(url) tree = html.fromstring(page.content) rows = tree.xpath('//*[#id="category"]/ul[2]/li') for row in rows: price = row.xpath('normalize-space(div/aside[2]/div[1]/div[1]/div/text())') product_ref = row.xpath('normalize-space(div/div/h2/a/text())') writer.writerow([product_ref,price])