Writing data scraped from a HTML table to a CSV file

Writing data scraped from a HTML table to a CSV file - python

I'm trying to figure out what will be the next step to convert my webscrape to CSV.
I've tried putting every column into individual lists, but I feel like this is not the solution.
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/years/2018/passing.htm'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for row in tb.find_all('tr'):
i = row.get_text()
print(i)

This should work
import csv #quite crucial
final_table = []
for row in tb.findall('tr'):
next_line = row.get_text()
final_table.append([next_line])
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(final_table)

Use the csv module. We'll grab the headers with soup.find("tr").find_all("th"), then loop over the body and write it to the text file. The first cell of each row is a <th>, so we need to handle that separately and prepend it to the <td> data. Note that the staggered headers every 30 lines are omitted.
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.pro-football-reference.com/years/2018/passing.htm"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
with open("output.csv", "w") as f:
writer = csv.writer(f)
writer.writerow([x.get_text() for x in soup.find("tr").find_all("th")])
for row in soup.find_all("tr"):
data = [row.find("th").get_text()] + [x.get_text() for x in row.find_all("td")]
if data:
writer.writerow(data)
Output (just the top few rows):
Rk,Player,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,Yds,TD,TD%,Int,Int%,Lng,Y/A,AY/A,Y/C,Y/G,Rate,QBR,Sk,Yds,NY/A,ANY/A,Sk%,4QC,GWD
1,Ben Roethlisberger,PIT,36,QB,16,16,9-6-1,452,675,67.0,5129,34,5.0,16,2.4,97,7.6,7.5,11.3,320.6,96.5,71.0,24,166,7.10,7.04,3.4,2,3
2,Andrew Luck*,IND,29,QB,16,16,10-6-0,430,639,67.3,4593,39,6.1,15,2.3,68,7.2,7.4,10.7,287.1,98.7,69.4,18,134,6.79,6.95,2.7,3,3
3,Matt Ryan,ATL,33,QB,16,16,7-9-0,422,608,69.4,4924,35,5.8,7,1.2,75,8.1,8.7,11.7,307.8,108.1,68.5,42,296,7.12,7.71,6.5,1,1
4,Kirk Cousins,MIN,30,QB,16,16,8-7-1,425,606,70.1,4298,30,5.0,10,1.7,75,7.1,7.3,10.1,268.6,99.7,58.2,40,262,6.25,6.48,6.2,1,0
Check this thread if you see extra newlines in the CSV result on Windows.

Related

csv.writer not writing entire output to CSV file

I am attempting to scrape the artists' Spotify streaming rankings from Kworb.net into a CSV file and I've nearly succeeded except I'm running into a weird issue.
The code below successfully scrapes all 10,000 of the listed artists into the console:
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://kworb.net/spotify/artists.html"
result = requests.get(URL)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
table = soup.find('table', id="spotifyartistindex")
header_tags = table.find_all('th')
headers = [header.text.strip() for header in header_tags]
rows = []
data_rows = table.find_all('tr')
for row in data_rows:
value = row.find_all('td')
beautified_value = [dp.text.strip() for dp in value]
print(beautified_value)
if len(beautified_value) == 0:
continue
rows.append(beautified_value)
The issue arises when I use the following code to save the output to a CSV file:
with open('artist_rankings.csv', 'w', newline="") as output:
writer = csv.writer(output)
writer.writerow(headers)
writer.writerows(rows)
For whatever reason, only 738 of the artists are saved to the file. Does anyone know what could be causing this?
Thanks so much for any help!

As an alternative approach, you might want to make your life easier next time and use pandas.
Here's how:
import requests
import pandas as pd
source = requests.get("https://kworb.net/spotify/artists.html")
df = pd.concat(pd.read_html(source.text, flavor="bs4"))
df.to_csv("artists.csv", index=False)
This outputs a .csv file with 10,000 artists.

BeautifulSoup Scraping Formatting

This is my first time using BeautifulSoup and I am attempting to scrap store location data from a local convenience store.
However I'm running into some issues on trying to remove empty lines when data is being passed into a CSV file, I've tried .replace('\n','') and .strip() both did not worked.
Also I'm having problems with splitting data that is scraped and contained in the same sibling method.
I've added the script below:
from bs4 import BeautifulSoup
from requests import get
import urllib.request
import sched, time
import csv
url = 'http://www.cheers.com.sg/web/store_location.jsp'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
#print (soup.prettify())
#open a file for writing
location_data = open('data/soupdata.csv', 'w', newline='')
#create the csv writer object
csvwriter = csv.writer(location_data)
cheers = soup.find('div' , id="store_container")
count = 0
#Loop for Header tags
for paragraph in cheers.find_all('b'):
header1 = paragraph.text.replace(':' , '')
header2 = paragraph.find_next('b').text.replace(':' , '')
header3 = paragraph.find_next_siblings('b')[1].text.replace(':' , '')
if count == 0:
csvwriter.writerow([header1, header2, header3])
count += 1
break
for paragraph in cheers.find_all('br'):
brnext = paragraph.next_sibling.strip()
brnext1 = paragraph.next_sibling
test1 = brnext1.next_sibling.next_sibling
print(test1)
csvwriter.writerow([brnext, test1])
location_data.close()
Sample of output generated:
Sample of what output should look like:
How can I achieve this?
Thanks in advance.

To make it slightly organized, you can try like the following. I've used .select() instead of .find_all().
import csv
from bs4 import BeautifulSoup
import requests
url = 'http://www.cheers.com.sg/web/store_location.jsp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
with open("output.csv","w",newline="") as infile:
writer = csv.writer(infile)
writer.writerow(["Address","Telephone","Store hours"])
for items in soup.select("#store_container .store_col"):
addr = items.select_one("b").next_sibling.next_sibling
tel = items.select_one("b:nth-of-type(2)").next_sibling
store = items.select_one("b:nth-of-type(3)").next_sibling
writer.writerow([addr,tel,store])

You just need to change the way of extracting address, tel and store hours
import csv
from bs4 import BeautifulSoup
from requests import get
url = 'http://www.cheers.com.sg/web/store_location.jsp'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# print (soup.prettify())
# open a file for writing
location_data = open('data/soupdata.csv', 'w', newline='')
# create the csv writer object
csvwriter = csv.writer(location_data)
cheers = soup.find('div', id="store_container")
count = 0
# Loop for Header tags
for paragraph in cheers.find_all('b'):
header1 = paragraph.text.replace(':', '')
header2 = paragraph.find_next('b').text.replace(':', '')
header3 = paragraph.find_next_siblings('b')[1].text.replace(':', '')
if count == 0:
csvwriter.writerow([header1, header2, header3])
count += 1
break
for paragraph in cheers.find_all('div'):
label = paragraph.find_all('b')
if len(label) == 3:
print(label)
address = label[0].next_sibling.next_sibling
tel = label[1].next_sibling
hours = label[2].next_sibling
csvwriter.writerow([address, tel, hours])
location_data.close()

How do I creat CSV file with webscraped content from several URLs?

I want to create a CSV file from webscraped content. The content is from FinViz.com
I want to scrape the table from this website 20 times for 20 different stocks and input all the content into a CSV file. Within my code, I generate a list of stocks from a scrape of twitter content. The list of stocks that is generated is the same list that I want to get information on from the FinViz.com tables.
Here is my code:
import csv
import urllib.request
from bs4 import BeautifulSoup
twiturl = "https://twitter.com/ACInvestorBlog"
twitpage = urllib.request.urlopen(twiturl)
soup = BeautifulSoup(twitpage,"html.parser")
print(soup.title.text)
tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')]
print(tweets)
url_base = "https://finviz.com/quote.ashx?t="
url_list = [url_base + tckr for tckr in tweets]
for url in url_list:
fpage = urllib.request.urlopen(url)
fsoup = BeautifulSoup(fpage, 'html.parser')
# scrape single page and add data to list
# write datalist
with open('today.csv', 'a') as file:
writer = csv.writer(file)
# write header row
writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'})))
# write body row
writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'})))
The trouble that I am running into is that my CSV file only has the webscraped data from the last item in the list. Instead I want the entire list in a sequence of rows.
Here is what my CSV file looks like:
Index,P/E,EPS (ttm),Insider Own,Shs Outstand,Perf Week,Market Cap,Forward P/E,EPS next Y,Insider Trans,Shs Float,Perf Month,Income,PEG,EPS next Q,Inst Own,Short Float,Perf Quarter,Sales,P/S,EPS this Y,Inst Trans,Short Ratio,Perf Half Y,Book/sh,P/B,EPS next Y,ROA,Target Price,Perf Year,Cash/sh,P/C,EPS next 5Y,ROE,52W Range,Perf YTD,Dividend,P/FCF,EPS past 5Y,ROI,52W High,Beta,Dividend %,Quick Ratio,Sales past 5Y,Gross Margin,52W Low,ATR,Employees,Current Ratio,Sales Q/Q,Oper. Margin,RSI (14),Volatility,Optionable,Debt/Eq,EPS Q/Q,Profit Margin,Rel Volume,Prev Close,Shortable,LT Debt/Eq,Earnings,Payout,Avg Volume,Price,Recom,SMA20,SMA50,SMA200,Volume,Change
-,-,-1.75,7.94%,79.06M,-22.52%,296.48M,-,-1.74,-4.61%,72.41M,-23.16%,-85.70M,-,-0.36,62.00%,3.21%,1.63%,15.10M,19.63,-197.00%,18.05%,2.57,66.67%,-0.65,-,-8.10%,-127.70%,12.17,-6.25%,0.93,4.03,-,146.70%,2.05 - 5.86,3.59%,-,-,-,385.80%,-36.01%,-,-,1.30,-,76.50%,82.93%,0.41,100,1.30,-59.60%,-,36.98,16.13% 9.32%,Yes,-,90.00%,-,0.82,3.63,Yes,-,Nov 08,-,902.43K,3.75,2.30,-22.08%,-10.43%,11.96%,"742,414",3.31%

It would be better to open your output file first, rather than keep on opening/closing it for each URL that you fetch. Exception handling is needed to catch cases where the URL does not exist.
Also on your output, you should open the file with newline='' to avoid extra empty lines being written to the file:
import csv
import urllib.request
from bs4 import BeautifulSoup
write_header = True
twiturl = "https://twitter.com/ACInvestorBlog"
twitpage = urllib.request.urlopen(twiturl)
soup = BeautifulSoup(twitpage,"html.parser")
print(soup.title.text)
tweets = [i.text for i in soup.select('a.twitter-cashtag.pretty-link.js-nav b')]
print(tweets)
url_base = "https://finviz.com/quote.ashx?t="
url_list = [url_base + tckr for tckr in tweets]
with open('today.csv', 'w', newline='') as file:
writer = csv.writer(file)
for url in url_list:
try:
fpage = urllib.request.urlopen(url)
fsoup = BeautifulSoup(fpage, 'html.parser')
# write header row (once)
if write_header:
writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2-cp'})))
write_header = False
# write body row
writer.writerow(map(lambda e : e.text, fsoup.find_all('td', {'class':'snapshot-td2'})))
except urllib.error.HTTPError:
print("{} - not found".format(url))
So today.csv would start like:
Index,P/E,EPS (ttm),Insider Own,Shs Outstand,Perf Week,Market Cap,Forward P/E,EPS next Y,Insider Trans,Shs Float,Perf Month,Income,PEG,EPS next Q,Inst Own,Short Float,Perf Quarter,Sales,P/S,EPS this Y,Inst Trans,Short Ratio,Perf Half Y,Book/sh,P/B,EPS next Y,ROA,Target Price,Perf Year,Cash/sh,P/C,EPS next 5Y,ROE,52W Range,Perf YTD,Dividend,P/FCF,EPS past 5Y,ROI,52W High,Beta,Dividend %,Quick Ratio,Sales past 5Y,Gross Margin,52W Low,ATR,Employees,Current Ratio,Sales Q/Q,Oper. Margin,RSI (14),Volatility,Optionable,Debt/Eq,EPS Q/Q,Profit Margin,Rel Volume,Prev Close,Shortable,LT Debt/Eq,Earnings,Payout,Avg Volume,Price,Recom,SMA20,SMA50,SMA200,Volume,Change
-,-,-10.85,4.60%,2.36M,11.00%,8.09M,-,-,-62.38%,1.95M,-16.14%,-14.90M,-,-,2.30%,10.00%,-44.42%,0.00M,-,21.80%,-5.24%,3.10,-38.16%,1.46,2.35,-,-155.10%,65.00,-50.47%,-,-,-,-238.40%,2.91 - 11.20,-38.29%,-,-,54.50%,-,-69.37%,1.63,-,2.20,-,-,17.87%,0.36,15,2.20,-,-,39.83,11.38% 10.28%,No,0.00,68.70%,-,1.48,3.30,Yes,0.00,Feb 28 AMC,-,62.76K,3.43,1.00,-5.21%,-25.44%,-37.33%,"93,166",3.94%
-,-,-0.26,1.50%,268.98M,3.72%,2.25B,38.05,0.22,-0.64%,263.68M,-9.12%,-55.50M,-,0.05,-,9.96%,-12.26%,1.06B,2.12,-328.10%,25.95%,2.32,17.72%,12.61,0.66,650.00%,-0.90%,12.64,-38.73%,0.03,264.87,-,-1.90%,6.69 - 15.27,-0.48%,-,-,-28.70%,0.00%,-45.17%,2.20,-,0.70,16.40%,67.80%,25.11%,0.41,477,0.80,71.90%,5.30%,52.71,4.83% 5.00%,Yes,0.80,7.80%,-5.20%,0.96,7.78,Yes,0.80,Feb 27 AMC,-,11.31M,8.37,2.20,0.99%,-1.63%,-4.72%,"10,843,026",7.58%
If you only want your file to contain data from one run of the script, you do not need a to append, just use w instead.

Write data into csv

I am crawling data from Wikipedia and it works so far. I can display it on the terminal, but I can't write it the way I need it into a csv file :-/
The code is pretty long, but I paste it here anyway and hope that somebody can help me.
import csv
import requests
from bs4 import BeautifulSoup
def spider():
url = 'https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9F-_und_Mittelst%C3%A4dte_in_Deutschland'
code = requests.get(url).text # Read source code and make unicode
soup = BeautifulSoup(code, "lxml") # create BS object
table = soup.find(text="Rang").find_parent("table")
for row in table.find_all("tr")[1:]:
partial_url = row.find_all('a')[0].attrs['href']
full_url = "https://de.wikipedia.org" + partial_url
get_single_item_data(full_url) # goes into the individual sites
def get_single_item_data(item_url):
page = requests.get(item_url).text # Read source code & format with .text to unicode
soup = BeautifulSoup(page, "lxml") # create BS object
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
basisdaten_list = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:',
'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:',
'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin',
'Oberbürgermeister', 'Oberbürgermeisterin']
with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:',
'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:',
'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin',
'Oberbürgermeister', 'Oberbürgermeisterin']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL, extrasaction='ignore')
writer.writeheader()
for i in basisdaten_list:
wanted = i
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print(BeautifulSoup.get_text(items[0]))
print(BeautifulSoup.get_text(items[1]))
writer.writerow({i: BeautifulSoup.get_text(items[1])})
if '<th ' in str(current): break
current = current.nextSibling
print(spider())
The output is incorrect in 2 ways. The cells are their right places and only one city is written, all others are missing. It looks like this:
But it should look like this + all other cities in it:

'... only one city is written ...': You call get_single_item_data for each city. Then inside this function you open the output file with the same name, in the statement with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile: which will overwrite the output file each time you call the function.
Each variable is written to a new row: In the statement writer.writerow({i: BeautifulSoup.get_text(items[1])}) you write the value for one variable to a row. What you need to do instead is to make a dictionary for values before you start looking for page values. As you accumulate the values from the page you shove them into the dictionary by field name. Then after you have found all of the values available you call writer.writerow.

Loop in python script with xpath. Why do I only get results form last url?

Why do I only get the results form the last url?
The idea is that I get a list of results of both urls.
Also, with the printing in csv I get eacht time an empty row. How do I remove this row?
import csv
import requests
from lxml import html
import urllib
TV_category = ["_108-tot-127-cm-43-tot-50-,98952,501090","_128-tot-150-cm-51-tot-59-,98952,501091"]
url_pattern = 'http://www.mediamarkt.be/mcs/productlist/{}.html?langId=-17'
for item in TV_category:
url = url_pattern.format(item)
page = requests.get(url)
tree = html.fromstring(page.content)
outfile = open("./tv_test1.csv", "wb")
writer = csv.writer(outfile)
rows = tree.xpath('//*[#id="category"]/ul[2]/li')
for row in rows:
price = row.xpath('normalize-space(div/aside[2]/div[1]/div[1]/div/text())')
product_ref = row.xpath('normalize-space(div/div/h2/a/text())')
writer.writerow([product_ref,price])

As I explained in the question's comments, you need to put the second for loop inside (at the end) the first one. Otherwise, only the last rows results will be saved/written to the CSV-format file.
You don't need to open the file in each loop (a with statement will close it automagically). It is, as well, important to highlight that if you open a file with write flags it will overwrite, and if it's inside a loop it will overwrite each time it's opened.
I'd refactor your code as follows:
import csv
import requests
from lxml import html
import urllib
TV_category = ["_108-tot-127-cm-43-tot-50-,98952,501090","_128-tot-150-cm-51-tot-59-,98952,501091"]
url_pattern = 'http://www.mediamarkt.be/mcs/productlist/{}.html?langId=-17'
with open("./tv_test1.csv", "wb") as outfile:
writer = csv.writer(outfile)
for item in TV_category:
url = url_pattern.format(item)
page = requests.get(url)
tree = html.fromstring(page.content)
rows = tree.xpath('//*[#id="category"]/ul[2]/li')
for row in rows:
price = row.xpath('normalize-space(div/aside[2]/div[1]/div[1]/div/text())')
product_ref = row.xpath('normalize-space(div/div/h2/a/text())')
writer.writerow([product_ref,price])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Writing data scraped from a HTML table to a CSV file - python

This should work import csv #quite crucial final_table = [] for row in tb.findall('tr'): next_line = row.get_text() final_table.append([next_line]) with open('output.csv', 'w') as f: writer = csv.writer(f) writer.writerows(final_table)

Related

csv.writer not writing entire output to CSV file

BeautifulSoup Scraping Formatting

How do I creat CSV file with webscraped content from several URLs?

Write data into csv

Loop in python script with xpath. Why do I only get results form last url?

Categories

Resources