I am attempting to scrape the artists' Spotify streaming rankings from Kworb.net into a CSV file and I've nearly succeeded except I'm running into a weird issue.
The code below successfully scrapes all 10,000 of the listed artists into the console:
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://kworb.net/spotify/artists.html"
result = requests.get(URL)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
table = soup.find('table', id="spotifyartistindex")
header_tags = table.find_all('th')
headers = [header.text.strip() for header in header_tags]
rows = []
data_rows = table.find_all('tr')
for row in data_rows:
value = row.find_all('td')
beautified_value = [dp.text.strip() for dp in value]
print(beautified_value)
if len(beautified_value) == 0:
continue
rows.append(beautified_value)
The issue arises when I use the following code to save the output to a CSV file:
with open('artist_rankings.csv', 'w', newline="") as output:
writer = csv.writer(output)
writer.writerow(headers)
writer.writerows(rows)
For whatever reason, only 738 of the artists are saved to the file. Does anyone know what could be causing this?
Thanks so much for any help!
As an alternative approach, you might want to make your life easier next time and use pandas.
Here's how:
import requests
import pandas as pd
source = requests.get("https://kworb.net/spotify/artists.html")
df = pd.concat(pd.read_html(source.text, flavor="bs4"))
df.to_csv("artists.csv", index=False)
This outputs a .csv file with 10,000 artists.
Stupid question. I have made my first scraper/crawler. It gives me exactly what i want, but when i write it to csv file, text appears with \n'] brackets. If i try to remove it in any way - it breaks my output in csv file.
Although the website is in hebrew, it shouldn't be a problem. Just look at csv that you get.
Thanks in advance
import csv
import requests
from bs4 import BeautifulSoup as bs
import io
url='https://www.maariv.co.il/news/politics'
source = requests.get(url).text
soup = bs(source, 'html.parser')
file = io.open('maariv7.csv', 'w', encoding="utf-16")
csv_writer = csv.writer(file, delimiter='|')
csv_writer.writerow(['Headline', 'Summary', 'Text', 'name'])
file.close()
def single_page_scraper(url):
source = requests.get(url).text
soup = bs(source, 'html.parser')
file = io.open('maariv7.csv', 'a', encoding="utf-16")
csv_writer = csv.writer(file, delimiter='|')
for article in soup.find_all(class_='article-title'):
headline = article.h1.text
print (headline,'\n')
for article in soup.find_all(class_='article-description'):
summary = article.h2.text
print(summary,'\n')
text=[]
name=[]
for par in soup.find_all(class_='article-body'):
text.append(par.get_text())
print(text)
politics = io.open('politicians.txt', 'r', encoding="utf-8")
my_list=politics.read().splitlines()
my_file=str(text)
for i in my_list:
if i in my_file:
name.append(i)
name_list = ", ".join(name)
print(name_list,'\n''\n''\n''\n')
csv_writer.writerow([headline, summary, my_file, name_list])
file.close()
for articles in soup.find_all(class_='three-articles-in-row'):
link = articles.a['href']
single_page_scraper(link)
Check out Yibo Yang's answer at the bottom.
Basically, try switching this line:
csv_writer = csv.writer(file, delimiter='|')
to this:
csv_writer = csv.writer(file, delimiter='|', newline='')
And see if it makes a difference.
So, inside of single_page_scraper I use
They are actually putting newlines in their text, so you should strip them right where you append the text: instead of text.append(par.get_text()) add the strip text.append(par.get_text())
for par in soup.find(class_='article-body'):
if isinstance(par, NavigableString):
t = par.strip()
else:
t = par.text.strip()
if t != '':
text.append(t)
edit: you would have to from bs4 import NavigableString
I'm trying to figure out what will be the next step to convert my webscrape to CSV.
I've tried putting every column into individual lists, but I feel like this is not the solution.
from bs4 import BeautifulSoup
import requests
url = 'https://www.pro-football-reference.com/years/2018/passing.htm'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for row in tb.find_all('tr'):
i = row.get_text()
print(i)
This should work
import csv #quite crucial
final_table = []
for row in tb.findall('tr'):
next_line = row.get_text()
final_table.append([next_line])
with open('output.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(final_table)
Use the csv module. We'll grab the headers with soup.find("tr").find_all("th"), then loop over the body and write it to the text file. The first cell of each row is a <th>, so we need to handle that separately and prepend it to the <td> data. Note that the staggered headers every 30 lines are omitted.
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.pro-football-reference.com/years/2018/passing.htm"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
with open("output.csv", "w") as f:
writer = csv.writer(f)
writer.writerow([x.get_text() for x in soup.find("tr").find_all("th")])
for row in soup.find_all("tr"):
data = [row.find("th").get_text()] + [x.get_text() for x in row.find_all("td")]
if data:
writer.writerow(data)
Output (just the top few rows):
Rk,Player,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,Yds,TD,TD%,Int,Int%,Lng,Y/A,AY/A,Y/C,Y/G,Rate,QBR,Sk,Yds,NY/A,ANY/A,Sk%,4QC,GWD
1,Ben Roethlisberger,PIT,36,QB,16,16,9-6-1,452,675,67.0,5129,34,5.0,16,2.4,97,7.6,7.5,11.3,320.6,96.5,71.0,24,166,7.10,7.04,3.4,2,3
2,Andrew Luck*,IND,29,QB,16,16,10-6-0,430,639,67.3,4593,39,6.1,15,2.3,68,7.2,7.4,10.7,287.1,98.7,69.4,18,134,6.79,6.95,2.7,3,3
3,Matt Ryan,ATL,33,QB,16,16,7-9-0,422,608,69.4,4924,35,5.8,7,1.2,75,8.1,8.7,11.7,307.8,108.1,68.5,42,296,7.12,7.71,6.5,1,1
4,Kirk Cousins,MIN,30,QB,16,16,8-7-1,425,606,70.1,4298,30,5.0,10,1.7,75,7.1,7.3,10.1,268.6,99.7,58.2,40,262,6.25,6.48,6.2,1,0
Check this thread if you see extra newlines in the CSV result on Windows.
I have a bunch of web text that I'd like to scrape and export to a csv file. The problem is that the text is split over multiple lines on the website and that's how beautifulsoup reads it. When I export to csv, all the text goes into one cell but the cell has multiple lines of text. When I try to read the csv into another program, it interprets the multiple lines in a way that yields a nonsensical dataset. The question is, how do I put all the text into a single line after I pull it with beautifulsoup but before I export to csv?
Here's a simple working example demonstrating the problem of multiple lines (in fact, the first few lines in the resulting csv are blank, so at first glance it may look empty):
import csv
import requests
from bs4 import BeautifulSoup
def main():
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,"html.parser")
with open('Temp.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f,delimiter=",")
abstract=soup.find("article").text
writer.writerow([abstract])
if __name__ == '__main__':
main()
UPDATE: there have been some good suggestions, but it's still not working. The following code still produces a csv file with line breaks in a cell:
import csv
import requests
from bs4 import BeautifulSoup
with open('Temp.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f,delimiter=',')
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,'lxml')
find_article = soup.find('article')
find_2para = find_article.p.find_next_sibling("p")
find_largetxt = find_article.p.find_next_sibling("p").nextSibling
writer.writerow([find_2para,find_largetxt])
Here's another attempt based on a different suggestion. This one also ends up producing a line break in the csv file:
import csv
import requests
from bs4 import BeautifulSoup
def main():
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,"html.parser")
with open('Temp.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f,delimiter=",")
abstract=soup.find("article").get_text(separator=" ", strip=True)
writer.writerow([abstract])
if __name__ == '__main__':
main()
Change your abstract = ... line into:
abstract = soup.find("article").get_text(separator=" ", strip=True)
It'll separate each line using the separator parameter (in this case It'll separate the strings with an empty space.
The solution that ended up working for me is pretty simple:
abstract=soup.find("article").text.replace("\t", "").replace("\r", "").replace("\n", "")
That gets rid of all line breaks.
r = requests.get("https://www.econometricsociety.org/publications/econometrica/2017/03/01/search-yield")
soup = BeautifulSoup(r.text,'lxml') # I prefer using xml parser
find_article = soup.find('article')
# Next line how to find The title in this case: Econometrica: Mar 2017, Volume 85, Issue 2
find_title = find_article.h3
# find search yeild
find_yeild = find_article.h1
#first_paragraph example : DOI: 10.3982/ECTA14057 p. 351-378
find_1para = find_article.p
#second p example : David Martinez‐Miera, Rafael Repullo
find_2para = find_article.p.find_next_sibling("p")
#find the large text area using e.g. 'We present a model of the relationship bet...'
find_largetxt = find_article.p.find_next_sibling("p").nextSibling
I used a variety of methods of getting to the text area you wish just for the purpose of education(you can use .text on each of these to get the text without tags or you can use Zroq's method.
But you can write each one of these into the file by doing for example
writer.writerow(find_title.text)
Why do I only get the results form the last url?
The idea is that I get a list of results of both urls.
Also, with the printing in csv I get eacht time an empty row. How do I remove this row?
import csv
import requests
from lxml import html
import urllib
TV_category = ["_108-tot-127-cm-43-tot-50-,98952,501090","_128-tot-150-cm-51-tot-59-,98952,501091"]
url_pattern = 'http://www.mediamarkt.be/mcs/productlist/{}.html?langId=-17'
for item in TV_category:
url = url_pattern.format(item)
page = requests.get(url)
tree = html.fromstring(page.content)
outfile = open("./tv_test1.csv", "wb")
writer = csv.writer(outfile)
rows = tree.xpath('//*[#id="category"]/ul[2]/li')
for row in rows:
price = row.xpath('normalize-space(div/aside[2]/div[1]/div[1]/div/text())')
product_ref = row.xpath('normalize-space(div/div/h2/a/text())')
writer.writerow([product_ref,price])
As I explained in the question's comments, you need to put the second for loop inside (at the end) the first one. Otherwise, only the last rows results will be saved/written to the CSV-format file.
You don't need to open the file in each loop (a with statement will close it automagically). It is, as well, important to highlight that if you open a file with write flags it will overwrite, and if it's inside a loop it will overwrite each time it's opened.
I'd refactor your code as follows:
import csv
import requests
from lxml import html
import urllib
TV_category = ["_108-tot-127-cm-43-tot-50-,98952,501090","_128-tot-150-cm-51-tot-59-,98952,501091"]
url_pattern = 'http://www.mediamarkt.be/mcs/productlist/{}.html?langId=-17'
with open("./tv_test1.csv", "wb") as outfile:
writer = csv.writer(outfile)
for item in TV_category:
url = url_pattern.format(item)
page = requests.get(url)
tree = html.fromstring(page.content)
rows = tree.xpath('//*[#id="category"]/ul[2]/li')
for row in rows:
price = row.xpath('normalize-space(div/aside[2]/div[1]/div[1]/div/text())')
product_ref = row.xpath('normalize-space(div/div/h2/a/text())')
writer.writerow([product_ref,price])