Trouble dealing with header in a csv file

Trouble dealing with header in a csv file - python

I've written some code using python to scrape some titles and price from a webpage and write the results in a csv file. The script is running awesome. As I'm appending data to a csv file the script is writing headers in such a way that if it runs 4 loops then the headers will be written 4 times. How to fix it so that the headers will be written once. Thanks.
This is the script:
import csv
import requests
from bs4 import BeautifulSoup
diction_page = ['http://www.bloomberg.com/quote/SPX:IND','http://www.bloomberg.com/quote/CCMP:IND']
for link in diction_page:
res = requests.get(link).text
soup = BeautifulSoup(res,'lxml')
title = soup.select_one('.name').text.strip()
price = soup.select_one('.price').text
print(title,price)
with open('item.csv','a',newline='') as outfile:
writer = csv.writer(outfile)
writer.writerow(["Title","Price"])
writer.writerow([title, price])

As an option you can try this:
import csv
import requests
from bs4 import BeautifulSoup
diction_page = ['http://www.bloomberg.com/quote/SPX:IND','http://www.bloomberg.com/quote/CCMP:IND']
for i,link in enumerate(diction_page):
res = requests.get(link).text
soup = BeautifulSoup(res,'lxml')
title = soup.select_one('.name').text.strip()
price = soup.select_one('.price').text
print(title,price)
with open('item.csv','a',newline='') as outfile:
writer = csv.writer(outfile)
if (i == 0):
writer.writerow(["Title","Price"])
writer.writerow([title, price])

Don't write the headers in the for loop:
import csv
import requests
from bs4 import BeautifulSoup
diction_page = ['http://www.bloomberg.com/quote/SPX:IND','http://www.bloomberg.com/quote/CCMP:IND']
outfile = open('item.csv','w',newline='')
writer = csv.writer(outfile)
writer.writerow(["Title","Price"])
for link in diction_page:
res = requests.get(link).text
soup = BeautifulSoup(res,'lxml')
title = soup.select_one('.name').text.strip()
price = soup.select_one('.price').text
print(title,price)
writer.writerow([title, price])
outfile.close()

Related

Scrape information from multiple URLs listed in a CSV using BeautifulSoup and then export these results to a new CSV file

I have a 45k+ rows CSV file, each one containing a different path of the same domain - which are structurally identical to each other - and every single one is clickable. I managed to use BeautifulSoup to scrape the title and content of each one and through the print function, I was able to validate the scraper. However, when I try to export the information gathered to a new CSV file, I only get the last URL's street name and description, and not all of them as I expected.
from bs4 import BeautifulSoup
import requests
import csv
with open('URLs.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
site = requests.get(row['addresses']).text
soup = BeautifulSoup(site, 'lxml')
StreetName = soup.find('div', class_='hist-title').text
Description = soup.find('div', class_='hist-content').text
with open('OutputList.csv','w', newline='') as output:
Header = ['StreetName', 'Description']
writer = csv.DictWriter(output, fieldnames=Header)
writer.writeheader()
writer.writerow({'StreetName' : StreetName, 'Description' : Description})
How can the output CSV have on each row the street name and description for the respective URL row in the input CSV file?

You need to open both files on the same level and then read and write on each iteration. Something like this:
from bs4 import BeautifulSoup
import requests
import csv
with open('URLs.csv') as a, open('OutputList.csv', 'w') as b:
reader = csv.reader(a)
writer = csv.writer(b, quoting=csv.QUOTE_ALL)
writer.writerow(['StreetName', 'Description'])
# Assuming url is the first field in the CSV
for url, *_ in reader:
r = requests.get(url)
if r.ok:
soup = BeautifulSoup(r.text, 'lxml')
street_name = soup.find('div', class_='hist-title').text.strip()
description = soup.find('div', class_='hist-content').text.strip()
writer.writerow([street_name, description])
I hope it helps.

Exporting Python Scraping Results to CSV

The code below is yielding a value for the "resultStats" ID, which I would like to save in a CSV file. Is there any smart way to have the " desired_google_queries" (i.e. the search terms) in column A and the "resultStats" values in column B of the CSV?
I saw that there are a number of threads on this topic but none of the solutions I have read through worked for the specific situation.
from bs4 import BeautifulSoup
import urllib.request
import csv
desired_google_queries = ['Elon Musk' , 'Tesla', 'Microsoft']
for query in desired_google_queries:
url = 'http://google.com/search?q=' + query
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
response = urllib.request.urlopen( req )
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
resultStats = soup.find(id="resultStats").string
print(resultStats)

I took the liberty of rewriting this to use the Requests library instead of urllib, but this shows how to do the CSV writing which is what I think you were more interested in:
from bs4 import BeautifulSoup
import requests
import csv
desired_google_queries = ['Elon Musk' , 'Tesla', 'Microsoft']
result_stats = dict()
for query in desired_google_queries:
url = 'http://google.com/search?q=' + query
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
result_stats[query] = soup.find(id="resultStats").string
with open ('searchstats.csv', 'w', newline='') as fout:
cw = csv.writer(fout)
for q in desired_google_queries:
cw.writerow([q, result_stats[q]])

instead of writing it line by line, you can write it all in one go by storing the result in a pandas dataframe first. See below code
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
data_dict = {'desired_google_queries': [],
'resultStats': []}
desired_google_queries = ['Elon Musk' , 'Tesla', 'Microsoft']
for query in desired_google_queries:
url = 'http://google.com/search?q=' + query
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
response = urllib.request.urlopen( req )
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
resultStats = soup.find(id="resultStats").string
data_dict['desired_google_queries'].append(query)
data_dict['resultStats'].append(resultStats)
df = pd.DataFrame(data=data_dict)
df.to_csv(path_or_buf='path/where/you/want/to/save/thisfile.csv', index=None)

The original answer has been deleted unfortunately - please find below the code for everyone else interested in the situation. Thanks to the user who has posted the solution in the first place:
with open('eggs.csv', 'w', newline='') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=' ',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow(['query', 'resultStats'])
for query in desired_google_queries:
...
spamwriter.writerow([query, resultStats])

How do I take a list of urls that are stored in a csv file and import them to python to scrape and export back to csv with newly gathered data

I am very new to python and BeautifulSoup and I am trying to use it to scrape multiple urls at the same time using a loop. The loop will consist of locating the banner slide on the home page of each website and get the len of how many banners that website has and place them into an excel file next to the corresponding url. I have a list of urls saved in a csv file and basically what I want to do is take each of those urls and run the loop, pulling the number of banners, and put that number next to the url into a separate column in excel.
This is the code I have so far and all it does for me is write the urls back into a csv file and gives me the number of banners for only the last url.
from bs4 import BeautifulSoup
import requests
with open("urls.csv", "r") as f:
csv_raw_cont=f.read()
split_csv=csv_raw_cont.split('\n')
split_csv.remove('')
separator=';'
filename = "DDC_number_of_banners.csv"
f = open(filename, "w")
headers = "url, Number_of_Banners\n"
f.write(headers)
for each in split_csv:
url_row_index=0
url = each.split(separator)[url_row_index]
html = requests.get(url).content
soup= BeautifulSoup(html, "html.parser")
banner_info = soup.findAll('div',{'class':['slide', 'slide has-link',
'html-slide slide has-link']})
Number_of_banners = len(banner_info)
f.write(csv_raw_cont + "," + str(Number_of_banners) + "," + "\n")
f.close()

Making use of Python's CSV library would make this a bit simpler:
from bs4 import BeautifulSoup
import requests
import csv
with open("urls.csv", "r") as f_urls, open("DDC_number_of_banners.csv", "w", newline="") as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['url', 'Number_of_banners'])
for url in f_urls:
url = url.strip()
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
banner_info = soup.findAll('div',{'class':['slide', 'slide has-link', 'html-slide slide has-link']})
csv_output.writerow([url, len(banner_info)])
To include information such as each banner's data-label:
from bs4 import BeautifulSoup
import requests
import csv
with open("urls.csv", "r") as f_urls, open("DDC_number_of_banners.csv", "w", newline="") as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['url', 'Number_of_banners', 'data_labels'])
for url in f_urls:
url = url.strip()
html = requests.get(url).content
soup = BeautifulSoup(html, "html.parser")
banner_info = soup.findAll('div',{'class':['slide', 'slide has-link', 'html-slide slide has-link']})
data_labels = [banner.get('data-label') for banner in banner_info]
csv_output.writerow([url, len(banner_info)] + data_labels)

python webscraping and write data into csv

I'm trying to save all the data(i.e all pages) in single csv file but this code only save the final page data.Eg Here url[] contains 2 urls. the final csv only contains the 2nd url data.
I'm clearly doing something wrong in the loop.but i dont know what.
And also this page contains 100 data points. But this code only write first 44 rows.
please help this issue.............
from bs4 import BeautifulSoup
import requests
import csv
url = ["http://sfbay.craigslist.org/search/sfc/npo","http://sfbay.craigslist.org/search/sfc/npo?s=100"]
for ur in url:
r = requests.get(ur)
soup = BeautifulSoup(r.content)
g_data = soup.find_all("a", {"class": "hdrlnk"})
gen_list=[]
for row in g_data:
try:
name = row.text
except:
name=''
try:
link = "http://sfbay.craigslist.org"+row.get("href")
except:
link=''
gen=[name,link]
gen_list.append(gen)
with open ('filename2.csv','wb') as file:
writer=csv.writer(file)
for row in gen_list:
writer.writerow(row)

the gen_list is being initialized again inside your loop that runs over the urls.
gen_list=[]
Move this line outside the for loop.
...
url = ["http://sfbay.craigslist.org/search/sfc/npo","http://sfbay.craigslist.org/search/sfc/npo?s=100"]
gen_list=[]
for ur in url:
...

i found your post later, wanna try this method:
import requests
from bs4 import BeautifulSoup
import csv
final_data = []
url = "https://sfbay.craigslist.org/search/sss"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all(class_="result-row")
for details in get_details:
getclass = details.find_all(class_="hdrlnk")
for link in getclass:
link1 = link.get("href")
sublist = []
sublist.append(link1)
final_data.append(sublist)
print(final_data)
filename = "sfbay.csv"
with open("./"+filename, "w") as csvfile:
csvfile = csv.writer(csvfile, delimiter = ",")
csvfile.writerow("")
for i in range(0, len(final_data)):
csvfile.writerow(final_data[i])

How can I scrape this table more efficiently?

Okay, I have built a program to scrape yahoo finance. I want the historical prices of a certain stock. I then want it to be written to an excel spreadsheet. It is doing everything the way it's supposed to, but it gives me ALL of the data on the whole page! I need just the data in the table. Thanks.
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
import requests
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
playerdatasaved=""
soup = make_soup("https://finance.yahoo.com/q/hp?s=USO+Historical+Prices")
for record in soup.findAll('tr'):
playerdata=""
for data in record.findAll('td'):
playerdata=playerdata+","+data.text
if len(playerdata)!=0:
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
header="Open,Close,High,Low"
file = open(os.path.expanduser("Uso.csv"),"wb")
file.write(bytes(header, encoding="ascii",errors='ignore'))
file.write(bytes(playerdatasaved, encoding="ascii",errors='ignore'))
print(playerdatasaved)

To get the table of data:
soup = make_soup("https://finance.yahoo.com/q/hp?s=USO+Historical+Prices")
table = [[cell.text for row in soup.findAll('tr')] for cell in soup.findAll('td')]
To write out the table of data to a file:
import csv
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(table)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble dealing with header in a csv file - python

Related

Scrape information from multiple URLs listed in a CSV using BeautifulSoup and then export these results to a new CSV file

Exporting Python Scraping Results to CSV

How do I take a list of urls that are stored in a csv file and import them to python to scrape and export back to csv with newly gathered data

python webscraping and write data into csv

How can I scrape this table more efficiently?

Categories

Resources