I was somewhat able to write and execute the program using BeautifulSoup. My concept is to capture the details from html source by parsing multiple urls via csv file and save the output as csv.
Programming is executing well, but the csv overwrites the values in 1st row itself.
input File has three urls to parse
I want the output to be stored in 3 different rows.
Below is my code
import csv
import requests
import pandas
from bs4 import BeautifulSoup
with open("input.csv", "r") as f:
reader = csv.reader(f)
for row in reader:
url = row[0]
print (url)
r=requests.get(url)
c=r.content
soup=BeautifulSoup(c, "html.parser")
all=soup.find_all("div", {"class":"biz-country-us"})
for br in soup.find_all("br"):
br.replace_with("\n")
l=[]
for item in all:
d={}
name=item.find("h1",{"class":"biz-page-title embossed-text-white shortenough"})
d["name"]=name.text.replace(" ","").replace("\n","")
claim=item.find("div", {"class":"u-nowrap claim-status_teaser js-claim-status-hover"})
d["claim"]=claim.text.replace(" ","").replace("\n","")
reviews=item.find("span", {"class":"review-count rating-qualifier"})
d["reviews"]=reviews.text.replace(" ","").replace("\n","")
l.append(d)
df=pandas.DataFrame(l)
df.to_csv("output.csv")
Please kindly let me know if Im not clear on explaining anything.
Open the output file in append mode as suggested in this post with the modification that you add header the first time:
from os.path import isfile
if not isfile("output.csv", "w"):
df.to_csv("output.csv", header=True)
else:
with open("output.csv", "a") as f:
df.to_csv(f, header=False)
Related
TypeError: a bytes-like object is required, not 'str'
Getting the above error while executing below python code to save the HTML table data in CSV file. Don't know how to get rideup. Pls help me.
import csv
import requests
from bs4 import BeautifulSoup
url='http://www.mapsofindia.com/districts-india/'
response=requests.get(url)
html=response.content
soup=BeautifulSoup(html,'html.parser')
table=soup.find('table', attrs={'class':'tableizer-table'})
list_of_rows=[]
for row in table.findAll('tr')[1:]:
list_of_cells=[]
for cell in row.findAll('td'):
list_of_cells.append(cell.text)
list_of_rows.append(list_of_cells)
outfile=open('./immates.csv','wb')
writer=csv.writer(outfile)
writer.writerow(["SNo", "States", "Dist", "Population"])
writer.writerows(list_of_rows)
on above the last line.
You are using Python 2 methodology instead of Python 3.
Change:
outfile=open('./immates.csv','wb')
To:
outfile=open('./immates.csv','w')
and you will get a file with the following output:
SNo,States,Dist,Population
1,Andhra Pradesh,13,49378776
2,Arunachal Pradesh,16,1382611
3,Assam,27,31169272
4,Bihar,38,103804637
5,Chhattisgarh,19,25540196
6,Goa,2,1457723
7,Gujarat,26,60383628
.....
In Python 3 csv takes the input in text mode, whereas in Python 2 it took it in binary mode.
Edited to Add
Here is the code I ran:
url='http://www.mapsofindia.com/districts-india/'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
table=soup.find('table', attrs={'class':'tableizer-table'})
list_of_rows=[]
for row in table.findAll('tr')[1:]:
list_of_cells=[]
for cell in row.findAll('td'):
list_of_cells.append(cell.text)
list_of_rows.append(list_of_cells)
outfile = open('./immates.csv','w')
writer=csv.writer(outfile)
writer.writerow(['SNo', 'States', 'Dist', 'Population'])
writer.writerows(list_of_rows)
I had the same issue with Python3.
My code was writing into io.BytesIO().
Replacing with io.StringIO() solved.
just change wb to w
outfile=open('./immates.csv','wb')
to
outfile=open('./immates.csv','w')
You are opening the csv file in binary mode, it should be 'w'
import csv
# open csv file in write mode with utf-8 encoding
with open('output.csv','w',encoding='utf-8',newline='')as w:
fieldnames = ["SNo", "States", "Dist", "Population"]
writer = csv.DictWriter(w, fieldnames=fieldnames)
# write list of dicts
writer.writerows(list_of_dicts) #writerow(dict) if write one row at time
file = open('parsed_data.txt', 'w')
for link in soup.findAll('a', attrs={'href': re.compile("^http")}): print (link)
soup_link = str(link)
print (soup_link)
file.write(soup_link)
file.flush()
file.close()
In my case, I used BeautifulSoup to write a .txt with Python 3.x. It had the same issue. Just as #tsduteba said, change the 'wb' in the first line to 'w'.
I'm trying to write some data to a csv file extracting from some html elements. The thing is when I write the data in an excel file I can see the text in that file the way I see it in that site. However, things go wrong when I write the data to a csv file. I see some unintelligible text instead of the one I'm after.
Html elements within which the data are:
<div class="col-xs-12">
<h1 class="text-default text-darker no-margin font-180 font-bold">
شركة الوطنية </h1>
<h2 class="text-default font-100 no-margin vertical-offset-5">
</h2>
</div>
Desired output:
شركة الوطنية
When I try like:
from openpyxl import Workbook
from bs4 import BeautifulSoup
wb = Workbook()
wb.remove(wb['Sheet'])
ws = wb.create_sheet("experimental")
ws.append(['name'])
soup = BeautifulSoup(htmlcontent,"lxml")
name = soup.select_one("h1").get_text(strip=True)
ws.append([name])
wb.save("document.xlsx")
It produces an excel file in which the text looks like [as expected]:
However, when I try like:
import csv
from bs4 import BeautifulSoup
with open("demo.csv","w",newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(['name'])
soup = BeautifulSoup(htmlcontent,"lxml")
name = soup.select_one("h1").get_text(strip=True)
writer.writerow([name])
It produces a csv file where the text looks horrible:
How can I write the exact text to a csv file?
To add to what #alex_bits said I would change the UTF version to UTF-16 like below:
import csv
from bs4 import BeautifulSoup
with open("demo.csv","w",newline="",encoding="utf-16") as f:
writer = csv.writer(f)
writer.writerow(['name'])
soup = BeautifulSoup(htmlcontent,"lxml")
name = soup.select_one("h1").get_text(strip=True)
writer.writerow([name])
As you might have suspected the issue here is with your encoding and excels understanding of it. Instead of utf-8 you should use utf-8-sig
import csv
text = "شركة الوطنية"
with open('test.csv', 'w', encoding='utf-8-sig') as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow([text])
OUTPUT
I've written a script in python which is able to fetch the title of different posts from a webpage and write them to a csv file. As the site updates it's content very frequently, I like to append the new result first in that csv file where there are already list of old titles available.
I've tried with:
import csv
import time
import requests
from bs4 import BeautifulSoup
url = "https://stackoverflow.com/questions/tagged/python"
def get_information(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
for title in soup.select(".summary .question-hyperlink"):
yield title.text
if __name__ == '__main__':
while True:
with open("output.csv","a",newline="") as f:
writer = csv.writer(f)
writer.writerow(['posts'])
for items in get_information(url):
writer.writerow([items])
print(items)
time.sleep(300)
The above script which when run twice can append the new results after the old results.
Old data are like:
A
F
G
T
New data are W,Q,U.
The csv file should look like below when I rerun the script:
W
Q
U
A
F
G
T
How can I append the new result first in an existing csv file having old data?
Inserting data anywhere in a file except at the end requires rewriting the whole thing. To do this without reading its entire contents into memory first, you could create a temporary csv file with the new data in it, append the data from the existing file to that, delete the old file and rename the new one.
Here's and example of what I mean (using a dummy get_information() function to simplify testing).
import csv
import os
from tempfile import NamedTemporaryFile
url = 'https://stackoverflow.com/questions/tagged/python'
csv_filepath = 'updated.csv'
# For testing, create a existing file.
if not os.path.exists(csv_filepath):
with open(csv_filepath, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows([item] for item in 'AFGT')
# Dummy for testing.
def get_information(url):
for item in 'WQU':
yield item
if __name__ == '__main__':
folder = os.path.abspath(os.path.dirname(csv_filepath)) # Get dir of existing file.
with NamedTemporaryFile(mode='w', newline='', suffix='.csv',
dir=folder, delete=False) as newf:
temp_filename = newf.name # Save filename.
# Put new data into the temporary file.
writer = csv.writer(newf)
for item in get_information(url):
writer.writerow([item])
print([item])
# Append contents of existing file to new one.
with open(csv_filepath, 'r', newline='') as oldf:
reader = csv.reader(oldf)
for row in reader:
writer.writerow(row)
print(row)
os.remove(csv_filepath) # Delete old file.
os.rename(temp_filename, csv_filepath) # Rename temporary file.
Since you intend to change the position of every element of the table, you need to read the table into memory and rewrite the entire file, starting with the new elements.
You may find it easier to (1) write the new element to a new file, (2) open the old file and append its contents to the new file, and (3) move the new file to the original (old) file name.
I am new to scraping using Python. After using a lot of useful resources I was able to scrape the content of a Page. However, I am having trouble saving this data into a .csv file.
Python:
import mechanize
import time
import requests
import csv
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Firefox(executable_path=r'C:\Users\geckodriver.exe')
driver.get("myUrl.jsp")
username = driver.find_element_by_name('USER')
password = driver.find_element_by_name('PASSWORD')
username.send_keys("U")
password.send_keys("P")
main_frame = driver.find_element_by_xpath('//*[#id="Frame"]')
src = driver.switch_to_frame(main_frame)
table = driver.find_element_by_xpath("/html/body/div/div[2]/div[5]/form/div[7]/div[3]/table")
rows = table.find_elements(By.TAG_NAME, "tr")
for tr in rows:
outfile = open("C:/Users/Scripts/myfile.csv", "w")
with outfile:
writers = csv.writer(outfile)
writers.writerows(tr.text)
Problem:
Only one of the rows gets written to the excel file. However, when I print the tr.text into the console, all the required rows show up. How can I get all the text inside tr elements to be written into an excel file?
Currently your code will open the file, write one line, close it, then on the next row open it again and overwrite the line. Please consider the following code snippet:
# We use 'with' to open the file and auto close it when done
# syntax is best modified as follows
with open('C:/Users/Scripts/myfile.csv', 'w') as outfile:
writers = csv.writer(outfile)
# we only need to open the file once so we open it first
# then loop through each row to print everything into the open file
for tr in rows:
writers.writerows(tr.text)
TypeError: a bytes-like object is required, not 'str'
Getting the above error while executing below python code to save the HTML table data in CSV file. Don't know how to get rideup. Pls help me.
import csv
import requests
from bs4 import BeautifulSoup
url='http://www.mapsofindia.com/districts-india/'
response=requests.get(url)
html=response.content
soup=BeautifulSoup(html,'html.parser')
table=soup.find('table', attrs={'class':'tableizer-table'})
list_of_rows=[]
for row in table.findAll('tr')[1:]:
list_of_cells=[]
for cell in row.findAll('td'):
list_of_cells.append(cell.text)
list_of_rows.append(list_of_cells)
outfile=open('./immates.csv','wb')
writer=csv.writer(outfile)
writer.writerow(["SNo", "States", "Dist", "Population"])
writer.writerows(list_of_rows)
on above the last line.
You are using Python 2 methodology instead of Python 3.
Change:
outfile=open('./immates.csv','wb')
To:
outfile=open('./immates.csv','w')
and you will get a file with the following output:
SNo,States,Dist,Population
1,Andhra Pradesh,13,49378776
2,Arunachal Pradesh,16,1382611
3,Assam,27,31169272
4,Bihar,38,103804637
5,Chhattisgarh,19,25540196
6,Goa,2,1457723
7,Gujarat,26,60383628
.....
In Python 3 csv takes the input in text mode, whereas in Python 2 it took it in binary mode.
Edited to Add
Here is the code I ran:
url='http://www.mapsofindia.com/districts-india/'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
table=soup.find('table', attrs={'class':'tableizer-table'})
list_of_rows=[]
for row in table.findAll('tr')[1:]:
list_of_cells=[]
for cell in row.findAll('td'):
list_of_cells.append(cell.text)
list_of_rows.append(list_of_cells)
outfile = open('./immates.csv','w')
writer=csv.writer(outfile)
writer.writerow(['SNo', 'States', 'Dist', 'Population'])
writer.writerows(list_of_rows)
I had the same issue with Python3.
My code was writing into io.BytesIO().
Replacing with io.StringIO() solved.
just change wb to w
outfile=open('./immates.csv','wb')
to
outfile=open('./immates.csv','w')
You are opening the csv file in binary mode, it should be 'w'
import csv
# open csv file in write mode with utf-8 encoding
with open('output.csv','w',encoding='utf-8',newline='')as w:
fieldnames = ["SNo", "States", "Dist", "Population"]
writer = csv.DictWriter(w, fieldnames=fieldnames)
# write list of dicts
writer.writerows(list_of_dicts) #writerow(dict) if write one row at time
file = open('parsed_data.txt', 'w')
for link in soup.findAll('a', attrs={'href': re.compile("^http")}): print (link)
soup_link = str(link)
print (soup_link)
file.write(soup_link)
file.flush()
file.close()
In my case, I used BeautifulSoup to write a .txt with Python 3.x. It had the same issue. Just as #tsduteba said, change the 'wb' in the first line to 'w'.