I'm trying to write some data to a csv file extracting from some html elements. The thing is when I write the data in an excel file I can see the text in that file the way I see it in that site. However, things go wrong when I write the data to a csv file. I see some unintelligible text instead of the one I'm after.
Html elements within which the data are:
<div class="col-xs-12">
<h1 class="text-default text-darker no-margin font-180 font-bold">
شركة الوطنية </h1>
<h2 class="text-default font-100 no-margin vertical-offset-5">
</h2>
</div>
Desired output:
شركة الوطنية
When I try like:
from openpyxl import Workbook
from bs4 import BeautifulSoup
wb = Workbook()
wb.remove(wb['Sheet'])
ws = wb.create_sheet("experimental")
ws.append(['name'])
soup = BeautifulSoup(htmlcontent,"lxml")
name = soup.select_one("h1").get_text(strip=True)
ws.append([name])
wb.save("document.xlsx")
It produces an excel file in which the text looks like [as expected]:
However, when I try like:
import csv
from bs4 import BeautifulSoup
with open("demo.csv","w",newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(['name'])
soup = BeautifulSoup(htmlcontent,"lxml")
name = soup.select_one("h1").get_text(strip=True)
writer.writerow([name])
It produces a csv file where the text looks horrible:
How can I write the exact text to a csv file?
To add to what #alex_bits said I would change the UTF version to UTF-16 like below:
import csv
from bs4 import BeautifulSoup
with open("demo.csv","w",newline="",encoding="utf-16") as f:
writer = csv.writer(f)
writer.writerow(['name'])
soup = BeautifulSoup(htmlcontent,"lxml")
name = soup.select_one("h1").get_text(strip=True)
writer.writerow([name])
As you might have suspected the issue here is with your encoding and excels understanding of it. Instead of utf-8 you should use utf-8-sig
import csv
text = "شركة الوطنية"
with open('test.csv', 'w', encoding='utf-8-sig') as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow([text])
OUTPUT
Related
TypeError: a bytes-like object is required, not 'str'
Getting the above error while executing below python code to save the HTML table data in CSV file. Don't know how to get rideup. Pls help me.
import csv
import requests
from bs4 import BeautifulSoup
url='http://www.mapsofindia.com/districts-india/'
response=requests.get(url)
html=response.content
soup=BeautifulSoup(html,'html.parser')
table=soup.find('table', attrs={'class':'tableizer-table'})
list_of_rows=[]
for row in table.findAll('tr')[1:]:
list_of_cells=[]
for cell in row.findAll('td'):
list_of_cells.append(cell.text)
list_of_rows.append(list_of_cells)
outfile=open('./immates.csv','wb')
writer=csv.writer(outfile)
writer.writerow(["SNo", "States", "Dist", "Population"])
writer.writerows(list_of_rows)
on above the last line.
You are using Python 2 methodology instead of Python 3.
Change:
outfile=open('./immates.csv','wb')
To:
outfile=open('./immates.csv','w')
and you will get a file with the following output:
SNo,States,Dist,Population
1,Andhra Pradesh,13,49378776
2,Arunachal Pradesh,16,1382611
3,Assam,27,31169272
4,Bihar,38,103804637
5,Chhattisgarh,19,25540196
6,Goa,2,1457723
7,Gujarat,26,60383628
.....
In Python 3 csv takes the input in text mode, whereas in Python 2 it took it in binary mode.
Edited to Add
Here is the code I ran:
url='http://www.mapsofindia.com/districts-india/'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
table=soup.find('table', attrs={'class':'tableizer-table'})
list_of_rows=[]
for row in table.findAll('tr')[1:]:
list_of_cells=[]
for cell in row.findAll('td'):
list_of_cells.append(cell.text)
list_of_rows.append(list_of_cells)
outfile = open('./immates.csv','w')
writer=csv.writer(outfile)
writer.writerow(['SNo', 'States', 'Dist', 'Population'])
writer.writerows(list_of_rows)
I had the same issue with Python3.
My code was writing into io.BytesIO().
Replacing with io.StringIO() solved.
just change wb to w
outfile=open('./immates.csv','wb')
to
outfile=open('./immates.csv','w')
You are opening the csv file in binary mode, it should be 'w'
import csv
# open csv file in write mode with utf-8 encoding
with open('output.csv','w',encoding='utf-8',newline='')as w:
fieldnames = ["SNo", "States", "Dist", "Population"]
writer = csv.DictWriter(w, fieldnames=fieldnames)
# write list of dicts
writer.writerows(list_of_dicts) #writerow(dict) if write one row at time
file = open('parsed_data.txt', 'w')
for link in soup.findAll('a', attrs={'href': re.compile("^http")}): print (link)
soup_link = str(link)
print (soup_link)
file.write(soup_link)
file.flush()
file.close()
In my case, I used BeautifulSoup to write a .txt with Python 3.x. It had the same issue. Just as #tsduteba said, change the 'wb' in the first line to 'w'.
My python script fetches data from below website 'http://api.sl.se/api2/deviations.json?key=c7606e4606f642a380f7fdd75d683448' in a text file.
Now my aim is to filter: 'Headers', 'Details', 'FromDateTime', 'UptoDateTime' and 'Updated'
I have tried BS with text specific search, but not there...Below code shows that. Any help will be indeed helpful :)Sorry if I missed something very natural..
'''
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
import csv
import operator
from numpy import *
# Collect and parse first page
page = requests.get('http://api.sl.se/api2/deviations.json?
key=c7606e4606f642a380f7fdd75d683448')
soup = BeautifulSoup(page.text, 'html.parser')
#print(soup)
for script in
soup(["Header","Details","Updated","UpToDateTime","FromDateTime"]):
script.extract()
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
f1 = open(data.txt", "r")
resultFile = open("out.csv", "wb")
wr = csv.writer(resultFile, quotechar=',')
'''
I expect a csv with columns of Header","Details","Updated","UpToDateTime","FromDateTime"
You are doing in a wrong way. You don't need a beautifulsoup for this task. Your api returning data as json. BeautifulSoup is best for html. For your purpose you can use PANDAS and JSON Library.
Pandas can read directly from webresource as well but you want only requestdata from json so that you require both library.
Here is a snippet which you can use :
import pandas as pd
import requests
import json
page = requests.get('http://api.sl.se/api2/deviations.json?key=c7606e4606f642a380f7fdd75d683448')
data = json.loads(page.text)
df = pd.DataFrame(data["ResponseData"])
df.to_csv("file path")
Change File path and you get whole data inside csv.
But if you want to remove any column or any manipulation over data you can do using pandas dataframe as well. It is very powerful library you can learn about it using google.
I was somewhat able to write and execute the program using BeautifulSoup. My concept is to capture the details from html source by parsing multiple urls via csv file and save the output as csv.
Programming is executing well, but the csv overwrites the values in 1st row itself.
input File has three urls to parse
I want the output to be stored in 3 different rows.
Below is my code
import csv
import requests
import pandas
from bs4 import BeautifulSoup
with open("input.csv", "r") as f:
reader = csv.reader(f)
for row in reader:
url = row[0]
print (url)
r=requests.get(url)
c=r.content
soup=BeautifulSoup(c, "html.parser")
all=soup.find_all("div", {"class":"biz-country-us"})
for br in soup.find_all("br"):
br.replace_with("\n")
l=[]
for item in all:
d={}
name=item.find("h1",{"class":"biz-page-title embossed-text-white shortenough"})
d["name"]=name.text.replace(" ","").replace("\n","")
claim=item.find("div", {"class":"u-nowrap claim-status_teaser js-claim-status-hover"})
d["claim"]=claim.text.replace(" ","").replace("\n","")
reviews=item.find("span", {"class":"review-count rating-qualifier"})
d["reviews"]=reviews.text.replace(" ","").replace("\n","")
l.append(d)
df=pandas.DataFrame(l)
df.to_csv("output.csv")
Please kindly let me know if Im not clear on explaining anything.
Open the output file in append mode as suggested in this post with the modification that you add header the first time:
from os.path import isfile
if not isfile("output.csv", "w"):
df.to_csv("output.csv", header=True)
else:
with open("output.csv", "a") as f:
df.to_csv(f, header=False)
I want to write some Japanese characters to a csv, but I can't get it to display properly. The csv ends up with characters like å¦æ ¡ and I'm not sure what I'm doing wrong.
import requests
import csv
r = requests.get('http://jisho.org/api/v1/search/words?keyword=%23common')
data = r.json()
with open('common_words.csv', 'a', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=['word','reading'])
for entry in data["data"]:
word = entry["japanese"][0]["word"]
reading = entry["japanese"][0]["reading"]
writer.writerow({'word':word,'reading':reading,})
The code is correct. Reading the csv file with excel was the problem. Opening the file with a text editor displayed the characters properly.
TypeError: a bytes-like object is required, not 'str'
Getting the above error while executing below python code to save the HTML table data in CSV file. Don't know how to get rideup. Pls help me.
import csv
import requests
from bs4 import BeautifulSoup
url='http://www.mapsofindia.com/districts-india/'
response=requests.get(url)
html=response.content
soup=BeautifulSoup(html,'html.parser')
table=soup.find('table', attrs={'class':'tableizer-table'})
list_of_rows=[]
for row in table.findAll('tr')[1:]:
list_of_cells=[]
for cell in row.findAll('td'):
list_of_cells.append(cell.text)
list_of_rows.append(list_of_cells)
outfile=open('./immates.csv','wb')
writer=csv.writer(outfile)
writer.writerow(["SNo", "States", "Dist", "Population"])
writer.writerows(list_of_rows)
on above the last line.
You are using Python 2 methodology instead of Python 3.
Change:
outfile=open('./immates.csv','wb')
To:
outfile=open('./immates.csv','w')
and you will get a file with the following output:
SNo,States,Dist,Population
1,Andhra Pradesh,13,49378776
2,Arunachal Pradesh,16,1382611
3,Assam,27,31169272
4,Bihar,38,103804637
5,Chhattisgarh,19,25540196
6,Goa,2,1457723
7,Gujarat,26,60383628
.....
In Python 3 csv takes the input in text mode, whereas in Python 2 it took it in binary mode.
Edited to Add
Here is the code I ran:
url='http://www.mapsofindia.com/districts-india/'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
table=soup.find('table', attrs={'class':'tableizer-table'})
list_of_rows=[]
for row in table.findAll('tr')[1:]:
list_of_cells=[]
for cell in row.findAll('td'):
list_of_cells.append(cell.text)
list_of_rows.append(list_of_cells)
outfile = open('./immates.csv','w')
writer=csv.writer(outfile)
writer.writerow(['SNo', 'States', 'Dist', 'Population'])
writer.writerows(list_of_rows)
I had the same issue with Python3.
My code was writing into io.BytesIO().
Replacing with io.StringIO() solved.
just change wb to w
outfile=open('./immates.csv','wb')
to
outfile=open('./immates.csv','w')
You are opening the csv file in binary mode, it should be 'w'
import csv
# open csv file in write mode with utf-8 encoding
with open('output.csv','w',encoding='utf-8',newline='')as w:
fieldnames = ["SNo", "States", "Dist", "Population"]
writer = csv.DictWriter(w, fieldnames=fieldnames)
# write list of dicts
writer.writerows(list_of_dicts) #writerow(dict) if write one row at time
file = open('parsed_data.txt', 'w')
for link in soup.findAll('a', attrs={'href': re.compile("^http")}): print (link)
soup_link = str(link)
print (soup_link)
file.write(soup_link)
file.flush()
file.close()
In my case, I used BeautifulSoup to write a .txt with Python 3.x. It had the same issue. Just as #tsduteba said, change the 'wb' in the first line to 'w'.