Printing 2d array with Russian characters to a csv with python - python

I have a two dimensional array which has Russian/Arabic/ Chinese characters in it. I am not sure how to print all of it to a CSV file. I would like to print it to the CSV in UTF-8. The code which I am using is listed below.
matchedArray = [[]]
matchedArray.append(x)
ar = matchedArray
f1 = open('C:/Users/sagars/Desktop/encoding.csv','w')
writer = csv.writer(f1)
writer.writerow(['Flat Content', 'Post Dates', 'AuthorIDs', 'ThreadIDs',
'Matched Keyword', 'Count of keyword']) #if needed
for values in ar:
writer.writerow(values)
However, I am getting the following error:
Traceback (most recent call last):
File "C:/Users/sagars/PycharmProjects/encoding/encoding.py", line 18, in <module>
writer.writerow(values)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-13: ordinal not in range(128)
Process finished with exit code 1
How can I print this 2D array to the CSV file?
Thanks!

You can use the str.encode to encode each character using the right encoding before storing to the csv file.
for values in ar:
# values is a list/string of unicode characters
writer.writerow([val.encode("utf-8") for val in values])

Related

"01"-string representing bytes to unicode conversion in python 2

If I have byte - 11001010 or 01001010, how can I convert it back to Unicode if it is a valid code point?
I can take inputs and do a regex check on the input, but that would be a crude way of doing it, and it will be only limited to UTF-8. If I want to extend in future, how can I optimise the solution?
The input is string with 0's and 1's -
11001010 This is invalid
or 01001010 This is valid
or 11010010 11001110 This is invalid
If there is no other text, split the strings on whitespace, convert each to an integer and feed the result to a bytearray() object to decode:
as_binary = bytearray(int(b, 2) for b in inputtext.split())
as_unicode = as_binary.decode('utf8')
By putting the integer values into a bytearray() we avoid having to concatenate individual characters and get a convenient .decode() method as a bonus.
Note that this does expect the input to contain valid UTF-8. You could add an error handler to replace bad bytes rather than raise an exception, e.g. as_binary.decode('utf8', 'replace').
Wrapped up as a function that takes a codec and error handler:
def to_text(inputtext, encoding='utf8', errors='strict'):
as_binary = bytearray(int(b, 2) for b in inputtext.split())
return as_binary.decode(encoding, errors)
Most of your samples are not actually valid UTF-8, so the demo sets errors to 'replace':
>>> to_text('11001010', errors='replace')
u'\ufffd'
>>> to_text('01001010', errors='replace')
u'J'
>>> to_text('11001010', errors='replace')
u'\ufffd'
>>> to_text('11010010 11001110', errors='replace')
u'\ufffd\ufffd'
Leave errors to the default if you want to detect invalid data; just catch the UnicodeDecodeError exception thrown:
>>> to_text('11010010 11001110')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in to_text
File "/Users/mjpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xd2 in position 0: invalid continuation byte

UnicodeEncodeError in Python CSV manipulation script

I have a script that was working earlier but now stops due to UnicodeEncodeError.
I am using Python 3.4.3.
The full error message is the following:
Traceback (most recent call last):
File "R:/A/APIDevelopment/ScivalPubsExternal/Combine/ScivalPubsExt.py", line 58, in <module>
outputFD.writerow(row)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8a' in position 413: character maps to <undefined>
How can I address this error?
The Python script is the following below:
import pdb
import csv,sys,os
import glob
import os
import codecs
os.chdir('R:/A/APIDevelopment/ScivalPubsExternal/Combine')
joinedFileOut='ScivalUpdate'
csvSourceDir="R:/A/APIDevelopment/ScivalPubsExternal/Combine/AustralianUniversities"
# create dictionary from Codes file (Institution names and codes)
codes = csv.reader(open('Codes.csv'))
#rows of the file are stored as lists/arrays
InstitutionCodesDict = {}
InstitutionYearsDict = {}
for row in codes:
#keys: instnames, #values: instcodes
InstitutionCodesDict[row[0]] = row[1]
#define year dictionary with empty values field
InstitutionYearsDict[row[0]] = []
#to create a fiel descriptor for the outputfile, wt means text mode (also rt opr r is the same)
with open(joinedFileOut,'wt') as csvWriteFD:
#write the file (it is still empty here)
outputFD=csv.writer(csvWriteFD,delimiter=',')
#with closes the file at the end, if exception occurs then before that
# open each scival file, create file descriptor (encoding needed) and then read it and print the name of the file
if not glob.glob(csvSourceDir+"/*.csv"):
print("CSV source files not found")
sys.exit()
for scivalFile in glob.glob(csvSourceDir+"/*.csv"):
#with open(scivalFile,"rt", encoding="utf8") as csvInFD:
with open(scivalFile,"rt", encoding="ISO-8859-1") as csvInFD:
fileFD = csv.reader(csvInFD)
print(scivalFile)
#create condition for loop
printon=False
#reads all rows in file and creates lists/arrays of each row
for row in fileFD:
if len(row)>1:
#the next printon part is skipped when looping through the rows above the data because it is not set to true
if printon:
#inserts instcode and inst sequentially to each row where there is data and after the header row
row.insert(0, InstitutionCode)
row.insert(0, Institution)
if row[10].strip() == "-":
row[10] = " "
else:
p = row[10].zfill(8)
q = p[0:4] + '-' + p[4:]
row[10] = q
#writes output file
outputFD.writerow(row)
else:
if "Publications at" in row[1]:
#get institution name from cell B1
Institution=row[1].replace('Publications at the ', "").replace('Publications at ',"")
print(Institution)
#lookup institution code from dictionary
InstitutionCode=InstitutionCodesDict[Institution]
#printon gets set to TRUE after the header column
if "Title" in row[0]: printon=True
if "Publication years" in row[0]:
#get the year to print it later to see which years were pulled
year=row[1]
#add year to institution in dictionary
if not year in InstitutionYearsDict[Institution]:
InstitutionYearsDict[Institution].append(year)
# Write a report showing the institution name followed by the years for
# which we have that institution's data.
with open("Instyears.txt","w") as instReportFD:
for inst in (InstitutionYearsDict):
instReportFD.write(inst)
for yr in InstitutionYearsDict[inst]:
instReportFD.write(" "+yr)
instReportFD.write("\n")
Make sure to use the correct encoding of your source and destination files. You open files in three locations:
codes = csv.reader(open('Codes.csv'))
: : :
with open(joinedFileOut,'wt') as csvWriteFD:
outputFD=csv.writer(csvWriteFD,delimiter=',')
: : :
with open(scivalFile,"rt", encoding="ISO-8859-1") as csvInFD:
fileFD = csv.reader(csvInFD)
This should look something like:
# Use the correct encoding. If you made this file on
# Windows it is likely Windows-1252 (also known as cp1252):
with open('Codes.csv', encoding='cp1252') as f:
codes = csv.reader(f)
: : :
# The output encoding can be anything you want. UTF-8
# supports all Unicode characters. Windows apps tend to like
# the files to start with a UTF-8 BOM if the file is UTF-8,
# so 'utf-8-sig' is an option.
with open(joinedFileOut,'w', encoding='utf-8-sig') as csvWriteFD:
outputFD=csv.writer(csvWriteFD)
: : :
# This file is probably the cause of your problem and is not ISO-8859-1.
# Maybe UTF-8 instead? 'utf-8-sig' will safely handle and remove a UTF-8 BOM
# if present.
with open(scivalFile,'r', encoding='utf-8-sig') as csvInFD:
fileFD = csv.reader(csvInFD)
The error is caused by an attempt to write a string containing a U+008A character using the default cp1252 encoding of your system. It is trivial to fix, just declare a latin1 encoding (or iso-8859-1) for your output file (because it just outputs the original byte without conversion):
with open(joinedFileOut,'wt', encoding='latin1') as csvWriteFD:
But this will only hide the real problem: where does this 0x8a character come from? My advice is to intercept the exception and dump the line where it occurs:
try:
outputFD.writerow(row)
except UnicodeEncodeError:
# print row, the name of the file being processed and the line number
It is probably caused by one of the input files not being is-8859-1 encoded but more probably utf8 encoded...

Error writing 3000+ pdf files in one txt file with python 3

I am trying to extract text from 3000+ PDFs in one txt file (while I had to remove headers from each page):
for x in range(len(files)-len(files)+15):
pdfFileObj=open(files[x],'rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
for pageNum in range(1,pdfReader.numPages):
pageObj=pdfReader.getPage(pageNum)
content=pageObj.extractText()
removeIndex = content.find('information.') + len('information.')
newContent=content[removeIndex:]
file.write(newContent)
file.close()
However, I get the following error:
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 5217: character maps to <undefined>
I was not able to check the encoding of each PDF so I just used replace(). Below is the working code:
for x in range(len(files)):
pdfFileObj=open(os.path.join(filepath,files[x]),'rb')
for pageNum in range(1,pdfReader.numPages):
pageObj=pdfReader.getPage(pageNum)
content=pageObj.extractText()
removeIndex = content.find('information.') + len('information.')
newContent=content[removeIndex:]
newContent=newContent.replace('\n',' ')
newContent=newContent.replace('\ufb02','FL')
file.write(str(newContent.encode('utf-8')))
file.close()

I am trying to scrape a wikipedia world capitals page for the country and capital pairs

The program works for a short time and then hits an error and I have no idea what it means or how to fix it.
Here is the code:
from bs4 import BeautifulSoup
import urllib
BASE_URL = "https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order"
capitals_countries = []
html = urllib.urlopen(BASE_URL).read()
soup = BeautifulSoup(html, "html.parser")
country_table = soup.find('table', {"class" : "wikitable sortable"})
for row in country_table.find_all('tr'):
cols = row.find_all('td')
if len(cols) == 3:
capitals_countries.append((cols[0].text.strip(), cols[1].text.strip()))
for capital, country in capitals_countries:
print('{:35} {}'.format(capital, country))
Here is the error:
Traceback (most recent call last):
File "/Users/Kyle/Documents/scraper.py", line 19, in <module>
print('{:35} {}'.format(capital, country))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 5: ordinal not in range(128)
I have a rather basic understanding of html and scraping in general. I would appreciate any clarity that anyone can provide me for what is going on here.
You already have a unicode string, trying to capital.decode('utf-8'), that is going to give you:
In [13]: s = u'\xe1'
In [14]: print s
á
In [15]: s.decode("utf-8")
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-15-61efbeae9c77> in <module>()
----> 1 s.decode("utf-8")
/usr/lib/python2.7/encodings/utf_8.pyc in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_8_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
The reason you see it in your own code is you are trying to do the same using str.format when you call format on the unicode string you are trying to encode the string to ascii which fails as you have non ascii characters:
In [16]: print "{}".format(s)
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-16-1119d22adcca> in <module>()
----> 1 print "{}".format(s)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
All you need is to make the str.format string a unicode string with a leading u, do not decode anything:
In [17]: print u"{}".format(s)
á
So in your own code you need a leading u on your format string, nothing else.
for capital, country in capitals_countries:
print(u'{:35} {}'.format(capital, country))
You can verify that you have a unicode string by just adding a print type(capital) which a would output <type 'unicode'>.
When I looked at the list I see that the code broke at Bogotá
I think it's breaking due to special character á
When I change the print statement to
print(u'{:35} {}'.format(capital, country))
It works perfectly fine
This should fix the issue:
print('{:35} {}'.format(capital.decode('utf-8'), country.decode('utf-8')))
Or as suggested by #karlson in the comment, we can also use unicode strings like:
print(u'{:35} {}'.format(capital, country))
Now the u'{:35} {}' part is unicode, so you don't need to decode it any more.

python encoding for huge volumes of data

i have a huge amount of jsondata that i need to transfer to excel(10,000 or so rows and 20ish columns) Im using csv.my code:
x = json.load(urllib2.urlopen('#####'))
f = csv.writer(codecs.open("fsbmigrate3.csv", "wb+", encoding='utf-8'))
y = #my headers
f.writerow(y)
for row in x:
f.writerow(row.values())
unicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 0: ordinal not in range(128)
is what comes up.
i have tried encoding the json data
dict((k.encode('utf-8'), v.encode('utf-8')) for (k,v) in x)
but there is too much data to handle.
any ideas on how to pull this off, (apologies for the lack of SO convention its my first post
the full traceback is; Traceback (most recent call last):
File "C:\Users\bryand\Desktop\bbsports_stuff\gba2.py", line 22, in <module>
f.writerow(row.values())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 0: ordinal not in range(128)
[Finished in 6.2s]
Since you didn't specify here's a Python 3 solution. The Python 2 solution is much more painful. I've included some short sample data with non-ASCII characters:
#!python3
import json
import csv
json_data = '[{"a": "\\u9a6c\\u514b", "c": "somethingelse", "b": "something"}, {"a": "one", "c": "three", "b": "two"}]'
data = json.loads(json_data)
with open('fsbmigrate3.csv','w',encoding='utf-8-sig',newline='') as f:
w = csv.DictWriter(f,fieldnames=sorted(data[0].keys()))
w.writeheader()
w.writerows(data)
The utf-8-sig codec makes sure a byte order mark character (BOM) is written at the start of the output file, since Excel will assume the local ANSI encoding otherwise.
Since you have json data with key/value pairs, using DictWriter allows the headers to be specified; otherwise, the header order isn't predictable.

Categories

Resources