python encoding for huge volumes of data - python

i have a huge amount of jsondata that i need to transfer to excel(10,000 or so rows and 20ish columns) Im using csv.my code:
x = json.load(urllib2.urlopen('#####'))
f = csv.writer(codecs.open("fsbmigrate3.csv", "wb+", encoding='utf-8'))
y = #my headers
f.writerow(y)
for row in x:
f.writerow(row.values())
unicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 0: ordinal not in range(128)
is what comes up.
i have tried encoding the json data
dict((k.encode('utf-8'), v.encode('utf-8')) for (k,v) in x)
but there is too much data to handle.
any ideas on how to pull this off, (apologies for the lack of SO convention its my first post
the full traceback is; Traceback (most recent call last):
File "C:\Users\bryand\Desktop\bbsports_stuff\gba2.py", line 22, in <module>
f.writerow(row.values())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd6' in position 0: ordinal not in range(128)
[Finished in 6.2s]

Since you didn't specify here's a Python 3 solution. The Python 2 solution is much more painful. I've included some short sample data with non-ASCII characters:
#!python3
import json
import csv
json_data = '[{"a": "\\u9a6c\\u514b", "c": "somethingelse", "b": "something"}, {"a": "one", "c": "three", "b": "two"}]'
data = json.loads(json_data)
with open('fsbmigrate3.csv','w',encoding='utf-8-sig',newline='') as f:
w = csv.DictWriter(f,fieldnames=sorted(data[0].keys()))
w.writeheader()
w.writerows(data)
The utf-8-sig codec makes sure a byte order mark character (BOM) is written at the start of the output file, since Excel will assume the local ANSI encoding otherwise.
Since you have json data with key/value pairs, using DictWriter allows the headers to be specified; otherwise, the header order isn't predictable.

Related

UnicodeError Replacing Not Working - Python

I am trying to replace nonunicode characters with an _ but this program despite compiling with no errors, does not solve the issue and I cannot determine why.
import csv
import unicodedata
import pandas as pd
df = pd.read_csv('/Users/pabbott/Desktop/Unicode.csv', sep = ',',
index_col=False, converters={'ClinetEMail':str, 'ClientZip':str,
'LocationZip':str, 'LicenseeName': str, 'LocationState':str,
'AppointmentType':str, 'ClientCity':str, 'ClientState':str})
data = df
for row in data:
for val in row:
try:
val.encode("utf-8")
except UnicodeDecodeError:
replace(val,"_")
data.to_csv('UnicodeExport.csv', sep=',', index=False,
quoting=csv.QUOTE_NONNUMERIC)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 4: invalid start byte
The above message (thrown from pd.read_csv) shows that the file is not saved in utf-8. You need to
either save the file as utf-8,
or read a file using proper encoding.
For instance (the latter variant), add encoding='windows-1252' to df = pd.read_csv(… as follows:
df = pd.read_csv('/Users/pabbott/Desktop/Unicode.csv', sep = ',', encoding='windows-1252',
index_col=False, converters={'ClinetEMail':str, 'ClientZip':str,
'LocationZip':str, 'LicenseeName': str, 'LocationState':str,
'AppointmentType':str, 'ClientCity':str, 'ClientState':str})
Then, you can omit all the stuff try: val.encode("utf-8") in for row in data: for val in row: loops.
Read pandas.read_csv:
encoding : str, default None
Encoding to use for UTF when reading/writing (ex. 'utf-8'). List of
Python standard encodings.

UnicodeDecodeError: 'utf8' codec can't decode byte error

I have a csv file which has one of the 4 columns namely tweet_id, label, topic,text. In one of the rows, the "text" column has the value:
I'm wit chu!! “#ShayDiddy: Officially boycotting #ups!!! Calling #apple to curse them out next for using them wasting my time!â€
I am using this code for importing the data:
def createTrainingCorpus(corpusFile):
import csv
corpus=[]
with open(corpusFile,'rb') as csvfile:
lineReader = csv.reader(csvfile,delimiter=',')
r=1
for row in lineReader:
if r<257:
corpus.append({"tweet_id":row[2],"label":row[1],"topic":row[0],"text":row[4]})
r=r+1
return corpus
corpusFile= "/Users/name/Desktop/corpus.csv"
TrainingData= createTrainingCorpus(corpusFile)
This line doesn't get added to the list TrainingData and I receive an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)
The TrainingData list has all the elements as expected until the loop reaches the row with the "text" as mentioned above. I googled for the error but couldn't find solution that worked for me. Please help.

UnicodeEncodeError: 'ascii' codec can't encode character error using writerow and map

In Python 2.7 and Ubuntu 14.04 I am trying to write to a csv file:
csv_w.writerow( map( lambda x: flatdata.get( x, "" ), columns ))
this gives me the notorious
UnicodeEncodeError: 'ascii' codec can't encode character u'\u265b' in position 19: ordinal not in range(128)
error.
The usual advice on here is to use unicode(x).encode("utf-8")
I have tried this and also just .encode("utf-8") for both parameters in the get:
csv_w.writerow( map( lambda x: flatdata.get( unicode(x).encode("utf-8"), unicode("").encode("utf-8") ), columns ))
but I still get the same error.
Any help is much appreciated in getting rid of the error. (I imagine the unicode("").encode("utf-8") is clumsy but I'm still a newb).
EDIT:
My full program is:
#!/usr/bin/env python
import json
import csv
import fileinput
import sys
import glob
import os
def flattenjson( b, delim ):
val = {}
for i in b.keys():
if isinstance( b[i], dict ):
get = flattenjson( b[i], delim )
for j in get.keys():
val[ i + delim + j ] = get[j]
else:
val[i] = b[i]
return val
def createcolumnheadings(cols):
#create column headings
print ('a', cols)
columns = cols.keys()
columns = list( set( columns ) )
print('b', columns)
return columns
doOnce=True
out_file= open( 'Excel.csv', 'wb' )
csv_w = csv.writer( out_file, delimiter="\t" )
print sys.argv, os.getcwd()
os.chdir(sys.argv[1])
for line in fileinput.input(glob.glob("*.txt")):
print('filename:', fileinput.filename(),'line #:',fileinput.filelineno(),'line:', line)
data = json.loads(line)
flatdata = flattenjson(data, "__")
if doOnce:
columns=createcolumnheadings(flatdata)
print('c', columns)
csv_w.writerow(columns)
doOnce=False
csv_w.writerow( map( lambda x: flatdata.get( unicode(x).encode("utf-8"), unicode("").encode("utf-8") ), columns ))
Redacted single tweet that throws the error UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 14: ordinal not in range(128): is available here.
SOLUTION as per Alistair's advice I installed unicodescv.
The steps were:
Download the zip from here
install it: sudo pip install /path/to/zipfile/python-unicodecsv-master.zip
import unicodecsv as csv
csv_w = csv.writer(f, encoding='utf-8')
csv_w.writerow(flatdata.get(x, u'') for x in columns)
Without seeing your data it would seem that your data contains Unicode data types (See How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte" for a brief explination of Unicode vs. str types)
Your solution to encode it is then error prone - any str with non-ascii encoded in it will throw an error when you unicode() it (See previous link for explanation).
You should get all you data into Unicode types before writing to CSV. As Python 2.7's CSV module is broken, you will need to use the drop in replacement: https://github.com/jdunck/python-unicodecsv.
You may also wish to break out your map into a separate statement to avoid confusion. Make to sure to provide the full stacktrace and examples of your code.
csv_w.writerow( map( lambda x: flatdata.get( unicode(x).encode("utf-8"), unicode("").encode("utf-8") ), columns ))
You've encoded the parameters passed to flatdata.get(), ie the dict key. But the unicode characters aren't in the key, they're in the value. You should encode the value returned by get():
csv_w.writerow([flatdata.get(x, u'').encode('utf-8') for x in columns])

Printing 2d array with Russian characters to a csv with python

I have a two dimensional array which has Russian/Arabic/ Chinese characters in it. I am not sure how to print all of it to a CSV file. I would like to print it to the CSV in UTF-8. The code which I am using is listed below.
matchedArray = [[]]
matchedArray.append(x)
ar = matchedArray
f1 = open('C:/Users/sagars/Desktop/encoding.csv','w')
writer = csv.writer(f1)
writer.writerow(['Flat Content', 'Post Dates', 'AuthorIDs', 'ThreadIDs',
'Matched Keyword', 'Count of keyword']) #if needed
for values in ar:
writer.writerow(values)
However, I am getting the following error:
Traceback (most recent call last):
File "C:/Users/sagars/PycharmProjects/encoding/encoding.py", line 18, in <module>
writer.writerow(values)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-13: ordinal not in range(128)
Process finished with exit code 1
How can I print this 2D array to the CSV file?
Thanks!
You can use the str.encode to encode each character using the right encoding before storing to the csv file.
for values in ar:
# values is a list/string of unicode characters
writer.writerow([val.encode("utf-8") for val in values])

Getting error in python with characters from word document

I have this text which is entered in text box
‘f’fdsfs’`124539763~!##$%^’’;’””::’
I am coverting to json and then it comes as
"\\u2018f\\u2019fdsfs\\u2019`124539763~!##$%^\\u2019\\u2019;\\u2019\\u201d\\u201d::\\u2019e"
Now when i am writing the csv file then i get this error
'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)
csv.writer(data)
I tried all data.encode('utf-8') data.decode('unicode-escape') but didn't work
csv module does not support unicode use https://github.com/jdunck/python-unicodecsv instead
although im not sure \u2018 is part of the utf-8 charset
x = "\\u2018f\\u2019fdsfs..."; j = json.loads('"' + x + '"'); print j.encode('cp1252')
‘f’fdsfs...
note that it is being encoded as cp1252
>>> import unicodecsv as csv #https://github.com/jdunck/python-unicodecsv
>>> x = "\\u2018f\\u2019fdsfs..."; j = json.loads('"' + x + '"');
>>> with open("some_file.csv","wb") as f:
... w = csv.writer(f,encoding="cp1252")
... w.writerow([j,"normal"])
...
>>>
here is the csv file : https://www.dropbox.com/s/m4gta1o9vg8tfap/some_file.csv

Categories

Resources