PYTHON: Problems with parsing csv file and finding maximum value [duplicate] - python

This question already has an answer here:
PYTHON: Simplest way to open a csv file and find the maximum number in a column and the name assosciated with it?
(1 answer)
Closed 6 years ago.
I need to read in the file https://drive.google.com/open?id=0B29hT1HI-pwxMjBPQWFYaWoyalE)
however, I have tried 3-4 different code methods and repeatedly get the error: "line contains NULL byte". I read on other threads that this is a problem with your csv-but, this is the file that my professor will be loading and grading me on, and I can't modify it, so I'm looking for a solution around this error.
As I mentioned I've tried several different methods to open the file. Here's my best two:
def largestState():
INPUT = "statepopulations.csv"
COLUMN = 5 # 6th column
with open(INPUT, "rU") as csvFile:
theFile = csv.reader(csvFile)
header = next(theFile, None) # skip header row
pop = [float(row[COLUMN]) for row in theFile]
max_pop = max(pop)
print max_pop
largestState()
This results in the NULL Byte error. Please ignore the additional max_pop lines. The next step after reading the file in is to find the maximum value of row F.
def test():
with open('state-populations.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
test()
This results in the NULL Byte Error.
If anyone could offer a simple solution to this problem I'd greatly appreciate it.
File as .txt: https://drive.google.com/open?id=0B29hT1HI-pwxZzhlMGZGVVAzX28

First of all the "csv" file you have provided via the Google Drive link is NOT a csv file. Its a gzip 'ed xml file.
[~/Downloads] file state-populations.csv
state-populations.csv: gzip compressed data, from Unix
[~/Downloads] gzip -d state-populations.csv
gzip: state-populations.csv: unknown suffix -- ignored
[~/Downloads] mv state-populations.csv state-populations.csv.gz
[~/Downloads] gzip -d state-populations.csv.gz
[~/Downloads] ls state-populations.csv
state-populations.csv
[~/Downloads] file state-populations.csv
state-populations.csv: XML 1.0 document text, ASCII text, with very long lines
You can use some xml module to parse it
[~/Downloads] python
Python 2.7.10 (default, Jul 30 2016, 18:31:42)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('state-populations.csv')
>>> root = tree.getroot()
>>> root
<Element '{http://www.gnumeric.org/v10.dtd}Workbook' at 0x10ded51d0>
>>> root.tag
'{http://www.gnumeric.org/v10.dtd}Workbook'
>>> for child in root:
... print child.tag, child.attrib
...
{http://www.gnumeric.org/v10.dtd}Version {'Epoch': '1', 'Full': '1.12.9', 'Major': '12', 'Minor': '9'}
{http://www.gnumeric.org/v10.dtd}Attributes {}
{urn:oasis:names:tc:opendocument:xmlns:office:1.0}document-meta {'{urn:oasis:names:tc:opendocument:xmlns:office:1.0}version': '1.2'}
{http://www.gnumeric.org/v10.dtd}Calculation {'ManualRecalc': '0', 'MaxIterations': '100', 'EnableIteration': '1', 'IterationTolerance': '0.001', 'FloatRadix': '2', 'FloatDigits': '53'}
{http://www.gnumeric.org/v10.dtd}SheetNameIndex {}
{http://www.gnumeric.org/v10.dtd}Geometry {'Width': '864', 'Height': '322'}
{http://www.gnumeric.org/v10.dtd}Sheets {}
{http://www.gnumeric.org/v10.dtd}UIData {'SelectedTab': '0'}

The new .txt file looks good, and your function largestState() gives the correct output. Just have return instead of print instead in the end.
def largestState():
INPUT = "state-populations.txt"
COLUMN = 5 # 6th column
with open(INPUT, "rU") as csvFile:
theFile = csv.reader(csvFile)
header = next(theFile, None) # skip header row
pop = [float(row[COLUMN]) for row in theFile]
max_pop = max(pop)
return(max_pop)
largestState()

Related

Writing to CSV in Python- Rows inserted out of place

I have a function that writes a set of information in rows to a CSV file in Python. The function is supposed to append the file with the new row, however I am finding that sometimes it misbehaves and places the new row in a separate space of the CSV (please see picture as an example).
Whenever I reformat the data manually I delete all of the empty cells again, so you know.
Hoping someone can help, thanks!
def Logger():
fileName = myDict[Sub]
with open(fileName, 'a+', newline="") as file:
writer = csv.writer(file)
if file.tell() == 0:
writer.writerow(["Date", "Asset", "Fear", "Anger", "Anticipation", "Trust", "Surprise", "Sadness", "Disgust", "Joy",
"Positivity", "Negativity"])
writer.writerow([date, Sub, fear, anger, anticip, trust, surprise, sadness, disgust, joy, positivity, negativity])
At first I thought it was a simple matter of there not being a trailing newline, and the new row being appended on the same line, right after the last row, but I can see what looks like a row's worth of empty columns between them.
This whole appending thing looks tricky. If you don't have to use Python, and can use a command-line tool instead, I recommend GoCSV.
Here's a sample file based on your screenshot I mocked up:
base.csv
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59
Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17
Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32,,,,,,,
I'm calling it base because it's the file that will be growing, and you can see it's got a problem on the last line: all those extras commas (I don't know how they got there 🤷🏻‍♂️).
The first step will be to clean it, and trim those pesky extra commas:
% gocsv clean base.csv > tmp
% mv tmp > base.csv
and now base.csv looks like:
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59
Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17
Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32
Here's another set of data to append, sample2.csv:
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 4,6040,0.69,0.89,0.72,0.44,0.21,0.15,0.03,0.63,0.78,0.42
Nov 5,7726,0.72,0.12,0.95,0.6,0.88,0.1,0.43,1.0,1.0,0.68
Nov 6,9028,0.87,0.34,0.46,0.57,0.15,0.3,0.8,0.32,0.17,0.42
Nov 7,3544,0.16,0.9,0.37,0.8,0.67,0.0,0.11,0.72,0.93,0.35
GoCSV's stack command will do this job:
% gocsv stack base.csv sample2.csv > tmp
% mv tmp base.csv
and now base.csv looks like:
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59
Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17
Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32
Nov 4,6040,0.69,0.89,0.72,0.44,0.21,0.15,0.03,0.63,0.78,0.42
Nov 5,7726,0.72,0.12,0.95,0.6,0.88,0.1,0.43,1.0,1.0,0.68
Nov 6,9028,0.87,0.34,0.46,0.57,0.15,0.3,0.8,0.32,0.17,0.42
Nov 7,3544,0.16,0.9,0.37,0.8,0.67,0.0,0.11,0.72,0.93,0.35
This can be scripted and simplified like this:
% gocsv clean base.csv > base
% gocsv clean sample.csv > sample
% gocsv stack base sample > base.csv
% rm base sample
Try this instead...
def Logger(col_one, col_two):
fileName = 'data.csv'
with open(fileName, 'a+') as file:
writer = csv.writer(file)
file.seek(0)
if file.read().strip() == '':
writer.writerow(["Date", "Asset"])
writer.writerow([col_one, col_two])

How to import a text file that both contains values and text in python?

I am aware that a lot of questions are already asked on this topic, but none of them worked for my specific case.
I want to import a text file in python, and want to be able to access each value seperatly in python. My text file looks like (it's seperated by tabs):
example dataset
For example, the data '1086: CampNou' is written in one cell. I am mainly interested in getting access to the values presented here. Does anybody have a clue how to do this?
1086: CampNou 2084: Hospi 2090: Sants 2094: BCN-S 2096: BCN-N 2101: UNI 2105: B23 Total
1086: CampNou 0 15,6508 12,5812 30,3729 50,2963 0 56,0408 164,942
2084: Hospi 15,7804 0 19,3732 37,1791 54,1852 27,4028 59,9297 213,85
2090: Sants 12,8067 22,1304 0 30,6268 56,7759 29,9935 62,5204 214,854
2096: BCN-N 51,135 54,8545 57,3742 46,0102 0 45,6746 56,8001 311,849
2101: UNI 0 28,9589 31,4786 37,5029 31,6773 0 50,2681 179,886
2105: B23 51,1242 38,5838 57,3634 75,1552 56,7478 40,2728 0 319,247
Total 130,846 160,178 178,171 256,847 249,683 143,344 285,559 1404,63'
You can use pandas to open and manipulate your data.
import pandas as pd
df = pd.read_csv("mytext.txt")
This should read properly your file
def read_file(filename):
"""Returns content of file"""
file = open(filename, 'r')
content = file.read()
file.close()
return content
content = read_file("the_file.txt") # or whatever your text file is called
items = content.split(' ')
Then your values will be in the list items: ['', '1086: CampNou', '2084: Hospi', '2090: Sants', ...]

How to solve problem decoding from wrong json format

everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.
The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)

UnicodeEncodeError in Python CSV manipulation script

I have a script that was working earlier but now stops due to UnicodeEncodeError.
I am using Python 3.4.3.
The full error message is the following:
Traceback (most recent call last):
File "R:/A/APIDevelopment/ScivalPubsExternal/Combine/ScivalPubsExt.py", line 58, in <module>
outputFD.writerow(row)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8a' in position 413: character maps to <undefined>
How can I address this error?
The Python script is the following below:
import pdb
import csv,sys,os
import glob
import os
import codecs
os.chdir('R:/A/APIDevelopment/ScivalPubsExternal/Combine')
joinedFileOut='ScivalUpdate'
csvSourceDir="R:/A/APIDevelopment/ScivalPubsExternal/Combine/AustralianUniversities"
# create dictionary from Codes file (Institution names and codes)
codes = csv.reader(open('Codes.csv'))
#rows of the file are stored as lists/arrays
InstitutionCodesDict = {}
InstitutionYearsDict = {}
for row in codes:
#keys: instnames, #values: instcodes
InstitutionCodesDict[row[0]] = row[1]
#define year dictionary with empty values field
InstitutionYearsDict[row[0]] = []
#to create a fiel descriptor for the outputfile, wt means text mode (also rt opr r is the same)
with open(joinedFileOut,'wt') as csvWriteFD:
#write the file (it is still empty here)
outputFD=csv.writer(csvWriteFD,delimiter=',')
#with closes the file at the end, if exception occurs then before that
# open each scival file, create file descriptor (encoding needed) and then read it and print the name of the file
if not glob.glob(csvSourceDir+"/*.csv"):
print("CSV source files not found")
sys.exit()
for scivalFile in glob.glob(csvSourceDir+"/*.csv"):
#with open(scivalFile,"rt", encoding="utf8") as csvInFD:
with open(scivalFile,"rt", encoding="ISO-8859-1") as csvInFD:
fileFD = csv.reader(csvInFD)
print(scivalFile)
#create condition for loop
printon=False
#reads all rows in file and creates lists/arrays of each row
for row in fileFD:
if len(row)>1:
#the next printon part is skipped when looping through the rows above the data because it is not set to true
if printon:
#inserts instcode and inst sequentially to each row where there is data and after the header row
row.insert(0, InstitutionCode)
row.insert(0, Institution)
if row[10].strip() == "-":
row[10] = " "
else:
p = row[10].zfill(8)
q = p[0:4] + '-' + p[4:]
row[10] = q
#writes output file
outputFD.writerow(row)
else:
if "Publications at" in row[1]:
#get institution name from cell B1
Institution=row[1].replace('Publications at the ', "").replace('Publications at ',"")
print(Institution)
#lookup institution code from dictionary
InstitutionCode=InstitutionCodesDict[Institution]
#printon gets set to TRUE after the header column
if "Title" in row[0]: printon=True
if "Publication years" in row[0]:
#get the year to print it later to see which years were pulled
year=row[1]
#add year to institution in dictionary
if not year in InstitutionYearsDict[Institution]:
InstitutionYearsDict[Institution].append(year)
# Write a report showing the institution name followed by the years for
# which we have that institution's data.
with open("Instyears.txt","w") as instReportFD:
for inst in (InstitutionYearsDict):
instReportFD.write(inst)
for yr in InstitutionYearsDict[inst]:
instReportFD.write(" "+yr)
instReportFD.write("\n")
Make sure to use the correct encoding of your source and destination files. You open files in three locations:
codes = csv.reader(open('Codes.csv'))
: : :
with open(joinedFileOut,'wt') as csvWriteFD:
outputFD=csv.writer(csvWriteFD,delimiter=',')
: : :
with open(scivalFile,"rt", encoding="ISO-8859-1") as csvInFD:
fileFD = csv.reader(csvInFD)
This should look something like:
# Use the correct encoding. If you made this file on
# Windows it is likely Windows-1252 (also known as cp1252):
with open('Codes.csv', encoding='cp1252') as f:
codes = csv.reader(f)
: : :
# The output encoding can be anything you want. UTF-8
# supports all Unicode characters. Windows apps tend to like
# the files to start with a UTF-8 BOM if the file is UTF-8,
# so 'utf-8-sig' is an option.
with open(joinedFileOut,'w', encoding='utf-8-sig') as csvWriteFD:
outputFD=csv.writer(csvWriteFD)
: : :
# This file is probably the cause of your problem and is not ISO-8859-1.
# Maybe UTF-8 instead? 'utf-8-sig' will safely handle and remove a UTF-8 BOM
# if present.
with open(scivalFile,'r', encoding='utf-8-sig') as csvInFD:
fileFD = csv.reader(csvInFD)
The error is caused by an attempt to write a string containing a U+008A character using the default cp1252 encoding of your system. It is trivial to fix, just declare a latin1 encoding (or iso-8859-1) for your output file (because it just outputs the original byte without conversion):
with open(joinedFileOut,'wt', encoding='latin1') as csvWriteFD:
But this will only hide the real problem: where does this 0x8a character come from? My advice is to intercept the exception and dump the line where it occurs:
try:
outputFD.writerow(row)
except UnicodeEncodeError:
# print row, the name of the file being processed and the line number
It is probably caused by one of the input files not being is-8859-1 encoded but more probably utf8 encoded...

Python UTF-16 CSV reader

I have a UTF-16 CSV file which I have to read. Python csv module does not seem to support UTF-16.
I am using python 2.7.2. CSV files I need to parse are huge size running into several GBs of data.
Answers for John Machin questions below
print repr(open('test.csv', 'rb').read(100))
Output with test.csv having just abc as content
'\xff\xfea\x00b\x00c\x00'
I think csv file got created on windows machine in USA. I am using Mac OSX Lion.
If I use code provided by phihag and test.csv containing one record.
sample test.csv content used. Below is print repr(open('test.csv', 'rb').read(1000)) output
'\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
Code by phihag
import codecs
import csv
with open('test.csv','rb') as f:
sr = codecs.StreamRecoder(f,codecs.getencoder('utf-8'),codecs.getdecoder('utf-8'),codecs.getreader('utf-16'),codecs.getwriter('utf-16'))
for row in csv.reader(sr):
print row
Output of the above code
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85']
['', '', 'I']
expected output is
['1', '2', 'G', 'S', 'H f\xc3\xbcr e \xc2\x96 m \xc2\x85','','I']
At the moment, the csv module does not support UTF-16.
In Python 3.x, csv expects a text-mode file and you can simply use the encoding parameter of open to force another encoding:
# Python 3.x only
import csv
with open('utf16.csv', 'r', encoding='utf16') as csvf:
for line in csv.reader(csvf):
print(line) # do something with the line
In Python 2.x, you can recode the input:
# Python 2.x only
import codecs
import csv
class Recoder(object):
def __init__(self, stream, decoder, encoder, eol='\r\n'):
self._stream = stream
self._decoder = decoder if isinstance(decoder, codecs.IncrementalDecoder) else codecs.getincrementaldecoder(decoder)()
self._encoder = encoder if isinstance(encoder, codecs.IncrementalEncoder) else codecs.getincrementalencoder(encoder)()
self._buf = ''
self._eol = eol
self._reachedEof = False
def read(self, size=None):
r = self._stream.read(size)
raw = self._decoder.decode(r, size is None)
return self._encoder.encode(raw)
def __iter__(self):
return self
def __next__(self):
if self._reachedEof:
raise StopIteration()
while True:
line,eol,rest = self._buf.partition(self._eol)
if eol == self._eol:
self._buf = rest
return self._encoder.encode(line + eol)
raw = self._stream.read(1024)
if raw == '':
self._decoder.decode(b'', True)
self._reachedEof = True
return self._encoder.encode(self._buf)
self._buf += self._decoder.decode(raw)
next = __next__
def close(self):
return self._stream.close()
with open('test.csv','rb') as f:
sr = Recoder(f, 'utf-16', 'utf-8')
for row in csv.reader(sr):
print (row)
open and codecs.open require the file to start with a BOM. If it doesn't (or you're on Python 2.x), you can still convert it in memory, like this:
try:
from io import BytesIO
except ImportError: # Python < 2.6
from StringIO import StringIO as BytesIO
import csv
with open('utf16.csv', 'rb') as binf:
c = binf.read().decode('utf-16').encode('utf-8')
for line in csv.reader(BytesIO(c)):
print(line) # do something with the line
The Python 2.x csv module documentation example shows how to handle other encodings.
I would strongly suggest that you recode your file(s) to UTF-8. Under the very likely condition that you don't have any Unicode characters outside the BMP, you can take advantage of the fact that UTF-16 is a fixed-length encoding to read fixed-length blocks from your input file without worrying about straddling block boundaries.
Step 1: Determine what encoding you actually have. Examine the first few bytes of your file:
print repr(open('thefile.csv', 'rb').read(100))
Four possible ways of encoding u'abc'
\xfe\xff\x00a\x00b\x00c -> utf_16
\xff\xfea\x00b\x00c\x00 -> utf_16
\x00a\x00b\x00c -> utf_16_be
a\x00b\x00c\x00 -> utf_16_le
If you have any trouble with this step, edit your question to include the results of the above print repr()
Step 2: Here's a Python 2.X recode-UTF-16*-to-UTF-8 script:
import sys
infname, outfname, enc = sys.argv[1:4]
fi = open(infname, 'rb')
fo = open(outfname, 'wb')
BUFSIZ = 64 * 1024 * 1024
first = True
while 1:
buf = fi.read(BUFSIZ)
if not buf: break
if first and enc == 'utf_16':
bom = buf[:2]
buf = buf[2:]
enc = {'\xfe\xff': 'utf_16_be', '\xff\xfe': 'utf_16_le'}[bom]
# KeyError means file doesn't start with a valid BOM
first = False
fo.write(buf.decode(enc).encode('utf8'))
fi.close()
fo.close()
Other matters:
You say that your files are too big to read the whole file, recode and rewrite, yet you can open it in vi. Please explain.
The <85> being treated as end of record is a bit of a worry. Looks like 0x85 is being recognised as NEL (C1 control code, NEWLINE). There is a strong possibility that the data was originally encoded in some legacy single-byte encoding where 0x85 has a meaning but has been transcoded to UTF-16 under the false assumption that the original encoding was ISO-8859-1 aka latin1. Where did the file originate? An IBM mainframe? Windows/Unix/classic Mac? What country, locale, language? You obviously think that the <85> is not meant to be a newline; what do you think that it means?
Please feel free to send a copy of a cut-down file (that includes some of the <85> stuff) to sjmachin at lexicon dot net
Update based on 1-line sample data provided.
This confirms my suspicions. Read this. Here's a quote from it:
... the C1 control characters ... are rarely used directly, except on
specific platforms such as OpenVMS. When they turn up in documents,
Web pages, e-mail messages, etc., which are ostensibly in an
ISO-8859-n encoding, their code positions generally refer instead to
the characters at that position in a proprietary, system-specific
encoding such as Windows-1252 or the Apple Macintosh ("MacRoman")
character set that use the codes provided for representation of the C1
set with a single 8-bit byte to instead provide additional graphic
characters
This code:
s1 = '\xff\xfe1\x00,\x002\x00,\x00G\x00,\x00S\x00,\x00H\x00 \x00f\x00\xfc\x00r\x00 \x00e\x00 \x00\x96\x00 \x00m\x00 \x00\x85\x00,\x00,\x00I\x00\r\x00\n\x00'
s2 = s1.decode('utf16')
print 's2 repr:', repr(s2)
from unicodedata import name
from collections import Counter
non_ascii = Counter(c for c in s2 if c >= u'\x80')
print 'non_ascii:', non_ascii
for c in non_ascii:
print "from: U+%04X %s" % (ord(c), name(c, "<no name>"))
c2 = c.encode('latin1').decode('cp1252')
print "to: U+%04X %s" % (ord(c2), name(c2, "<no name>"))
s3 = u''.join(
c.encode('latin1').decode('1252') if u'\x80' <= c < u'\xA0' else c
for c in s2
)
print 's3 repr:', repr(s3)
print 's3:', s3
produces the following (Python 2.7.2 IDLE, Windows 7):
s2 repr: u'1,2,G,S,H f\xfcr e \x96 m \x85,,I\r\n'
non_ascii: Counter({u'\x85': 1, u'\xfc': 1, u'\x96': 1})
from: U+0085 <no name>
to: U+2026 HORIZONTAL ELLIPSIS
from: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
to: U+00FC LATIN SMALL LETTER U WITH DIAERESIS
from: U+0096 <no name>
to: U+2013 EN DASH
s3 repr: u'1,2,G,S,H f\xfcr e \u2013 m \u2026,,I\r\n'
s3: 1,2,G,S,H für e – m …,,I
Which do you think is a more reasonable interpretation of \x96:
SPA i.e. Start of Protected Area (Used by block-oriented terminals.)
or
EN DASH
?
Looks like a thorough analysis of a much larger data sample is warranted. Happy to help.
Just open your file with codecs.open like in
import codecs, csv
stream = codecs.open(<yourfile.csv>, encoding="utf-16")
reader = csv.reader(stream)
And work through your program with unicode strings, as you should do anyway if you are processing text

Categories

Resources