Python CSV reader isn't reading CSV data the way I expect - python

I'm trying to read some CSV data into an array. I can probably explain what I'm trying to do better in Python than in English:
>>> line = ImportFile.objects.all().reverse()[0].file.split("\n")[0]
>>> line
'"007147","John Smith","100 Farley Ln","","Berlin NH 03570","Berlin","NH",2450000,"John",24643203,3454,"E","",2345071,1201,"N",15465,"I",.00,20102456,945610,20247320,1245712,"0T",.00100000,"",.00,.00,780,"D","000",.00,0\r'
>>> s = cStringIO.StringIO()
>>> s
<cStringIO.StringO object at 0x9ab1960>
>>> s.write(line)
>>> r = csv.reader(s)
>>> r
<_csv.reader object at 0x9aa217c>
>>> [line for line in r]
[]
As you can see, the CSV data starts in memory, not in a file. I would expect my reader to have some of that data but it doesn't. What am I doing wrong?

You are using StringIO in the wrong way. Try
s = cStringIO.StringIO(line)
r = csv.reader(s)
next(r)
# "['007147', 'John Smith', '100 Farley Ln', '', 'Berlin NH 03570', 'Berlin', 'NH', '2450000', 'John', '24643203', '3454', 'E', '', '2345071', '1201', 'N', '15465', 'I', '.00', '20102456', '945610', '20247320', '1245712', '0T', '.00100000', '', '.00', '.00', '780', 'D', '000', '.00', '0']"
and the result should be what you expect.
Edit: To explain in more detail: After writing to the StringIO instance, the file pointer will point past the end of the contents. This is where you would expect new contents to be written by subsequent write() calls. But this also means that read() calls will not return anything. You would need to call s.reset() or s.seek(0) to reset the position to the beginning, or initialise the StringIO with the desired contents.

Add s.seek(0) after s.write(line). Current pointer in the file-like object s is just past the written line.

Related

python write german umlaute into a file

I know, this question has been asked million times. But I am still stuck with it. I am using python 2 and cannot change to python 3.
problem is this:
>>> w = u"ümlaut"
>>> w
>>> u'\xfcmlaut'
>>> print w
ümlaut
>>> dic = {'key': w}
>>> dic
{'key': u'\xfcmlaut'}
>>> f = io.open('testtt.sql', mode='a', encoding='UTF8')
>>> f.write(u'%s' % dic)
then file has:
{'key': u'\xfcmlaut'}
I need {'key': 'ümlaut'} or {'key': u'ümlaut'}
What am I missing, I am still noob in encoding decoding things :/
I'm not sure why you want this format particularly, since it won't be valid to read into any other application, but never mind.
The problem is that to write a dictionary to a file, it needs to be converted to a string - and to do that, Python calls repr on all its elements.
If you create the output manually as a string, all is well:
d = "{'key': '%s'}" % w
with io.open('testtt.sql', mode='a', encoding='UTF8') as f:
f.write(d)
The easiest solution is to switch to python3, but since you can't do that, how about converting dictionary to json first before trying to save it into the file.
import io
import json
import sys
reload(sys)
sys.setdefaultencoding('utf8')
w = u"ümlaut"
dic = {'key': w}
f = io.open('testtt.sql', mode='a', encoding='utf8')
f.write(unicode(json.dumps(dic, ensure_ascii=False).encode('utf8')))

Python csv reader incomplete file line iteration

Here is my problem. I need to parse a comma separated file and I've got my code working how I would like, however while testing it and attempting to break things I've come across a problem.
Here is the example code:
import csv
compareList=["testfield1","testfield2","testfield3","testfield4"]
z=open("testFile",'r')
x=csv.reader(z,quotechar='\'')
testDic={}
iter=0
for lineList in x:
try:
for item in compareList:
testDic[item]=lineList[iter]
iter+=1
iter=0
except IndexError:
iter=0
lineList=[]
for item in compareList:
testList.append("")
testDic[item]=lineList[iter]
iter+=1
iter=0
for item in compareList:
testFile.write(testDic[item])
if compareList.index(item)!=len(compareList)-1
testFile.write(",")
testFile.write('\n')
testFile.close()
z.close()
So what this is supposed to do is check and make sure that each line of the csv file matches the length of a list. If the length of the line does not match the length of the list, then the line is converted to null values(commas) that equal the length of compareList.
Here is an example of what is in the file:
,,"sometext",343434
,,"moretext",343434
,,"stuff",4543343
,,"morestuff",3434354
The code works just fine if the line is missing an item. So the output of at file containing:
,"sometext",343434
,,"moretext",343434
,,"stuff",4543343
,,"morestuff",3434354
will look like this:
,,,,
,,"moretext",343434
,,"stuff",4543343
,,"morestuff",3434354
The problem I have induced is when the line looks something like this:
,"sometext",343434
,,"moretext",343434
,,"St,'",uff",4543343
,,"morestuff",3434354
The ouput of this file will be:
,,,,
,,"moretext",343434
,,,,
So it will apply the change as expected and null out lines 1 and 3, but it just stops processing at that line. I've been pulling my hair out trying to figure out what is going on here, with no luck.
As always I greatly appreciate any help you are willing to give.
Just print each line returned by csv.reader to understand what is the problem:
>>> import csv
>>> z=open("testFile",'r')
>>> x=csv.reader(z,quotechar='\'')
>>> for lineList in x:
... print lineList
...
['', '"sometext"', '343434']
['', '', '"moretext"', '343434']
['', '', '"St', '",uff",4543343\n,,"morestuff",3434354\n']
The last 2 lines are just one line for csv.reader.
Now, just remove quotechar='\''
>>> import csv
>>> z=open("testFile",'r')
>>> x=csv.reader(z)
>>> for lineList in x:
... print lineList
...
['', 'sometext', '343434']
['', '', 'moretext', '343434']
['', '', "St,'", 'uff"', '4543343']
['', '', 'morestuff', '3434354']

Pasing through CSV file to store as dictionary with nested array values. Best approach?

I am trying to take this csv file and parse and store it in a form of a dictionary (sorry if I use the terms incorrectly I am currently learning). The first element is my key and the rest will be values in a form of nested arrays.
targets_value,11.4,10.5,10,10.8,8.3,10.1,10.7,13.1
targets,Cbf1,Sfp1,Ino2,Opi1,Cst6,Stp1,Met31,Ino4
one,"9.6,6.3,7.9,11.4,5.5",N,"8.4,8.1,8.1,8.4,5.9,5.9",5.4,5.1,"8.1,8.3",N,N
two,"7.0,11.4,7.0","4.8,5.3,7.0,8.1,9.0,6.1,4.6,5.0,4.6","6.3,5.9,5.9",N,"4.3,4.8",N,N,N
three,"6.0,9.7,11.4,6.8",N,"11.8,6.3,5.9,5.9,9.5","5.4,8.4","5.1,5.1,4.3,4.8,5.1",N,N,11.8
four,"9.7,11.4,11.4,11.4",4.6,"6.2,7.9,5.9,5.9,6.3","5.6,5.5","4.8,4.8,8.3,5.1,4.3",N,7.9,N
five,7.9,N,"8.1,8.4",N,"4.3,8.3,4.3,4.3",N,N,N
six,"5.7,11.4,9.7,5.5,9.7,9.7","4.4,7.0,7.7,7.5,6.9,4.9,4.6,4.9,4.6","7.9,5.9,5.9,5.9,5.9,6.3",6.7,"5.1,4.8",N,7.9,N
seven,"6.3,11.4","5.2,4.7","6.3,6.0",N,"8.3,4.3,4.8,4.3,5.1","9.8,9.5",N,8.4
eight,"11.4,11.4,5.9","4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9","6.3,6.3,5.9,5.9,6.6,6.6","5.3,5.2,7.0","8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1","9.2,7.4","9.4,9.3,7.9",N
nine,"9.7,9.7,11.4,9.7","5.2,4.6,5.5,6.5,4.5,4.6,5.5","6.3,5.9,5.9,9.5,6.5",N,"4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8",8.0,8.6,N
ten,"9.7,9.7,9.7,11.4,7.9","5.2,4.6,5.5,6.5,4.5,4.6,5.5","6.3,5.9,5.9,9.5,6.5",5.7,"4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8",8.0,8.6,N
YPL250C_Icy2,"11.4,6.1,11.4",N,"6.3,6.0,6.6,7.0,10.0,6.5,9.5,7.0,10.0",7.1,"4.3,4.3",9.2,"10.7,9.5",N
,,,,,,,,
,,,,,,,,
The issue was that in each line, some columns are a quotes because of multiple values per cell, and some only have a single entry but no quote. And cells that had no value input were inserted with an N. Since there was a mixture of quotes and non quotes, and numbers and non numbers.
Wanted the output to look something like this:
{'eight': ['11.4,11.4,5.9', '4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9', '6.3,6.3,5.9,5.9,6.6,6.6', '5.3,5.2,7.0', '8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1', '9.2,7.4', '9.4,9.3,7.9', 'N'],
'ten': ['9.7,9.7,9.7,11.4,7.9', '5.2,4.6,5.5,6.5,4.5,4.6,5.5', '6.3,5.9,5.9,9.5,6.5', '5.7', '4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8', '8.0', '8.6', 'N'],
'nine': ['9.7,9.7,11.4,9.7', '5.2,4.6,5.5,6.5,4.5,4.6,5.5', '6.3,5.9,5.9,9.5,6.5', 'N', '4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8', '8.0', '8.6', 'N']
}
I wrote a script to clean it and store it, but was not sure if my script was "too long for no reason". Any tips?
motif_dict = {}
with open(filename, "r") as file:
data = file.readlines()
for line in data:
if ',,,,,,,,' in line:
continue
else:
quoted_holder = re.findall(r'"(\d.*?\d)"' , line)
#reverses the order of the elements contained in the array
quoted_holder = quoted_holder[::-1]
new_line = re.sub(r'"\d.*?\d"', 'h', line).split(',')
for position,element in enumerate(new_line):
if element == 'h':
new_line[position] = quoted_holder.pop()
motif_dict[new_line[0]] = new_line[1:]
There's a csv module which makes working with csv files much easier. In your case, your code becomes
import csv
with open("motif.csv","r",newline="") as fp:
reader = csv.reader(fp)
data = {row[0]: row[1:] for row in reader if row and row[0]}
where the if row and row[0] lets us skip rows which are empty or have an empty first element. This produces (newlines added)
>>> data["eight"]
['11.4,11.4,5.9', '4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9',
'6.3,6.3,5.9,5.9,6.6,6.6', '5.3,5.2,7.0',
'8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1',
'9.2,7.4', '9.4,9.3,7.9', 'N']
>>> data["ten"]
['9.7,9.7,9.7,11.4,7.9', '5.2,4.6,5.5,6.5,4.5,4.6,5.5',
'6.3,5.9,5.9,9.5,6.5', '5.7', '4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8',
'8.0', '8.6', 'N']
In practice, for processing, I think you'd want to replace 'N' with None or some other object as a missing marker, and make every value a list of floats (even if it's only got one element), but that's up to you.

Download CSV directly into Python CSV parser

I'm trying to download CSV content from morningstar and then parse its contents. If I inject the HTTP content directly into Python's CSV parser, the result is not formatted correctly. Yet, if I save the HTTP content to a file (/tmp/tmp.csv), and then import the file in the python's csv parser the result is correct. In other words, why does:
def finDownload(code,report):
h = httplib2.Http('.cache')
url = 'http://financials.morningstar.com/ajax/ReportProcess4CSV.html?t=' + code + '&region=AUS&culture=en_us&reportType='+ report + '&period=12&dataType=A&order=asc&columnYear=5&rounding=1&view=raw&productCode=usa&denominatorView=raw&number=1'
headers, data = h.request(url)
return data
balancesheet = csv.reader(finDownload('FGE','is'))
for row in balancesheet:
print row
return:
['F']
['o']
['r']
['g']
['e']
[' ']
['G']
['r']
['o']
['u']
(etc...)
instead of:
[Forge Group Limited (FGE) Income Statement']
?
The problem results from the fact that iteration over a file is done line-by-line whereas iteration over a string is done character-by-character.
You want StringIO/cStringIO (Python 2) or io.StringIO (Python 3, thanks to John Machin for pointing me to it) so a string can be treated as a file-like object:
Python 2:
mystring = 'a,"b\nb",c\n1,2,3'
import cStringIO
csvio = cStringIO.StringIO(mystring)
mycsv = csv.reader(csvio)
Python 3:
mystring = 'a,"b\nb",c\n1,2,3'
import io
csvio = io.StringIO(mystring, newline="")
mycsv = csv.reader(csvio)
Both will correctly preserve newlines inside quoted fields:
>>> for row in mycsv: print(row)
...
['a', 'b\nb', 'c']
['1', '2', '3']

Sorting CSV in Python

I assumed sorting a CSV file on multiple text/numeric fields using Python would be a problem that was already solved. But I can't find any example code anywhere, except for specific code focusing on sorting date fields.
How would one go about sorting a relatively large CSV file (tens of thousand lines) on multiple fields, in order?
Python code samples would be appreciated.
Python's sort works in-memory only; however, tens of thousands of lines should fit in memory easily on a modern machine. So:
import csv
def sortcsvbymanyfields(csvfilename, themanyfieldscolumnnumbers):
with open(csvfilename, 'rb') as f:
readit = csv.reader(f)
thedata = list(readit)
thedata.sort(key=operator.itemgetter(*themanyfieldscolumnnumbers))
with open(csvfilename, 'wb') as f:
writeit = csv.writer(f)
writeit.writerows(thedata)
Here's Alex's answer, reworked to support column data types:
import csv
import operator
def sort_csv(csv_filename, types, sort_key_columns):
"""sort (and rewrite) a csv file.
types: data types (conversion functions) for each column in the file
sort_key_columns: column numbers of columns to sort by"""
data = []
with open(csv_filename, 'rb') as f:
for row in csv.reader(f):
data.append(convert(types, row))
data.sort(key=operator.itemgetter(*sort_key_columns))
with open(csv_filename, 'wb') as f:
csv.writer(f).writerows(data)
Edit:
I did a stupid. I was playing with various things in IDLE and wrote a convert function a couple of days ago. I forgot I'd written it, and I haven't closed IDLE in a good long while - so when I wrote the above, I thought convert was a built-in function. Sadly no.
Here's my implementation, though John Machin's is nicer:
def convert(types, values):
return [t(v) for t, v in zip(types, values)]
Usage:
import datetime
def date(s):
return datetime.strptime(s, '%m/%d/%y')
>>> convert((int, date, str), ('1', '2/15/09', 'z'))
[1, datetime.datetime(2009, 2, 15, 0, 0), 'z']
Here's the convert() that's missing from Robert's fix of Alex's answer:
>>> def convert(convert_funcs, seq):
... return [
... item if func is None else func(item)
... for func, item in zip(convert_funcs, seq)
... ]
...
>>> convert(
... (None, float, lambda x: x.strip().lower()),
... [" text ", "123.45", " TEXT "]
... )
[' text ', 123.45, 'text']
>>>
I've changed the name of the first arg to highlight that the per-columns function can do what you need, not merely type-coercion. None is used to indicate no conversion.
You bring up 3 issues:
file size
csv data
sorting on multiple fields
Here is a solution for the third part. You can handle csv data in a more sophisticated way.
>>> data = 'a,b,c\nb,b,a\nb,c,a\n'
>>> lines = [e.split(',') for e in data.strip().split('\n')]
>>> lines
[['a', 'b', 'c'], ['b', 'b', 'a'], ['b', 'c', 'a']]
>>> def f(e):
... field_order = [2,1]
... return [e[i] for i in field_order]
...
>>> sorted(lines, key=f)
[['b', 'b', 'a'], ['b', 'c', 'a'], ['a', 'b', 'c']]
Edited to use a list comprehension, generator does not work as I had expected it to.

Categories

Resources