Here is my problem. I need to parse a comma separated file and I've got my code working how I would like, however while testing it and attempting to break things I've come across a problem.
Here is the example code:
import csv
compareList=["testfield1","testfield2","testfield3","testfield4"]
z=open("testFile",'r')
x=csv.reader(z,quotechar='\'')
testDic={}
iter=0
for lineList in x:
try:
for item in compareList:
testDic[item]=lineList[iter]
iter+=1
iter=0
except IndexError:
iter=0
lineList=[]
for item in compareList:
testList.append("")
testDic[item]=lineList[iter]
iter+=1
iter=0
for item in compareList:
testFile.write(testDic[item])
if compareList.index(item)!=len(compareList)-1
testFile.write(",")
testFile.write('\n')
testFile.close()
z.close()
So what this is supposed to do is check and make sure that each line of the csv file matches the length of a list. If the length of the line does not match the length of the list, then the line is converted to null values(commas) that equal the length of compareList.
Here is an example of what is in the file:
,,"sometext",343434
,,"moretext",343434
,,"stuff",4543343
,,"morestuff",3434354
The code works just fine if the line is missing an item. So the output of at file containing:
,"sometext",343434
,,"moretext",343434
,,"stuff",4543343
,,"morestuff",3434354
will look like this:
,,,,
,,"moretext",343434
,,"stuff",4543343
,,"morestuff",3434354
The problem I have induced is when the line looks something like this:
,"sometext",343434
,,"moretext",343434
,,"St,'",uff",4543343
,,"morestuff",3434354
The ouput of this file will be:
,,,,
,,"moretext",343434
,,,,
So it will apply the change as expected and null out lines 1 and 3, but it just stops processing at that line. I've been pulling my hair out trying to figure out what is going on here, with no luck.
As always I greatly appreciate any help you are willing to give.
Just print each line returned by csv.reader to understand what is the problem:
>>> import csv
>>> z=open("testFile",'r')
>>> x=csv.reader(z,quotechar='\'')
>>> for lineList in x:
... print lineList
...
['', '"sometext"', '343434']
['', '', '"moretext"', '343434']
['', '', '"St', '",uff",4543343\n,,"morestuff",3434354\n']
The last 2 lines are just one line for csv.reader.
Now, just remove quotechar='\''
>>> import csv
>>> z=open("testFile",'r')
>>> x=csv.reader(z)
>>> for lineList in x:
... print lineList
...
['', 'sometext', '343434']
['', '', 'moretext', '343434']
['', '', "St,'", 'uff"', '4543343']
['', '', 'morestuff', '3434354']
I am trying to take this csv file and parse and store it in a form of a dictionary (sorry if I use the terms incorrectly I am currently learning). The first element is my key and the rest will be values in a form of nested arrays.
targets_value,11.4,10.5,10,10.8,8.3,10.1,10.7,13.1
targets,Cbf1,Sfp1,Ino2,Opi1,Cst6,Stp1,Met31,Ino4
one,"9.6,6.3,7.9,11.4,5.5",N,"8.4,8.1,8.1,8.4,5.9,5.9",5.4,5.1,"8.1,8.3",N,N
two,"7.0,11.4,7.0","4.8,5.3,7.0,8.1,9.0,6.1,4.6,5.0,4.6","6.3,5.9,5.9",N,"4.3,4.8",N,N,N
three,"6.0,9.7,11.4,6.8",N,"11.8,6.3,5.9,5.9,9.5","5.4,8.4","5.1,5.1,4.3,4.8,5.1",N,N,11.8
four,"9.7,11.4,11.4,11.4",4.6,"6.2,7.9,5.9,5.9,6.3","5.6,5.5","4.8,4.8,8.3,5.1,4.3",N,7.9,N
five,7.9,N,"8.1,8.4",N,"4.3,8.3,4.3,4.3",N,N,N
six,"5.7,11.4,9.7,5.5,9.7,9.7","4.4,7.0,7.7,7.5,6.9,4.9,4.6,4.9,4.6","7.9,5.9,5.9,5.9,5.9,6.3",6.7,"5.1,4.8",N,7.9,N
seven,"6.3,11.4","5.2,4.7","6.3,6.0",N,"8.3,4.3,4.8,4.3,5.1","9.8,9.5",N,8.4
eight,"11.4,11.4,5.9","4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9","6.3,6.3,5.9,5.9,6.6,6.6","5.3,5.2,7.0","8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1","9.2,7.4","9.4,9.3,7.9",N
nine,"9.7,9.7,11.4,9.7","5.2,4.6,5.5,6.5,4.5,4.6,5.5","6.3,5.9,5.9,9.5,6.5",N,"4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8",8.0,8.6,N
ten,"9.7,9.7,9.7,11.4,7.9","5.2,4.6,5.5,6.5,4.5,4.6,5.5","6.3,5.9,5.9,9.5,6.5",5.7,"4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8",8.0,8.6,N
YPL250C_Icy2,"11.4,6.1,11.4",N,"6.3,6.0,6.6,7.0,10.0,6.5,9.5,7.0,10.0",7.1,"4.3,4.3",9.2,"10.7,9.5",N
,,,,,,,,
,,,,,,,,
The issue was that in each line, some columns are a quotes because of multiple values per cell, and some only have a single entry but no quote. And cells that had no value input were inserted with an N. Since there was a mixture of quotes and non quotes, and numbers and non numbers.
Wanted the output to look something like this:
{'eight': ['11.4,11.4,5.9', '4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9', '6.3,6.3,5.9,5.9,6.6,6.6', '5.3,5.2,7.0', '8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1', '9.2,7.4', '9.4,9.3,7.9', 'N'],
'ten': ['9.7,9.7,9.7,11.4,7.9', '5.2,4.6,5.5,6.5,4.5,4.6,5.5', '6.3,5.9,5.9,9.5,6.5', '5.7', '4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8', '8.0', '8.6', 'N'],
'nine': ['9.7,9.7,11.4,9.7', '5.2,4.6,5.5,6.5,4.5,4.6,5.5', '6.3,5.9,5.9,9.5,6.5', 'N', '4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8', '8.0', '8.6', 'N']
}
I wrote a script to clean it and store it, but was not sure if my script was "too long for no reason". Any tips?
motif_dict = {}
with open(filename, "r") as file:
data = file.readlines()
for line in data:
if ',,,,,,,,' in line:
continue
else:
quoted_holder = re.findall(r'"(\d.*?\d)"' , line)
#reverses the order of the elements contained in the array
quoted_holder = quoted_holder[::-1]
new_line = re.sub(r'"\d.*?\d"', 'h', line).split(',')
for position,element in enumerate(new_line):
if element == 'h':
new_line[position] = quoted_holder.pop()
motif_dict[new_line[0]] = new_line[1:]
There's a csv module which makes working with csv files much easier. In your case, your code becomes
import csv
with open("motif.csv","r",newline="") as fp:
reader = csv.reader(fp)
data = {row[0]: row[1:] for row in reader if row and row[0]}
where the if row and row[0] lets us skip rows which are empty or have an empty first element. This produces (newlines added)
>>> data["eight"]
['11.4,11.4,5.9', '4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9',
'6.3,6.3,5.9,5.9,6.6,6.6', '5.3,5.2,7.0',
'8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1',
'9.2,7.4', '9.4,9.3,7.9', 'N']
>>> data["ten"]
['9.7,9.7,9.7,11.4,7.9', '5.2,4.6,5.5,6.5,4.5,4.6,5.5',
'6.3,5.9,5.9,9.5,6.5', '5.7', '4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8',
'8.0', '8.6', 'N']
In practice, for processing, I think you'd want to replace 'N' with None or some other object as a missing marker, and make every value a list of floats (even if it's only got one element), but that's up to you.
I assumed sorting a CSV file on multiple text/numeric fields using Python would be a problem that was already solved. But I can't find any example code anywhere, except for specific code focusing on sorting date fields.
How would one go about sorting a relatively large CSV file (tens of thousand lines) on multiple fields, in order?
Python code samples would be appreciated.
Python's sort works in-memory only; however, tens of thousands of lines should fit in memory easily on a modern machine. So:
import csv
def sortcsvbymanyfields(csvfilename, themanyfieldscolumnnumbers):
with open(csvfilename, 'rb') as f:
readit = csv.reader(f)
thedata = list(readit)
thedata.sort(key=operator.itemgetter(*themanyfieldscolumnnumbers))
with open(csvfilename, 'wb') as f:
writeit = csv.writer(f)
writeit.writerows(thedata)
Here's Alex's answer, reworked to support column data types:
import csv
import operator
def sort_csv(csv_filename, types, sort_key_columns):
"""sort (and rewrite) a csv file.
types: data types (conversion functions) for each column in the file
sort_key_columns: column numbers of columns to sort by"""
data = []
with open(csv_filename, 'rb') as f:
for row in csv.reader(f):
data.append(convert(types, row))
data.sort(key=operator.itemgetter(*sort_key_columns))
with open(csv_filename, 'wb') as f:
csv.writer(f).writerows(data)
Edit:
I did a stupid. I was playing with various things in IDLE and wrote a convert function a couple of days ago. I forgot I'd written it, and I haven't closed IDLE in a good long while - so when I wrote the above, I thought convert was a built-in function. Sadly no.
Here's my implementation, though John Machin's is nicer:
def convert(types, values):
return [t(v) for t, v in zip(types, values)]
Usage:
import datetime
def date(s):
return datetime.strptime(s, '%m/%d/%y')
>>> convert((int, date, str), ('1', '2/15/09', 'z'))
[1, datetime.datetime(2009, 2, 15, 0, 0), 'z']
Here's the convert() that's missing from Robert's fix of Alex's answer:
>>> def convert(convert_funcs, seq):
... return [
... item if func is None else func(item)
... for func, item in zip(convert_funcs, seq)
... ]
...
>>> convert(
... (None, float, lambda x: x.strip().lower()),
... [" text ", "123.45", " TEXT "]
... )
[' text ', 123.45, 'text']
>>>
I've changed the name of the first arg to highlight that the per-columns function can do what you need, not merely type-coercion. None is used to indicate no conversion.
You bring up 3 issues:
file size
csv data
sorting on multiple fields
Here is a solution for the third part. You can handle csv data in a more sophisticated way.
>>> data = 'a,b,c\nb,b,a\nb,c,a\n'
>>> lines = [e.split(',') for e in data.strip().split('\n')]
>>> lines
[['a', 'b', 'c'], ['b', 'b', 'a'], ['b', 'c', 'a']]
>>> def f(e):
... field_order = [2,1]
... return [e[i] for i in field_order]
...
>>> sorted(lines, key=f)
[['b', 'b', 'a'], ['b', 'c', 'a'], ['a', 'b', 'c']]
Edited to use a list comprehension, generator does not work as I had expected it to.