Sorting CSV in Python - python
I assumed sorting a CSV file on multiple text/numeric fields using Python would be a problem that was already solved. But I can't find any example code anywhere, except for specific code focusing on sorting date fields.
How would one go about sorting a relatively large CSV file (tens of thousand lines) on multiple fields, in order?
Python code samples would be appreciated.
Python's sort works in-memory only; however, tens of thousands of lines should fit in memory easily on a modern machine. So:
import csv
def sortcsvbymanyfields(csvfilename, themanyfieldscolumnnumbers):
with open(csvfilename, 'rb') as f:
readit = csv.reader(f)
thedata = list(readit)
thedata.sort(key=operator.itemgetter(*themanyfieldscolumnnumbers))
with open(csvfilename, 'wb') as f:
writeit = csv.writer(f)
writeit.writerows(thedata)
Here's Alex's answer, reworked to support column data types:
import csv
import operator
def sort_csv(csv_filename, types, sort_key_columns):
"""sort (and rewrite) a csv file.
types: data types (conversion functions) for each column in the file
sort_key_columns: column numbers of columns to sort by"""
data = []
with open(csv_filename, 'rb') as f:
for row in csv.reader(f):
data.append(convert(types, row))
data.sort(key=operator.itemgetter(*sort_key_columns))
with open(csv_filename, 'wb') as f:
csv.writer(f).writerows(data)
Edit:
I did a stupid. I was playing with various things in IDLE and wrote a convert function a couple of days ago. I forgot I'd written it, and I haven't closed IDLE in a good long while - so when I wrote the above, I thought convert was a built-in function. Sadly no.
Here's my implementation, though John Machin's is nicer:
def convert(types, values):
return [t(v) for t, v in zip(types, values)]
Usage:
import datetime
def date(s):
return datetime.strptime(s, '%m/%d/%y')
>>> convert((int, date, str), ('1', '2/15/09', 'z'))
[1, datetime.datetime(2009, 2, 15, 0, 0), 'z']
Here's the convert() that's missing from Robert's fix of Alex's answer:
>>> def convert(convert_funcs, seq):
... return [
... item if func is None else func(item)
... for func, item in zip(convert_funcs, seq)
... ]
...
>>> convert(
... (None, float, lambda x: x.strip().lower()),
... [" text ", "123.45", " TEXT "]
... )
[' text ', 123.45, 'text']
>>>
I've changed the name of the first arg to highlight that the per-columns function can do what you need, not merely type-coercion. None is used to indicate no conversion.
You bring up 3 issues:
file size
csv data
sorting on multiple fields
Here is a solution for the third part. You can handle csv data in a more sophisticated way.
>>> data = 'a,b,c\nb,b,a\nb,c,a\n'
>>> lines = [e.split(',') for e in data.strip().split('\n')]
>>> lines
[['a', 'b', 'c'], ['b', 'b', 'a'], ['b', 'c', 'a']]
>>> def f(e):
... field_order = [2,1]
... return [e[i] for i in field_order]
...
>>> sorted(lines, key=f)
[['b', 'b', 'a'], ['b', 'c', 'a'], ['a', 'b', 'c']]
Edited to use a list comprehension, generator does not work as I had expected it to.
Related
.split() returning empty results
I am trying to split a list that I have converted with str(), but I don't seem to be returning any results? My code is as follows: import csv def csv_read(file_obj): reader=csv.DictReader(file_obj,delimiter=',') for line in reader: unique_id.append(line["LUSERFLD4"]) total_amt.append(line["LAMOUNT1"]) luserfld10.append(line["LUSERFLD10"]) break bal_fwd, junk, sdc, junk2, est_read=str(luserfld10).split(' ') if __name__=="__main__": with open("UT_0004A493.csv") as f_obj: csv_read(f_obj) print (luserfld10) print (bal_fwd) print (sdc) print (est_read) print (luserfld10) returns ['N | N | Y'] which is correct. (Due to system limitations when creating the csv file, this field holds three separate values) All variables have been defined and I'm not getting any errors, but my last three print commands are returning empty lists? I've tried dedenting the .split() line, but then I can unpack only one value. How do I get them to each return N or Y? Why isn't it working as it is? I'm sure it's obvious, but this is my first week of coding and I haven't been able to find the answer anywhere here. Any help (with explanations please) would be appreciated :) Edit: all defined variables are as follows: luserfld10=[] bal_fwd=[] sdc=[] est_read=[] etc. File contents I'm not certain how to show? I hope this is okay? LACCNBR,LAMOUNT1,LUSERFLD4,LUSERFLD5,LUSERFLD6,LUSERFLD8,LUSERFLD9,LUSERFLD10 1290,-12847.28,VAAA0022179,84889.363,Off Peak - nil,5524.11,,N | N | N 2540255724,12847.28,VAAA0022179,84889.363,Off Peak - nil,5524.11,,N | N | N
If the luserfld10 is ['N | N | Y'] then, luserfld10[0].replace('|', '').split() Result: ['N', 'N', 'Y']
Even if you fix the .split stuff in bal_fwd, junk, sdc, junk2, est_read=str(luserfld10).split(' ') it won't do what you want because it's assigning the results of the split to local names bal_fwd, sdc, etc, that only exist inside the csv_read function, not to the names you defined outside the function in the global scope. You could use global statements to tell Python to assign those values to the global names, but it's generally best to avoid using the global statement unless you really need it. Also, merely using a global statement won't put the string data into your bal_fwd list. Instead, it will bind the global name to your string data and discard the list. If you want to put the string into the list you need to .append it, like you did with unique_id. You don't need global for that, since you aren't performing an assignment, you're just modifying the existing list object. Here's a repaired version of your code, tested with the data sample you posted. import csv unique_id = [] total_amt = [] luserfld10 = [] bal_fwd = [] sdc = [] est_read = [] def csv_read(file_obj): for line in csv.DictReader(file_obj, delimiter=','): unique_id.append(line["LUSERFLD4"]) total_amt.append(line["LAMOUNT1"]) fld10 = line["LUSERFLD10"] luserfld10.append(fld10) t = fld10.split(' | ') bal_fwd.append(t[0]) sdc.append(t[1]) est_read.append(t[2]) if __name__=="__main__": with open("UT_0004A493.csv") as f_obj: csv_read(f_obj) print('id', unique_id) print('amt', total_amt) print('fld10', luserfld10) print('bal', bal_fwd) print('sdc', sdc) print('est_read', est_read) output id ['VAAA0022179', 'VAAA0022179'] amt ['-12847.28', '12847.28'] fld10 ['N | N | N', 'N | N | N'] bal ['N', 'N'] sdc ['N', 'N'] est_read ['N', 'N'] I should mention that using t = fld10.split(' | ') is a bit fragile: if the separator isn't exactly ' | ' then the split will fail. So if there's a possibility that there might not be exactly one space either side of the pipe (|) then you should use a variation of Jinje's suggestion: t = fld10.replace('|', ' ').split() This replaces all pipe chars with spaces, and then splits on runs of white space, so it's guaranteed to split the subields correctly, assuming there's at least one space or pipe between each subfield (Jinje's original suggestion will fail if both spaces are missing on either side of the pipe). Breaking your data up into separate lists may not be a great strategy: you have to be careful to keep the lists synchronised, so it's tricky to sort them or to add or remove items. And it's tedious to manipulate all the data as a unit when you have it spread out over half a dozen named lists. One option is to put your data into a dictionary of lists: import csv from pprint import pprint def csv_read(file_obj): data = { 'unique_id': [], 'total_amt': [], 'bal_fwd': [], 'sdc': [], 'est_read': [], } for line in csv.DictReader(file_obj, delimiter=','): data['unique_id'].append(line["LUSERFLD4"]) data['total_amt'].append(line["LAMOUNT1"]) fld10 = line["LUSERFLD10"] t = fld10.split(' | ') data['bal_fwd'].append(t[0]) data['sdc'].append(t[1]) data['est_read'].append(t[2]) return data if __name__=="__main__": with open("UT_0004A493.csv") as f_obj: data = csv_read(f_obj) pprint(data) output {'bal_fwd': ['N', 'N'], 'est_read': ['N', 'N'], 'sdc': ['N', 'N'], 'total_amt': ['-12847.28', '12847.28'], 'unique_id': ['VAAA0022179', 'VAAA0022179']} Note that csv_read doesn't directly modify any global variables. It creates a dictionary of lists and passes it back to the code that calls it. This makes the code more modular; trying to debug large programs that use globals can become a nightmare because you have to keep track of every part of the program that modifies those globals. Alternatively, you can put the data into a list of dictionaries, one per row. def csv_read(file_obj): data = [] for line in csv.DictReader(file_obj, delimiter=','): luserfld10 = line["LUSERFLD10"] bal_fwd, sdc, est_read = luserfld10.split(' | ') # Put desired data and into a new dictionary row = { 'unique_id': line["LUSERFLD4"], 'total_amt': line["LAMOUNT1"], 'bal_fwd': bal_fwd, 'sdc': sdc, 'est_read': est_read, } data.append(row) return data if __name__=="__main__": with open("UT_0004A493.csv") as f_obj: data = csv_read(f_obj) pprint(data) output [{'bal_fwd': 'N', 'est_read': 'N', 'sdc': 'N', 'total_amt': '-12847.28', 'unique_id': 'VAAA0022179'}, {'bal_fwd': 'N', 'est_read': 'N', 'sdc': 'N', 'total_amt': '12847.28', 'unique_id': 'VAAA0022179'}]
Pasing through CSV file to store as dictionary with nested array values. Best approach?
I am trying to take this csv file and parse and store it in a form of a dictionary (sorry if I use the terms incorrectly I am currently learning). The first element is my key and the rest will be values in a form of nested arrays. targets_value,11.4,10.5,10,10.8,8.3,10.1,10.7,13.1 targets,Cbf1,Sfp1,Ino2,Opi1,Cst6,Stp1,Met31,Ino4 one,"9.6,6.3,7.9,11.4,5.5",N,"8.4,8.1,8.1,8.4,5.9,5.9",5.4,5.1,"8.1,8.3",N,N two,"7.0,11.4,7.0","4.8,5.3,7.0,8.1,9.0,6.1,4.6,5.0,4.6","6.3,5.9,5.9",N,"4.3,4.8",N,N,N three,"6.0,9.7,11.4,6.8",N,"11.8,6.3,5.9,5.9,9.5","5.4,8.4","5.1,5.1,4.3,4.8,5.1",N,N,11.8 four,"9.7,11.4,11.4,11.4",4.6,"6.2,7.9,5.9,5.9,6.3","5.6,5.5","4.8,4.8,8.3,5.1,4.3",N,7.9,N five,7.9,N,"8.1,8.4",N,"4.3,8.3,4.3,4.3",N,N,N six,"5.7,11.4,9.7,5.5,9.7,9.7","4.4,7.0,7.7,7.5,6.9,4.9,4.6,4.9,4.6","7.9,5.9,5.9,5.9,5.9,6.3",6.7,"5.1,4.8",N,7.9,N seven,"6.3,11.4","5.2,4.7","6.3,6.0",N,"8.3,4.3,4.8,4.3,5.1","9.8,9.5",N,8.4 eight,"11.4,11.4,5.9","4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9","6.3,6.3,5.9,5.9,6.6,6.6","5.3,5.2,7.0","8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1","9.2,7.4","9.4,9.3,7.9",N nine,"9.7,9.7,11.4,9.7","5.2,4.6,5.5,6.5,4.5,4.6,5.5","6.3,5.9,5.9,9.5,6.5",N,"4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8",8.0,8.6,N ten,"9.7,9.7,9.7,11.4,7.9","5.2,4.6,5.5,6.5,4.5,4.6,5.5","6.3,5.9,5.9,9.5,6.5",5.7,"4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8",8.0,8.6,N YPL250C_Icy2,"11.4,6.1,11.4",N,"6.3,6.0,6.6,7.0,10.0,6.5,9.5,7.0,10.0",7.1,"4.3,4.3",9.2,"10.7,9.5",N ,,,,,,,, ,,,,,,,, The issue was that in each line, some columns are a quotes because of multiple values per cell, and some only have a single entry but no quote. And cells that had no value input were inserted with an N. Since there was a mixture of quotes and non quotes, and numbers and non numbers. Wanted the output to look something like this: {'eight': ['11.4,11.4,5.9', '4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9', '6.3,6.3,5.9,5.9,6.6,6.6', '5.3,5.2,7.0', '8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1', '9.2,7.4', '9.4,9.3,7.9', 'N'], 'ten': ['9.7,9.7,9.7,11.4,7.9', '5.2,4.6,5.5,6.5,4.5,4.6,5.5', '6.3,5.9,5.9,9.5,6.5', '5.7', '4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8', '8.0', '8.6', 'N'], 'nine': ['9.7,9.7,11.4,9.7', '5.2,4.6,5.5,6.5,4.5,4.6,5.5', '6.3,5.9,5.9,9.5,6.5', 'N', '4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8', '8.0', '8.6', 'N'] } I wrote a script to clean it and store it, but was not sure if my script was "too long for no reason". Any tips? motif_dict = {} with open(filename, "r") as file: data = file.readlines() for line in data: if ',,,,,,,,' in line: continue else: quoted_holder = re.findall(r'"(\d.*?\d)"' , line) #reverses the order of the elements contained in the array quoted_holder = quoted_holder[::-1] new_line = re.sub(r'"\d.*?\d"', 'h', line).split(',') for position,element in enumerate(new_line): if element == 'h': new_line[position] = quoted_holder.pop() motif_dict[new_line[0]] = new_line[1:]
There's a csv module which makes working with csv files much easier. In your case, your code becomes import csv with open("motif.csv","r",newline="") as fp: reader = csv.reader(fp) data = {row[0]: row[1:] for row in reader if row and row[0]} where the if row and row[0] lets us skip rows which are empty or have an empty first element. This produces (newlines added) >>> data["eight"] ['11.4,11.4,5.9', '4.4,6.3,6.0,5.6,7.6,7.1,5.1,5.3,5.1,4.9', '6.3,6.3,5.9,5.9,6.6,6.6', '5.3,5.2,7.0', '8.3,4.3,4.3,4.8,4.3,4.3,8.3,4.8,8.3,5.1', '9.2,7.4', '9.4,9.3,7.9', 'N'] >>> data["ten"] ['9.7,9.7,9.7,11.4,7.9', '5.2,4.6,5.5,6.5,4.5,4.6,5.5', '6.3,5.9,5.9,9.5,6.5', '5.7', '4.3,4.3,4.3,5.1,8.3,8.3,4.3,4.3,4.3,4.8', '8.0', '8.6', 'N'] In practice, for processing, I think you'd want to replace 'N' with None or some other object as a missing marker, and make every value a list of floats (even if it's only got one element), but that's up to you.
Python - Iterating through a list of list with a specifically formatted output; file output
Sorry to ask such a trivial question but I can't find the answer anyway and it's my first day using Python (need it for work). Think my problem is trying to use Python like C. Anyway, here is what I have: for i in data: for j in i: print("{}\t".format(j)) Which gives me data in the form of elem[0][0] elem[1][0] elem[2][0] ... elem[0][1] elem[1][1] ... i.e. all at once. What I really want to do, is access each element directly so I can output the list of lists data to a file whereby the elements are separated by tabs, not commas. Here's my bastardised Python code for outputting the array to a file: k=0 with open("Output.txt", "w") as text_file: for j in data: print("{}".format(data[k]), file=text_file) k += 1 So basically, I have a list of lists which I want to save to a file in tab delimited/separated format, but currently it comes out as comma separated. My approach would involve reiterating through the lists again, element by element, and saving the output by forcing in the the tabs. Here's data excerpts (though changed to meaningless values) data ['a', 'a', 306518, ' 1111111', 'a', '-', .... ] ['a', 'a', 306518, ' 1111111', 'a', '-', .... ] .... text_file a a 306518 1111111 a -.... a a 306518 1111111 a -.... ....
for i in data: print("\t".join(i))
if data is something like this '[[1,2,3],[2,3,4]]' for j in data: text_file.write('%s\n' % '\t'.join(str(x) for x in j))
I think this should work: with open(somefile, 'w') as your_file: for values in data: print("\t".join(valeues), file=your_file)
How to split a string based on comma as separator with comma within double quotes remaining as it is in python
I want to separate a string based on comma, but when the string is within double quotes the commas should be kept as it is. For that I wrote the following code. However, the code given below does not seem to work. Can someone please help me figure out as to what the error is? >>> from csv import reader >>> l='k,<livesIn> "Dayton,_Ohio"' >>> l1=[] >>> l1.append(l) >>> for line1 in reader(l1): print line1 The output which I am getting is: ['k', '<livesIn> "Dayton', '_Ohio"'] Whereas I want the output as: ['k', '<livesIn> "Dayton,_Ohio"'] i.e. I don't want "Dayton,_Ohio" to get separated.
So here is a way. >>> from csv import reader >>> l='k,<livesIn> "Dayton,_Ohio"' >>> l1=[] >>> l1.append(l) >>> for line in reader(l1): ... print list((line[0], ','.join(line[1:]))) ... ['k', '<livesIn> "Dayton,_Ohio"']
Download CSV directly into Python CSV parser
I'm trying to download CSV content from morningstar and then parse its contents. If I inject the HTTP content directly into Python's CSV parser, the result is not formatted correctly. Yet, if I save the HTTP content to a file (/tmp/tmp.csv), and then import the file in the python's csv parser the result is correct. In other words, why does: def finDownload(code,report): h = httplib2.Http('.cache') url = 'http://financials.morningstar.com/ajax/ReportProcess4CSV.html?t=' + code + '®ion=AUS&culture=en_us&reportType='+ report + '&period=12&dataType=A&order=asc&columnYear=5&rounding=1&view=raw&productCode=usa&denominatorView=raw&number=1' headers, data = h.request(url) return data balancesheet = csv.reader(finDownload('FGE','is')) for row in balancesheet: print row return: ['F'] ['o'] ['r'] ['g'] ['e'] [' '] ['G'] ['r'] ['o'] ['u'] (etc...) instead of: [Forge Group Limited (FGE) Income Statement'] ?
The problem results from the fact that iteration over a file is done line-by-line whereas iteration over a string is done character-by-character. You want StringIO/cStringIO (Python 2) or io.StringIO (Python 3, thanks to John Machin for pointing me to it) so a string can be treated as a file-like object: Python 2: mystring = 'a,"b\nb",c\n1,2,3' import cStringIO csvio = cStringIO.StringIO(mystring) mycsv = csv.reader(csvio) Python 3: mystring = 'a,"b\nb",c\n1,2,3' import io csvio = io.StringIO(mystring, newline="") mycsv = csv.reader(csvio) Both will correctly preserve newlines inside quoted fields: >>> for row in mycsv: print(row) ... ['a', 'b\nb', 'c'] ['1', '2', '3']