Error in forming dictionary from a csv file in python - python

I have a csv file whose structure is like this:
Year-Sem,Course,Studentid,Score
201001,CS301,100,363
201001,CS301,101,283
201001,CS301,102,332
201001,CS301,103,254
201002,CS302,101,466
201002,CS302,102,500
Here each year is divided into two semesters - 01 (for fall) and 02 (for spring) and data has years from 2008 till 2014 (for a total of 14 semesters). Now what I want to do is to form a dictionary where course and studentid become the key and there respective score ordered by the year-sem as values. So the output should be something like this for each student:
[(studentid,course):(year-sem1 score,year-sem2 score,...)]
I first tried to make a dictionary of [(studentid,course):(score)] using this code but I get error as IndexError: list index out of range:
with open('file1.csv', mode='rU') as infile:
reader = csv.reader(infile,dialect=csv.excel_tab)
with open('file2.csv', mode='w') as outfile:
writer = csv.writer(outfile)
mydict = {(rows[2],rows[1]): rows[3] for rows in reader}
writer.writerows(mydict)
When I was not using dialect=csv.excel_tab and rU then I was getting error as _csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?.
How can I resolve this error and form the dictionary with structure [(studentid,course):(year-sem1 score,year-sem2 score,...)] that I had mentioned in my post above?

The dialect you've chosen seems to be wrong. csv.excel_tab uses the tabulator character as delimiter. For your data, the default dialect should work.
You got the error message about newlines earlier because of the missing U in the rU mode.
with open(r"test.csv", "rU") as file:
reader = csv.reader(file)
for row in reader:
print(row)
This example seems to work for me (Python 3).

If you have repeating keys you need to store the values in some container, if you want the data ordered you will need to use an OrderedDict:
import csv
from collections import OrderedDict
with open("in.csv") as infile, open('file2.csv', mode='w') as outfile:
d = OrderedDict()
reader, writer = csv.reader(infile), csv.writer(outfile)
header = next(reader) # skip header
# choose whatever column names you want
writer.writerow(["id-crse","score"])
# unpack the values from each row
for yr, cre, stid, scr in reader:
# use id and course as keys and append scores
d.setdefault("{} {}".format(stid, cre),[]).append(scr)
# iterate over the dict keys and values and write each new row
for k,v in d.items():
writer.writerow([k] + v)
Which will give you something like:
id-crse,score
100 CS301,363
101 CS301,283
102 CS301,332
103 CS301,254
101 CS302,466
102 CS302,500
In your own code you would only store the last value for the key, you also only write the keys using writer.writerows(mydict) as you are just iterating over the keys of the dict, not the keys and values. If the data is not all in chronological order you will have to call sorted on the reader object using itemgetter:
for yr, cre, stid, scr in sorted(reader,key=operator.itemgetter(3)):
............

Related

Format csv data and write each row to a json

I'm trying to write each row of a csv to a json (this will then be posted and looped back through so overwriting the json file is not a big deal here). I have code which seems to do this well enough, but also need to some of the data to be floats/integers rather than strings.
I have a method which works for this in other places, but cannot manage to get the two to agree with each other.
Could anyone point me in the right direction to be able to format the csv data before sending it out as a json? Below is the code for when headers are left in, though I also have a tweaked version which just has raw data in the csv and uses fieldnames for the headers instead.
import csv
import json
input_file = 'Test3.csv'
output_file_template = 'Test.json'
with open(input_file, 'r', encoding='utf8') as csvfile:
reader = csv.DictReader(csvfile, delimiter=',')
rows = list(reader)
for i in range(len(rows)):
out = json.dumps(rows[1*i:1*(i+1)])
with open(output_file_template.format(i), 'w') as f:
f.write(out)
Data is in a format like this:
OrderType OrderStatus OrderDateTime SettlementDate MarketId OrderRoute
Sale Executed 18/11/2016 23/11/2016 1 None
Sale Executed 18/11/2016 23/11/2016 1 None
Sale Executed 18/11/2016 23/11/2016 1 None
With row[4] producing the key error.
In your loop if the float/int data is consistently in the same spot, you can simply cast the values.
for i, row in enumerate(rows):
row[0] = int(row[0]) # this column stores ints
row[1] = float(row[1]) # this column stores floats
out = json.dumps([row])
with open(output_file_template.format(i), 'w') as f:
f.write(out)
I don't know if columns 0 and 1 hold ints and floats, but you can change that as necessary.
Update:
It appears row is an OrderedDict, so you'll just need to use the key instead of an index:
row['MarketId'] = int(row['MarketId'])

python3 csv with duplicate keys + python defaultdict

I have a csv file which is having lot of serial numbers and material numbers for ex: show below (I need only first 2columns i.e serial and chassis and rest is not required).
serial chassis type date
ZX34215 Test XX YY
ZX34215 final-001 XX YY
AB30000 Used XX YY
ZX34215 final-002 XX YY
I have below snippet which gets all the serial and material numbers into a dictionary but here duplicate keys are eliminated and it captures latest serial key.
Working code
import sys
import csv
with open('file1.csv', mode='r') as infile:
reader = csv.reader(infile)
mydict1 = {rows[0]:rows[1] for rows in reader}
print(mydict1)
I need to capture duplicate keys with respective values also but it failed. I used python defaultdict and looks like I missed something here.
not working
from collections import defaultdict
with open('file1.csv',mode='r') as infile:
data=defaultdict(dict)
reader=csv.reader(infile)
list_res = list(reader)
for row in reader:
result=data[row[0]].append(row[1])
print(result)
Can some one correct me to capture duplicate keys into dictionary.
You need to pass a list to your defaultdict not dict :
data=defaultdict(list)
Also you don't need to convert the reader object to list, for iterating over it, you also shouldn't assign the append snipped to a variable in each iteration:
data=defaultdict(list)
with open('file1.csv') as infile:
reader=csv.reader(infile)
for row in reader:
try:
data[row[0]].append(row[1])
except IndexError:
pass
print(data)

How to slice a single CSV file into several smaller ones grouped by a field and deleting columns in the final csv's?

Even thought this might sound as a repeated question, I have not found a solution. Well, I have a large .csv file that looks like:
prot_hit_num,prot_acc,prot_desc,pep_res_before,pep_seq,pep_res_after,ident,country
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPV,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPVL,D,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],L,SSISGAGGGGLA,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],D,NYDNSAGKW,W,F40,EB
....
The aim is to slice this .csv file into multiple smaller .csv files according to the last two columns ('ident' and 'country').
I have used a code from an answer in a previous post and is the following:
csv_contents = []
with open(outfile_path4, 'rb') as fin:
dict_reader = csv.DictReader(fin) # default delimiter is comma
fieldnames = dict_reader.fieldnames # save for writing
for line in dict_reader: # read in all of your data
csv_contents.append(line) # gather data into a list (of dicts)
# input to itertools.groupby must be sorted by the grouping value
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('prot_desc','ident','country'))
for groupkey, groupdata in it.groupby(sorted_csv_contents,
key=op.itemgetter('prot_desc','ident','country')):
with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
dict_writer = csv.DictWriter(fou, fieldnames=fieldnames)
dict_writer.writerows(groupdata)
However, I need that my output .csv's just contain the column 'pep_seq', a desired output like:
pep_seq
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW
What can I do?
Your code was almost correct, it just needed the fieldsnames to be set correctly and for extraaction='ignore' to be set. This tells the DictWriter to only write the fields you specify:
import itertools
import operator
import csv
outfile_path4 = 'input.csv'
outfile_path5 = r'my_output_folder\output.csv'
csv_contents = []
with open(outfile_path4, 'rb') as fin:
dict_reader = csv.DictReader(fin) # default delimiter is comma
fieldnames = dict_reader.fieldnames # save for writing
for line in dict_reader: # read in all of your data
csv_contents.append(line) # gather data into a list (of dicts)
group = ['prot_desc','ident','country']
# input to itertools.groupby must be sorted by the grouping value
sorted_csv_contents = sorted(csv_contents, key=operator.itemgetter(*group))
for groupkey, groupdata in itertools.groupby(sorted_csv_contents, key=operator.itemgetter(*group)):
with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
dict_writer = csv.DictWriter(fou, fieldnames=['pep_seq'], extrasaction='ignore')
dict_writer.writeheader()
dict_writer.writerows(groupdata)
This will give you an output csv file containing:
pep_seq
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW
The following would output a csv file per country containing only the field you need.
You could always add another step to group by the second field you need I think.
import csv
# use a dict so you can store the list of pep_seqs found for each country
# the country value with be the dict key
csv_rows_by_country = {}
with open('in.csv', 'rb') as csv_in:
csv_reader = csv.reader(csv_in)
for row in csv_reader:
if row[7] in csv_rows_by_country:
# add this pep_seq to the list we already found for this country
csv_rows_by_country[row[7]].append(row[4])
else:
# start a new list for this country - we haven't seen it before
csv_rows_by_country[row[7]] = [row[4],]
for country in csv_rows_by_country:
# create a csv output file for each country and write the pep_seqs into it.
with open('out_%s.csv' % (country, ), 'wb') as csv_out:
csv_writer = csv.writer(csv_out)
for pep_seq in csv_rows_by_country[country]:
csv_writer.writerow([pep_seq, ])

More pythonic way of iteratively assigning csv rows to dictionary values?

I have a CSV file, with columns holding specific values that I read into specific places in a dictionary, and rows separate instances of data that equal one full dictionary. I read in and then use this data to computer certain values, process some of the inputs, etc., for each row before moving on to the next row. My question is, if I have a header that specifics the names of the columns (Key1 versus Key 3A, etc.), can I use that information to avoid the somewhat draw out code I am currently using (below).
with open(input_file, 'rU') as controlFile:
reader = csv.reader(controlFile)
next(reader, None) # skip the headers
for row in reader:
# Grabbing all the necessary inputs
inputDict = {}
inputDict["key1"] = row[0]
inputDict["key2"] = row[1]
inputDict["key3"] = {}
inputDict["key3"].update({"A" : row[2]})
inputDict["key3"].update({"B" : row[3]})
inputDict["key3"].update({"C" : row[4]})
inputDict["key3"].update({"D" : row[5]})
inputDict["key3"].update({"E" : row[6]})
inputDict["Key4"] = {}
inputDict["Key4"].update({"F" : row[7]})
inputDict["Key4"].update({"G" : float(row[8])})
inputDict["Key4"].update({"H" : row[9]})
If you use a DictReader, you can improve your code a bit:
Create an object which operates like a regular reader but maps the
information read into a dict whose keys are given by the optional
fieldnames parameter. The fieldnames parameter is a sequence whose
elements are associated with the fields of the input data in order.
These elements become the keys of the resulting dictionary. If the
fieldnames parameter is omitted, the values in the first row of the
csvfile will be used as the fieldnames.
So, if we utilize that:
import csv
import string
results = []
mappings = [
[(string.ascii_uppercase[i-2], i) for i in range(2, 7)],
[(string.ascii_uppercase[i-2], i) for i in range(7, 10)]]
with open(input_file, 'rU') as control_file:
reader = csv.DictReader(control_file)
for row in reader:
row_data = {}
row_data['key1'] = row['key1']
row_data['key2'] = row['key2']
row_data['key3'] = {k:row[v] for k,v in mappings[0]}
row_data['key4'] = {k:row[v] for k,v in mappings[1]}
results.append(row_data)
yes you can.
import csv
with open(infile, 'rU') as infile:
reader = csv.DictReader(infile)
for row in reader:
print(row)
Take a look at this piece of code.
fields = csv_data.next()
for row in csv_data:
parsed_data.append(dict(zip(fields,row)))

Write dictionary of lists (varying length) to csv in Python

iam currently struggling with dictionaries of lists.
Given a dictionary like that:
GO_list = {'Seq_A': ['GO:1234', 'GO:2345', 'GO:3456'],
'Seq_B': ['GO:7777', 'GO:8888']}
No i wanted to write this dictionary to a csv file as
follows:
EDIT i have added the whole function to give more information
def map_GI2GO(gilist, mapped, gi_to_go):
with open(gilist) as infile:
read_gi = csv.reader(infile)
GI_list = {rows[0]:rows[1] for rows in read_gi} # read GI list into dictionary
GO_list = defaultdict(list) # set up GO list as empty dictionary of lists
infile.close()
with open(gi_to_go) as mapping:
read_go = csv.reader(mapping, delimiter=',')
for k, v in GI_list.items(): # iterate over GI list and mapping file
for row in read_go:
if len(set(row[0]).intersection(v)) > 0 :
GO_list[k].append(row[1]) # write found GOs into dictionary
break
mapping.close()
with open(mapped, 'wb') as outfile: # save mapped SeqIDs plus GOs
looked_up_go = csv.writer(outfile, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
for key, val in GO_list.iteritems():
looked_up_go.writerow([key] + val)
outfile.close()
However this gives me the following output:
Seq_A,GO:1234;GO2345;GO:3456
Seq_B,GO:7777;GO:8888
I would prefer to have the list entries in separate columns,
separated by a defined delimiter. I have a hard time to get
rid of the ;, which are apparently separating the list entries.
Any ideas are welcome
If I were you I would try out itertools izip_longest to match up columns of varying length...
from csv import writer
from itertools import izip_longest
GO_list = {'Seq_A': ['GO:1234', 'GO:2345', 'GO:3456'],
'Seq_B': ['GO:7777', 'GO:8888']}
with open("test.csv","wb") as csvfile:
wr = writer(csvfile)
wr.writerow(GO_list.keys())#writes title row
for each in izip_longest(*GO_list.values()): wr.writerow(each)

Categories

Resources