Referencing keys in a Python dictionary for CSV reader - python

new to python and trying to build a simple CSV reader to create new trades off an existing instrument. Ideally, I'd like to build a dictionary to simplify the parameters required to set up a new trade (instead of using row[1], [2], [3], etc, I'd like to replace with my headers that read Value Date, Trade Date, Price, Quantity, etc.)
I've created dictionary keys below, but am having trouble linking them to my script to create the new trade. What should I put to substitute the rows? Any advice appreciated! Thanks...
Code below:
import acm
import csv
# Opening CSV file
with open('C:\Users\Yina.Huang\Desktop\export\TradeBooking.csv', 'rb') as f:
reader = csv.DictReader(f, delimiter=',')
next(reader, None)
for row in reader:
# Match column header with column number
d = {
row["Trade Time"],
row["Value Day"],
row["Acquire Day"],
row["Instrument"],
row["Price"],
row["Quantity"],
row["Counterparty"],
row["Acquirer"],
row["Trader"],
row["Currency"],
row["Portfolio"],
row["Status"]
}
NewTrade = acm.FTrade()
NewTrade.TradeTime = "8/11/2016 12:00:00 AM"
NewTrade.ValueDay = "8/13/2016"
NewTrade.AcquireDay = "8/13/2016"
NewTrade.Instrument = acm.FInstrument[row["Instrument"]]
NewTrade.Price = row[4]
NewTrade.Quantity = row[5]
NewTrade.Counterparty = acm.FParty[row[6]]
NewTrade.Acquirer = acm.FParty[row[7]]
NewTrade.Trader = acm.FUser[row[8]]
NewTrade.Currency = acm.FCurrency[row[9]]
NewTrade.Portfolio = acm.FPhysicalPortfolio[row[10]]
NewTrade.Premium = (int(row[4])*int(row[5]))
NewTrade.Status = row[11]
print NewTrade
NewTrade.Commit()

The csv module already provides this functionality with the csv.DictReader object.
with open('C:\Users\Yina.Huang\Desktop\export\TradeBooking.csv', 'rb') as f:
reader = csv.DictReader(f)
for row in reader:
NewTrade = acm.FTrade()
NewTrade.TradeTime = row['Trade Time']
NewTrade.ValueDay = row['Value Day']
NewTrade.AcquireDay = row['Aquire Day']
NewTrade.Instrument = acm.Finstrument[row['Instrument']]
NewTrade.Price = row['Price']
NewTrade.Quantity = row['Quantity']
# etc
From the documentation:
Create an object which operates like a regular reader but maps the
information read into a dict whose keys are given by the optional
fieldnames parameter. The fieldnames parameter is a sequence whose
elements are associated with the fields of the input data in order.
These elements become the keys of the resulting dictionary. If the
fieldnames parameter is omitted, the values in the first row of the
csvfile will be used as the fieldnames. If the row read has more
fields than the fieldnames sequence, the remaining data is added as a
sequence keyed by the value of restkey. If the row read has fewer
fields than the fieldnames sequence, the remaining keys take the value
of the optional restval parameter.

Related

How can I combining like-rows in CSV to fill missing data?

I have a CSV file that might look like:
id, value01, value02,
01, , 01b,
01, 01a, ,
02, , 02b,
02, 02a, 02b,
...
As you can see, I have duplicate rows where one of the duplicates (can be more than two) (determined as a duplicate by the id) has missing values, and the other duplicates contain other values missing.
I think someone who managed this CSV output wrote twice to the CSV rather than combine results and output once, so now I need to find a clean way to do this.
So far, my work is:
import csv
def combine_dups():
data = []
with open("file.csv", newline='', mode='r') as csvFile:
csvData = csv.DictReader(csvFile)
for row in csvData:
for lookahead in csvData:
# Check if lookahead and current row have matching ids
if row["id"] == lookahead["id"]:
# Loop through columns of row and lookahead
for col in row:
# If current row's column is blank, take value from lookahead
if row[col] == '' or row[col] is None:
row[col] = lookahead[col]
data.append(row) # Add new filled out, completed row
# Manage data to no longer contain excess duplicates
# Code here??
return data
This code isn't correct, as it as:
If loops through csvData for each row, rather than looping through all data after current row. This is easily solved using a for loop using an index, but I left that out for simplicity.
The row is filled in with the missing data, but this operation is done multiple times for the other values with identical id's. How can I avoid this?
Edit:
For clarity, the NEW csv should look like:
id, value01, value02,
01, 01a, 01b,
02, 02a, 02b,
...
Using the csv module.
Ex:
import csv
data = {}
with open(filename) as csvInFile:
csvData = csv.DictReader(csvInFile)
fieldnames = csvData.fieldnames
for row in csvData:
# Combine data and update.
data.setdefault(row['id'], dict()).update({k:v for k,v in row.items() if v.strip() and not data.get(row['id']).get(k)})
with open(filename, "w", newline='') as csvOutFile:
csvOutData = csv.DictWriter(csvOutFile, fieldnames=fieldnames)
csvOutData.writeheader()
csvOutData.writerows(data.values()) # Write data.
Output:
id, value01, value02
01, 01a, 01b
02, 02a, 02b
The could be solved by using a dictionary object to store the values. This csv would be traversed just once for each row.
You can refer to the code below:
import csv
c_dict = {}
with open("file.csv", newline='', mode='r') as csvFile:
csvData = csv.DictReader(csvFile)
for row in csvData:
if row['id'] not in c_dict.keys():
c_dict[row['id']] = {}
if (row['value01'] == '' or row['value01'] is None) or 'value01' in c_dict[row['id']].keys():
pass
else:
c_dict[row['id']]['value01'] = row['value01']
if (row['value02'] == '' or row['value02'] is None) or 'value02' in c_dict[row['id']].keys():
pass
else:
c_dict[row['id']]['value02'] = row['value02']
keys = ['id', 'value01', 'value02']
with open('output.csv', 'w', newline='') as output_file:
dict_writer = csv.writer(output_file)
dict_writer.writerow(keys)
for k, v in c_dict.items():
dict_writer.writerow([k, v['value01'], v['value02']])
Note: If you have more than these 2 columns, you may if-else clause in a loop

parse a csv into python dictionaries by structured range

I've got a csv file that has ranges of information dumped into rows and comma separated. Each line is a separate row, and the recordset is 'bookended' by rows that say startrecord (with an unique id) and stoprecord (matching id). The data between the record bookmarks various, no there is some repeatability.
StartRecord, ID
name, value, other
name, value,value,value
EndRecord, ID
StartRecord, ID
name, value, other
something, another, another, something new, different
EndRecord, ID
StartRecord, ID
various, name, value, name, value, other
EndRecord, ID
I'm reading in the records using csv reader and iterating the rows. How can I capture the data between and startrecord and endrecord into separate dictionary objects (in this example, I'd have 3 records)?
The continue statement should be helpful here. Try this.
import csv
dict = {}
current_id = ''
with open('filename.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
print(row)
if row[0] == 'StartRecord':
current_id = row[1]
dict[row[1]] = []
continue
if row[0] == 'EndRecord':
continue
dict[current_id].append(row)
print(dict)
Edit: I realised that you may need to modify the delimiter argument if your csv file has spaces after the commas like in your example.

Code swap. How would I swap the value of one CSV file column to another?

I have two CSV files. The first file(state_abbreviations.csv) has only states abbreviations and their full state names side by side(like the image below), the second file(test.csv) has the state abbreviations with additional info.
I want to replace each state abbreviation in test.csv with its associated state full name from the first file.
My approach was to read reach file, built a dict of the first file(state_abbreviations.csv). Read the second file(test.csv), then compare if an abbreviation matches the first file, if so replace it with the full name.
Any help is appreacited
import csv
state_initials = ("state_abbr")
state_names = ("state_name")
state_file = open("state_abbreviations.csv","r")
state_reader = csv.reader(state_file)
headers = None
final_state_initial= []
for row in state_reader:
if not headers:
headers = []
for i, col in enumerate(row):
if col in state_initials:
headers.append(i)
else:
final_state_initial.append((row[0]))
print final_state_initial
headers = None
final_state_abbre= []
for row in state_reader:
if not headers:
headers = []
for i, col in enumerate(row):
if col in state_initials:
headers.append(i)
else:
final_state_abbre.append((row[1]))
print final_state_abbre
final_state_initial
final_state_abbre
state_dictionary = dict(zip(final_state_initial, final_state_abbre))
print state_dictionary
You almost got it, the approach that is - building out a dict out of the abbreviations is the easiest way to do this:
with open("state_abbreviations.csv", "r") as f:
# you can use csv.DictReader() instead but lets strive for performance
reader = csv.reader(f)
next(reader) # skip the header
# assuming the first column holds the abbreviation, second the full state name
state_map = {state[0]: state[1] for state in reader}
Now you have state_map containing a map of all your state abbreviations, for example: state_map["FL"] contains Florida.
To replace the values in your test.csv, tho, you'll either have to load the whole file into memory, parse it, do the replacement and save it, or create a temporary file and stream-write to it the changes, then overwrite the original file with the temporary file. Assuming that test.csv is not too big to fit into your memory, the first approach is much simpler:
with open("test.csv", "r+U") as f: # open the file in read-write mode
# again, you can use csv.DictReader() for convenience, but this is significantly faster
reader = csv.reader(f)
header = next(reader) # get the header
rows = [] # hold our rows
if "state" in header: # proceed only if `state` column is found in the header
state_index = header.index("state") # find the state column index
for row in reader: # read the CSV row by row
current_state = row[state_index] # get the abbreviated state value
# replace the abbreviation if it exists in our state_map
row[state_index] = state_map.get(current_state, current_state)
rows.append(row) # append the processed row to our `rows` list
# now lets overwrite the file with updated data
f.seek(0) # seek to the file begining
f.truncate() # truncate the rest of the content
writer = csv.writer(f) # create a CSV writer
writer.writerow(header) # write back the header
writer.writerows(rows) # write our modified rows
It seems like you are trying to go through the file twice? This is absolutely not necessary: the first time you go through you are already reading all the lines, so you can then create your dictionary items directly.
In addition, comprehension can be very useful when creating lists or dictionaries. In this case it might be a bit less readable though. The alternative would be to create an empty dictionary, start a "real" for-loop and adding all the key:value pairs manually. (i.e: with state_dict[row[abbr]] = row[name])
Finally, I used the with statement when opening the file to ensure it is safely closed when we're done with it. This is good practice when opening files.
import csv
with open("state_abbreviations.csv") as state_file:
state_reader = csv.DictReader(state_file)
state_dict = {row['state_abbr']: row['state_name'] for row in state_reader}
print(state_dict)
Edit: note that, like the code you showed, this only creates the dictionary that maps abbreviations to state names. Actually replacing them in the second file would be the next step.
Step 1: Ask Python to remember the abbreviated full names, so we are using dictionary for that
with open('state_abbreviations.csv', 'r') as f:
csvreader = csv.reader(f)
next(csvreader)
abs = {r[0]: r[1] for r in csvreader}
step 2: Replace the abbreviations with full names and write to an output, I used "test_output.csv"
with open('test.csv', 'r') as reading:
csvreader = csv.reader(reading)
next(csvreader)
header = ['name', 'gender', 'birthdate', 'address', 'city', 'state']
with open( 'test_output.csv', 'w' ) as f:
writer = csv.writer(f)
writer.writerow(header)
for a in csvreader:
writer.writerow(a[0], a[1], a[2], a[3], a[4], abs[a[5]])

How to slice a single CSV file into several smaller ones grouped by a field and deleting columns in the final csv's?

Even thought this might sound as a repeated question, I have not found a solution. Well, I have a large .csv file that looks like:
prot_hit_num,prot_acc,prot_desc,pep_res_before,pep_seq,pep_res_after,ident,country
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPV,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],A,ANSPVL,D,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],L,SSISGAGGGGLA,L,F40,EB
1,gi|21909,21 kDa seed protein [Theobroma cacao],D,NYDNSAGKW,W,F40,EB
....
The aim is to slice this .csv file into multiple smaller .csv files according to the last two columns ('ident' and 'country').
I have used a code from an answer in a previous post and is the following:
csv_contents = []
with open(outfile_path4, 'rb') as fin:
dict_reader = csv.DictReader(fin) # default delimiter is comma
fieldnames = dict_reader.fieldnames # save for writing
for line in dict_reader: # read in all of your data
csv_contents.append(line) # gather data into a list (of dicts)
# input to itertools.groupby must be sorted by the grouping value
sorted_csv_contents = sorted(csv_contents, key=op.itemgetter('prot_desc','ident','country'))
for groupkey, groupdata in it.groupby(sorted_csv_contents,
key=op.itemgetter('prot_desc','ident','country')):
with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
dict_writer = csv.DictWriter(fou, fieldnames=fieldnames)
dict_writer.writerows(groupdata)
However, I need that my output .csv's just contain the column 'pep_seq', a desired output like:
pep_seq
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW
What can I do?
Your code was almost correct, it just needed the fieldsnames to be set correctly and for extraaction='ignore' to be set. This tells the DictWriter to only write the fields you specify:
import itertools
import operator
import csv
outfile_path4 = 'input.csv'
outfile_path5 = r'my_output_folder\output.csv'
csv_contents = []
with open(outfile_path4, 'rb') as fin:
dict_reader = csv.DictReader(fin) # default delimiter is comma
fieldnames = dict_reader.fieldnames # save for writing
for line in dict_reader: # read in all of your data
csv_contents.append(line) # gather data into a list (of dicts)
group = ['prot_desc','ident','country']
# input to itertools.groupby must be sorted by the grouping value
sorted_csv_contents = sorted(csv_contents, key=operator.itemgetter(*group))
for groupkey, groupdata in itertools.groupby(sorted_csv_contents, key=operator.itemgetter(*group)):
with open(outfile_path5+'slice_{:s}.csv'.format(groupkey), 'wb') as fou:
dict_writer = csv.DictWriter(fou, fieldnames=['pep_seq'], extrasaction='ignore')
dict_writer.writeheader()
dict_writer.writerows(groupdata)
This will give you an output csv file containing:
pep_seq
ANSPV
ANSPVL
SSISGAGGGGLA
NYDNSAGKW
The following would output a csv file per country containing only the field you need.
You could always add another step to group by the second field you need I think.
import csv
# use a dict so you can store the list of pep_seqs found for each country
# the country value with be the dict key
csv_rows_by_country = {}
with open('in.csv', 'rb') as csv_in:
csv_reader = csv.reader(csv_in)
for row in csv_reader:
if row[7] in csv_rows_by_country:
# add this pep_seq to the list we already found for this country
csv_rows_by_country[row[7]].append(row[4])
else:
# start a new list for this country - we haven't seen it before
csv_rows_by_country[row[7]] = [row[4],]
for country in csv_rows_by_country:
# create a csv output file for each country and write the pep_seqs into it.
with open('out_%s.csv' % (country, ), 'wb') as csv_out:
csv_writer = csv.writer(csv_out)
for pep_seq in csv_rows_by_country[country]:
csv_writer.writerow([pep_seq, ])

More pythonic way of iteratively assigning csv rows to dictionary values?

I have a CSV file, with columns holding specific values that I read into specific places in a dictionary, and rows separate instances of data that equal one full dictionary. I read in and then use this data to computer certain values, process some of the inputs, etc., for each row before moving on to the next row. My question is, if I have a header that specifics the names of the columns (Key1 versus Key 3A, etc.), can I use that information to avoid the somewhat draw out code I am currently using (below).
with open(input_file, 'rU') as controlFile:
reader = csv.reader(controlFile)
next(reader, None) # skip the headers
for row in reader:
# Grabbing all the necessary inputs
inputDict = {}
inputDict["key1"] = row[0]
inputDict["key2"] = row[1]
inputDict["key3"] = {}
inputDict["key3"].update({"A" : row[2]})
inputDict["key3"].update({"B" : row[3]})
inputDict["key3"].update({"C" : row[4]})
inputDict["key3"].update({"D" : row[5]})
inputDict["key3"].update({"E" : row[6]})
inputDict["Key4"] = {}
inputDict["Key4"].update({"F" : row[7]})
inputDict["Key4"].update({"G" : float(row[8])})
inputDict["Key4"].update({"H" : row[9]})
If you use a DictReader, you can improve your code a bit:
Create an object which operates like a regular reader but maps the
information read into a dict whose keys are given by the optional
fieldnames parameter. The fieldnames parameter is a sequence whose
elements are associated with the fields of the input data in order.
These elements become the keys of the resulting dictionary. If the
fieldnames parameter is omitted, the values in the first row of the
csvfile will be used as the fieldnames.
So, if we utilize that:
import csv
import string
results = []
mappings = [
[(string.ascii_uppercase[i-2], i) for i in range(2, 7)],
[(string.ascii_uppercase[i-2], i) for i in range(7, 10)]]
with open(input_file, 'rU') as control_file:
reader = csv.DictReader(control_file)
for row in reader:
row_data = {}
row_data['key1'] = row['key1']
row_data['key2'] = row['key2']
row_data['key3'] = {k:row[v] for k,v in mappings[0]}
row_data['key4'] = {k:row[v] for k,v in mappings[1]}
results.append(row_data)
yes you can.
import csv
with open(infile, 'rU') as infile:
reader = csv.DictReader(infile)
for row in reader:
print(row)
Take a look at this piece of code.
fields = csv_data.next()
for row in csv_data:
parsed_data.append(dict(zip(fields,row)))

Categories

Resources