parse a csv into python dictionaries by structured range - python

I've got a csv file that has ranges of information dumped into rows and comma separated. Each line is a separate row, and the recordset is 'bookended' by rows that say startrecord (with an unique id) and stoprecord (matching id). The data between the record bookmarks various, no there is some repeatability.
StartRecord, ID
name, value, other
name, value,value,value
EndRecord, ID
StartRecord, ID
name, value, other
something, another, another, something new, different
EndRecord, ID
StartRecord, ID
various, name, value, name, value, other
EndRecord, ID
I'm reading in the records using csv reader and iterating the rows. How can I capture the data between and startrecord and endrecord into separate dictionary objects (in this example, I'd have 3 records)?

The continue statement should be helpful here. Try this.
import csv
dict = {}
current_id = ''
with open('filename.csv', newline='') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
for row in reader:
print(row)
if row[0] == 'StartRecord':
current_id = row[1]
dict[row[1]] = []
continue
if row[0] == 'EndRecord':
continue
dict[current_id].append(row)
print(dict)
Edit: I realised that you may need to modify the delimiter argument if your csv file has spaces after the commas like in your example.

Related

How can I combining like-rows in CSV to fill missing data?

I have a CSV file that might look like:
id, value01, value02,
01, , 01b,
01, 01a, ,
02, , 02b,
02, 02a, 02b,
...
As you can see, I have duplicate rows where one of the duplicates (can be more than two) (determined as a duplicate by the id) has missing values, and the other duplicates contain other values missing.
I think someone who managed this CSV output wrote twice to the CSV rather than combine results and output once, so now I need to find a clean way to do this.
So far, my work is:
import csv
def combine_dups():
data = []
with open("file.csv", newline='', mode='r') as csvFile:
csvData = csv.DictReader(csvFile)
for row in csvData:
for lookahead in csvData:
# Check if lookahead and current row have matching ids
if row["id"] == lookahead["id"]:
# Loop through columns of row and lookahead
for col in row:
# If current row's column is blank, take value from lookahead
if row[col] == '' or row[col] is None:
row[col] = lookahead[col]
data.append(row) # Add new filled out, completed row
# Manage data to no longer contain excess duplicates
# Code here??
return data
This code isn't correct, as it as:
If loops through csvData for each row, rather than looping through all data after current row. This is easily solved using a for loop using an index, but I left that out for simplicity.
The row is filled in with the missing data, but this operation is done multiple times for the other values with identical id's. How can I avoid this?
Edit:
For clarity, the NEW csv should look like:
id, value01, value02,
01, 01a, 01b,
02, 02a, 02b,
...
Using the csv module.
Ex:
import csv
data = {}
with open(filename) as csvInFile:
csvData = csv.DictReader(csvInFile)
fieldnames = csvData.fieldnames
for row in csvData:
# Combine data and update.
data.setdefault(row['id'], dict()).update({k:v for k,v in row.items() if v.strip() and not data.get(row['id']).get(k)})
with open(filename, "w", newline='') as csvOutFile:
csvOutData = csv.DictWriter(csvOutFile, fieldnames=fieldnames)
csvOutData.writeheader()
csvOutData.writerows(data.values()) # Write data.
Output:
id, value01, value02
01, 01a, 01b
02, 02a, 02b
The could be solved by using a dictionary object to store the values. This csv would be traversed just once for each row.
You can refer to the code below:
import csv
c_dict = {}
with open("file.csv", newline='', mode='r') as csvFile:
csvData = csv.DictReader(csvFile)
for row in csvData:
if row['id'] not in c_dict.keys():
c_dict[row['id']] = {}
if (row['value01'] == '' or row['value01'] is None) or 'value01' in c_dict[row['id']].keys():
pass
else:
c_dict[row['id']]['value01'] = row['value01']
if (row['value02'] == '' or row['value02'] is None) or 'value02' in c_dict[row['id']].keys():
pass
else:
c_dict[row['id']]['value02'] = row['value02']
keys = ['id', 'value01', 'value02']
with open('output.csv', 'w', newline='') as output_file:
dict_writer = csv.writer(output_file)
dict_writer.writerow(keys)
for k, v in c_dict.items():
dict_writer.writerow([k, v['value01'], v['value02']])
Note: If you have more than these 2 columns, you may if-else clause in a loop

Remove columns + keep certain rows in multiple large .csv files using python

Hello I'm really new here as well as in the world of python.
I have some (~1000) .csv files, including ~ 1800000 rows of information each. The files are in the following form:
5302730,131841,-0.29999999999999999,NULL,2013-12-31 22:00:46.773
5303072,188420,28.199999999999999,NULL,2013-12-31 22:27:46.863
5350066,131841,0.29999999999999999,NULL,2014-01-01 00:37:21.023
5385220,-268368577,4.5,NULL,2014-01-01 03:12:14.163
5305752,-268368587,5.1900000000000004,NULL,2014-01-01 03:11:55.207
So, i would like for all of the files:
(1) to remove the 4th (NULL) column
(2) to keep in every file only certain rows (depending on the value of the first column i.e.5302730, keep only the rows that containing that value)
I don't know if this is even possible, so any answer is appreciated!
Thanks in advance.
Have a look at the csv module
One can use the csv.reader function to generate an iterator of lines, with each lines cells as a list.
for line in csv.reader(open("filename.csv")):
# Remove 4th column, remember python starts counting at 0
line = line[:3] + line[4:]
if line[0] == "thevalueforthefirstcolumn":
dosomethingwith(line)
If you wish to do this sort of operation with CSV files more than once and want to use different parameters regarding column to skip, column to use as key and what to filter on, you can use something like this:
import csv
def read_csv(filename, column_to_skip=None, key_column=0, key_filter=None):
data_from_csv = []
with open(filename) as csvfile:
csv_reader = csv.reader(csvfile)
for row in csv_reader:
# Skip data in specific column
if column_to_skip is not None:
del row[column_to_skip]
# Filter out rows where the key doesn't match
if key_filter is not None:
key = row[key_column]
if key_filter != key:
continue
data_from_csv.append(row)
return data_from_csv
def write_csv(filename, data_to_write):
with open(filename, 'w') as csvfile:
csv_writer = csv.writer(csvfile)
for row in data_to_write:
csv_writer.writerow(row)
data = read_csv('data.csv', column_to_skip=3, key_filter='5302730')
write_csv('data2.csv', data)

Referencing keys in a Python dictionary for CSV reader

new to python and trying to build a simple CSV reader to create new trades off an existing instrument. Ideally, I'd like to build a dictionary to simplify the parameters required to set up a new trade (instead of using row[1], [2], [3], etc, I'd like to replace with my headers that read Value Date, Trade Date, Price, Quantity, etc.)
I've created dictionary keys below, but am having trouble linking them to my script to create the new trade. What should I put to substitute the rows? Any advice appreciated! Thanks...
Code below:
import acm
import csv
# Opening CSV file
with open('C:\Users\Yina.Huang\Desktop\export\TradeBooking.csv', 'rb') as f:
reader = csv.DictReader(f, delimiter=',')
next(reader, None)
for row in reader:
# Match column header with column number
d = {
row["Trade Time"],
row["Value Day"],
row["Acquire Day"],
row["Instrument"],
row["Price"],
row["Quantity"],
row["Counterparty"],
row["Acquirer"],
row["Trader"],
row["Currency"],
row["Portfolio"],
row["Status"]
}
NewTrade = acm.FTrade()
NewTrade.TradeTime = "8/11/2016 12:00:00 AM"
NewTrade.ValueDay = "8/13/2016"
NewTrade.AcquireDay = "8/13/2016"
NewTrade.Instrument = acm.FInstrument[row["Instrument"]]
NewTrade.Price = row[4]
NewTrade.Quantity = row[5]
NewTrade.Counterparty = acm.FParty[row[6]]
NewTrade.Acquirer = acm.FParty[row[7]]
NewTrade.Trader = acm.FUser[row[8]]
NewTrade.Currency = acm.FCurrency[row[9]]
NewTrade.Portfolio = acm.FPhysicalPortfolio[row[10]]
NewTrade.Premium = (int(row[4])*int(row[5]))
NewTrade.Status = row[11]
print NewTrade
NewTrade.Commit()
The csv module already provides this functionality with the csv.DictReader object.
with open('C:\Users\Yina.Huang\Desktop\export\TradeBooking.csv', 'rb') as f:
reader = csv.DictReader(f)
for row in reader:
NewTrade = acm.FTrade()
NewTrade.TradeTime = row['Trade Time']
NewTrade.ValueDay = row['Value Day']
NewTrade.AcquireDay = row['Aquire Day']
NewTrade.Instrument = acm.Finstrument[row['Instrument']]
NewTrade.Price = row['Price']
NewTrade.Quantity = row['Quantity']
# etc
From the documentation:
Create an object which operates like a regular reader but maps the
information read into a dict whose keys are given by the optional
fieldnames parameter. The fieldnames parameter is a sequence whose
elements are associated with the fields of the input data in order.
These elements become the keys of the resulting dictionary. If the
fieldnames parameter is omitted, the values in the first row of the
csvfile will be used as the fieldnames. If the row read has more
fields than the fieldnames sequence, the remaining data is added as a
sequence keyed by the value of restkey. If the row read has fewer
fields than the fieldnames sequence, the remaining keys take the value
of the optional restval parameter.

More pythonic way of iteratively assigning csv rows to dictionary values?

I have a CSV file, with columns holding specific values that I read into specific places in a dictionary, and rows separate instances of data that equal one full dictionary. I read in and then use this data to computer certain values, process some of the inputs, etc., for each row before moving on to the next row. My question is, if I have a header that specifics the names of the columns (Key1 versus Key 3A, etc.), can I use that information to avoid the somewhat draw out code I am currently using (below).
with open(input_file, 'rU') as controlFile:
reader = csv.reader(controlFile)
next(reader, None) # skip the headers
for row in reader:
# Grabbing all the necessary inputs
inputDict = {}
inputDict["key1"] = row[0]
inputDict["key2"] = row[1]
inputDict["key3"] = {}
inputDict["key3"].update({"A" : row[2]})
inputDict["key3"].update({"B" : row[3]})
inputDict["key3"].update({"C" : row[4]})
inputDict["key3"].update({"D" : row[5]})
inputDict["key3"].update({"E" : row[6]})
inputDict["Key4"] = {}
inputDict["Key4"].update({"F" : row[7]})
inputDict["Key4"].update({"G" : float(row[8])})
inputDict["Key4"].update({"H" : row[9]})
If you use a DictReader, you can improve your code a bit:
Create an object which operates like a regular reader but maps the
information read into a dict whose keys are given by the optional
fieldnames parameter. The fieldnames parameter is a sequence whose
elements are associated with the fields of the input data in order.
These elements become the keys of the resulting dictionary. If the
fieldnames parameter is omitted, the values in the first row of the
csvfile will be used as the fieldnames.
So, if we utilize that:
import csv
import string
results = []
mappings = [
[(string.ascii_uppercase[i-2], i) for i in range(2, 7)],
[(string.ascii_uppercase[i-2], i) for i in range(7, 10)]]
with open(input_file, 'rU') as control_file:
reader = csv.DictReader(control_file)
for row in reader:
row_data = {}
row_data['key1'] = row['key1']
row_data['key2'] = row['key2']
row_data['key3'] = {k:row[v] for k,v in mappings[0]}
row_data['key4'] = {k:row[v] for k,v in mappings[1]}
results.append(row_data)
yes you can.
import csv
with open(infile, 'rU') as infile:
reader = csv.DictReader(infile)
for row in reader:
print(row)
Take a look at this piece of code.
fields = csv_data.next()
for row in csv_data:
parsed_data.append(dict(zip(fields,row)))

Python: General CSV file parsing and manipulation

The purpose of my Python script is to compare the data present in multiple CSV files, looking for discrepancies. The data are ordered, but the ordering differs between files. The files contain about 70K lines, weighing around 15MB. Nothing fancy or hardcore here. Here's part of the code:
def getCSV(fpath):
with open(fpath,"rb") as f:
csvfile = csv.reader(f)
for row in csvfile:
allRows.append(row)
allCols = map(list, zip(*allRows))
Am I properly reading from my CSV files? I'm using csv.reader, but would I benefit from using csv.DictReader?
How can I create a list containing whole rows which have a certain value in a precise column?
Are you sure you want to be keeping all rows around? This creates a list with matching values only... fname could also come from glob.glob() or os.listdir() or whatever other data source you so choose. Just to note, you mention the 20th column, but row[20] will be the 21st column...
import csv
matching20 = []
for fname in ('file1.csv', 'file2.csv', 'file3.csv'):
with open(fname) as fin:
csvin = csv.reader(fin)
next(csvin) # <--- if you want to skip header row
for row in csvin:
if row[20] == 'value':
matching20.append(row) # or do something with it here
You only want csv.DictReader if you have a header row and want to access your columns by name.
This should work, you don't need to make another list to have access to the columns.
import csv
import sys
def getCSV(fpath):
with open(fpath) as ifile:
csvfile = csv.reader(ifile)
rows = list(csvfile)
value_20 = [x for x in rows if x[20] == 'value']
If I understand the question correctly, you want to include a row if value is in the row, but you don't know which column value is, correct?
If your rows are lists, then this should work:
testlist = [row for row in allRows if 'value' in row]
post-edit:
If, as you say, you want a list of rows where value is in a specified column (specified by an integer pos, then:
testlist = []
pos = 20
for row in allRows:
testlist.append([element if index != pos else 'value' for index, element in enumerate(row)])
(I haven't tested this, but let me now if that works).

Categories

Resources