Adding data from a CSV into separate dictionaries in python - python

I'm currently import some data from a csv and want to add it to separate dictionaries dependant which row I'm iterating through. I've added my dictionaries to a list which I hoped would allow me to index the dictionary from the list dependant which row I'm currently iterating through.
I have however come across a stumbling block when I make the row equal to the list index I believe it makes the instance of the dictionary in the list equal to the row not the dictionary itself (this is what I think is happening not 100% sure, this is my guess from my print statements at the bottom of my code). I would like to make the dictionary itself to equal that row not the version of it in the list, my code is below any help will be much apricated and thanks in advance.
import csv
one_information = {}
two_information = {}
three_information = {}
four_information = {}
five_information = {}
six_information = {}
seven_information = {}
lst = [one_information, two_information, three_information, four_information, five_information, six_information, seven_information]
with open('testing.csv', mode = 'r',encoding='utf-8-sig') as csv_file:
csv_reader = csv.DictReader(csv_file)
for index, row in enumerate(csv_reader):
if row['Room Name'] == 'test':
lst[index] = row
else:
pass
print(one_information)
print (lst[0])

You are right, what you did is assigning list element under specified index from dictionary to row. Whay you probably mean is to update your dictionary:
lst[index].update(row)
# or for python 3.9
# lst[index] |= row

Related

Turning a CSV file with a header into a python dictionary

Locked. There are disputes about this question’s content being resolved at this time. It is not currently accepting new answers or interactions.
Lets say I have the following example csv file
a,b
100,200
400,500
How would I make into a dictionary like below:
{a:[100,400],b:[200,500]}
I am having trouble figuring out how to do it manually before I use a package, so I understand. Any one can help?
some code I tried
with open("fake.csv") as f:
index= 0
dictionary = {}
for line in f:
words = line.strip()
words = words.split(",")
if index >= 1:
for x in range(len(headers_list)):
dictionary[headers_list[i]] = words[i]
# only returns the last element which makes sense
else:
headers_list = words
index += 1
At the very least, you should be using the built-in csv package for reading csv files without having to bother with parsing. That said, this first approach is still applicable to your .strip and .split technique:
Initialize a dictionary with the column names as keys and empty lists as values
Read a line from the csv reader
Zip the line's contents with the column names you got in step 1
For each key:value pair in the zip, update the dictionary by appending
with open("test.csv", "r") as file:
reader = csv.reader(file)
column_names = next(reader) # Reads the first line, which contains the header
data = {col: [] for col in column_names}
for row in reader:
for key, value in zip(column_names, row):
data[key].append(value)
Your issue was that you were using the assignment operator = to overwrite the contents of your dictionary on every iteration. This is why you either want to pre-initialize the dictionary like above, or use a membership check first to test if the key exists in the dictionary, adding it if not:
key = headers_list[i]
if key not in dictionary:
dictionary[key] = []
dictionary[key].append(words[i])
An even cleaner shortcut is to take advantage of dict.get:
key = headers_list[i]
dictionary[key] = dictionary.get(key, []) + [words[i]]
Another approach would be to take advantage of the csv package by reading each row of the csv file as a dictionary itself:
with open("test.csv", "r") as file:
reader = csv.DictReader(file)
data = {}
for row_dict in reader:
for key, value in row_dict.items():
data[key] = data.get(key, []) + [value]
Another standard library package you could use to clean this up further is collections, with defaultdict(list), where you can directly append to the dictionary at a given key without worrying about initializing with an empty list if the key wasn't already there.
To do that just keep the column name and data seperate then iterate the column and add the value for the corresponding index in data, not sure if this work with empty values.
However, I am much sure that going through pandas would be 100% easier, it's a really used library for working with data in external files.
import csv
datas = []
with open('fake.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
for row in csv_reader:
if line_count == 0:
cols = row
line_count += 1
else:
datas.append(row)
line_count += 1
dict = {}
for index, col in enumerate(cols): #Iterate through the data with value and indices
dict[col] = []
for data in datas: #append a in the current dict key, a new value.
#if this key doesn't exist, it will create a new one.
dict[col].append(data[index])
print(dict)

How do I make sure that all of the items that I read from a CSV are parsed into a dictionary?

I have a large CSV file from which I am reading some data and adding that data into a dictionary. My CSV file has approximately 360000 rows and my dictionary has only a length of 5700. I know my CSV has a lot of duplicates but I expect about 50000 unique rows. I know that Python dictionaries have no limits to size. My code reads all the 360000 entries in the file, writes it to another file and terminates. All this processing finishes in about 2 seconds without any exceptions. How do I know for sure that all of the items in the CSV that I process are actually being added to the dictionary?
The code that I am using is as follows:
with open("read.csv", 'rb') as input1:
with open("write.csv", 'wb') as output1:
reader = csv.reader(input1, delimiter="\t")
writer = csv.writer(output1, delimiter="\t")
#Just testing if my program reads the whole CSV file
for row in reader:
count += 1
print count # Gives 360000
input1.seek(0)
for row in reader:
#print row[1] + "\t" + row[2]
dict1.update({row[1] : [ row[2], row[0] ]})
print len(dict1) # Gives 5700
for key in dict1:
ext_id = key
list1 = dict1[key]
name = list1[0]
url = list1[1]
writer.writerow([ext_id, name, url])
EDIT
I am not sure if people are understanding what I am trying to do and how that would be relevant but still, I'll explain.
My CSV file has 3 columns for each row. Their format is as follows:
URL+unique value | unique value | some name
However, the rows are duplicated in the CSV and I want another CSV which just has rows without any duplicates.
The keys in your dictionary are row[1]. The size of the dictionary will depend on how many different values of this field are in the input. It does not matter if the rest of the row (row[2], row[0]) differs between rows.
Example:
a,foo,1
b,bar,2
c,foo,3
d,baz,4
Will result in a dictionary of length 3 if the first field (index 1) is used as a key. The result will be:
{'foo':['3', 'c'],
'bar':['2', 'b'],
'baz':['4', 'd']}
The first line will be overwritten. Of course the 'order' can be different since a dictionary has no order.
EDIT: if you're just checking for uniqueness, there's no need to put this into a dictionary (which are designed for fast lookup and grouping). Use a set here instead.
out_ = set()
for row in reader:
out_.add(tuple(row))
# csv.reader may iterate through tuples already, I don't know! If so
# there's obviously no reason to cast it as one. Do:
## print(type(reader[0]))
# to find out.
for row in out_:
writer.writerow([row[1], row[0], row[2]])
Here's the quickest check I can think of.
set_headers = {row[1] for row in reader}
This is a set containing all the 2nd columns (that is to say, row[1]) of all the rows in your CSV. As you probably know, sets cannot contain duplicates, so this gives you how many UNIQUE values you have in your header column of each row.
Since dict.update REPLACES values, this is exactly what you're going to see with len(dict1), in fact len(set_headers) == len(dict1). Each time you iterate through a row, dict.update CHANGES THE VALUE of the key row[1] to (row[0], row[2]). Which is probably just fine if you don't care about the earlier values, but somehow I don't think that's true.
Instead, do this:
for row in reader:
dict1.setdefault(row[1],[]).append((row[0],row[1]))
This will end up with something like:
dict1 = {"foo": [(row1_col0,row1_col2),(row3_col0,row3_col2)],
"baz": [(row2_col0,row2_col2)]}
from input of:
row1_col0, foo, row1_col2
row2_col0, baz, row2_col2
row3_col0, foo, row3_col2
Now you can do, for instance:
for header in dict1:
for row in header:
print("{}\t{}".format(row[0],row[1]))
The simplest way to make sure you are getting everything is to add 2 test lists. Add: test1,test2=[],[], then right after you update your dictionary add test1.append(row[1]) then if row[1] not in test2: test2.append(row[1]). then you can print the length of both lists and see if the length of test2 is the same as your dictionary and the length of test1 is the same as the length of your input csv.
with open("read.csv", 'rb') as input1:
with open("write.csv", 'wb') as output1:
reader = csv.reader(input1, delimiter="\t")
writer = csv.writer(output1, delimiter="\t")
#Just testing if my program reads the whole CSV file
test1,test2=[],[]
for row in reader:
count += 1
print count # Gives 360000
input1.seek(0)
for row in reader:
#print row[1] + "\t" + row[2]
dict1.update({row[1] : [ row[2], row[0] ]})
test1.append(row[1])
if row[1] not in test2: test2.append(row[1])
print 'dictionary length:', len(dict1) # Gives 5700
print 'test 1 length (total values):',len(test1)
print 'test2 length (unique key values):',len(test2)
for key in dict1:
ext_id = key
list1 = dict1[key]
name = list1[0]
url = list1[1]
writer.writerow([ext_id, name, url])

list of dicts to memory for csv reader

I have a list of dictionaries. For example
l = [{'date': '2014-01-01', 'spent': '$3'},[{'date': '2014-01-02', 'spent': '$5'}]
I want to make this csv-like so if I want to save it as a csv I can.
I have other that gets a list and calls splitlines() so I can use csv methods.
for example:
reader = csv.reader(resp.read().splitlines(), delimiter=',')
How can I change my list of dictionaries into a list that like a csv file?
I've been trying cast the dictionary to a string, but haven't had much luck. It should be something like
"date", "spent"
"2014-01-01", "$3"
"2014-01-02", "$5"
this will also help me print out the list of dictionaries in a nice way for the user.
update
This is the function I have which made me want to have the list of dicts:
def get_daily_sum(resp):
rev = ['revenue', 'estRevenue']
reader = csv.reader(resp.read().splitlines(), delimiter=',')
first = reader.next()
for i, val in enumerate(first):
if val in rev:
place = i
break
else:
place = None
if place:
total = sum(float(r[place]) for r in reader)
else:
total = 'Not available'
return total
so I wanted to total up a column from a list of column names. The problem was that the "revenue" column was not always in the same place.
Is there a better way? I have one object that returns a csv like string, and the other a list of dicts.
You would want to use csv.DictWriter to write the file.
with open('outputfile.csv', 'wb') as fout:
writer = csv.DictWriter(fout, ['date', 'spent'])
for dct in lst_of_dict:
writer.writerow(dct)
A solution using list comprehension, should work for any number of keys, but only if all your dicts have same heys.
l = [[d[key] for key in dicts[0].keys()] for d in dicts]
To attach key names for column titles:
l = dicts[0].keys() + l
This will return a list of lists which can be exported to csv:
import csv
myfile = csv.writer(open("data.csv", "wb"))
myfile.writerows(l)

How can I write to an existing csv file from a dictionary to a specific column?

I have a dictionary I created from a csv file and would like to use this dict to update the values in a specific column of a different csv file called sheet2.csv.
Sheet2.csv has many columns with different headers and I need to only update the column PartNumber based on my key value pairs in my dict.
My question is how would I use the keys in dict to search through sheet2.csv and update/write to only the column PartNumber with the appropriate value?
I am new to python so I hope this is not too confusing and any help is appreciated!
This is the code I used to create the dict:
import csv
a = open('sheet1.csv', 'rU')
csvReader = csv.DictReader(a)
dict = {}
for line in csvReader:
dict[line["ReferenceID"]] = line["PartNumber"]
print(dict)
dict = {'R150': 'PN000123', 'R331': 'PN000873', 'C774': 'PN000064', 'L7896': 'PN000447', 'R0640': 'PN000878', 'R454': 'PN000333'}
To make things even more confusing, I also need to make sure that already existing rows in sheet2 remain unchanged. For example, if there is a row with ReferenceID as R1234 and PartNumber as PN000000, it should stay untouched. So I would need to skip rows which are not in my dict.
Link to sample CSVs:
http://dropbox.com/s/zkagunnm0xgroy5/Sheet1.csv
http://dropbox.com/s/amb7vr48mdc94v6/Sheet2.csv
EDIT: Let me rephrase my question and provide a better example csvfile.
Let's say I have a Dict = {'R150': 'PN000123', 'R331': 'PN000873', 'C774': 'PN000064', 'L7896': 'PN000447', 'R0640': 'PN000878', 'R454': 'PN000333'}.
I need to fill in this csv file: https://www.dropbox.com/s/c95mlitjrvyppef/sheet.csv
Specifically, I need to fill in the PartNumber column using the keys of the dict I created. So I need to iterate through column ReferenceID and compare that value to my keys in dict. If there is a match I need to fill in the corresponding PartNumber cell with that value.... I'm sorry if this is all confusing!
The code below should do the trick. It first builds a dictionary just like your code and then moves on to read Sheet2.csv row by row, possibly updating the part number. The output goes to temp.csv which you can compare with the inital Sheet2.csv. In case you want to overwrite Sheet2.csv with the contents of temp.csv, simply uncomment the line with shutil.move.
Note that the sample files you provided do not contain any updateable data, so Sheet2.csv and temp.csv will be identical. I tested this with a slightly modified Sheet1.csv where I made sure that it actually contains a reference ID used by Sheet2.csv.
import csv
import shutil
def createReferenceIdToPartNumberMap(csvToReadPath):
result = {}
print 'read part numbers to update from', csvToReadPath
with open(csvToReadPath, 'rb') as csvInFile:
csvReader = csv.DictReader(csvInFile)
for row in csvReader:
result[row['ReferenceID']] = row['PartNumber']
return result
def updatePartNumbers(csvToUpdatePath, referenceIdToPartNumberMap):
tempCsvPath = 'temp.csv'
print 'update part numbers in', csvToUpdatePath
with open(csvToUpdatePath, 'rb') as csvInFile:
csvReader = csv.reader(csvInFile)
# Figure out which columns contain the reference ID and part number.
titleRow = csvReader.next()
referenceIdColumn = titleRow.index('ReferenceID')
partNumberColumn = titleRow.index('PartNumber')
# Write tempoary CSV file with updated part numbers.
with open(tempCsvPath, 'wb') as tempCsvFile:
csvWriter = csv.writer(tempCsvFile)
csvWriter.writerow(titleRow)
for row in csvReader:
# Check if there is an updated part number.
referenceId = row[referenceIdColumn]
newPartNumber = referenceIdToPartNumberMap.get(referenceId)
# If so, update the row just read accordingly.
if newPartNumber is not None:
row[partNumberColumn] = newPartNumber
print ' update part number for %s to %s' % (referenceId, newPartNumber)
csvWriter.writerow(row)
# TODO: Move the temporary CSV file over the initial CSV file.
# shutil.move(tempCsvPath, csvToUpdatePath)
if __name__ == '__main__':
referenceIdToPartNumberMap = createReferenceIdToPartNumberMap('Sheet1.csv')
updatePartNumbers('Sheet2.csv', referenceIdToPartNumberMap)

Need more efficient way to parse out csv file in Python

Here's a sample csv file
id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601
This is the output I'm looking for (list of serial_no withing a list of ids):
[2, [500,501,502]]
[3, [600, 601]]
I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
each_row = []
each_row.append(row[0])
each_row.append(row[1])
zipped_data.append(each_row)
for rec in zipped_data:
if rec[0] not in ids:
ids.append(rec[0])
for id in ids:
for rec in zipped_data:
if rec[0] == id:
ser_no.append(rec[1])
tmp.append(id)
tmp.append(ser_no)
print tmp
tmp = []
ser_no = []
**I've omitted var initializing for simplicity of code
print tmp
Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!
from collections import defaultdict
records = defaultdict(list)
file = 'test.csv'
data = csv.reader(open(file))
fields = data.next()
for row in data:
records[row[0]].append(row[1])
#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results
If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])
Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.
result = collections.defaultdict(list)
for row in data:
result[row[0]].append(row[1])
Here's a version I wrote, looks like there are plenty of answers for this one already though.
You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).
#!/usr/bin/python
import csv
myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)
myData = {}
for myRow in csvFile:
myId = myRow['id']
if not myData.has_key(myId): myData[myId] = []
myData[myId].append(myRow['serial_no'])
for myId in sorted(myData):
print '%s %s' % (myId, myData[myId])
myFile.close()
Some observations:
0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...
1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.
2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.
3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.
4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.
5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.
6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).
Applying these ideas, we get:
filename = 'test.csv'
with open(filename) as in_file:
data = csv.reader(in_file)
data.next() # ignore the field labels
rows = list(data) # read the rest of the rows from the iterator
print [
# We want a list of all serial numbers from rows with a matching ID...
[serial_no for row_id, serial_no in rows if row_id == id]
# for each of the IDs that there is to match, which come from making
# a set from the first column of the data.
for id in set(zip(*rows)[0])
]
We can probably do even better than this by using the groupby function from the itertools module.
example using itertools.groupby. This only works if the rows are already grouped by id
from csv import DictReader
from itertools import groupby
from operator import itemgetter
filename = 'test.csv'
# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:
# group by id - this requires that the rows are already grouped by id
groups = groupby(DictReader(infile), key=itemgetter('id'))
# loop through the groups printing a list for each one
for i,j in groups:
print [i, map(itemgetter(' serial_no'), list(j))]
note the space in front of ' serial_no'. This is because of the space after the comma in the input file

Categories

Resources