I posted below a code that works fine. What it does at the moment is:
it opens 2 .csv files 'CMF.csv' and 'D65.csv', and then
performs some math on it.
Here's the simple structure of those files :
'CMF.csv' (wavelength, x, y, z)
400,1.879338E-02,2.589775E-03,8.508254E-02
410,8.277331E-02,1.041303E-02,3.832822E-01
420,2.077647E-01,2.576133E-02,9.933444E-01
...etc
'D65.csv': (wavelength, a, b)
400,82.7549,14.708
410,91.486,17.6753
420,93.4318,20.995
...etc
I have a 3rd file data.csv, with this structure (serialNumber, wavelength, measurement, name) :
0,400,2.21,1
0,410,2.22,1
0,420,2.22,1
...
1,400,2.21,2
1,410,2.22,2
1,420,2.22,2
...etc
What I would like to do is to be able to write a few lines of code to perform
math on all the series of the last file (series are defined by their serial number and their name)
For example I need a loop that will perform, for each name or serial number, and for each wavelength, the operation:
x * a * measurement
I tried to load data.csv`in the csv reader as the other files, but I couldn't
any ideas?
Thanks
import csv
with open('CMF.csv') as cmf:
reader = csv.reader(cmf)
dict_cmf = dict()
for row in reader:
dict_cmf[float(row[0])] = row
with open('D65.csv') as d65:
reader = csv.reader(d65)
dict_d65 = dict()
for row in reader:
dict_d65[float(row[0])] = row
with open('data.csv') as sp:
reader = csv.reader(sp)
dict_sp = dict()
for row in reader:
dict_sp[float(row[0])] = row
Y = 0
Y_total = 0
X = 0
X_total = 0
Z = 0
Z_total = 0
i = 0
j = 0
for i in range(400, 700, i+10):
X = float(dict_cmf[i][1]) * float(dict_d65[i][1])
X_total = X_total + X
Y = float(dict_cmf[i][2]) * float(dict_d65[i][1])
Y_total = Y_total + Y
Z = float(dict_cmf[i][3]) * float(dict_d65[i][1])
Z_total = Z_total + Z
wp_X = 100 * X_total / Y_total
wp_Y = 100 * Y_total / Y_total
wp_Z = 100 * Z_total / Y_total
print Y_total
print "D65_CMF_2006_10_deg white point = "
print wp_X, wp_Y, wp_Z
I get this :
Traceback (most recent call last): File "C:\Users\gary\Documents\eclipse\Spectro\1illum_XYZ2006_D65_numpy.py", line 24, in <module> dict_sp[row[0]] = row IndexError: list index out of range
You need pandas. You can read the files into pandas tables, then join them to replace your code with this code:
import pandas
cmf = pandas.read_csv('CMF.csv', names=['wavelength', 'x', 'y', 'z'])
d65 = pandas.read_csv('D65.csv', names=['wavelength', 'a', 'b'])
data = pandas.read_csv('data.csv', names=['serialNumber', 'wavelength', 'measurement', 'name'])
lookup = pandas.merge(cmf, d65, on='wavelength')
merged = pandas.merge(data, lookup, on='wavelength')
totals = ((lookup[['x', 'y', 'z']].T*lookup['a']).T).sum()
wps = totals/totals['y']
print totals['y']
print "D65_CMF_2006_10_deg white point = "
print wps
Now, that doesn't do the last bit where you want to calculate extra values for each measurement. You can do this by adding a column to merged, like this:
merged['newcol'] = merged.x * merged.a * merged.measurement
One or more of the lines in data.csv does not contain what you think it does. Try to put your statement inside a try...except block to see what the problem is:
with open('spectral_data.csv') as sp:
reader = csv.reader(sp)
dict_sp = dict()
for row in reader:
try:
dict_sp[float(row[0])] = row
except IndexError as e:
print 'The problematic row is:'
print row
raise e
A proper debugger would also be helpful in these kind of situations.
pandas is probably the better way to go, but if you'd like an example in vanilla Python, you can have a look at this example:
import csv
from collections import defaultdict
d = defaultdict(dict)
for fname, cols in [('CMF.csv', ('x', 'y', 'z')), ('D65.csv', ('a', 'b'))]:
with open(fname) as ifile:
reader = csv.reader(ifile)
for row in reader:
wl, values = int(row[0]), row[1:]
d[wl].update(zip(cols, map(float, values)))
measurements = defaultdict(dict)
with open('data.csv') as ifile:
reader = csv.reader(ifile)
cols = ('measurement', 'name')
for serial, wl, me, name in reader:
measurements[int(serial)][int(wl)] = dict(zip(cols, (float(me), str(name))))
for serial in sorted(measurements.keys()):
for wl in sorted(measurements[serial].keys()):
me = measurements[serial][wl]['measurement']
print me * d[wl]['x'] * d[wl]['a']
This will store both x, y, z, a and b in a dictionary inside a dictionary with wavelength as the key (there is no apparent reason to store these values in separate dicts).
The measurements are stored in a two level deep dictionary with keys serial and wavelength. This way you can iterate over all serials and all corresponding wavelength like shown in the latter part of the code.
As for your specific calculations on the data in your example, this can be done quite easily with this structure:
tot_x = sum(v['x']*v['a'] for v in data.values())
tot_y = sum(v['y']*v['a'] for v in data.values())
tot_z = sum(v['z']*v['a'] for v in data.values())
wp_x = 100 * tot_x / tot_y
wp_y = 100 * tot_y / tot_y # Sure this is correct? It will always be 100
wp_z = 100 * tot_z / tot_y
print wp_x, wp_y, wp_z # 798.56037811 100.0 3775.04316468
These are the dictionaries given the input file in your question:
>>> from pprint import pprint
>>> pprint(dict(data))
{400: {'a': 82.7549,
'b': 14.708,
'x': 0.01879338,
'y': 0.002589775,
'z': 0.08508254},
410: {'a': 91.486,
'b': 17.6753,
'x': 0.08277331,
'y': 0.01041303,
'z': 0.3832822},
420: {'a': 93.4318,
'b': 20.995,
'x': 0.2077647,
'y': 0.02576133,
'z': 0.9933444}}
>>> pprint(dict(measurements))
{0: {400: {'measurement': 2.21, 'name': '1'},
410: {'measurement': 2.22, 'name': '1'},
420: {'measurement': 2.22, 'name': '1'}},
1: {400: {'measurement': 2.21, 'name': '2'},
410: {'measurement': 2.22, 'name': '2'},
420: {'measurement': 2.22, 'name': '2'}}}
Related
I would like to read a csv data which contains list entries with a label and two corresponding data points. All in All, there are three labels N,M,U.
I would like to create a Dict with a key for each of the label and all the corresponding data points in a list as value for the key. I tried with the below code, but have the problem that it returns a Dict with {"N":[all datapoint]}, so it assigns every data point to the label N and doesn't create a new key for M and U.
Does anybody see the problem here?
with open('./data.csv', 'r') as i:
D = {}
for line in i:
datatuple = tuple(line[2:-1].split(","))
floattuple = (float(datatuple[0])),float(datatuple[1])
label = line[:1]
if label in D:
D[label].append(floattuple)
else:
D[label] = [floattuple]
return D
Example data from the csv:
Thanks!
Your problem is the exact reason Python's dict has the .setdefault() method.
First, let's define a generator to generate some random data
In [28]: def lines():
...: from random import random, randrange
...: for _ in range(12):
...: key = {0:'M', 1:'N', 2:'U'}[randrange(3)]
...: yield ','.join((
...: key,
...: "%+5.3f"%(random()*10-5),
...: "%+5.3f"%(random()*10-5)
...: ))
Then, just as you read lines from a file, we read lines from the generator and update our dictionary, using the setdefault() method that, if the item is new, provides a default value, here an empty list, that you can immediately use to append the x, y point (I have placed some prints into the code so that you can check its correctness)
In [29]: d = {}
...:
...: for line in lines():
...: print(line)
...: key, x, y = line.split(',')
...: d.setdefault(key, []).append((float(x), float(y)))
...: print(*((k+': '+', '.join(str(t) for t in d[k])) for k in d), sep='\n')
M,-0.141,+1.755
M,+0.088,+3.354
N,+3.295,-3.847
U,+1.771,-3.268
M,-4.215,-4.499
U,-2.647,+1.218
U,-0.039,-0.357
U,+3.311,-3.312
N,-0.015,+2.039
N,-0.157,+3.319
N,-4.088,-0.914
U,+4.266,+4.863
M: (-0.141, 1.755), (0.088, 3.354), (-4.215, -4.499)
N: (3.295, -3.847), (-0.015, 2.039), (-0.157, 3.319), (-4.088, -0.914)
U: (1.771, -3.268), (-2.647, 1.218), (-0.039, -0.357), (3.311, -3.312), (4.266, 4.863)
In [30]:
This should do the job:
# i = ["N,1,2", "U,3,4", "U,5,6"]
D = {}
with open('./data.csv', 'r') as i:
for line in i:
line_list = line.split(",")
datatuple = tuple(map(float, line_list[1:]))
label = line_list[0]
D[label] = D.get(label, list()) + [datatuple]
return D
Using the example data i = ["N,1,2", "U,3,4", "U,5,6"] this results in {'N': [(1.0, 2.0)], 'U': [(3.0, 4.0), (5.0, 6.0)]}.
An arguably better option would be to use pandas read_csv. Depending on the size of your data, this will also be much faster:
import numpy as np
import pandas as pd
np.random.seed(3)
# Create example data (same structure as in the OP) and write to disk
pd.DataFrame({"label": np.random.choice(["M", "N", "U"], 10),
"x": map("{:.3f}".format, np.random.normal(size=10)),
"y": map("{:.3f}".format, np.random.normal(size=10))}
).to_csv("./data.csv", header=False, index=False)
# read data to dataframe, convert to tuple, groupby and convert to dict
D = (pd.read_csv("./data.csv", header=None, names=["label", "x", "y"])
.set_index("label")
.apply(tuple, axis=1)
.groupby("label")
.apply(list)
.to_dict())
# Output:
{'M': [(-0.581, -1.69), (-1.147, -1.73), (-0.611, 0.696), (-1.19, 0.565)],
'N': [(-0.152, -0.349),
(0.872, 0.48),
(-0.016, -0.29600000000000004),
(-2.1590000000000003, -0.86)],
'U': [(0.278, 0.7559999999999999), (1.167, -0.42)]}
Some rounding errors occur when reading the csv file (0.29600000000000004 etc.).
Here is a simplified version of my code:
import json
import random
y = {}
y['red'] = {'name': "red", 'p': 0}
y['blue'] = {'name': "blue", 'p': 0}
y['green'] = {'name': "green", 'p': 0}
with open('y.json', 'w') as f:
json.dump(y, f)
f = open('y.json')
y = json.load(f)
rr = random.randint(1, 101)
rb = random.randint(1, 101)
rg = random.randint(1, 101)
z = '%s%s%s' % (y['red']['name'], " ", y['red']['p'])
zz = '%s%s%s' % (y['blue']['name'], " ", y['blue']['p'])
zzz = '%s%s%s' % (y['green']['name'], " ", y['green']['p'])
print('\n'.join(sorted([z, zz, zzz], key=lambda x: int(x.split()[-1]), reverse=True)))
with open('y.json', 'w') as f:
json.dump(y, f)
The 'random'-commands currently have no use, but I want to compare these 3 so that the highest random number adds 3 to 'p', the second-highest 2 to 'p' and the lowest 1 to 'p'.
Example:
if rr = 22; rb = 66; rg = 44
then the output should be:
blue 3
green 2
red 1
So that the values of p in my JASON file are:
y['red'] = {'name': "red", 'p': 1}
y['blue'] = {'name': "blue", 'p': 3}
y['green'] = {'name': "green", 'p': 2}
I know that the JSON file would be overwritten everytime I run the program, but this is just a simplified version.
I also know that I could use if statements for that, but I kinda don't want to because I would have to chain a lot of them and that's just not really pretty and takes a lot of time if I want to add more "colors".
I tried my best to explain my problem, but if there are still any questions please ask me.
Thank you :-)
You can add another item to the dict and then sort the dict keys on that item.
Replace:
rr = random.randint(1, 101)
rb = random.randint(1, 101)
rg = random.randint(1, 101)
With:
for key, value in y.items():
y[key]['rand'] = random.randint(1, 101)
sorted_keys = sorted(y.keys(), key=lambda x: y[x]['rand'])
for i, key in enumerate(sorted_keys):
y[key]['p'] = i + 1
From what I understood, you want to sort the values and keep the color tags?
You can build a structure like this:
z = [['red', 22], ['blue', 66], ['green', 44]]
z.sort(key = lambda x: x[-1])
and then change the values of p:
z = [(z[x][0],x+1) for x in range(len(z))]
The result should be:
[('red', 1), ('green', 2), ('blue', 3)]
I have multiple csv files like this:
csv1:
h1,h2,h3
aa,34,bd9
bb,459,jg0
csv2:
h1,h5,h2
aa,rg,87
aa,gru,90
bb,sf,459
For each value in column 0 with header h1, I'd like to get its corresponding h2 values from all the csv files in a folder. A sample output could be
csv1: (aa,34),(bb,459)
csv2: (aa,87,90),(bb,459)
I'm a little clueless on how to go about doing this.
PS- I don't want to use pandas.
PPS- I'm able to do it by hardcoding the value from column 0, but I don't want to do it that way since there are hundreds of rows.
This is a small piece of code I've tried. It prints the values of h2 for 'aa' in different lines. I want them to be printed in the same line.
import csv
with open("test1/sample.csv") as csvfile:
reader = csv.DictReader(csvfile, delimiter = ",")
for row in reader:
print(row['h1'], row['h2'])
import glob
import csv
import os
from collections import defaultdict
d = defaultdict(list)
path = "path_to_folder"
for fle in (glob.glob("*.csv")):
with open(os.path.join(path,fle)) as f:
header = next(f).rstrip().split(",")
# if either does not appear in header the value will be None
h1 = next((i for i, x in enumerate(header) if x == "h1"),None)
h2 = next((i for i, x in enumerate(header) if x == "h2"),None)
# make sure we have both columns before going further
if h1 is not None and h2 is not None:
r = csv.reader(f,delimiter=",")
# save file name as key appending each h1 and h2 value
for row in r:
d[fle].append([row[h1],row[h2]])
print(d)
defaultdict(<class 'list'>, {'csv1.csv': [['aa', '34'], ['bb', '459']], 'csv2.csv': [['aa', '87'], ['aa', '90'], ['bb', '459']]})
It is a quick draft, it presumes all files are delimited by , and all h1 and h2 columns have values, if so it will find all pairings keeping order.
To get a set of unique values we can use a set and set.update:
d = defaultdict(set) # change to set
for fle in (glob.glob("*.csv")):
with open(os.path.join(path,fle)) as f:
header = next(f).rstrip().split(",")
h1 = next((i for i, x in enumerate(header) if x == "h1"),None)
h2 = next((i for i, x in enumerate(header) if x == "h2"),None)
if h1 is not None and h2 is not None:
r = csv.reader(f,delimiter=",")
for row in r:
d[fle].update([row[h1],row[h2]) # set.update
print(d)
defaultdict(<class 'set'>, {'csv1.csv': {'459', '34', 'bb', 'aa'}, 'csv2.csv': {'459', '90', '87', 'bb', 'aa'}})
If you are sure you always have h1 and h2 you can reduce the code to simply:
d = defaultdict(set)
path = "path/"
for fle in (glob.glob("*.csv")):
with open(os.path.join(path, fle)) as f:
r = csv.reader(f,delimiter=",")
header = next(r)
h1 = header.index("h1")
h2 = header.index("h2")
for row in r:
d[fle].update([row[h1], row[h2]])
lastly if you want to keep the order the elements are found we cannot use a set as they are unordered so we would need to check if either element was already in the list:
for fle in (glob.glob("*.csv")):
with open(os.path.join(path, fle)) as f:
r = csv.reader(f,delimiter=",")
header = next(r)
h1 = header.index("h1")
h2 = header.index("h2")
for row in r:
h_1, h_2 = row[h1], row[h2]
if h_1 not in d[fle]:
d[fle].append(h_1)
if h_2 not in d[fle]:
d[fle].append(h_2)
print(d)
defaultdict(<class 'list'>, {'csv2.csv': ['aa', '87', '90', 'bb', '459'], 'csv1.csv': ['aa', '34', 'bb', '459']})
I want to find duplicate values of one column and replaced with value of another column of csv which has multiple columns. So first I put two columns from the csv to the dictionary. Then I want to find duplicate values of dictionary that has string values and keys. I tried with solutions of remove duplicates of dictionary but got the error as not hashable or no result. Here is the first part of code.
import csv
from collections import defaultdict
import itertools as it
mydict = {}
index = 0
reader = csv.reader(open(r"computing.csv", "rb"))
for i, rows in enumerate(reader):
if i == 0:
continue
if len(rows) == 0:
continue
k = rows[3].strip()
v = rows[2].strip()
if k in mydict:
mydict[k].append(v)
else:
mydict[k] = [v]
#mydict = hash(frozenset(mydict))
print mydict
d = {}
while True:
try:
d = defaultdict(list)
for k,v in mydict.iteritems():
#d[frozenset(mydict.items())]
d[v].append(k)
except:
continue
writer = csv.writer(open(r"OLD.csv", 'wb'))
for key, value in d.items():
writer.writerow([key, value])
Your question is unclear. So I hope I got it right.
Please give an example of input columns and the desired output columns.
Please give a printout of the error and let us know which line caused the error.
if column1=[1,2,3,1,4] and column2=[a,b,c,d,e] do you want the output to be n_column1=[a,2,3,d,4] and column2 =[1,b,c,d,e]
I imagine the exception was in d[v].append(k) since clearly v is a list. you cannot use a list as a key in a dictionary.
In [1]: x = [1,2,3,1,4]
In [2]: y = ['a','b','c','d','e']
In [5]: from collections import defaultdict
In [6]: d = defaultdict(int)
In [7]: for a in x:
...: d[a] += 1
In [8]: d
Out[8]: defaultdict(<type 'int'>, {1: 2, 2: 1, 3: 1, 4: 1})
In [9]: x2 = []
In [10]: for a,b in zip(x,y):
....: x2.append(a if d[a]==1 else b)
....:
In [11]: x
Out[11]: [1, 2, 3, 1, 4]
In [12]: x2
Out[12]: ['a', 2, 3, 'd', 4]
In that case, I guess if I had to change your code to fit. I'd do something like that:
import csv
from collections import defaultdict
import itertools as it
mydict = {}
index = 0
reader = csv.reader(open(r"computing.csv", "rb"))
histogram = defaultdict(int)
k = []
v = []
for i, rows in enumerate(reader):
if i == 0:
continue
if len(rows) == 0:
continue
k.append(rows[3].strip())
v.append(rows[2].strip())
item = k[-1]
histogram[item] += 1
output_column = []
for first_item, second_item in zip(k,v):
output_column.append(first_item if histogram[first_item]==1 else second_item)
writer = csv.writer(open(r"OLD.csv", 'wb'))
for c1, c2 in zip(output_column, v):
writer.writerow([c1, c2])
Greetings All.
I have a [(tuple): value] dicts as elements in a list as follows:
lst = [{('unit1', 'test1'): 11, ('unit1','test2'): 12}, {('unit2','test1'): 13, ('unit2','test2'):14 }]
testnames = ['test1','test2']
unitnames = ['unit1','unit2']
How to write to csv file with the following output?
unitnames, test1, test2
unit1, 11, 12
unit2, 13, 14
Thanks.
First, flatten the structure.
units = collections.defaultdict(lambda: collections.defaultdict(lambda: float('-inf')))
for (u, t), r in lst.iteritems():
units[u][t] = r
table = [(u, t['test1'], t['test2']) for (u, t) in units.iteritems()]
Then use csv to write out the CSV file.
The way you have lst grouped is redundant; all keys are unique, it might as well be a single dictionary, as
data = {
('unit1', 'test1'): 11,
('unit1', 'test2'): 12,
('unit2', 'test1'): 13,
('unit2', 'test2'): 14
}
then
import csv
def getUniqueValues(seq):
"Return sorted list of unique values in sequence"
values = list(set(seq))
values.sort()
return values
def dataArray(data2d, rowIterField=0, rowLabel='', defaultVal=''):
# get all unique unit and test labels
rowLabels = getUniqueValues(key[rowIterField] for key in data2d)
colLabels = getUniqueValues(key[1-rowIterField] for key in data2d)
# create key-tuple maker
if rowIterField==0:
key = lambda row,col: (row, col)
else:
key = lambda row,col: (col, row)
# header row
yield [rowLabel] + colLabels
for row in rowLabels:
# data rows
yield [row] + [data2d.get(key(row,col), defaultVal) for col in colLabels]
def main():
with open('output.csv', 'wb') as outf:
outcsv = csv.writer(outf)
outcsv.writerows(dataArray(data, 0, 'unitnames'))
if __name__=="__main__":
main()
and the output can easily be flipped (units across, tests down) by changing dataArray(data, 0, 'unitnames') to dataArray(data, 1, 'testnames').
Following code creates a dict that can be dumped to a csv:
from collections import defaultdict
d = defaultdict(list)
for (unit, test), val in lst.items():
d[unit].append(val)