I have multiple csv files like this:
csv1:
h1,h2,h3
aa,34,bd9
bb,459,jg0
csv2:
h1,h5,h2
aa,rg,87
aa,gru,90
bb,sf,459
For each value in column 0 with header h1, I'd like to get its corresponding h2 values from all the csv files in a folder. A sample output could be
csv1: (aa,34),(bb,459)
csv2: (aa,87,90),(bb,459)
I'm a little clueless on how to go about doing this.
PS- I don't want to use pandas.
PPS- I'm able to do it by hardcoding the value from column 0, but I don't want to do it that way since there are hundreds of rows.
This is a small piece of code I've tried. It prints the values of h2 for 'aa' in different lines. I want them to be printed in the same line.
import csv
with open("test1/sample.csv") as csvfile:
reader = csv.DictReader(csvfile, delimiter = ",")
for row in reader:
print(row['h1'], row['h2'])
import glob
import csv
import os
from collections import defaultdict
d = defaultdict(list)
path = "path_to_folder"
for fle in (glob.glob("*.csv")):
with open(os.path.join(path,fle)) as f:
header = next(f).rstrip().split(",")
# if either does not appear in header the value will be None
h1 = next((i for i, x in enumerate(header) if x == "h1"),None)
h2 = next((i for i, x in enumerate(header) if x == "h2"),None)
# make sure we have both columns before going further
if h1 is not None and h2 is not None:
r = csv.reader(f,delimiter=",")
# save file name as key appending each h1 and h2 value
for row in r:
d[fle].append([row[h1],row[h2]])
print(d)
defaultdict(<class 'list'>, {'csv1.csv': [['aa', '34'], ['bb', '459']], 'csv2.csv': [['aa', '87'], ['aa', '90'], ['bb', '459']]})
It is a quick draft, it presumes all files are delimited by , and all h1 and h2 columns have values, if so it will find all pairings keeping order.
To get a set of unique values we can use a set and set.update:
d = defaultdict(set) # change to set
for fle in (glob.glob("*.csv")):
with open(os.path.join(path,fle)) as f:
header = next(f).rstrip().split(",")
h1 = next((i for i, x in enumerate(header) if x == "h1"),None)
h2 = next((i for i, x in enumerate(header) if x == "h2"),None)
if h1 is not None and h2 is not None:
r = csv.reader(f,delimiter=",")
for row in r:
d[fle].update([row[h1],row[h2]) # set.update
print(d)
defaultdict(<class 'set'>, {'csv1.csv': {'459', '34', 'bb', 'aa'}, 'csv2.csv': {'459', '90', '87', 'bb', 'aa'}})
If you are sure you always have h1 and h2 you can reduce the code to simply:
d = defaultdict(set)
path = "path/"
for fle in (glob.glob("*.csv")):
with open(os.path.join(path, fle)) as f:
r = csv.reader(f,delimiter=",")
header = next(r)
h1 = header.index("h1")
h2 = header.index("h2")
for row in r:
d[fle].update([row[h1], row[h2]])
lastly if you want to keep the order the elements are found we cannot use a set as they are unordered so we would need to check if either element was already in the list:
for fle in (glob.glob("*.csv")):
with open(os.path.join(path, fle)) as f:
r = csv.reader(f,delimiter=",")
header = next(r)
h1 = header.index("h1")
h2 = header.index("h2")
for row in r:
h_1, h_2 = row[h1], row[h2]
if h_1 not in d[fle]:
d[fle].append(h_1)
if h_2 not in d[fle]:
d[fle].append(h_2)
print(d)
defaultdict(<class 'list'>, {'csv2.csv': ['aa', '87', '90', 'bb', '459'], 'csv1.csv': ['aa', '34', 'bb', '459']})
Related
Assuming a following text file (dict.txt) has
1 2 3
aaa bbb ccc
the dictionary should be {1: aaa, 2: bbb, 3: ccc} like this
I did:
d = {}
with open("dict.txt") as f:
for line in f:
(key, val) = line.split()
d[int(key)] = val
print (d)
but it didn't work. I think it is because of the structure of txt file.
The data which you want to be keys are in first line, and all the data which you want to be as values are in second line.
So, do something like this:
with open(r"dict.txt") as f: data = f.readlines() # Read 'list' of all lines
keys = list(map(int, data[0].split())) # Data from first line
values = data[1].split() # Data from second line
d = dict(zip(keys, values)) # Zip them and make dictionary
print(d) # {1: 'aaa', 2: 'bbb', 3: 'ccc'}
Updated answer based on OP edit:
#Initialize dict
d = {}
#Read in file by newline splits & ignore blank lines
fobj = open("dict.txt","r")
lines = fobj.read().split("\n")
lines = [l for l in line if not l.strip() == ""]
fobj.close()
#Get first line (keys)
key_list = lines[0].split()
#Convert keys to integers
key_list = list(map(int,key_list))
#Get second line (values)
val_list = lines[1].split()
#Store in dict going through zipped lists
for k,v in zip(key_list,val_list):
d[k] = v
First create separate list for keys and values, with condition
like :
if (idx % 2) == 0:
keys = line.split()
values = lines[idx + 1].split()
then combine both the lists
d = {}
# Get all lines in list
with open("dict.txt") as f:
lines = f.readlines()
for idx, line in enumerate(lines):
if (idx % 2) == 0:
# Get the key list
keys = line.split()
# Get the value list
values = lines[idx + 1].split()
# Combine both the lists in dictionary
d.update({ keys[i] : values[i] for i in range(len(keys))})
print (d)
The output must be like this:
[{'id': '1', 'first_name': 'Heidie','gender': 'Female'}, {'id': '2', 'first_name': 'Adaline', 'gender': 'Female'}, {...}
There is a code snippet that works, running this requirement.
with open('./test.csv', 'r') as file_read:
reader = csv.DictReader(file_read, skipinitialspace=True)
listDict = [{k: v for k, v in row.items()} for row in reader]
print(listDict)
However, i can't understand some points about this code above:
List comprehension: listDict = [{k: v for k, v in row.items()} for row in reader]
How the python interpret this?
How does the compiler assemble a list always with the header (id,first_name, gender) and their values?
How would be the implementation of this code with nested for
I read theese answers, but i still do not understand:
python list comprehension double for
convert csv file to list of dictionaries
My csv file:
id,first_name,last_name,email,gender
1,Heidie,Philimore,hphilimore0#msu.edu,Female
2,Adaline,Wapplington,awapplington1#icq.com,Female
3,Erin,Copland,ecopland2#google.co.uk,Female
4,Way,Buckthought,wbuckthought3#usa.gov,Male
5,Adan,McComiskey,amccomiskey4#theatlantic.com,Male
6,Kilian,Creane,kcreane5#hud.gov,Male
7,Mandy,McManamon,mmcmanamon6#omniture.com,Female
8,Cherish,Futcher,cfutcher7#accuweather.com,Female
9,Dave,Tosney,dtosney8#businesswire.com,Male
10,Torr,Kiebes,tkiebes9#dyndns.org,Male
your list comprehension :
listDict = [{k: v for k, v in row.items()} for row in reader]
equals:
item_list = []
#go through every row
for row in reader:
item_dict = {}
#in every row go through each item
for k,v in row.items():
#add each items k,v to dict.
item_dict[k] = v
#append every item_dict to item_list
item_list.append(item_dict)
print(item_list)
EDIT (some more explanation):
#lets create a list
list_ = [x ** 2 for x in range(0,10)]
print(list_)
this returns:
[0,1,4,9,16,25,36,49,64,81]
You can write this as:
list_ = []
for x in range(0,10):
list_.append(x ** 2)
So in that example yes you read it 'backwards'
Now assume the next:
#lets create a list
list_ = [x ** 2 for x in range(0,10) if x % 2 == 0]
print(list_)
this returns:
[0,4,16,36,64]
You can write this as:
list_ = []
for x in range(0,10):
if x % 2 == 0:
list_.append(x ** 2)
So thats not 100% backwards, but it should be logical whats happening. Hope this helps you!
I'm trying to extract columns from a string of values in python. The string of values looks like follows -
CN=Unix ADISID,OU=SA,OU=DGO,DC=dom,DC=ab,DC=com,1001
CN=1002--DS,OU=Process,DC=dom,DC=ab,DC=com,1002
CN=1003--Cyb,OU=SA,OU=DGO,DC=dom,DC=ab,DC=com,1003
CN=Doe--Joe,OU=Adm,DC=dom,DC=ab,DC=com,d1004
CN=cruise--bob,OU=SA,OU=DGO,DC=dom,DC=ab,DC=com,d1005
Now I would like to extract columns from this string with column headers like CN, OU1, OU2,DC1, DC2, DC3,ID. The number of OU and DC values are different in every line so if they are not present in a line, I would like to keep that column as blank. Also, I'm using the following piece of code to generate the above string.
result = l.search_s(base, ldap.SCOPE_SUBTREE, criteria, attributes)
results=""
for i in [entry for dn, entry in result if isinstance(entry, dict)]:
results += str(i.get('distinguishedName')[0] +","+ i.get('sAMAccountName')[0] + "\n").replace("\, ","--")
print results
Will it be easier if I create results as a list to begin with?
To get the "fields left blank" behavior, you're going to have to count the max number of each field. I believe that CN is unique, so that should always be 1.
result = l.search_s(base, ldap.SCOPE_SUBTREE, criteria, attributes)
users = []
for i in [entry for dn, entry in result if isinstance(entry, dict)]:
dn = i.get('distinguishedName')[0].replace('\, ', '--').split(',')
info = collections.defaultdict(list)
info['id'] = i.get('sAMAccountName')[0]
for part in dn:
key,value = part.split('=',1)
info[key].append(value)
users.append(info)
max_cn = max(map(lambda u: len(u['CN']), users))
assert max_cn == 1
max_ou = max(map(lambda u: len(u['OU']), users))
max_dn = max(map(lambda u: len(u['DN']), users))
numflds = max_cn + max_ou + max_dn
fields = []
for u in users:
f = [u['CN']]
ou = u['OU'] + [''] * max_ou
f.extend(ou[:max_ou])
dn = u['DN'] + [''] * max_dn
f.extend(dn[:max_dn])
f.append(u['id'])
For each line:
pairs = [kv.split('=') for kv in line.split(',')]
for pair in pairs:
if len(pair) == 1:
pair.insert(0, 'ID')
Now you have something like this:
[['CN', 'Unix ADISID'],
['OU', 'SA'],
['OU', 'DGO'],
['DC', 'dom'],
['DC', 'ab'],
['DC', 'com'],
['ID', '1001']]
Then:
from collections import defaultdict
mapping = defaultdict(list)
for k,v in pairs:
mapping[k].append(v)
Which gives you:
{'CN': ['Unix ADISID'],
'DC': ['dom', 'ab', 'com'],
'ID': ['1001'],
'OU': ['SA', 'DGO']}
I have 2 files, The first only has 2 columns
A 2
B 5
C 6
And the second has the letters as a first column.
A cat
B dog
C house
I want to replace the letters in the second file with the numbers that correspond to them in the first file so I would get.
2 cat
5 dog
6 house
I created a dict from the first and read the second. I tried a few things but none worked. I can't seem to replace the values.
import csv
with open('filea.txt','rU') as f:
reader = csv.reader(f, delimiter="\t")
for i in reader:
print i[0] #reads only first column
a_data = (i[0])
dictList = []
with open('file2.txt', 'r') as d:
for line in d:
elements = line.rstrip().split("\t")[0:]
dictList.append(dict(zip(elements[::1], elements[0::1])))
for key, value in dictList.items():
if value == "A":
dictList[key] = "cat"
The issue appears to be on your last lines:
for key, value in dictList.items():
if value == "A":
dictList[key] = "cat"
This should be:
for key, value in dictList.items():
if key in a_data:
dictList[a_data[key]] = dictList[key]
del dictList[key]
d1 = {'A': 2, 'B': 5, 'C': 6}
d2 = {'A': 'cat', 'B': 'dog', 'C': 'house', 'D': 'car'}
for key, value in d2.items():
if key in d1:
d2[d1[key]] = d2[key]
del d2[key]
>>> d2
{2: 'cat', 5: 'dog', 6: 'house', 'D': 'car'}
Notice that this method allows for items in the second dictionary which don't have a key from the first dictionary.
Wrapped up in a conditional dictionary comprehension format:
>>> {d1[k] if k in d1 else k: d2[k] for k in d2}
{2: 'cat', 5: 'dog', 6: 'house', 'D': 'car'}
I believe this code will get you your desired result:
with open('filea.txt', 'rU') as f:
reader = csv.reader(f, delimiter="\t")
d1 = {}
for line in reader:
if line[1] != "":
d1[line[0]] = int(line[1])
with open('fileb.txt', 'rU') as f:
reader = csv.reader(f, delimiter="\t")
reader.next() # Skip header row.
d2 = {}
for line in reader:
d2[line[0]] = [float(i) for i in line[1:]]
d3 = {d1[k] if k in d1 else k: d2[k] for k in d2}
You could use dictionary comprehension:
d1 = {'A':2,'B':5,'C':6}
d2 = {'A':'cat','B':'dog','C':'house'}
In [23]: {d1[k]:d2[k] for k in d1.keys()}
Out[23]: {2: 'cat', 5: 'dog', 6: 'house'}
If the two dictionaries are called a and b, you can construct a new dictionary this way:
composed_dict = {a[k]:b[k] for k in a}
This will take all the keys in a, and read the corresponding values from a and b to construct a new dictionary.
Regarding your code:
The variable a_data has no purpose. You read the first file, pront the first column, and do nothing else with the data in it
zip(elements[::1], elements[0::1]) will just construct pairs like [1,2,3] -> [(1,1),(2,2),(3,3)], I think that's not what you want
After all you have a list of dictionaries, and at the last line you just put strings in that list. I think that is not intentional.
import re
d1 = dict()
with open('filea.txt', 'r') as fl:
for f in fl:
key, val = re.findall('\w+', f)
d1[key] = val
d2 = dict()
with open('file2.txt', 'r') as fl:
for f in fl:
key, val = re.findall('\w+', f)
d2[key] = val
with open('file3.txt', 'wb') as f:
for k, v in d1.items():
f.write("{a}\t{b}\n".format(a=v, b=d2[k]))
I posted below a code that works fine. What it does at the moment is:
it opens 2 .csv files 'CMF.csv' and 'D65.csv', and then
performs some math on it.
Here's the simple structure of those files :
'CMF.csv' (wavelength, x, y, z)
400,1.879338E-02,2.589775E-03,8.508254E-02
410,8.277331E-02,1.041303E-02,3.832822E-01
420,2.077647E-01,2.576133E-02,9.933444E-01
...etc
'D65.csv': (wavelength, a, b)
400,82.7549,14.708
410,91.486,17.6753
420,93.4318,20.995
...etc
I have a 3rd file data.csv, with this structure (serialNumber, wavelength, measurement, name) :
0,400,2.21,1
0,410,2.22,1
0,420,2.22,1
...
1,400,2.21,2
1,410,2.22,2
1,420,2.22,2
...etc
What I would like to do is to be able to write a few lines of code to perform
math on all the series of the last file (series are defined by their serial number and their name)
For example I need a loop that will perform, for each name or serial number, and for each wavelength, the operation:
x * a * measurement
I tried to load data.csv`in the csv reader as the other files, but I couldn't
any ideas?
Thanks
import csv
with open('CMF.csv') as cmf:
reader = csv.reader(cmf)
dict_cmf = dict()
for row in reader:
dict_cmf[float(row[0])] = row
with open('D65.csv') as d65:
reader = csv.reader(d65)
dict_d65 = dict()
for row in reader:
dict_d65[float(row[0])] = row
with open('data.csv') as sp:
reader = csv.reader(sp)
dict_sp = dict()
for row in reader:
dict_sp[float(row[0])] = row
Y = 0
Y_total = 0
X = 0
X_total = 0
Z = 0
Z_total = 0
i = 0
j = 0
for i in range(400, 700, i+10):
X = float(dict_cmf[i][1]) * float(dict_d65[i][1])
X_total = X_total + X
Y = float(dict_cmf[i][2]) * float(dict_d65[i][1])
Y_total = Y_total + Y
Z = float(dict_cmf[i][3]) * float(dict_d65[i][1])
Z_total = Z_total + Z
wp_X = 100 * X_total / Y_total
wp_Y = 100 * Y_total / Y_total
wp_Z = 100 * Z_total / Y_total
print Y_total
print "D65_CMF_2006_10_deg white point = "
print wp_X, wp_Y, wp_Z
I get this :
Traceback (most recent call last): File "C:\Users\gary\Documents\eclipse\Spectro\1illum_XYZ2006_D65_numpy.py", line 24, in <module> dict_sp[row[0]] = row IndexError: list index out of range
You need pandas. You can read the files into pandas tables, then join them to replace your code with this code:
import pandas
cmf = pandas.read_csv('CMF.csv', names=['wavelength', 'x', 'y', 'z'])
d65 = pandas.read_csv('D65.csv', names=['wavelength', 'a', 'b'])
data = pandas.read_csv('data.csv', names=['serialNumber', 'wavelength', 'measurement', 'name'])
lookup = pandas.merge(cmf, d65, on='wavelength')
merged = pandas.merge(data, lookup, on='wavelength')
totals = ((lookup[['x', 'y', 'z']].T*lookup['a']).T).sum()
wps = totals/totals['y']
print totals['y']
print "D65_CMF_2006_10_deg white point = "
print wps
Now, that doesn't do the last bit where you want to calculate extra values for each measurement. You can do this by adding a column to merged, like this:
merged['newcol'] = merged.x * merged.a * merged.measurement
One or more of the lines in data.csv does not contain what you think it does. Try to put your statement inside a try...except block to see what the problem is:
with open('spectral_data.csv') as sp:
reader = csv.reader(sp)
dict_sp = dict()
for row in reader:
try:
dict_sp[float(row[0])] = row
except IndexError as e:
print 'The problematic row is:'
print row
raise e
A proper debugger would also be helpful in these kind of situations.
pandas is probably the better way to go, but if you'd like an example in vanilla Python, you can have a look at this example:
import csv
from collections import defaultdict
d = defaultdict(dict)
for fname, cols in [('CMF.csv', ('x', 'y', 'z')), ('D65.csv', ('a', 'b'))]:
with open(fname) as ifile:
reader = csv.reader(ifile)
for row in reader:
wl, values = int(row[0]), row[1:]
d[wl].update(zip(cols, map(float, values)))
measurements = defaultdict(dict)
with open('data.csv') as ifile:
reader = csv.reader(ifile)
cols = ('measurement', 'name')
for serial, wl, me, name in reader:
measurements[int(serial)][int(wl)] = dict(zip(cols, (float(me), str(name))))
for serial in sorted(measurements.keys()):
for wl in sorted(measurements[serial].keys()):
me = measurements[serial][wl]['measurement']
print me * d[wl]['x'] * d[wl]['a']
This will store both x, y, z, a and b in a dictionary inside a dictionary with wavelength as the key (there is no apparent reason to store these values in separate dicts).
The measurements are stored in a two level deep dictionary with keys serial and wavelength. This way you can iterate over all serials and all corresponding wavelength like shown in the latter part of the code.
As for your specific calculations on the data in your example, this can be done quite easily with this structure:
tot_x = sum(v['x']*v['a'] for v in data.values())
tot_y = sum(v['y']*v['a'] for v in data.values())
tot_z = sum(v['z']*v['a'] for v in data.values())
wp_x = 100 * tot_x / tot_y
wp_y = 100 * tot_y / tot_y # Sure this is correct? It will always be 100
wp_z = 100 * tot_z / tot_y
print wp_x, wp_y, wp_z # 798.56037811 100.0 3775.04316468
These are the dictionaries given the input file in your question:
>>> from pprint import pprint
>>> pprint(dict(data))
{400: {'a': 82.7549,
'b': 14.708,
'x': 0.01879338,
'y': 0.002589775,
'z': 0.08508254},
410: {'a': 91.486,
'b': 17.6753,
'x': 0.08277331,
'y': 0.01041303,
'z': 0.3832822},
420: {'a': 93.4318,
'b': 20.995,
'x': 0.2077647,
'y': 0.02576133,
'z': 0.9933444}}
>>> pprint(dict(measurements))
{0: {400: {'measurement': 2.21, 'name': '1'},
410: {'measurement': 2.22, 'name': '1'},
420: {'measurement': 2.22, 'name': '1'}},
1: {400: {'measurement': 2.21, 'name': '2'},
410: {'measurement': 2.22, 'name': '2'},
420: {'measurement': 2.22, 'name': '2'}}}