Python + CSV : Sum up Similar Values from a CSV columns - python

INPUT file:
$ cat dummy.csv
OS,A,B,C,D,E
Ubuntu,0,1,0,1,1
Windows,0,0,1,1,1
Mac,1,0,1,0,0
Ubuntu,1,1,1,1,0
Windows,0,0,1,1,0
Mac,1,0,1,1,1
Ubuntu,0,1,0,1,1
Ubuntu,0,0,1,1,1
Ubuntu,1,0,1,0,0
Ubuntu,1,1,1,1,0
Mac,0,0,1,1,0
Mac,1,0,1,1,1
Windows,1,1,1,1,0
Ubuntu,0,0,1,1,0
Windows,1,0,1,1,1
Mac,0,1,0,1,1
Windows,0,0,1,1,1
Mac,1,0,1,0,0
Windows,1,1,1,1,0
Mac,0,0,1,1,0
Expected output:
OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3
I generated above output using Excel's Pivot Table.
MyCode:
import csv
import pprint
from collections import defaultdict
d = defaultdict(dict)
with open('dummy.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
d[row['OS']]['A'] += row['A']
d[row['OS']]['B'] += row['B']
d[row['OS']]['C'] += row['C']
d[row['OS']]['D'] += row['D']
d[row['OS']]['E'] += row['E']
pprint.pprint(d)
Error:
$ python3 dummy.py
Traceback (most recent call last):
File "dummy.py", line 10, in <module>
d[row['OS']]['A'] += row['A']
KeyError: 'A'
My Idea was to get the CSV values accumulated into a dictionary and later print it. However, I get above error when I try to add the values.
This seems like achievable with built-in csv module. I thought this was an easier one :( Any pointers would be of great help.

There are two problems. The nested dictionaries don't initially have any keys set so d[row[OS]]['A'] results to error. The other issue is that you need to convert column values to int before adding them.
You could use Counter as values in defaultdict since there missing keys default to 0:
import csv
from collections import Counter, defaultdict
d = defaultdict(Counter)
with open('dummy.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
nested = d[row.pop('OS')]
for k, v in row.items():
nested[k] += int(v)
print(*d.items(), sep='\n')
Output:
('Ubuntu', Counter({'D': 6, 'C': 5, 'B': 4, 'E': 3, 'A': 3}))
('Windows', Counter({'C': 6, 'D': 6, 'E': 3, 'A': 3, 'B': 2}))
('Mac', Counter({'C': 6, 'D': 5, 'A': 4, 'E': 3, 'B': 1}))

Something like this? You can write the dataframe to csv file to get the desired format.
import pandas as pd
# df0=pd.read_clipboard(sep=',')
# df0
df=df0.copy()
df=df.groupby(by='OS').sum()
print df
Output:
A B C D E
OS
Mac 4 1 6 5 3
Ubuntu 3 4 5 6 3
Windows 3 2 6 6 3
df.to_csv('file01')
file01
OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3

You got that exception because for the first time, row['OS'] does not exist in d, so 'A' does not exist in d[row['OS']]. Try the following to fix that:
import csv
from collections import defaultdict
d = defaultdict(dict)
with open('dummy.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
d[row['OS']]['A'] = d[row['OS']]['A'] + int(row['A']) if (row['OS'] in d and 'A' in d[row['OS']]) else int(row['A'])
d[row['OS']]['B'] = d[row['OS']]['B'] + int(row['B']) if (row['OS'] in d and 'B' in d[row['OS']]) else int(row['B'])
d[row['OS']]['C'] = d[row['OS']]['C'] + int(row['C']) if (row['OS'] in d and 'C' in d[row['OS']]) else int(row['C'])
d[row['OS']]['D'] = d[row['OS']]['D'] + int(row['D']) if (row['OS'] in d and 'D' in d[row['OS']]) else int(row['D'])
d[row['OS']]['E'] = d[row['OS']]['E'] + int(row['E']) if (row['OS'] in d and 'E' in d[row['OS']]) else int(row['E'])
Output:
>>> import pprint
>>>
>>> pprint.pprint(dict(d))
{'Mac': {'A': 4, 'B': 1, 'C': 6, 'D': 5, 'E': 3},
'Ubuntu': {'A': 3, 'B': 4, 'C': 5, 'D': 6, 'E': 3},
'Windows': {'A': 3, 'B': 2, 'C': 6, 'D': 6, 'E': 3}}

This does not answer your question exactly, as it is indeed possible to solve the problem using csv, but it is worth mentioning that pandas is perfect for this sort of thing:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('dummy.csv')
In [3]: df.groupby('OS').sum()
Out[3]:
A B C D E
OS
Mac 4 1 6 5 3
Ubuntu 3 4 5 6 3
Windows 3 2 6 6 3

d is a dictionary, so d[row['OS']] is a valid expression, but d[row['OS']]['A'] expects that dictionary item to be some kind of collection. Since you didn't provide a default value, it will instead be None, which is not.

I assume your input file is called input_file.csv.
You can also process your data and have your desired output using groupby from itertools module and two dicts like the example below:
from itertools import groupby
data = list(k.strip("\n").split(",") for k in open("input_file.csv", 'r'))
a, b = {}, {}
for k, v in groupby(data[1:], lambda x : x[0]):
try:
a[k] += [i[1:] for i in list(v)]
except KeyError:
a[k] = [i[1:] for i in list(v)]
for key in a.keys():
for j in range(5):
c = 0
for i in a[key]:
c += int(i[j])
try:
b[key] += ',' + str(c)
except KeyError:
b[key] = str(c)
Output:
print(','.join(data[0]))
for k in b.keys():
print("{0},{1}".format(k, b[k]))
>>> OS,A,B,C,D,E
>>> Ubuntu,3,4,5,6,3
>>> Windows,3,2,6,6,3
>>> Mac,4,1,6,5,3

This extends niemmi's solution to format the output to be the same as the OP's example:
import csv
from collections import Counter, defaultdict
d = defaultdict(Counter)
with open('dummy.csv') as csv_file:
reader = csv.DictReader(csv_file)
field_names = reader.fieldnames
for row in reader:
counter = d[row.pop('OS')]
for key, value in row.iteritems():
counter[key] += int(value)
print ','.join(field_names)
for os, counter in sorted(d.iteritems()):
print "%s,%s" % (os, ','.join([str(v) for k, v in sorted(counter.iteritems())]))
Output
OS,A,B,C,D,E
Mac,4,1,6,5,3
Ubuntu,3,4,5,6,3
Windows,3,2,6,6,3
Update: Fixed the output.

Related

Adding dictionaries together in Python

If i have 2 dictionaries x={'a':1,'b':2} and y={'a':1,'b':3}
and i want the output z={'a':2,'b':5}, is there a z=dict.add(x,y) function or should i convert both dictionaries into dataframes and then add them together with z=x.add(y)?
You could use Counter in this case for example:
from pprint import pprint
from collections import Counter
x={'a':1,'b':2}
y={'a':1,'b':3}
c = Counter()
c.update(x)
c.update(y)
pprint(dict(c))
Output:
{'a': 2, 'b': 5}
Or using +:
from pprint import pprint
from collections import Counter
x={'a':1,'b':2}
y={'a':1,'b':3}
pprint(dict(Counter(x) + Counter(y)))
collections.Counter is the natural method, but you can also use a dictionary comprehension after calculating the union of your dictionary keys:
x = {'a':1, 'b':2}
y = {'a':1, 'b':3}
dict_tup = (x, y)
keys = set().union(*dict_tup)
z = {k: sum(i.get(k, 0) for i in dict_tup) for k in keys}
print(z)
{'a': 2, 'b': 5}
Code:
from collections import Counter
x = {"a":1, "b":2}
y = {"a":1, "b":3}
c = Counter(x)
c += Counter(y)
z = dict(c)
print(z)
Output:
{'a': 2, 'b': 5}

How to convert coma delimited string into python dictionary?

Please help me with the following issue: I have coma delimited strings and value that related to these strings. The number of comas is unpredictable and can be limited by 6. I have to convert it into python dictionary. For example:
aa.bb.cc 6 => mydict['aa']['bb']['cc']=6
aa.bb.dd.ee 8 = mydict['aa']['bb']['dd']['ee']=8
My python version is 2.7.9
One way to do it is to use defaultdict
from collections import defaultdict as defd
dic = defd(dict)
def create_dict(astr):
keys, val = astr.split(' ')
keys = keys.split('.')
val = int(val)
prevdic = dic
for j, k in enumerate(keys):
if j == len(keys)-1:
prevdic[k] = val
elif k not in prevdic:
prevdic[k] = defd(dict)
prevdic = prevdic[k]
create_dict('aa.bb.cc 6')
create_dict('aa.bb.dd.ee 8')
dic['aa']['bb']['cc'] ## returns 6
dic['aa']['bb']['dd']['ee'] ## returns 8
I'm going to assume you want one dictionary with multiple keys (you do say "a dictionary") :
import re
s1 = 'a.bb.dd.ee 8'
s2 = 'aa.bb.cc 6'
def create_d(s):
fields = re.split(r'[. ]', s)
v = int(fields.pop())
d = { field:v for field in fields }
return d
mydict = create_d(s1)
print mydict
mydict.update(create_d(s2))
print mydict
Gives:
{'a': 8, 'ee': 8, 'dd': 8, 'bb': 8}
{'a': 8, 'aa': 6, 'bb': 6, 'ee': 8, 'dd': 8, 'cc': 6}
You can try following simple code
str1 = 'aa.bb.cc 6'
str2 = 'aa.bb.cc.dd 8'
dictstr = {}
strdelim = str1.split(' ')
dictstr[tuple(strdelim[0].split('.'))] = strdelim[1]
strdelim = str2.split(' ')
dictstr[tuple(strdelim[0].split('.'))] = strdelim[1]
print dictstr
Result
{('aa', 'bb', 'cc'): '6', ('aa', 'bb', 'cc', 'dd'): '8'}

Problems with Python Dictionarys and nested Lists

I am trying to create a dictionary that has a nested list inside of it.
The goal would be to have it be:
key : [x,y,z]
I am pulling the information from a csv file and counting the number of times a certain key shows up in each column. However I am getting the below error
> d[key][i] = 1
KeyError: 'owner'
Where owner is the title of my column.
if __name__ == '__main__':
d = {}
with open ('sample.csv','r') as f:
reader = csv.reader(f)
for i in range(0,3):
for row in reader:
key = row[0]
if key in d:
d[key][i] +=1
else:
d[key][i] = 1
for key,value in d.iteritems():
print key,value
What do I tweak in this loop to have it create a key if it doesn't exist and then add to it if it does?
The problem is, that you try to use a list ([i]) where no list is.
So you have to replace
d[key][i] = 1
with
d[key] = [0,0,0]
d[key][i] = 1
This would first create the list with three entries (so you can use [0], [1] and [2] afterward without error) and then assigns one to the correct entry in the list.
You can use defaultdict:
from collections import defaultdict
ncols = 3
d = defaultdict(lambda: [0 for i in range(ncols)])
Use a try, catch block to append a list to the new key, then increment as needed
if __name__ == '__main__':
d = {}
with open ('sample.csv','r') as f:
reader = csv.reader(f)
for i in xrange(0,3):
for row in reader:
key = row[i]
try: d[key][i] += 1
except KeyError:
d[key] = [0, 0, 0]
d[key][i] = 1
for key,value in d.iteritems():
print key,value
Using defaultdict and Counter you can come up with a dict that allows you to easily measure how many times a key appeared in a position (in this case 1st, 2nd or 3rd, by the slice)
csv = [
['a','b','c','d'],
['e','f','g', 4 ],
['a','b','c','d']
]
from collections import Counter, defaultdict
d = defaultdict(Counter)
for row in csv:
for idx, value in enumerate(row[0:3]):
d[value][idx] += 1
example usage:
print d
print d['a'][0] #number of times 'a' has been found in the 1st position
print d['b'][2] #number of times 'b' found in the 3rd position
print d['f'][1] #number of times 'f' found in 2nd position
print [d['a'][n] for n in xrange(3)] # to match the format requested in your post
defaultdict(<class 'collections.Counter'>, {'a': Counter({0: 2}), 'c': Counter({2: 2}), 'b': Counter({1: 2}), 'e': Counter({0: 1}), 'g': Counter({2: 1}), 'f': Counter({1: 1})})
2
0
1
[2, 0, 0]
Or put into a function:
def occurrences(key):
return [d[key][n] for n in xrange(3)]
print occurrences('a') # [2, 0, 0]

Find duplicates of two columns from csv

I want to find duplicate values of one column and replaced with value of another column of csv which has multiple columns. So first I put two columns from the csv to the dictionary. Then I want to find duplicate values of dictionary that has string values and keys. I tried with solutions of remove duplicates of dictionary but got the error as not hashable or no result. Here is the first part of code.
import csv
from collections import defaultdict
import itertools as it
mydict = {}
index = 0
reader = csv.reader(open(r"computing.csv", "rb"))
for i, rows in enumerate(reader):
if i == 0:
continue
if len(rows) == 0:
continue
k = rows[3].strip()
v = rows[2].strip()
if k in mydict:
mydict[k].append(v)
else:
mydict[k] = [v]
#mydict = hash(frozenset(mydict))
print mydict
d = {}
while True:
try:
d = defaultdict(list)
for k,v in mydict.iteritems():
#d[frozenset(mydict.items())]
d[v].append(k)
except:
continue
writer = csv.writer(open(r"OLD.csv", 'wb'))
for key, value in d.items():
writer.writerow([key, value])
Your question is unclear. So I hope I got it right.
Please give an example of input columns and the desired output columns.
Please give a printout of the error and let us know which line caused the error.
if column1=[1,2,3,1,4] and column2=[a,b,c,d,e] do you want the output to be n_column1=[a,2,3,d,4] and column2 =[1,b,c,d,e]
I imagine the exception was in d[v].append(k) since clearly v is a list. you cannot use a list as a key in a dictionary.
In [1]: x = [1,2,3,1,4]
In [2]: y = ['a','b','c','d','e']
In [5]: from collections import defaultdict
In [6]: d = defaultdict(int)
In [7]: for a in x:
...: d[a] += 1
In [8]: d
Out[8]: defaultdict(<type 'int'>, {1: 2, 2: 1, 3: 1, 4: 1})
In [9]: x2 = []
In [10]: for a,b in zip(x,y):
....: x2.append(a if d[a]==1 else b)
....:
In [11]: x
Out[11]: [1, 2, 3, 1, 4]
In [12]: x2
Out[12]: ['a', 2, 3, 'd', 4]
In that case, I guess if I had to change your code to fit. I'd do something like that:
import csv
from collections import defaultdict
import itertools as it
mydict = {}
index = 0
reader = csv.reader(open(r"computing.csv", "rb"))
histogram = defaultdict(int)
k = []
v = []
for i, rows in enumerate(reader):
if i == 0:
continue
if len(rows) == 0:
continue
k.append(rows[3].strip())
v.append(rows[2].strip())
item = k[-1]
histogram[item] += 1
output_column = []
for first_item, second_item in zip(k,v):
output_column.append(first_item if histogram[first_item]==1 else second_item)
writer = csv.writer(open(r"OLD.csv", 'wb'))
for c1, c2 in zip(output_column, v):
writer.writerow([c1, c2])

Count values by key in dictionary

I look for something that will count values in dict (automatically)without use a list of element
d = {}
d["x1"] = "1"
{'x1':'1'}
d["x1"] = "2"
{'x1':'2'}
d["x1"] = "3"
{'x1':'3'}
d["x2"] = "1"
{'x1':'3', 'x2':'1'}
ect..
I try create a list them using
for x in list:
d[x] = list.count(x)
But when I created a list , I receive a memory error
Are you sure you want to use a dict to do it? It seems a Counter or a defaultdict suits your need more.
>>> d = collections.Counter()
>>> d['x1'] += 1
>>> d
Counter({'x1': 1})
>>> d['x1'] += 1
>>> d
Counter({'x1': 2})
>>> d['x2'] += 1
>>> d
Counter({'x1': 2, 'x2': 1})
You could also convert a sequence to a counter:
>>> collections.Counter(['x1', 'x1', 'x2'])
Counter({'x1': 2, 'x2': 1})
Use a defaultdict:
>>> d = defaultdict(int)
>>> d['foo'] += 1
>>> d['foo'] += 1
>>> d['bar'] += 1
>>> for i in d:
... print i,d[i]
...
foo 2
bar 1
You can use dict in the following manner -
d['x1'] = d.get('x1', 0) + 1
The second argument in get specifies the object to return if the key supplied in the first argument is not found.
Applying this on your example:
from pprint import pprint
d = {}
d['x1'] = d.get('x1', 0) + 1
d['x1'] = d.get('x1', 0) + 1
d['x1'] = d.get('x1', 0) + 1
d['x2'] = d.get('x2', 0) + 1
pprint(d) # will print {'x1': 3, 'x2': 1}

Categories

Resources