CSV Grouping w/o Pandas - python

Id like to group data in a .csv file. My data is like the following:
code,balance
CN,999.99
CN,1.01
LS,177.77
LS,69.42
LA,200.43
WO,100
I would like to group the items by code and sum up the balances of the like codes. Desired output would be:
code,blance
CN,1001
LS,247.19
...
I was originaly using Pandas for this task but will not have a package available to put that library on a server.
mydata = pd.read_csv('./tmp/temp.csv')
out = mydata.groupby('code').sum()
Solutions would preferably be compatible with Python 2.6.
I apologize if this is a duplicate, the other posts seem to be grouping differently.
I would also like to avoid doing this in a -
if code = x
add balance to x_total
-kind of way
MY SOLUTION:
def groupit():
groups = defaultdict(list)
with open('tmp.csv') as fd:
reader = csv.DictReader(fd)
for row in reader:
groups[row['code']].append(float(row['balance.']))
total={key:sum(groups[key]) for key in groups}
total=str(total)
total=total.replace(' ','')
total=total.replace('{','')
total=total.replace('}','')
total=total.replace("'",'')
total=total.replace(',','\n')
total=total.replace(':',',')
outfile = open('out.csv','w+')
outfile.write('code,balance\n')
outfile.write(total)

Python > 2.6:
from collections import defaultdict
import csv
groups = defaultdict(list)
with open('text.txt') as fd:
reader = csv.DictReader(fd)
for row in reader:
groups[row['code']].append(float(row['balance']))
totals = {key: sum(groups[key]) for key in groups}
print(totals)
This outputs:
{'CN': 1001.0, 'LS': 247.19, 'LA': 200.43, 'WO': 100.0}
Python = 2.6:
from collections import defaultdict
import csv
groups = defaultdict(list)
with open('text.txt') as fd:
reader = csv.DictReader(fd)
for row in reader:
groups[row['code']].append(float(row['balance']))
totals = dict((key, sum(groups[key])) for key in groups)
print(totals)

Here is how I will go about it:
with open("data.csv", 'r') as f:
data = f.readlines()
result = {}
for val in range(1, len(data)-1):
x = data[val].split(",")
if x[0] not in result:
result[x[0]] = float(x[1].replace('\n', ""))
else:
result[x[0]] = result[x[0]] + float(x[1].replace('\n', ""))
result dictionary will have the values of interest which can then be saves as csv.
import csv
with open('mycsvfile.csv', 'wb') as f: # Just use 'w' mode in 3.x
w = csv.DictWriter(f, result.keys())
w.writeheader()
w.writerow(result)
Hope this helps :)

Related

Python CSV import with nested list creation

I am trying to simply import a .csv into Python. I've read numerous documents but for the life of me I can't figure out how to do the following.
The CSV format is as follows
NYC,22,55
BOSTON,39,22
I'm trying to generate the following : {NYC = [22,55], BOSTON = [39,22]} so that I can call i[0] and i[1] in a loop for each variable.
I've tried
import csv
input_file = csv.DictReader(open("C:\Python\Sandbox\longlat.csv"))
for row in input_file:
print(row)
Which prints my variables, but I dont know hot to nest two numeric values within the city name and generate the list that im hoping to get.
Thanks for your help, sorry for my rookie question -
If you are not familiar with python comprehensions, you can use the following code that uses a for loop:
import csv
with open('C:\Python\Sandbox\longlat.csv', 'r') as f:
reader = csv.reader(f)
result = {}
for row in reader:
result[row[0]] = row[1:]
The previous code works if you want the numbers to be string, if you want them to be numbers use:
import csv
with open('C:\Python\Sandbox\longlat.csv', 'r') as f:
reader = csv.reader(f)
result = {}
for row in reader:
result[row[0]] = [int(e) for i in row[1:]] # float instead of int is also valid
Use dictionary comprehension:
import csv
with open(r'C:\Python\Sandbox\longlat.csv', mode='r') as csvfile:
csvread = csv.reader(csvfile)
result = {k: [int(c) for c in cs] for k, *cs in csvread}
This works in python-3.x, and produces on my machine:
>>> result
{'NYC': [22, 55], 'BOSTON': [39, 22]}
It also works for an arbitrary number of columns.
In case you use python-2.7, you can use indexing and slicing over sequence unpacking:
import csv
with open(r'C:\Python\Sandbox\longlat.csv', mode='r') as csvfile:
csvread = csv.reader(csvfile)
result = {row[0]: [int(c) for c in row[1:]] for row in csvread}
Each row will have 3 values. You want the first as the key and the rest as the value.
>>> row
['NYC','22','55']
>>> {row[0]: row[1:]}
{'NYC': ['22', '55']}
You can create the whole dict:
lookup = {row[0]: row[1:] for row in input_file}
You can also use pandas like so:
import pandas as pd
df = pd.read_csv(r'C:\Python\Sandbox\longlat.csv')
result = {}
for index, row in df.iterrows():
result[row[0]] = row[1:]
Heres a hint. Try familiarizing yourself with the str.split(x) function
strVar = "NYC,22,55"
listVar = strVar.split(',') # ["NYC", "22", "55"]
cityVar = listVar[0] # "NYC"
restVar = listVar[1:]; # ["22", "55"]
# If you want to convert `restVar` into integers
restVar = map(int, restVar)

How to realise "sumif" in python3.x within an csv data

i want to add up lines in a csv file (It's a BOM) if they are identical and in the same part, but not if they are a specific type.
Here is the example to make it more clear:
LevelName,Type,Amount
Part_1,a,1
Part_1,a,1
Part_1,b,1
Part_1,c,1
Part_1,d,1
Part_1,f,1
Part_2,a,1
Part_2,c,1
Part_2,d,1
Part_2,a,1
Part_2,a,1
Part_2,d,1
Part_2,d,1
So i need to some up all Types within a Part but not if the type is 'd'.
Result should look like this:
LevelName,Type,Amount
Part_1,a,2
Part_1,b,1
Part_1,c,1
Part_1,d,1
Part_1,f,1
Part_2,a,3
Part_2,c,1
Part_2,d,1
Part_2,d,1
Part_2,d,1
unfortunatly i can not use any external lib. so pandas is no option here.
That is how far i got:
import csv
map = {}
with open('infile.csv', 'rt') as f:
reader = csv.reader(f, delimiter = ',')
with open('outfile.csv', 'w', newline='') as fout:
writer = csv.writer(fout, delimiter=';', quoting=csv.QUOTE_MINIMAL)
writer.writerow(next(reader))
for row in reader:
(level, type, count) = row
if not type=='d':
Well, here i just don't get ahead...
Thanks for any hint!
Ok sorry about using pandas. Then first read the file saving the results in a defaultdict.
from collections import defaultdict
grouped = defaultdict(int)
if not type=='d':
grouped[(level, type)] += int(count)
Then you can save the result of that dict to a file
import csv
import os
cwd = os.getcwd()
master = {}
file = csv.DictReader(open(cwd+'\\infile.csv', 'rb'), delimiter=',')
data = [row for row in file]
for row in data:
master.setdefault(row['LevelName'], {})
if row['Type'] != 'd':
master[row['LevelName']].setdefault(row['Type'], 0)
master[row['LevelName']][row['Type']] += int(row['Amount'])
print (master)
Not as simple as the soloution above but this shows how to iterate over the data
OR i suppose you could concatenate the 'LevelName' and the 'Type' so that you have one less line of code. It depends on what you what you want.
for row in data:
if row['Type'] != 'd':
master.setdefault(row['LevelName'] + row['Type'], 0)
master[row['LevelName'] + row['Type']] += int(row['Amount'])
print (master)
EDIT
to write back to original format something like:
out = open(cwd+'\\outfile.csv', 'wb')
out.write('LevelName,Type,Amount\n')
for k,v in master.iteritems():
for z in v:
out.write('%s,%s,%s\n' % (k, z, str(v[z])))

Python CSV writer

I have a csv that looks like this:
HA-MASTER,CategoryID
38231-S04-A00,14
39790-S10-A03,14
38231-S04-A00,15
39790-S10-A03,15
38231-S04-A00,16
39790-S10-A03,16
38231-S04-A00,17
39790-S10-A03,17
38231-S04-A00,18
39790-S10-A03,18
38231-S04-A00,19
39795-ST7-000,75
57019-SN7-000,75
38251-SV4-911,75
57119-SN7-003,75
57017-SV4-A02,75
39795-ST7-000,76
57019-SN7-000,76
38251-SV4-911,76
57119-SN7-003,76
57017-SV4-A02,76
What I would like to do is reformat this data so that there is only one line for each categoryID for example:
14,38231-S04-A00,39790-S10-A03
76,39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02
I have not found a way in excel that I can accomplish this programatically. I have over 100,000 lines. Is there a way using python CSV Read and Write to do something like this?
Yes there is a way:
import csv
def addRowToDict(row):
global myDict
key=row[1]
if key in myDict.keys():
#append values if entry already exists
myDict[key].append(row[0])
else:
#create entry
myDict[key]=[row[1],row[0]]
global myDict
myDict=dict()
inFile='C:/Users/xxx/Desktop/pythons/test.csv'
outFile='C:/Users/xxx/Desktop/pythons/testOut.csv'
with open(inFile, 'r') as f:
reader = csv.reader(f)
ignore=True
for row in reader:
if ignore:
#ignore first row
ignore=False
else:
#add entry to dict
addRowToDict(row)
with open(outFile,'w') as f:
writer = csv.writer(f)
#write everything to file
writer.writerows(myDict.itervalues())
Just edit inFile and outFile
This is pretty trivial using a dictionary of lists (Python 2.7 solution):
#!/usr/bin/env python
import fileinput
categories={}
for line in fileinput.input():
# Skip the first line in the file (assuming it is a header).
if fileinput.isfirstline():
continue
# Split the input line into two fields.
ha_master, cat_id = line.strip().split(',')
# If the given category id is NOT already in the dictionary
# add a new empty list
if not cat_id in categories:
categories[cat_id]=[]
# Append a new value to the category.
categories[cat_id].append(ha_master)
# Iterate over all category IDs and lists. Use ','.join() to
# to output a comma separate list from an Python list.
for k,v in categories.iteritems():
print '%s,%s' %(k,','.join(v))
I would read in the entire file, create a dictionary where the key is the ID and the value is a list of the other data.
data = {}
with open("test.csv", "r") as f:
for line in f:
temp = line.rstrip().split(',')
if len(temp[0].split('-')) == 3: # => specific format that ignores the header...
if temp[1] in data:
data[temp[1]].append(temp[0])
else:
data[temp[1]] = [temp[0]]
with open("output.csv", "w+") as f:
for id, datum in data.iteritems():
f.write("{},{}\n".format(id, ','.join(datum)))
Use pandas!
import pandas
csv_data = pandas.read_csv('path/to/csv/file')
use_this = csv_data.group_by('CategoryID').values
You will get a list with everything you want, now you just have to format it.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Cheers.
I see many beautiful answers have come up while I was trying it, but I'll post mine as well.
import re
csvIN = open('your csv file','r')
csvOUT = open('out.csv','w')
cat = dict()
for line in csvIN:
line = line.rstrip()
if not re.search('^[0-9]+',line): continue
ham, cid = line.split(',')
if cat.get(cid,False):
cat[cid] = cat[cid] + ',' + ham
else:
cat[cid] = ham
for i in sorted(cat):
csvOUT.write(i + ',' + cat[i] + '\n')
Pandas approach:
import pandas as pd
df = pd.read_csv('data.csv')
#new = df.groupby('CategoryID')['HA-MASTER'].apply(lambda row: '%s' % ','.join(row))
new = df.groupby('CategoryID')['HA-MASTER'].agg(','.join)
new.to_csv('out.csv')
out.csv:
14,"38231-S04-A00,39790-S10-A03"
15,"38231-S04-A00,39790-S10-A03"
16,"38231-S04-A00,39790-S10-A03"
17,"38231-S04-A00,39790-S10-A03"
18,"38231-S04-A00,39790-S10-A03"
19,38231-S04-A00
75,"39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02"
76,"39795-ST7-000,57019-SN7-000,38251-SV4-911,57119-SN7-003,57017-SV4-A02"
This was an interesting question. My solution was to append each new item for a given key to a single string in the value, along with a comma to delimit the columns.
with open('Input01.csv') as input_file:
file_lines = [item.strip() for item in input_file.readlines()]
for item in iter([i.split(',') for i in file_lines]):
if item[1] in set_vals:
set_vals[item[1]] = set_vals[item[1]] + ',' + item[0]
else:
set_vals[item[1]] = item[0]
with open('Results01.csv','w') as output_file:
for i in sorted(set_vals.keys()):
output_file.write('{},{}\n'.format(i, set_vals[i]))
MaxU's implementation, using pandas, has good potential and looks really elegant, but all the values are placed into one cell, because each of the strings is double-quoted. For example, the line corresponding to the code '18'—"38231-S04-A00,39790-S10-A03"—would place both values in the second column.
import csv
from collections import defaultdict
inpath = '' # Path to input CSV
outpath = '' # Path to output CSV
output = defaultdict(list) # To hold {category: [serial_numbers]}
for row in csv.DictReader(open(inpath)):
output[row['CategoryID']].append(row['HA-MASTER'])
with open(outpath, 'w') as f:
f.write('CategoryID,HA-MASTER\n')
for category, serial_number in output.items():
row = '%s,%s\n' % (category, serial_number)
f.write(row)

How do I merge two CSV files based on field and keep same number of attributes on each record?

I am attempting to merge two CSV files based on a specific field in each file.
file1.csv
id,attr1,attr2,attr3
1,True,7,"Purple"
2,False,19.8,"Cucumber"
3,False,-0.5,"A string with a comma, because it has one"
4,True,2,"Nope"
5,True,4.0,"Tuesday"
6,False,1,"Failure"
file2.csv
id,attr4,attr5,attr6
2,"python",500000.12,False
5,"program",3,True
3,"Another string",-5,False
This is the code I am using:
import csv
from collections import OrderedDict
with open('file2.csv','r') as f2:
reader = csv.reader(f2)
fields2 = next(reader,None) # Skip headers
dict2 = {row[0]: row[1:] for row in reader}
with open('file1.csv','r') as f1:
reader = csv.reader(f1)
fields1 = next(reader,None) # Skip headers
dict1 = OrderedDict((row[0], row[1:]) for row in reader)
result = OrderedDict()
for d in (dict1, dict2):
for key, value in d.iteritems():
result.setdefault(key, []).extend(value)
with open('merged.csv', 'wb') as f:
w = csv.writer(f)
for key, value in result.iteritems():
w.writerow([key] + value)
I get output like this, which merges appropriately, but does not have the same number of attributes for all rows:
1,True,7,Purple
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure
file2 will not have a record for every id in file1. I'd like the output to have empty fields from file2 in the merged file. For example, id 1 would look like this:
1,True,7,Purple,,,
How can I add the empty fields to records that don't have data in file2 so that all of my records in the merged CSV have the same number of attributes?
If we're not using pandas, I'd refactor to something like
import csv
from collections import OrderedDict
filenames = "file1.csv", "file2.csv"
data = OrderedDict()
fieldnames = []
for filename in filenames:
with open(filename, "rb") as fp: # python 2
reader = csv.DictReader(fp)
fieldnames.extend(reader.fieldnames)
for row in reader:
data.setdefault(row["id"], {}).update(row)
fieldnames = list(OrderedDict.fromkeys(fieldnames))
with open("merged.csv", "wb") as fp:
writer = csv.writer(fp)
writer.writerow(fieldnames)
for row in data.itervalues():
writer.writerow([row.get(field, '') for field in fieldnames])
which gives
id,attr1,attr2,attr3,attr4,attr5,attr6
1,True,7,Purple,,,
2,False,19.8,Cucumber,python,500000.12,False
3,False,-0.5,"A string with a comma, because it has one",Another string,-5,False
4,True,2,Nope,,,
5,True,4.0,Tuesday,program,3,True
6,False,1,Failure,,,
For comparison, the pandas equivalent would be something like
df1 = pd.read_csv("file1.csv")
df2 = pd.read_csv("file2.csv")
merged = df1.merge(df2, on="id", how="outer").fillna("")
merged.to_csv("merged.csv", index=False)
which is much simpler to my eyes, and means you can spend more time dealing with your data and less time reinventing wheels.
You can use pandas to do this:
import pandas
csv1 = pandas.read_csv('filea1.csv')
csv2 = pandas.read_csv('file2.csv')
merged = csv1.merge(csv2, on='id')
merged.to_csv("output.csv", index=False)
I haven't tested this yet but it should put you on the right track until I can try it out. The code is quite self-explanatory; first you import the pandas library so that you can use it. Then using pandas.read_csv you read the 2 csv files and use the merge method to merge them. The on parameter specifies which column should be used as the "key". Finally, the merged csv is written to output.csv.
Use dict of dict then update it. Like this:
import csv
from collections import OrderedDict
with open('file2.csv','r') as f2:
reader = csv.reader(f2)
lines2 = list(reader)
with open('file1.csv','r') as f1:
reader = csv.reader(f1)
lines1 = list(reader)
dict1 = {row[0]: dict(zip(lines1[0][1:], row[1:])) for row in lines1[1:]}
dict2 = {row[0]: dict(zip(lines2[0][1:], row[1:])) for row in lines2[1:]}
#merge
updatedDict = OrderedDict()
mergedAttrs = OrderedDict.fromkeys(lines1[0][1:] + lines2[0][1:], "?")
for id, attrs in dict1.iteritems():
d = mergedAttrs.copy()
d.update(attrs)
updatedDict[id] = d
for id, attrs in dict2.iteritems():
updatedDict[id].update(attrs)
#out
with open('merged.csv', 'wb') as f:
w = csv.writer(f)
for id, rest in sorted(updatedDict.iteritems()):
w.writerow([id] + rest.values())

Python Joining csv files where key is first column value

I try to join two csv files where key is value of first column.
There's no header.
Files have different number of lines and rows.
Order of file a must be preserved.
file a:
john,red,34
andrew,green,18
tonny,black,50
jack,yellow,27
phill,orange,45
kurt,blue,29
mike,pink,61
file b:
tonny,driver,new york
phill,scientist,boston
desired result:
john,red,34
andrew,green,18
tonny,black,50,driver,new york
jack,yellow,27
phill,orange,45,scientist,boston
kurt,blue,29
mike,pink,61
I examined all related threads and I am sure that some of you are gonna mark this question duplicate but I simply have not found solution yet.
I grabbed dictionary based solution but this approach does not handle preserve line order from file 'a' condition.
import csv
from collections import defaultdict
with open('a.csv') as f:
r = csv.reader(f, delimiter=',')
dict1 = {}
for row in r:
dict1.update({row[0]: row[1:]})
with open('b.csv') as f:
r = csv.reader(f, delimiter=',')
dict2 = {}
for row in r:
dict2.update({row[0]: row[1:]})
result = defaultdict(list)
for d in (dict1, dict2):
for key, value in d.iteritems():
result[key].append(value)
I also would like to avoid putting these csv files to the database like sqlite or using pandas module.
Thanks in advance
Something like
import csv
from collections import OrderedDict
with open('b.csv', 'rb') as f:
r = csv.reader(f)
dict2 = {row[0]: row[1:] for row in r}
with open('a.csv', 'rb') as f:
r = csv.reader(f)
dict1 = OrderedDict((row[0], row[1:]) for row in r)
result = OrderedDict()
for d in (dict1, dict2):
for key, value in d.iteritems():
result.setdefault(key, []).extend(value)
with open('ab_combined.csv', 'wb') as f:
w = csv.writer(f)
for key, value in result.iteritems():
w.writerow([key] + value)
produces
john,red,34
andrew,green,18
tonny,black,50,driver,new york
jack,yellow,27
phill,orange,45,scientist,boston
kurt,blue,29
mike,pink,61
(Note that I didn't bother protecting against the case where dict2 has a key which isn't in dict1-- that's easily added if you like.)

Categories

Resources