Parse CSV file and aggregate the values

Parse CSV file and aggregate the values - python

I'd like to parse a CSV file and aggregate the values. The city row has repeating values (sample):
CITY,AMOUNT
London,20
Tokyo,45
London,55
New York,25
After parsing the result should be something like:
CITY, AMOUNT
London,75
Tokyo,45
New York,25
I've written the following code to extract the unique city names:
def main():
contrib_data = list(csv.DictReader(open('contributions.csv','rU')))
combined = []
for row in contrib_data:
if row['OFFICE'] not in combined:
combined.append(row['OFFICE'])
How do I then aggregate values?

Tested in Python 3.2.2:
import csv
from collections import defaultdict
reader = csv.DictReader(open('test.csv', newline=''))
cities = defaultdict(int)
for row in reader:
cities[row["CITY"]] += int(row["AMOUNT"])
writer = csv.writer(open('out.csv', 'w', newline = ''))
writer.writerow(["CITY", "AMOUNT"])
writer.writerows([city, cities[city]] for city in cities)
Result:
CITY,AMOUNT
New York,25
London,75
Tokyo,45
As for your added requirements:
import csv
from collections import defaultdict
def default_factory():
return [0, None, None, 0]
reader = csv.DictReader(open('test.csv', newline=''))
cities = defaultdict(default_factory)
for row in reader:
amount = int(row["AMOUNT"])
cities[row["CITY"]][0] += amount
max = cities[row["CITY"]][1]
cities[row["CITY"]][1] = amount if max is None else amount if amount > max else max
min = cities[row["CITY"]][2]
cities[row["CITY"]][2] = amount if min is None else amount if amount < min else min
cities[row["CITY"]][3] += 1
for city in cities:
cities[city][3] = cities[city][0]/cities[city][3] # calculate mean
writer = csv.writer(open('out.csv', 'w', newline = ''))
writer.writerow(["CITY", "AMOUNT", "max", "min", "mean"])
writer.writerows([city] + cities[city] for city in cities)
This gives you
CITY,AMOUNT,max,min,mean
New York,25,25,25,25.0
London,75,55,20,37.5
Tokyo,45,45,45,45.0
Note that under Python 2, you'll need the additional line from __future__ import division at the top to get correct results.

Using a dict with the value as the AMOUNT might do the trick. Something like the following-
Suppose you read one line at a time and city indicates the current city and amount indicates the current amount -
main_dict = {}
---for loop here---
if city in main_dict:
main_dict[city] = main_dict[city] + amount
else:
main_dict[city] = amount
---end for loop---
At the end of the loop you will have aggregate values in main_dict.

Related

How to get a running total for a column in a csv file while depending on a unique variable in a different column?

import csv
def getDataFromFile(filename, dataList):
file = open(filename, "r")
csvReader = csv.reader(file)
for aList in csvReader:
dataList.append(aList)
file.close()
def getTotalByYear(expendDataList):
total = 0
for row in expendDataList:
expenCount = float(row[2])
total += expenCount**
Rtotal = input(print("Enter 'every' or a particular year. "))
if Rtotal == 'every' or == 'Every':
print(expenCount)
As you can see I got the running total for column 2 if you type every or Every but I don't understand how to do a running total for column 2 while dependent on a certain variable in column one.
In this case my CSV file has three columns of data. A year field, an item field, and an expenditure field. How do I get a running total of the expenditure field based on a certain year?
expendDataList = []
fname = "expenditures.csv"
getDataFromFile(fname, expendDataList)
getTotalByYear(expendDataList)

Producing a running total is good task for a generator function. This example uses the filter built-in function to filter out unwanted years (a generator expression/ list comprehension could be used instead). Then it iterates over the selected rows to produce the results.
import csv
def running_totals(year):
with open('year-item-expenditure.csv') as f:
reader = csv.DictReader(f)
predicate = None if year.lower() == 'every' else lambda row: row['Year'] == year
total = 0
for row in filter(predicate, reader):
total += float(row['Expenditure'])
yield total
totals = running_totals('2019')
for total in totals:
print(total)
Another approach would be to use itertools.accumulate, though you still have to perform all of the filtering, so there's not much benefit to doing this unless you need performance.
import csv
import itertools
def running_totals(year):
with open('year-item-expenditure.csv') as f:
reader = csv.DictReader(f)
predicate = None if year.lower() == 'every' else lambda row: row['Year'] == year
# Create a generator expression that yields expenditures as floats
expenditures = (float(row['Expenditure']) for row in filter(predicate, reader))
for total in itertools.accumulate(expenditures):
yield total

How to find minimum value from CSV file row in Python?

I'm a beginner at Python and I am using it for my project.
I want to extract the minimum value from column4 of a CSV file and I am not sure how to.
I can print the whole of column[4] but am not sure how to print the minimum value (just one column) from column[4].
CSV File: https://www.emcsg.com/marketdata/priceinformation
I'm downloading the Uniform Singapore Energy Price & Demand Forecast for 9 Sep.
Thank you in advance.
This is my current codes:
import csv
import operator
with open('sep.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
header = next(readCSV)
data = []
for row in readCSV:
Date = row[0]
Time = row[1]
Demand = row[2]
RCL = row[3]
USEP = row [4]
EHEUR = row [5]
LCP = row[6]
Regulations = row[7]
Primary = row[8]
Secondary = row[9]
Contingency = row[10]
Last_Updated = row[11]
print header[4]
print row[4]

not sure how are you reading the values. however, you can add all the values to and list and then:
list = []
<loop to extract values>
list.insert(index, value)
min_value = min(list)
Note: index is the 'place' where the value get inserted.

Your phrasing is a bit ambiguous. At first I thought you meant the minimum of the fourth row, but looking at the data you seem to be wanting the minimum of the fourth column (USEP($/MWh)). For that, (assuming that "Realtime_Sep-2017.csv" is the filename) you can do:
import pandas as pd
df = pd.read_csv("Realtime_Sep-2017.csv")
print(min(df["USEP($/MWh)"])
Other options include df.min()["USEP($/MWh)"], df.min()[4], and min(df.iloc[:,4])

EDIT 2 :
Solution for a column without pandas module:
with open("Realtime_Sep-2017.csv") as file:
lines = file.read().split("\n") #Read lines
num_list = []
for line in lines:
try:
item = line.split(",")[4][1:-1] #Choose 4th column and delete ""
num_list.append(float(item)) #Try to parse
except:
pass #If it can't parse, the string is not a number
print(max(num_list)) #Prints maximum value
print(min(num_list)) #Prints minimum value
Output:
81.92
79.83
EDIT :
Here is the solution for a column:
import pandas as pd
df = pd.read_csv("Realtime_Sep-2017.csv")
row_count = df.shape[0]
column_list = []
for i in range(row_count):
item = df.at[i, df.columns.values[4]] #4th column
column_list.append(float(item)) #parse float and append list
print(max(column_list)) #Prints maximum value
print(min(column_list)) #Prints minimum value
BEFORE EDIT :
(solution for a row)
Here is a simple code block:
with open("Realtime_Sep-2017.csv") as file:
lines = file.read().split("\n") #Reading lines
num_list = []
line = lines[3] #Choosing 4th row.
for item in line.split(","):
try:
num_list.append(float(item[1:-1])) #Try to parse
except:
pass #If it can't parse, the string is not a number
print(max(num_list)) #Prints maximum value
print(min(num_list)) #Prints minimum value

Split a row into multiple cells and keep the maximum value of second value for each gene

I am new to Python and I prepared a script that will modify the following csv file
accordingly:
1) Each row that contains multiple Gene entries separated by the /// such as:
C16orf52 /// LOC102725138 1.00551
should be transformed to:
C16orf52 1.00551
LOC102725138 1.00551
2) The same gene may have different ratio values
AASDHPPT 0.860705
AASDHPPT 0.983691
and we want to keep only the pair with the highest ratio value (delete the pair AASDHPPT 0.860705)
Here is the script I wrote but it does not assign the correct ratio values to the genes:
import csv
import pandas as pd
with open('2column.csv','rb') as f:
reader = csv.reader(f)
a = list(reader)
gene = []
ratio = []
for t in range(len(a)):
if '///' in a[t][0]:
s = a[t][0].split('///')
gene.append(s[0])
gene.append(s[1])
ratio.append(a[t][1])
ratio.append(a[t][1])
else:
gene.append(a[t][0])
ratio.append(a[t][1])
gene[t] = gene[t].strip()
newgene = []
newratio = []
for i in range(len(gene)):
g = gene[i]
r = ratio[i]
if g not in newgene:
newgene.append(g)
for j in range(i+1,len(gene)):
if g==gene[j]:
if ratio[j]>r:
r = ratio[j]
newratio.append(r)
for i in range(len(newgene)):
print newgene[i] + '\t' + newratio[i]
if len(newgene) > len(set(newgene)):
print 'missionfailed'
Thank you very much for any help or suggestion.

Try this:
with open('2column.csv') as f:
lines = f.read().splitlines()
new_lines = {}
for line in lines:
cols = line.split(',')
for part in cols[0].split('///'):
part = part.strip()
if not part in new_lines:
new_lines[part] = cols[1]
else:
if float(cols[1]) > float(new_lines[part]):
new_lines[part] = cols[1]
import csv
with open('clean_2column.csv', 'wb') as csvfile:
writer = csv.writer(csvfile, delimiter=' ',
quotechar='|', quoting=csv.QUOTE_MINIMAL)
for k, v in new_lines.items():
writer.writerow([k, v])

First of all, if you're importing Pandas, know that you have I/O Tools to read CSV files.
So first, let's import it that way :
df = pd.read_csv('2column.csv')
Then, you can extract the indexes where you have your '///' pattern:
l = list(df[df['Gene Symbol'].str.contains('///')].index)
Then, you can create your new rows :
for i in l :
for sub in df['Gene Symbol'][i].split('///') :
df=df.append(pd.DataFrame([[sub, df['Ratio(ifna vs. ctrl)'][i]]], columns = df.columns))
Then, drop the old ones :
df=df.drop(df.index[l])
Then, I'll do a little trick to remove your lowest duplicate values. First, I'll sort them by 'Ratio (ifna vs. ctrl)' then I'll drop all the duplicates but the first one :
df = df.sort('Ratio(ifna vs. ctrl)', ascending=False).drop_duplicates('Gene Symbol', keep='first')
If you want to keep your sorting by Gene Symbol and reset indexes to have simpler ones, simply do :
df = df.sort('Gene Symbol').reset_index(drop=True)
If you want to re-export your modified data to your csv, do :
df.to_csv('2column.csv')
EDIT : I edited my answer to correct syntax errors, I've tested this solution with your csv and it worked perfectly :)

This should work.
It uses the dictionary suggestion of Peter.
import csv
with open('2column.csv','r') as f:
reader = csv.reader(f)
original_file = list(reader)
# gets rid of the header
original_file = original_file[1:]
# create an empty dictionary
genes_ratio = {}
# loop over every row in the original file
for row in original_file:
gene_name = row[0]
gene_ratio = row[1]
# check if /// is in the string if so split the string
if '///' in gene_name:
gene_names = gene_name.split('///')
# loop over all the resulting compontents
for gene in gene_names:
# check if the component is in the dictionary
# if not in dictionary set value to gene_ratio
if gene not in genes_ratio:
genes_ratio[gene] = gene_ratio
# if in dictionary compare value in dictionary to gene_ratio
# if dictionary value is smaller overwrite value
elif genes_ratio[gene] < gene_ratio:
genes_ratio[gene] = gene_ratio
else:
if gene_name not in genes_ratio:
genes_ratio[gene_name] = gene_ratio
elif genes_ratio[gene_name] < gene_ratio:
genes_ratio[gene_name] = gene_ratio
#loop over dictionary and print gene names and their ratio values
for key in genes_ratio:
print key, genes_ratio[key]

Counting a ncaa basketball teams wins

I am trying to count the wins of certain college basketball teams, I have a csv file containing that data. When I run this code no matter what I have tried it always returns 0.
import csv
f = open("data.csv", 'r')
data = list(csv.reader(f))
def ncaa(team):
count = 0
for row in data:
if row[2] == team:
count += 1
return count
airforce_wins = ncaa("Air force")
akron_wins = ncaa("Akron")
print(akron_wins)

This will give you "1".
import csv
f = open("C:\\users/alex/desktop/data.csv", 'r')
data = list(csv.reader(f))
def ncaa(team):
count = 0
for row in data:
if row[1] == team: #corrected index here
count += 1
return count
airforce_wins = ncaa("Air force")
akron_wins = ncaa("Akron")
print(akron_wins)
However, I don't think you are counting the wins correctly. You are counting occurrences of a row in the file but, since each team only has one row, you will always get "1" for any team. Perhaps, your wins are in another column and that's the value you need to look up when you find your team.

Try this instead before the function definition:
import csv
with open("data1.csv", 'r') as f:
data = csv.reader(f,delimiter=',')
I don't think using list(reader_object) is correct.

Parse CSV file with and aggregate values, multiple columns

I would like to adapt the post here (Parse CSV file and aggregate the values) to sum multiple columns instead of just one.
So for these data:
CITY,AMOUNT,AMOUNT2,AMOUNTn
London,20,21,22
Tokyo,45,46,47
London,55,56,57
New York,25,26,27
How can I get this:
CITY,AMOUNT,AMOUNT2,AMOUNTn
London,75,77,79
Tokyo,45,46,47
New York,25,26,27
I will have several thousand columns eventually, and unfortunately I can not use the pandas package for this task. Here is the code I have just aggregates all three AMOUNT cols into one, which is not what I am after
from __future__ import division
import csv
from collections import defaultdict
def default_factory():
return [0, None, None, 0]
reader = csv.DictReader(open('test_in.txt'))
cities = defaultdict(default_factory)
for row in reader:
headers = [r for r in row.keys()]
headers.remove('CITY')
for i in headers:
amount = int(row[i])
cities[row["CITY"]][0] += amount
max = cities[row["CITY"]][1]
cities[row["CITY"]][1] = amount if max is None else amount if amount > max else max
min = cities[row["CITY"]][2]
cities[row["CITY"]][2] = amount if min is None else amount if amount < min else min
cities[row["CITY"]][3] += 1
for city in cities:
cities[city][3] = cities[city][0]/cities[city][3] # calculate mean
with open('test_out.txt', 'wb') as myfile:
writer = csv.writer(myfile, delimiter="\t")
writer.writerow(["CITY", "AMOUNT", "AMOUNT2", "AMOUNTn ,"max", "min", "mean"])
writer.writerows([city] + cities[city] for city in cities)
Thank you for any help

Here is one way using itertools.groupby.
import StringIO
import csv
import itertools
data = """CITY,AMOUNT,AMOUNT2,AMOUNTn
London,20,21,22
Tokyo,45,46,47
London,55,56,57
New York,25,26,27"""
# I use StringIO to create a file like object for demo purposes
f = StringIO.StringIO(data)
fieldnames = f.readline().strip().split(',')
key = lambda x: x[0] # the first column will be a grouping key
# rows must be sorted by city before passing to itertools.groupby
rows_sorted = sorted(csv.reader(f), key=key)
outfile = StringIO.StringIO('')
writer = csv.DictWriter(outfile, fieldnames=fieldnames, lineterminator='\n')
writer.writeheader()
for city, rows in itertools.groupby(rows_sorted, key=key):
# remove city column for aggregation, convert to ints
rows = [[int(x) for x in row[1:]] for row in rows]
agg = [sum(column) for column in zip(*rows)]
writer.writerow(dict(zip(fieldnames, [city] + agg)))
print outfile.getvalue()
# CITY,AMOUNT,AMOUNT2,AMOUNTn
# London,75,77,79
# New York,25,26,27
# Tokyo,45,46,47

Here is how I would do it.
import csv
from StringIO import StringIO
data = '''CITY,AMOUNT,AMOUNT2,AMOUNTn
London,20,21,22
Tokyo,45,46,47
London,55,56,57,99
New York,25,26,27'''
file_ = StringIO(data)
reader = csv.reader(file_)
headers = next(reader)
rows = {}
def add(col1, col2):
l = len(col1)
for i, n in enumerate(col2):
if i >= l:
col1.extend(col2[i:])
break
col1[i] += n
return col1
for row in reader:
key = row[0]
nums = map(int, row[1:])
if key in rows:
rows[key] = add(rows[key], nums)
else:
rows[key] = map(int, nums)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse CSV file and aggregate the values - python

Related

How to get a running total for a column in a csv file while depending on a unique variable in a different column?

How to find minimum value from CSV file row in Python?

Split a row into multiple cells and keep the maximum value of second value for each gene

Counting a ncaa basketball teams wins

Parse CSV file with and aggregate values, multiple columns

Categories

Resources