Counting data within ranges in csv - python
I have some data which I need to break down into manageable chunks. With the following data I need to count the number of times x occurs in column 11 with column 7 being a 1 and how many times the number x occurs in column 11. I need to put them into the first line of a csv. After that I need to count the same thing but with column 11 being the following brackets:
0
">0 but <0.05"
">0.05 but <0.10"
">0.1 but <0.15... all the way up to 1.00"
All of these would ideally be appended to the same new.csv i.e. not the main data csv
Some example raw data that fits the above description (please note a lot of the brackets will contain no data. In which case they would need to return 0,0:
01/01/2002,Data,class1,4,11yo+,4,1,George Smith,0,0,x
01/01/2002,Data,class1,4,11yo+,4,2,Ted James,0,0,x
01/01/2002,Data,class1,4,11yo+,4,3,Emma Lilly,0,0,x
01/01/2002,Data,class1,4,11yo+,4,5,George Smith,0,0,x
02/01/2002,Data,class2,4,10yo+,6,4,Tom Phillips,0,0,x
02/01/2002,Data,class2,4,10yo+,6,2,Tom Phillips,0,0,x
02/01/2002,Data,class2,4,10yo+,6,5,George Smith,1,2,0.5
02/01/2002,Data,class2,4,10yo+,6,3,Tom Phillips,0,0,x
02/01/2002,Data,class2,4,10yo+,6,1,Emma Lilly,0,1,0
02/01/2002,Data,class2,4,10yo+,6,6,George Smith,1,2,0.5
03/01/2002,Data,class3,4,10yo+,6,6,Ted James,0,1,0
03/01/2002,Data,class3,4,10yo+,6,3,Tom Phillips,0,3,0
03/01/2002,Data,class3,4,10yo+,6,2,George Smith,1,4,0.25
03/01/2002,Data,class3,4,10yo+,6,4,George Smith,1,4,0.25
03/01/2002,Data,class3,4,10yo+,6,1,George Smith,1,4,0.25
03/01/2002,Data,class3,4,10yo+,6,5,Tom Phillips,0,3,0
04/01/2002,Data,class4,2,10yo+,5,3,Emma Lilly,1,2,0.5
04/01/2002,Data,class4,2,10yo+,5,1,Ted James,0,2,0
04/01/2002,Data,class4,2,10yo+,5,2,George Smith,2,7,0.285714286
04/01/2002,Data,class4,2,10yo+,5,4,Emma Lilly,1,2,0.5
04/01/2002,Data,class4,2,10yo+,5,5,Tom Phillips,0,5,0
05/01/2002,Data,class5,4,11yo+,4,1,George Smith,2,8,0.25
05/01/2002,Data,class5,4,11yo+,4,2,Ted James,1,3,0.333333333
05/01/2002,Data,class5,4,11yo+,4,3,Emma Lilly,1,4,0.25
05/01/2002,Data,class5,4,11yo+,4,5,George Smith,2,8,0.25
06/01/2002,Data,class6,4,10yo+,6,4,Tom Phillips,0,6,0
06/01/2002,Data,class6,4,10yo+,6,2,Tom Phillips,0,6,0
06/01/2002,Data,class6,4,10yo+,6,5,George Smith,3,10,0.3
06/01/2002,Data,class6,4,10yo+,6,3,Tom Phillips,0,6,0
06/01/2002,Data,class6,4,10yo+,6,1,Emma Lilly,1,5,0.2
06/01/2002,Data,class6,4,10yo+,6,6,George Smith,3,10,0.3
07/01/2002,Data,class7,4,10yo+,6,6,Ted James,1,4,0.25
07/01/2002,Data,class7,4,10yo+,6,3,Tom Phillips,0,9,0
07/01/2002,Data,class7,4,10yo+,6,2,George Smith,3,12,0.25
07/01/2002,Data,class7,4,10yo+,6,4,George Smith,3,12,0.25
07/01/2002,Data,class7,4,10yo+,6,1,George Smith,3,12,0.25
07/01/2002,Data,class7,4,10yo+,6,5,Tom Phillips,0,9,0
08/01/2002,Data,class8,2,10yo+,5,3,Emma Lilly,2,6,0.333333333
08/01/2002,Data,class8,2,10yo+,5,1,Ted James,1,5,0.2
08/01/2002,Data,class8,2,10yo+,5,2,George Smith,4,15,0.266666667
08/01/2002,Data,class8,2,10yo+,5,4,Emma Lilly,2,6,0.333333333
08/01/2002,Data,class8,2,10yo+,5,5,Tom Phillips,0,11,0
09/01/2002,Data,class9,4,11yo+,4,1,George Smith,4,16,0.25
09/01/2002,Data,class9,4,11yo+,4,2,Ted James,2,6,0.333333333
09/01/2002,Data,class9,4,11yo+,4,3,Emma Lilly,2,8,0.25
09/01/2002,Data,class9,4,11yo+,4,5,George Smith,4,16,0.25
10/01/2002,Data,class10,4,10yo+,6,4,Tom Phillips,0,12,0
10/01/2002,Data,class10,4,10yo+,6,2,Tom Phillips,0,12,0
10/01/2002,Data,class10,4,10yo+,6,5,George Smith,5,18,0.277777778
10/01/2002,Data,class10,4,10yo+,6,3,Tom Phillips,0,12,0
10/01/2002,Data,class10,4,10yo+,6,1,Emma Lilly,2,9,0.222222222
10/01/2002,Data,class10,4,10yo+,6,6,George Smith,5,18,0.277777778
11/01/2002,Data,class11,4,10yo+,6,6,Ted James,2,7,0.285714286
11/01/2002,Data,class11,4,10yo+,6,3,Tom Phillips,0,15,0
11/01/2002,Data,class11,4,10yo+,6,2,George Smith,5,20,0.25
11/01/2002,Data,class11,4,10yo+,6,4,George Smith,5,20,0.25
11/01/2002,Data,class11,4,10yo+,6,1,George Smith,5,20,0.25
11/01/2002,Data,class11,4,10yo+,6,5,Tom Phillips,0,15,0
12/01/2002,Data,class12,2,10yo+,5,3,Emma Lilly,3,10,0.3
12/01/2002,Data,class12,2,10yo+,5,1,Ted James,2,8,0.25
12/01/2002,Data,class12,2,10yo+,5,2,George Smith,6,23,0.260869565
12/01/2002,Data,class12,2,10yo+,5,4,Emma Lilly,3,10,0.3
12/01/2002,Data,class12,2,10yo+,5,5,Tom Phillips,0,17,0
13/01/2002,Data,class13,4,11yo+,4,1,George Smith,6,24,0.25
13/01/2002,Data,class13,4,11yo+,4,2,Ted James,3,9,0.333333333
13/01/2002,Data,class13,4,11yo+,4,3,Emma Lilly,3,12,0.25
13/01/2002,Data,class13,4,11yo+,4,5,George Smith,6,24,0.25
14/01/2002,Data,class14,4,10yo+,6,4,Tom Phillips,0,18,0
14/01/2002,Data,class14,4,10yo+,6,2,Tom Phillips,0,18,0
14/01/2002,Data,class14,4,10yo+,6,5,George Smith,7,26,0.269230769
14/01/2002,Data,class14,4,10yo+,6,3,Tom Phillips,0,18,0
14/01/2002,Data,class14,4,10yo+,6,1,Emma Lilly,3,13,0.230769231
14/01/2002,Data,class14,4,10yo+,6,6,George Smith,7,26,0.269230769
15/01/2002,Data,class15,4,10yo+,6,6,Ted James,3,10,0.3
If anybody can help me achieve this I will truly grateful. If this requires more detail please ask.
One last note the csv in question has main data csv in question has 800k rows.
EDIT
Currently the output file appears as follows using the code supplied by #user650654:
data1,data2
If at all possible I would like the code changed slightly to out put two more things. Hopefully therse are not too difficult to do. Proposed changes to output file (commas represent each new row):
title row labeling the row (e.g. "x" or "0:0.05",Calculated avereage of values within each bracket e.g."0.02469",data1,data2
So in reality it would probably look like this:
x,n/a,data1,data2
0:0.05,0.02469,data1,data2
0.05:0.1,0.5469,data1,data2
....
....
Column1 = Row label (The data ranges that are being counted in the original question i.e. from 0 to 0.05
Column2 = Calculated average of values that fell within a particular range. I.e. If the
Note the data1 & data2 are the two values the question innitially asked for.
Column1
Many thanks AEA
Here is a solution for adding the two new fields:
import csv
import numpy
def count(infile='data.csv', outfile='new.csv'):
bins = numpy.arange(0, 1.05, 0.05)
total_x = 0
col7one_x = 0
total_zeros = 0
col7one_zeros = 0
all_array = []
col7one_array = []
with open(infile, 'r') as fobj:
reader = csv.reader(fobj)
for line in reader:
if line[10] == 'x':
total_x += 1
if line[6] == '1':
col7one_x += 1
elif line[10] == '0':
# assumes zero is represented as "0" and not as say, "0.0"
total_zeros += 1
if line[6] == '1':
col7one_zeros += 1
else:
val = float(line[10])
all_array.append(val)
if line[6] == '1':
col7one_array.append(val)
all_array = numpy.array(all_array)
hist_all, edges = numpy.histogram(all_array, bins=bins)
hist_col7one, edges = numpy.histogram(col7one_array, bins=bins)
bin_ranges = ['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]
digitized = numpy.digitize(all_array, bins)
bin_means = [all_array[digitized == i].mean() if hist_all[i - 1] else 'n/a' for i in range(1, len(bins))]
with open(outfile, 'w') as fobj:
writer = csv.writer(fobj)
writer.writerow(['x', 'n/a', col7one_x, total_x])
writer.writerow(['0', 0 if total_zeros else 'n/a', col7one_zeros, total_zeros])
for row in zip(bin_ranges, bin_means, hist_col7one, hist_all):
writer.writerow(row)
if __name__ == '__main__':
count()
This might work:
import numpy as np
import pandas as pd
column_names = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6',
'col7', 'col8', 'col9', 'col10', 'col11'] #names to be used as column labels. If no names are specified then columns can be refereed to by number eg. df[0], df[1] etc.
df = pd.read_csv('data.csv', header=None, names=column_names) #header= None means there are no column headings in the csv file
df.ix[df.col11 == 'x', 'col11']=-0.08 #trick so that 'x' rows will be grouped into a category >-0.1 and <= -0.05. This will allow all of col11 to be treated as a numbers
bins = np.arange(-0.1, 1.0, 0.05) #bins to put col11 values in. >-0.1 and <=-0.05 will be our special 'x' rows, >-0.05 and <=0 will capture all the '0' values.
labels = np.array(['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]) #create labels for the bins
labels[0] = 'x' #change first bin label to 'x'
labels[1] = '0' #change second bin label to '0'
df['col11'] = df['col11'].astype(float) #convert col11 to numbers so we can do math on them
df['bin'] = pd.cut(df['col11'], bins=bins, labels=False) # make another column 'bins' and put in an integer representing what bin the number falls into.Later we'll map the integer to the bin label
df.set_index('bin', inplace=True, drop=False, append=False) #groupby is meant to run faster with an index
def count_ones(x):
"""aggregate function to count values that equal 1"""
return np.sum(x==1)
dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]}) # groupby the bin number and apply aggregate functions to specified column.
dfg.index = labels[dfg.index]# apply labels to bin numbers
dfg.ix['x',('col11', 'mean')]='N/A' #mean of 'x' rows is meaningless
print(dfg)
dfg.to_csv('new.csv')
which gave me
col7 col11
count_ones len mean
x 1 7 N/A
0 2 21 0
0.15:0.2 2 2 0.2
0.2:0.25 9 22 0.2478632
0.25:0.3 0 13 0.2840755
0.3:0.35 0 5 0.3333333
0.45:0.5 0 4 0.5
This solution uses numpy.histogram. See below.
import csv
import numpy
def count(infile='data.csv', outfile='new.csv'):
total_x = 0
col7one_x = 0
total_zeros = 0
col7one_zeros = 0
all_array = []
col7one_array = []
with open(infile, 'r') as fobj:
reader = csv.reader(fobj)
for line in reader:
if line[10] == 'x':
total_x += 1
if line[6] == '1':
col7one_x += 1
elif line[10] == '0':
# assumes zero is represented as "0" and not as say, "0.0"
total_zeros += 1
if line[6] == '1':
col7one_zeros += 1
else:
val = float(line[10])
all_array.append(val)
if line[6] == '1':
col7one_array.append(val)
bins = numpy.arange(0, 1.05, 0.05)
hist_all, edges = numpy.histogram(all_array, bins=bins)
hist_col7one, edges = numpy.histogram(col7one_array, bins=bins)
with open(outfile, 'w') as fobj:
writer = csv.writer(fobj)
writer.writerow([col7one_x, total_x])
writer.writerow([col7one_zeros, total_zeros])
for row in zip(hist_col7one, hist_all):
writer.writerow(row)
if __name__ == '__main__':
count()
Related
Removing value from a DataFrame column which repeats over 15 times
I'm working on forex data like this: 0 1 2 3 1 AUD/JPY 20040101 00:01:00.000 80.598 80.598 2 AUD/JPY 20040101 00:02:00.000 80.595 80.595 3 AUD/JPY 20040101 00:03:00.000 80.562 80.562 4 AUD/JPY 20040101 00:04:00.000 80.585 80.585 5 AUD/JPY 20040101 00:05:00.000 80.585 80.585 I want to go through column 2 and 3 and remove the rows in which the value is repeated for more than 15 times in a row. So far I managed to produce this piece of code: price = 0 drop_start = 0 counter = 0 df_new = df for i, r in df.iterrows(): if r.iloc[2] != price: if counter >= 15: df_new = df_new.drop(df_new.index[drop_start:i]) price = r.iloc[2] counter = 1 drop_start = i if r.iloc[2] == price: counter = counter + 1 price = 0 drop_start = 0 counter = 0 df = df_new for i, r in df.iterrows(): if r.iloc[3] != price: if counter >= 15: df_new = df_new.drop(df_new.index[drop_start:i]) price = r.iloc[3] counter = 1 drop_start = i if r.iloc[3] == price: counter = counter + 1 print(df_new.info()) df_new.to_csv('df_new.csv', index=False, header=None) Unfortunately when I check the output file there are some mistakes, there are some weekends which have not been removed by the program. How should I build my algorithm, so it removes the duplicated values correctly? First 250k rows of my initial dataset is available here: https://ufile.io/omg5h The output of this program for that sample data is available here: https://ufile.io/2gc3d You can see that in the output file the rows 6931+ were not succesfully removed:
The problem with your algorithm is that, you are not holding specific counter values for the row values, but rather increment the counter through the loop. This causes the result to be false I believe. Also, the comparison r.iloc[2] != price also does not make sense because you are changing the value of price every iteration, so if there are elements between the duplicates, this check do not serve a proper function. I wrote a small code to copy the behavior you asked for. df = pd.DataFrame([[0,0.5, 2.5],[0,1, 2],[0,1.5,2.5 ],[0,2, 3],[0,2, 3],[0,3, 4], [0,4, 5]],columns = ['A','B','C']) df_new = df dict = {} print('Initial DF') print(df) print() for i, r in df.iterrows(): counter = dict.get(r.iloc[1]) if counter == None: counter = 0 dict[r.iloc[1]] = counter + 1 if dict[r.iloc[1]] >= 2: df_new = df_new[df_new.B != r.iloc[1]] print('2nd col. deleted DF') print(df_new) print() df_fin = df_new dict2 = {} for i, r in df_new.iterrows(): counter = dict2.get(r.iloc[2]) if counter == None: counter = 0 dict2[r.iloc[2]] = counter + 1 if dict2[r.iloc[2]] >= 2: df_fin = df_fin[df_fin.C != r.iloc[2]] print('3rd col. deleted DF') print(df_fin) Here, I hold the counter value for each unique value in the rows of column 2 and 3. Then, according to the threshold(which is 2 in this case) I remove the rows which are exceeding the threshold. I first eliminate values according to the 2nd column, then forward this modified array to the next loop and eliminate values according to the 3rd column and finish the process.
Calculate while excluding -1's
I have an extremely large file of tab delimited values of 10000+ values. I am trying to find the averages of each row in the data and append these new values to a new file. Howvever, values that weren't found are inputted in the large file as -1. Using the -1 values when calculating my averages will mess up my data. How can i exclude these values? The large file structure looks like this: "HsaEX0029886" 100 -1 -1 100 100 100 100 100 100 -1 100 -1 100 "HsaEX0029895" 100 100 91.49 100 100 100 100 100 97.87 95.29 100 100 93.33 "HsaEX0029923" 0 0 0 -1 0 0 0 0 0 9.09 0 5.26 0 In my code Im taking the last 3 elements and finding the average of just the 3 values. If the last 3 elements in the row are 85 , 12, and -1, I need to return the average of 85 and 12. Here's my entire code: with open("PSI_Datatxt.txt", 'rt') as data: next(data) lis = [line.strip("\n").split("\t") for line in data] # create a list of lists(each row) for row in lis: x = float(row[11]) y = float(row[12]) z = float(row[13]) avrg = ((x + y + z) / 3) with open("DataEditted","a+") as newdata: if avrg == -1: continue #skipping lines where all 3 values are -1 else: newdata.write(str(avrg) + ' ' + '\n') Thanks. Comment if any clarification is needed.
data = [float(x) for x in row[1:] if float(x) > -1] if data: avg = sum(data)/len(data) else: avg = 0 # or throw an exception; you had a row of all -1's The first line is a fairly standard Pythonism... given an array (in this case row), you can iterate through the list and filter out stuff by using the for x in array if condition bit. If you wanted to only look at the last three values, you have two options depending on what you mean by last three: data = [float(x) for x in row[-3:] if float(x) > -1] will look at the last 3 and given you 0 to 3 values back depending on if they're -1. data = [float(x) for x in row[1:] if float(x) > -1][:-3] will give you up to 3 of the last "good" values (if you have all or almost all -1 for a given row, it will be less than 3)
Here is it in the same format as your original question. It offers you to write an error message if the row is all zeros, or you can ignore it instead and write nothing with open("PSI_Datatxt.txt", 'r') as data: for row in data: vals = [float(val) for val in row[1:] if float(val) != -1] with open("DataEditted","a+") as newdata: try: newdata.write(str(sum(vals)/len(vals)) + ' ' + '\n') except ZeroDivisionError: newdata.write("My Error Message Here\n")
This should do it import csv def average(L): L = [i for i in map(float, L) if i != -1] if not L: return None return sum(L)/len(L) with open('path/to/input/file') as infile, open('path/to/output/file', 'w') as fout: outfile = csv.writer(fout, delimiter='\t') for name, *vals in csv.reader(infile, delimiter='\t'): outfile.writerow((name, average(vals))
How can I save a corresponding value in a row after I separate data from another column?
to add my last question. I want to separate values from one columns, and make them into separate lists. When I do that how can I save the corresponding values from another column? Example: depth: 90, 88, 56, 33, 22, 5 Corresponding Irrad Values: 0.01, 0.03, 0.02, 0.02, 0.04, 0.005 So in the row, there is 90 and 0.01, 88 and 0.03, and so on.. I want to separate my depth data based on certain criteria. When I do so how can I save the corresponding value in the same row that I decide to keep. My Last Question: How can I sort through a range in a csv file in python? My Current Code: import csv import numpy as np from collections import defaultdict columns = defaultdict(list) # each value in each column is appended to a list with open('C:\\Users\\AdamStoer\\Documents\\practicedata2.csv') as file: reader = csv.DictReader(file,delimiter=',') # read rows into a dictionary format def split_my_list(column_data): # here I am splitting my depth data into a dictonary...I split the values from each key into a seperate list group =0 temp = [] splited_list = {} lengh = len(column_data) for i in range(lengh): if not i == lengh-1: if column_data[i] > column_data[i+1]: temp.append(column_data[i]) columns['irrad2'].append(row['sci_ocr504i_irrad2']) else: temp.append(column_data[i]) group +=1 splited_list.update({str(group):temp}) temp = [] else: if column_data[i] < column_data[-2]: temp.append(column_data[i]) group +=1 splited_list.update({str(group):temp}) break else: group +=1 splited_list.update({str(group):[[i]]}) break return splited_list for row in reader: # right here I am sorting the value based on criteria r = float(row['roll']) p = float(row['pitch']) if 0.21 <= p <= 0.31: if -0.06 <= r <= 0.06: columns['i_depth'].append(row['i_depth']) columns['irrad2'].append(row['sci_ocr504i_irrad2']) depth= columns['i_depth'] split_depth = split_my_list(depth) #I split my data here irrad = columns['irrad2'] lst = [] for depthvalues in split_depth.values(): #i make seperate lists here and do my main math here print (depthvalues) depthlst = depthvalues depthfirst = depthlst[0] depthlast = depthlst[-1] print (depthfirst + " " + depthlast) I want to do a similar thing with the irrad values but they are not split into a lists like depth, they are still in one big list. So they wont correspond with the depth values I need. irradlst = columns['irrad2'] irradfirst = irradlst[0] irradlast = irradlst[-1]
Python: count values within defined intervals
I import data from a CSV which looks like this: 3.13 3.51 3.51 4.01 2.13 1.13 1.13 1.13 1.63 1.88 What I would like to do now is to COUNT the values within those intervals: 0-1, 1-2, 2-3, >3 So the result would be 0-1: 0 1-2: 5 2-3: 1 >3: 4 Apart from this main task I would like to calculate the outcome into percent of total numbers (e.g. 0-1: 0%, 1-2: 50%,...) I am quite new to Python so I got stuck in my attemps solving this thing. Maybe there is a predefined function for solving this I don't know of? Thanks a lot for your help!!! +++ UPDATE: +++ Thanks for all the replies. I have testes a bunch of them but I kind of doing something wrong with reading the CSV-File I guess. Refering to the code snippets using a,b,c,d for the differnt intervalls these variables always stay '0' for me. Here is my actual code: import csv a=b=c=0 with open('winter.csv', 'rb') as csvfile: spamreader = csv.reader(csvfile, delimiter=',') for row in spamreader: if row in range(0,1): a += 1 elif row in range (1,2): b += 1 print a,b I also converted all values in the CSV to Integers without success. In the CSV there is just one single column. Any ideas what I am doing wrong???
Here's how to do it in a very concise way with numpy: import sys import csv import numpy as np with open('winter.csv') as csvfile: field = 0 # (zero-based) field/column number containing the required values float_list = [float(row[field]) for row in csv.reader(csvfile)] #float_list = [3.13, 3.51, 3.51, 4.01, 2.13, 1.13, 1.13, 1.13, 1.63, 1.88] hist, bins = np.histogram(float_list, bins=[0,1,2,3,sys.maxint]) bin_counts = zip(bins, bins[1:], hist) # [(bin_start, bin_end, count), ... ] for bin_start, bin_end, count in bin_counts[:-1]: print '{}-{}: {}'.format(bin_start, bin_end, count) # different output required for last bin bin_start, bin_end, count = bin_counts[-1] print '>{}: {}'.format(bin_start, count) Which outputs: 0-1: 0 1-2: 5 2-3: 1 >3: 4 Most of the effort is in massaging the data for output. It's also quite flexible as it is easy to use different intervals by changing the bins argument to np.histogram(), e.g. add another interval by changing bins: hist, bins = np.histogram(float_list, bins=[0,1,2,3,4,sys.maxint]) outputs: 0-1: 0 1-2: 5 2-3: 1 3-4: 3 >4: 1
This should do, provided the data from the CSV is in values: from collections import defaultdict # compute a histogram histogram = defaultdict(lambda: 0) interval = 1. max = 3 for v in values: bin = int(v / interval) bin = max if bin >= max else bin histogram[bin] += 1 # output sum = sum(histogram.values()) for k, v in sorted(histogram.items()): share = 100. * v / sum if k >= max: print "{}+ : {}, {}%".format(k, v, share) else: print "{}-{}: {}, {}%".format(k, k+interval, v, share)
import csv a=b=c=d=0 with open('cf.csv', 'r') as csvfile: spamreader = csv.reader(csvfile) for row in spamreader: if 0<float(row[0])<1: a+=1 elif 1<float(row[0])<2: b+=1 elif 2<float(row[0])<3: c+=1 if 3<float(row[0]): d+=1 print "0-1:{} \n 1-2:{} \n 2-3:{} \n <3:{}".format(a,b,c,d) out put: 0-1:0 1-2:5 2-3:1 <3:4 Because of your rows are list type we use [0] index to access our data and convert the string to float by float() function .
After you get the entries into a list: 0_to_1 = 0 1_to_2 = 0 2_to_3 = 0 ovr_3 = 0 for i in list: if i in range(0,1): 0_to_1 += 1 elif i in range (1,2): 1_to_2 += 1 So on and so forth... And to find the breakdown: total_values = 0_to_1 + 1_to_2 + 2_to_3 + Ovr_3 perc_0_to_1 = (total_values/0_to_1)*100 perc_1_to_2 = (total_values/1_to_2)*100 perc_2_to_3 = (total_values/2_to_3)*100 perc_ovr_3 = (total_values/ovr_3)*100 +++++ Response to Update +++++++ import csv a=b=c=0 with open('winter.csv', 'rb') as csvfile: spamreader = csv.reader(csvfile, delimiter=',') for row in spamreader: for i in row: i = float(i.strip()) # .strip() removes blank spaces before converting it to float if row in range(0,1): a += 1 elif row in range(1,2): b += 1 # add more elif statements here as desired. Hope that works. Side note, I like that a=b=c=o thing. Didn't realize you could do that after all this time haha.
Help with an if else loop in python
Hi here is my problem. I have a program that calulcates the averages of data in columns. Example Bob 1 2 3 the output is Bob 2 Some of the data has 'na's So for Joe Joe NA NA NA I want this output to be NA so I wrote an if else loop The problem is that it doesn't execute the second part of the loop and just prints out one NA. Any suggestions? Here is my program: with open('C://achip.txt', "rtU") as f: columns = f.readline().strip().split(" ") numRows = 0 sums = [0] * len(columns) numRowsPerColumn = [0] * len(columns) # this figures out the number of columns for line in f: # Skip empty lines since I was getting that error before if not line.strip(): continue values = line.split(" ") for i in xrange(len(values)): try: # this is the whole strings to math numbers things sums[i] += float(values[i]) numRowsPerColumn[i] += 1 except ValueError: continue with open('c://chipdone.txt', 'w') as ouf: for i in xrange(len(columns)): if numRowsPerColumn[i] ==0 : print 'NA' else: print>>ouf, columns[i], sums[i] / numRowsPerColumn[i] # this is the average calculator The file looks like so: Joe Bob Sam 1 2 NA 2 4 NA 3 NA NA 1 1 NA and final output is the names and the averages Joe Bob Sam 1.5 1.5 NA Ok I tried Roger's suggestion and now I have this error: Traceback (most recent call last): File "C:/avy14.py", line 5, in for line in f: ValueError: I/O operation on closed file Here is this new code: with open('C://achip.txt', "rtU") as f: columns = f.readline().strip().split(" ") sums = [0] * len(columns) rows = 0 for line in f: line = line.strip() if not line: continue rows += 1 for col, v in enumerate(line.split()): if sums[col] is not None: if v == "NA": sums[col] = None else: sums[col] += int(v) with open("c:/chipdone.txt", "w") as out: for name, sum in zip(columns, sums): print >>out, name, if sum is None: print >>out, "NA" else: print >>out, sum / rows
with open("c:/achip.txt", "rU") as f: columns = f.readline().strip().split() sums = [0.0] * len(columns) row_counts = [0] * len(columns) for line in f: line = line.strip() if not line: continue for col, v in enumerate(line.split()): if v != "NA": sums[col] += int(v) row_counts[col] += 1 with open("c:/chipdone.txt", "w") as out: for name, sum, rows in zip(columns, sums, row_counts): print >>out, name, if rows == 0: print >>out, "NA" else: print >>out, sum / rows I'd also use the no-parameter version of split when getting the column names (it allows you to have multiple space separators). Regarding your edit to include input/output sample, I kept your original format and my output would be: Joe 1.75 Bob 2.33333333333 Sam NA This format is 3 rows of (ColumnName, Avg) columns, but you can change the output if you want, of course. :)
Using numpy: import numpy as np with open('achip.txt') as f: names=f.readline().split() arr=np.genfromtxt(f) print(arr) # [[ 1. 2. NaN] # [ 2. 4. NaN] # [ 3. NaN NaN] # [ 1. 1. NaN]] print(names) # ['Joe', 'Bob', 'Sam'] print(np.ma.mean(np.ma.masked_invalid(arr),axis=0)) # [1.75 2.33333333333 --]
Using your original code, I would add one loop and edit the print statement with open(r'C:\achip.txt', "rtU") as f: columns = f.readline().strip().split(" ") numRows = 0 sums = [0] * len(columns) numRowsPerColumn = [0] * len(columns) # this figures out the number of columns for line in f: # Skip empty lines since I was getting that error before if not line.strip(): continue values = line.split(" ") ### This removes any '' elements caused by having two spaces like ### in the last line of your example chip file above for count, v in enumerate(values): if v == '': values.pop(count) ### (End of Addition) for i in xrange(len(values)): try: # this is the whole strings to math numbers things sums[i] += float(values[i]) numRowsPerColumn[i] += 1 except ValueError: continue with open('c://chipdone.txt', 'w') as ouf: for i in xrange(len(columns)): if numRowsPerColumn[i] ==0 : print>>ouf, columns[i], 'NA' #Just add the extra parts else: print>>ouf, columns[i], sums[i] / numRowsPerColumn[i] This solution also gives the same result in Roger's format, not your intended format.
Solution below is cleaner and has fewer lines of code ... import pandas as pd # read the file into a DataFrame using read_csv df = pd.read_csv('C://achip.txt', sep="\s+") # compute the average of each column avg = df.mean() # save computed average to output file avg.to_csv("c:/chipdone.txt") They key to the simplicity of this solution is the way the input text file is read into a Dataframe. Pandas read_csv allows you to use regular expressions for specifying the sep/delimiter argument. In this case, we used the "\s+" regex pattern to take care of having one or more spaces between columns. Once the data is in a dataframe, computing the average and saving to a file can all be done with straight forward pandas functions.