Need help understanding why this value is staying as 1? Python CSV - python

So this block of code is supposed to open the csv file, get the values from column 1-3 (not 0). Once it has got the values for each row and their 3 columns, it is supposed to add these values up and divide by 3. I thought this code would work however the addition of the 3 columns in each row doesn't seem to be working. If anyone could tell me why and how i can fix this, that would be great, thank you. I'm pretty certain the problem lies at the for index, summedValue in enumerate (sums): Specifically, the "summedValue" value.
if order ==("average score"):
askclass = str(input("what class?"))
if askclass == ('1'):
with open("Class1.csv") as f:
columns = f.readline().strip().split(" ")
sums = [1] * len(columns)
for line in f:
# Skip empty lines
if not line.strip():
continue
values = line.split(" ")
for i in range(1,len(values)):
sums[i] += int(values[i])
for index, summedValues in enumerate (sums):
print (columns[index], 1.0 * (summedValues) / 3)

from statistics import mean
import csv
with open("Class1.csv") as f:
# create reader object
r = csv.reader(f)
# skip headers
headers = next(r)
# exract name from row and use statistics.mean to average from row[1..
# mapping scores to ints
avgs = ((row[0], mean(map(int, row[1:]))) for row in r)
# unpack name and average and print
for name, avg in avgs:
print(name,avg)
Unless you have written empty lines to your csv file there won't be any, not sure how the header fits into it but you can use it if necessary.
You can also unpack with the * syntax in python 3 which I think is a bit nicer:
avgs = ((name, mean(map(int, row))) for name, *row in r)
for name, avg in avgs:
print(name,avg)
To order just sort by the average using reverse=True to sort from highest to lowest:
from statistics import mean
import csv
from operator import itemgetter
with open("Class1.csv") as f:
r = csv.reader(f)
avgs = sorted(((name, mean(map(int, row))) for name, *row in r),key=itemgetter(1),reverse=True)
for name, avg in avgs:
print(name,avg)
Passing key=itemgetter(1) means we sort by the second subelement which is the average in each tuple.

using
1, 2, 3
4, 2, 3
4, 5, 3
1, 6, 3
1, 6, 6
6, 2, 3
as Class1.csv
and
askclass = str(input("what class?"))
if askclass == ('1'):
with open("Class1.csv") as f:
columns = f.readline().strip().split(",")
sums = [1] * len(columns)
for line in f:
# Skip empty lines
if not line.strip():
continue
values = line.split(",")
for i in range(1,len(values)):
sums[i] += int(values[i])
for index, summedValues in enumerate (sums):
print (columns[index], 1.0 * (summedValues) / 3)
I obtain the expected result:
what class?1
('1', 0.3333333333333333)
(' 2', 7.333333333333333)
(' 3', 6.333333333333333)
[update] Observations:
sums defined ad sums = [1] * len(columns) has length columns, but you ignore first column in you operations so value for sum[0] will always be 1, do not seems necessary.
for float division it is sufficient summedValues / 3.0 instead of 1.0 * (summedValues) / 3
Maybe this is what you want
for line in f:
# Skip empty lines
if not line.strip():
continue
values = line.split(" ")
for i in range(1,len(values)):
sums[i] += int(values[i])
for index, summedValues in enumerate (sums):
print (columns[index], 1.0 * (summedValues) / 3)

Related

Splitting a special type of list data and saving data into two separate dataframe using condition in python

Want to seperate a list data into two parts based on condition. If the value is less than "H1000", we want in a first dataframe(Output for list 1) and if it is greater or equal to "H1000" we want in a second dataframe(Output for list2). First column starts the value with H followed by a four numbers.
Here is my python code:
with open(fn) as f:
text = f.read().strip()
print(text)
lines = [[(Path(fn.name), line_no + 1, col_no + 1, cell) for col_no, cell in enumerate(
re.split('\t', l.strip())) if cell != ''] for line_no, l in enumerate(re.split(r'[\r\n]+', text))]
print(lines)
if (lines[:][:][3] == "H1000"):
list1
list2
I am not able to write a python logic to divide the list data into two parts.
Attach python code & file here
So basically you want to check if the number after the H is greater or not than 1000 right? If I'm right then just do like this:
with open(fn) as f:
text = f.read().strip()
print(text)
lines = [[(Path(fn.name), line_no + 1, col_no + 1, cell) for col_no, cell in enumerate(
re.split('\t', l.strip())) if cell != ''] for line_no, l in enumerate(re.split(r'[\r\n]+', text))]
print(lines)
value = lines[:][:][3]
if value[1:].isdigit():
if (int(value[1:]) < 1000):
#list 1
else:
#list 2
we simply take the numerical part of the factor "hxxxx" with the slices, convert it into an integer and compare it with 1000
with open(fn) as f:
text = f.read().strip()
lines =text.split('\n')
list1=[]
list2=[]
for i in lines:
if int(i.split(' ')[0].replace("H",""))>=1000:
list2.append(i)
else:
list1.append(i)
print(list1)
print("***************************************")
print(list2)
I'm not sure exactly where the problem lies. Assuming you read the above text file line by line, you can simply make use of str.__le__ to check your condition, e.g.
lines = """
H0002 Version 3
H0003 Date_generated 5-Aug-81
H0004 Reporting_period_end_date 09-Jun-99
H0005 State WAA
H0999 Tene_no/Combined_rept_no E79/38975
H1001 Tene_holder Magnetic Resources NL
""".strip().split("\n")
# Or
# with open(fn) as f: lines = f.readlines()
list_1, list_2 = [], []
for line in lines:
if line[:6] <= "H1000":
list_1.append(line)
else:
list_2.append(line)
print(list_1, list_2, sep="\n")
# ['H0002 Version 3', 'H0003 Date_generated 5-Aug-81', 'H0004 Reporting_period_end_date 09-Jun-99', 'H0005 State WAA', 'H0999 Tene_no/Combined_rept_no E79/38975']
# ['H1001 Tene_holder Magnetic Resources NL']

Converting string to float

I am trying to write a program that tallies the values in a file. For example, I am given a file with numbers like this
2222 (First line)
4444 (Second line)
1111 (Third line)
My program takes in the name of an input file (E.G. File.txt), and the column of numbers to tally. So for example, if my file.txt contains the number above and i need the sum of column 2, my function should be able to print out 7(2+4+1)
t1 = open(argv[1], "r")
number = argv[2]
k = 0
while True:
n = int(number)
t = t1.readline()
z = list(t)
if t == "":
break
k += float(z[n])
t1.close()
print k
This code works for the first column when I set it to 0, but it doesn't return a consistent result when I set it to 1 even though they should be the same answer.
Any thoughts?
A somewhat uglier implementation that demonstrates the cool-factor of zip:
def sum_col(filename, colnum):
with open(filename) as inf:
columns = zip(*[line.strip() for line in inf])
return sum([int(num) for num in list(columns)[colnum]])
zip(*iterable) flips from row-wise to columnwise, so:
iterable = ['aaa','bbb','ccc','ddd']
zip(*iterable) == ['abcd','abcd','abcd'] # kind of...
zip objects aren't subscriptable, so we need to cast as list before we subscript it (doing [colnum]). Alternatively we could do:
...
for _ in range(colnum-1):
next(columns) # skip the columns we don't need
return sum([int(num) for num in next(columns)])
Or just calculate all the sums and grab the sum that we need
...
col_sums = [sum(int(num) for num in column) for column in columns]
return col_sums[colnum]

Python: count values within defined intervals

I import data from a CSV which looks like this:
3.13
3.51
3.51
4.01
2.13
1.13
1.13
1.13
1.63
1.88
What I would like to do now is to COUNT the values within those intervals:
0-1, 1-2, 2-3, >3
So the result would be
0-1: 0
1-2: 5
2-3: 1
>3: 4
Apart from this main task I would like to calculate the outcome into percent of total numbers (e.g. 0-1: 0%, 1-2: 50%,...)
I am quite new to Python so I got stuck in my attemps solving this thing. Maybe there is a predefined function for solving this I don't know of?
Thanks a lot for your help!!!
+++ UPDATE: +++
Thanks for all the replies.
I have testes a bunch of them but I kind of doing something wrong with reading the CSV-File I guess. Refering to the code snippets using a,b,c,d for the differnt intervalls these variables always stay '0' for me.
Here is my actual code:
import csv
a=b=c=0
with open('winter.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
for row in spamreader:
if row in range(0,1):
a += 1
elif row in range (1,2):
b += 1
print a,b
I also converted all values in the CSV to Integers without success. In the CSV there is just one single column.
Any ideas what I am doing wrong???
Here's how to do it in a very concise way with numpy:
import sys
import csv
import numpy as np
with open('winter.csv') as csvfile:
field = 0 # (zero-based) field/column number containing the required values
float_list = [float(row[field]) for row in csv.reader(csvfile)]
#float_list = [3.13, 3.51, 3.51, 4.01, 2.13, 1.13, 1.13, 1.13, 1.63, 1.88]
hist, bins = np.histogram(float_list, bins=[0,1,2,3,sys.maxint])
bin_counts = zip(bins, bins[1:], hist) # [(bin_start, bin_end, count), ... ]
for bin_start, bin_end, count in bin_counts[:-1]:
print '{}-{}: {}'.format(bin_start, bin_end, count)
# different output required for last bin
bin_start, bin_end, count = bin_counts[-1]
print '>{}: {}'.format(bin_start, count)
Which outputs:
0-1: 0
1-2: 5
2-3: 1
>3: 4
Most of the effort is in massaging the data for output.
It's also quite flexible as it is easy to use different intervals by changing the bins argument to np.histogram(), e.g. add another interval by changing bins:
hist, bins = np.histogram(float_list, bins=[0,1,2,3,4,sys.maxint])
outputs:
0-1: 0
1-2: 5
2-3: 1
3-4: 3
>4: 1
This should do, provided the data from the CSV is in values:
from collections import defaultdict
# compute a histogram
histogram = defaultdict(lambda: 0)
interval = 1.
max = 3
for v in values:
bin = int(v / interval)
bin = max if bin >= max else bin
histogram[bin] += 1
# output
sum = sum(histogram.values())
for k, v in sorted(histogram.items()):
share = 100. * v / sum
if k >= max:
print "{}+ : {}, {}%".format(k, v, share)
else:
print "{}-{}: {}, {}%".format(k, k+interval, v, share)
import csv
a=b=c=d=0
with open('cf.csv', 'r') as csvfile:
spamreader = csv.reader(csvfile)
for row in spamreader:
if 0<float(row[0])<1:
a+=1
elif 1<float(row[0])<2:
b+=1
elif 2<float(row[0])<3:
c+=1
if 3<float(row[0]):
d+=1
print "0-1:{} \n 1-2:{} \n 2-3:{} \n <3:{}".format(a,b,c,d)
out put:
0-1:0
1-2:5
2-3:1
<3:4
Because of your rows are list type we use [0] index to access our data and convert the string to float by float() function .
After you get the entries into a list:
0_to_1 = 0
1_to_2 = 0
2_to_3 = 0
ovr_3 = 0
for i in list:
if i in range(0,1):
0_to_1 += 1
elif i in range (1,2):
1_to_2 += 1
So on and so forth...
And to find the breakdown:
total_values = 0_to_1 + 1_to_2 + 2_to_3 + Ovr_3
perc_0_to_1 = (total_values/0_to_1)*100
perc_1_to_2 = (total_values/1_to_2)*100
perc_2_to_3 = (total_values/2_to_3)*100
perc_ovr_3 = (total_values/ovr_3)*100
+++++ Response to Update +++++++
import csv
a=b=c=0
with open('winter.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
for row in spamreader:
for i in row:
i = float(i.strip()) # .strip() removes blank spaces before converting it to float
if row in range(0,1):
a += 1
elif row in range(1,2):
b += 1
# add more elif statements here as desired.
Hope that works.
Side note, I like that a=b=c=o thing. Didn't realize you could do that after all this time haha.

Time complexity of a huge list in python 2.7

I've a list which has approximately 177071007 items.
and i'm trying to perform the following operations
a) get the first and last occurance of a unique item in the list.
b) the number of occurances.
def parse_data(file, op_file_test):
ins = csv.reader(open(file, 'rb'), delimiter = '\t')
pc = list()
rd = list()
deltas = list()
reoccurance = list()
try:
for row in ins:
pc.append(int(row[0]))
rd.append(int(row[1]))
except:
print row
pass
unique_pc = set(pc)
unique_pc = list(unique_pc)
print "closing file"
#takes a long time from here!
for a in range(0, len(unique_pc)):
index_first_occurance = pc.index(unique_pc[a])
index_last_occurance = len(pc) - 1 - pc[::-1].index(unique_pc[a])
delta_rd = rd[index_last_occurance] - rd[index_first_occurance]
deltas.append(int(delta_rd))
reoccurance.append(pc.count(unique_pc[a]))
print unique_pc[a] , delta_rd, reoccurance[a]
print "printing to file"
map_file = open(op_file_test,'a')
for a in range(0, len(unique_pc)):
print >>map_file, "%d, %d, %d" % (unique_pc[a], deltas[a], reoccurance)
map_file.close()
However the complexity is in the order of O(n).
Would there be a possibility to make the for loop 'run fast', by that i mean, do you think yielding would make it fast? or is there any other way? unfortunately, i don't have numpy
Try the following:
from collections import defaultdict
# Keep a dictionary of our rd and pc values, with the value as a list of the line numbers each occurs on
# e.g. {'10': [1, 45, 79]}
pc_elements = defaultdict(list)
rd_elements = defaultdict(list)
with open(file, 'rb') as f:
line_number = 0
csvin = csv.reader(f, delimiter='\t')
for row in csvin:
try:
pc_elements[int(row[0])].append(line_number)
rd_elements[int(row[1])].append(line_number)
line_number += 1
except ValueError:
print("Not a number")
print(row)
line_number += 1
continue
for pc, indexes in pc_elements.iteritems():
print("pc {0} appears {1} times. First on row {2}, last on row {3}".format(
pc,
len(indexes),
indexes[0],
indexes[-1]
))
This works by creating a dictionary, when reading the TSV with the pc value as the the key and a list of occurrences as the value. By the nature of a dict the key must be unique so we avoid the set and the list values are only being used to keep the rows that key occurs on.
Example:
pc_elements = {10: [4, 10, 18, 101], 8: [3, 12, 13]}
would output:
"pc 10 appears 4 times. First on row 4, last on row 101"
"pc 8 appears 3 times. First on row 3, last on row 13"
As you scan items from your input file, put the items into a collections.defaultdict(list) where the key is the item and the value is a list of occurence indices. It will take linear time to read the file and build up this data structure and constant time to get the first and last occurrence index of an item, and constant time to get the number of occurrences of an item.
Here's how it might work
mydict = collections.defaultdict(list)
for item, index in itemfilereader: # O(n)
mydict[item].append(index)
# first occurrence of item, O(1)
mydict[item][0]
# last occurrence of item, O(1)
mydict[item][-1]
# number of occurrences of item, O(1)
len(mydict[item])
Maybe it's worth chaning the data structure used. I'd use a dict that uses pc as key and the occurence as values.
lookup = dict{}
counter = 0
for line in ins:
values = lookup.setdefault(int(line[0]),[])
values.append(tuple(counter,int(line[1])))
counter += 1
for key, val in lookup.iteritems():
value_of_first_occurence = lookup[key][1][1]
value_of_last_occurence = lookup[key][-1][1]
first_occurence = lookup[key][1][0]
last_occurence = lookup[key][-1][0]
value = lookup[key][0]
Try replacing list by dicts, lookup in a dict is much faster than in a long list.
That could be something like this:
def parse_data(file, op_file_test):
ins = csv.reader(open(file, 'rb'), delimiter = '\t')
# Dict of pc -> [rd first occurence, rd last occurence, list of occurences]
occurences = {}
for i in range(0, len(ins)):
row = ins[i]
try:
pc = int(row[0])
rd = int(row[1])
except:
print row
continue
if pc not in occurences:
occurences[pc] = [rd, rd, i]
else:
occurences[pc][1] = rd
occurences[pc].append(i)
# (Remove the sorted is you don't need them sorted but need them faster)
for value in sorted(occurences.keys()):
print "value: %d, delta: %d, occurences: %s" % (
value, occurences[value][1] - occurences[value][0],
", ".join(occurences[value][2:])

Help with an if else loop in python

Hi here is my problem. I have a program that calulcates the averages of data in columns.
Example
Bob
1
2
3
the output is
Bob
2
Some of the data has 'na's
So for Joe
Joe
NA
NA
NA
I want this output to be NA
so I wrote an if else loop
The problem is that it doesn't execute the second part of the loop and just prints out one NA. Any suggestions?
Here is my program:
with open('C://achip.txt', "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
numRowsPerColumn = [0] * len(columns) # this figures out the number of columns
for line in f:
# Skip empty lines since I was getting that error before
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
try: # this is the whole strings to math numbers things
sums[i] += float(values[i])
numRowsPerColumn[i] += 1
except ValueError:
continue
with open('c://chipdone.txt', 'w') as ouf:
for i in xrange(len(columns)):
if numRowsPerColumn[i] ==0 :
print 'NA'
else:
print>>ouf, columns[i], sums[i] / numRowsPerColumn[i] # this is the average calculator
The file looks like so:
Joe Bob Sam
1 2 NA
2 4 NA
3 NA NA
1 1 NA
and final output is the names and the averages
Joe Bob Sam
1.5 1.5 NA
Ok I tried Roger's suggestion and now I have this error:
Traceback (most recent call last):
File "C:/avy14.py", line 5, in
for line in f:
ValueError: I/O operation on closed file
Here is this new code:
with open('C://achip.txt', "rtU") as f:
columns = f.readline().strip().split(" ")
sums = [0] * len(columns)
rows = 0
for line in f:
line = line.strip()
if not line:
continue
rows += 1
for col, v in enumerate(line.split()):
if sums[col] is not None:
if v == "NA":
sums[col] = None
else:
sums[col] += int(v)
with open("c:/chipdone.txt", "w") as out:
for name, sum in zip(columns, sums):
print >>out, name,
if sum is None:
print >>out, "NA"
else:
print >>out, sum / rows
with open("c:/achip.txt", "rU") as f:
columns = f.readline().strip().split()
sums = [0.0] * len(columns)
row_counts = [0] * len(columns)
for line in f:
line = line.strip()
if not line:
continue
for col, v in enumerate(line.split()):
if v != "NA":
sums[col] += int(v)
row_counts[col] += 1
with open("c:/chipdone.txt", "w") as out:
for name, sum, rows in zip(columns, sums, row_counts):
print >>out, name,
if rows == 0:
print >>out, "NA"
else:
print >>out, sum / rows
I'd also use the no-parameter version of split when getting the column names (it allows you to have multiple space separators).
Regarding your edit to include input/output sample, I kept your original format and my output would be:
Joe 1.75
Bob 2.33333333333
Sam NA
This format is 3 rows of (ColumnName, Avg) columns, but you can change the output if you want, of course. :)
Using numpy:
import numpy as np
with open('achip.txt') as f:
names=f.readline().split()
arr=np.genfromtxt(f)
print(arr)
# [[ 1. 2. NaN]
# [ 2. 4. NaN]
# [ 3. NaN NaN]
# [ 1. 1. NaN]]
print(names)
# ['Joe', 'Bob', 'Sam']
print(np.ma.mean(np.ma.masked_invalid(arr),axis=0))
# [1.75 2.33333333333 --]
Using your original code, I would add one loop and edit the print statement
with open(r'C:\achip.txt', "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
numRowsPerColumn = [0] * len(columns) # this figures out the number of columns
for line in f:
# Skip empty lines since I was getting that error before
if not line.strip():
continue
values = line.split(" ")
### This removes any '' elements caused by having two spaces like
### in the last line of your example chip file above
for count, v in enumerate(values):
if v == '':
values.pop(count)
### (End of Addition)
for i in xrange(len(values)):
try: # this is the whole strings to math numbers things
sums[i] += float(values[i])
numRowsPerColumn[i] += 1
except ValueError:
continue
with open('c://chipdone.txt', 'w') as ouf:
for i in xrange(len(columns)):
if numRowsPerColumn[i] ==0 :
print>>ouf, columns[i], 'NA' #Just add the extra parts
else:
print>>ouf, columns[i], sums[i] / numRowsPerColumn[i]
This solution also gives the same result in Roger's format, not your intended format.
Solution below is cleaner and has fewer lines of code ...
import pandas as pd
# read the file into a DataFrame using read_csv
df = pd.read_csv('C://achip.txt', sep="\s+")
# compute the average of each column
avg = df.mean()
# save computed average to output file
avg.to_csv("c:/chipdone.txt")
They key to the simplicity of this solution is the way the input text file is read into a Dataframe. Pandas read_csv allows you to use regular expressions for specifying the sep/delimiter argument. In this case, we used the "\s+" regex pattern to take care of having one or more spaces between columns.
Once the data is in a dataframe, computing the average and saving to a file can all be done with straight forward pandas functions.

Categories

Resources