Calculate while excluding -1's - python

I have an extremely large file of tab delimited values of 10000+ values. I am trying to find the averages of each row in the data and append these new values to a new file. Howvever, values that weren't found are inputted in the large file as -1. Using the -1 values when calculating my averages will mess up my data. How can i exclude these values?
The large file structure looks like this:
"HsaEX0029886" 100 -1 -1 100 100 100 100 100 100 -1 100 -1 100
"HsaEX0029895" 100 100 91.49 100 100 100 100 100 97.87 95.29 100 100 93.33
"HsaEX0029923" 0 0 0 -1 0 0 0 0 0 9.09 0 5.26 0
In my code Im taking the last 3 elements and finding the average of just the 3 values. If the last 3 elements in the row are 85 , 12, and -1, I need to return the average of 85 and 12. Here's my entire code:
with open("PSI_Datatxt.txt", 'rt') as data:
next(data)
lis = [line.strip("\n").split("\t") for line in data] # create a list of lists(each row)
for row in lis:
x = float(row[11])
y = float(row[12])
z = float(row[13])
avrg = ((x + y + z) / 3)
with open("DataEditted","a+") as newdata:
if avrg == -1:
continue #skipping lines where all 3 values are -1
else:
newdata.write(str(avrg) + ' ' + '\n')
Thanks. Comment if any clarification is needed.

data = [float(x) for x in row[1:] if float(x) > -1]
if data:
avg = sum(data)/len(data)
else:
avg = 0 # or throw an exception; you had a row of all -1's
The first line is a fairly standard Pythonism... given an array (in this case row), you can iterate through the list and filter out stuff by using the for x in array if condition bit.
If you wanted to only look at the last three values, you have two options depending on what you mean by last three:
data = [float(x) for x in row[-3:] if float(x) > -1]
will look at the last 3 and given you 0 to 3 values back depending on if they're -1.
data = [float(x) for x in row[1:] if float(x) > -1][:-3]
will give you up to 3 of the last "good" values (if you have all or almost all -1 for a given row, it will be less than 3)

Here is it in the same format as your original question. It offers you to write an error message if the row is all zeros, or you can ignore it instead and write nothing
with open("PSI_Datatxt.txt", 'r') as data:
for row in data:
vals = [float(val) for val in row[1:] if float(val) != -1]
with open("DataEditted","a+") as newdata:
try:
newdata.write(str(sum(vals)/len(vals)) + ' ' + '\n')
except ZeroDivisionError:
newdata.write("My Error Message Here\n")

This should do it
import csv
def average(L):
L = [i for i in map(float, L) if i != -1]
if not L: return None
return sum(L)/len(L)
with open('path/to/input/file') as infile, open('path/to/output/file', 'w') as fout:
outfile = csv.writer(fout, delimiter='\t')
for name, *vals in csv.reader(infile, delimiter='\t'):
outfile.writerow((name, average(vals))

Related

Print data between positions within a loop

I have one files.
File1 which has 3 columns. Data are tab separated
File1:
2 4 Apple
6 7 Samsung
Let's say if I run a loop of 10 iteration. If the iteration has value between column 1 and column 2 of File1, then print the corresponding 3rd column from File1, else print "0".
The columns may or may not be sorted, but 2nd column is always greater than 1st. Range of values in the two columns do not overlap between lines.
The output Result should look like this.
Result:
0
Apple
Apple
Apple
0
Samsung
Samsung
0
0
0
My program in python is here:
chr5_1 = [[]]
for line in file:
line = line.rstrip()
line = line.split("\t")
chr5_1.append([line[0],line[1],line[2]])
# Here I store all position information in chr5_1 list in list
chr5_1.pop(0)
for i in range (1,10):
for listo in chr5_1:
L1 = " ".join(str(x) for x in listo[:1])
L2 = " ".join(str(x) for x in listo[1:2])
L3 = " ".join(str(x) for x in listo[2:3])
if int(L1) <= i and int(L2) >= i:
print(L3)
break
else:
print ("0")
break
I am confused with loop iteration and it break point.
Try this:
chr5_1 = dict()
for line in file:
line = line.rstrip()
_from, _to, value = line.split("\t")
for i in range(int(_from), int(_to) + 1):
chr5_1[i] = value
for i in range (1, 10):
print chr5_1.get(i, "0")
I think this is a job for else:
position_information = []
with open('file1', 'rb') as f:
for line in f:
position_information.append(line.strip().split('\t'))
for i in range(1, 11):
for start, through, value in position_information:
if i >= int(start) and i <= int(through):
print value
# No need to continue searching for something to print on this line
break
else:
# We never found anything to print on this line, so print 0 instead
print 0
This gives the result you're looking for:
0
Apple
Apple
Apple
0
Samsung
Samsung
0
0
0
Setup:
import io
s = '''2 4 Apple
6 7 Samsung'''
# Python 2.x
f = io.BytesIO(s)
# Python 3.x
#f = io.StringIO(s)
If the lines of the file are not sorted by the first column:
import csv, operator
reader = csv.reader(f, delimiter = ' ', skipinitialspace = True)
f = list(reader)
f.sort(key = operator.itemgetter(0))
Read each line; do some math to figure out what to print and how many of them to print; print stuff; iterate
def print_stuff(thing, n):
while n > 0:
print(thing)
n -= 1
limit = 10
prev_end = 1
for line in f:
# if iterating over a file, separate the columns
begin, end, text = line.strip().split()
# if iterating over the sorted list of lines
#begin, end, text = line
begin, end = map(int, (begin, end))
# don't exceed the limit
begin = begin if begin < limit else limit
# how many zeros?
gap = begin - prev_end
print_stuff('0', gap)
if begin == limit:
break
# don't exceed the limit
end = end if end < limit else limit
# how many words?
span = (end - begin) + 1
print_stuff(text, span)
if end == limit:
break
prev_end = end
# any more zeros?
gap = limit - prev_end
print_stuff('0', gap)

Python programming changing a specific value in file

I have a problem with concerning my code:
with open('Premier_League.txt', 'r+') as f:
data = [int(line.strip()) for line in f.readlines()] # [1, 2, 3]
f.seek(0)
i = int(input("Add your result! \n"))
data[i] += 1 # e.g. if i = 1, data now [1, 3, 3]
for line in data:
f.write(str(line) + "\n")
f.truncate()
print(data)
The code works that the file "Premier_League.txt" that contains for example:
1
2
3
where i=1
gets converted to and saved to already existing file (the previous info gets deleted)
1
3
3
My problem is that I want to chose a specific value in a matris (not only a vertical line) for example:
0 0 0 0
0 0 0 0
0 0 0 0
where i would like to change it to for example:
0 1 0 0
0 0 0 0
0 0 0 0
When I run this trough my program this appears:
ValueError: invalid literal for int() with base 10: '1 1 1 1'
So my question is: how do I change a specific value in a file that contains more than a vertical line of values?
The problem is you are not handling the increased number of dimensions properly. Try something like this;
with open('Premier_League.txt', 'r+') as f:
# Note this is now a 2D matrix (nested list)
data = [[int(value) for value in line.strip().split()] for line in f ]
f.seek(0)
# We must specify both a column and row
i = int(input("Add your result to column! \n"))
j = int(input("Add your result to row! \n"))
data[i][j] += 1 # Assign to the index of the column and row
# Parse out the data and write back to file
for line in data:
f.write(' '.join(map(str, line)) + "\n")
f.truncate()
print(data)
You could also use a generator expression to write to the file, for example;
# Parse out the data and write back to file
f.write('\n'.join((' '.join(map(str, line)) for line in data)))
instead of;
# Parse out the data and write back to file
for line in data:
f.write(' '.join(map(str, line)) + "\n")
First up, you are trying to parse the string '0 0 0 0' as an int, that's the error you are getting. To fix this, do:
data = [[int(ch) for ch in line.strip().split()] for line in f.readlines()]
This will create a 2D array, where the first index corresponds to the row, and the second index corresponds to the column. Then, you would probably want the user to give you two values, instead of a singular i since you are trying to edit in a 2D array.
Edit:
So your following code will look like this:
i = int(input("Add your result row: \n"))
j = int(input("Add your result column: \n"))
data[i][j] += 1
# For data = [[1,2,1], [2,3,2]], and user enters i = 1
# and j = 0, the new data will be [[1,2,1], [3,3,2]]

Python: count values within defined intervals

I import data from a CSV which looks like this:
3.13
3.51
3.51
4.01
2.13
1.13
1.13
1.13
1.63
1.88
What I would like to do now is to COUNT the values within those intervals:
0-1, 1-2, 2-3, >3
So the result would be
0-1: 0
1-2: 5
2-3: 1
>3: 4
Apart from this main task I would like to calculate the outcome into percent of total numbers (e.g. 0-1: 0%, 1-2: 50%,...)
I am quite new to Python so I got stuck in my attemps solving this thing. Maybe there is a predefined function for solving this I don't know of?
Thanks a lot for your help!!!
+++ UPDATE: +++
Thanks for all the replies.
I have testes a bunch of them but I kind of doing something wrong with reading the CSV-File I guess. Refering to the code snippets using a,b,c,d for the differnt intervalls these variables always stay '0' for me.
Here is my actual code:
import csv
a=b=c=0
with open('winter.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
for row in spamreader:
if row in range(0,1):
a += 1
elif row in range (1,2):
b += 1
print a,b
I also converted all values in the CSV to Integers without success. In the CSV there is just one single column.
Any ideas what I am doing wrong???
Here's how to do it in a very concise way with numpy:
import sys
import csv
import numpy as np
with open('winter.csv') as csvfile:
field = 0 # (zero-based) field/column number containing the required values
float_list = [float(row[field]) for row in csv.reader(csvfile)]
#float_list = [3.13, 3.51, 3.51, 4.01, 2.13, 1.13, 1.13, 1.13, 1.63, 1.88]
hist, bins = np.histogram(float_list, bins=[0,1,2,3,sys.maxint])
bin_counts = zip(bins, bins[1:], hist) # [(bin_start, bin_end, count), ... ]
for bin_start, bin_end, count in bin_counts[:-1]:
print '{}-{}: {}'.format(bin_start, bin_end, count)
# different output required for last bin
bin_start, bin_end, count = bin_counts[-1]
print '>{}: {}'.format(bin_start, count)
Which outputs:
0-1: 0
1-2: 5
2-3: 1
>3: 4
Most of the effort is in massaging the data for output.
It's also quite flexible as it is easy to use different intervals by changing the bins argument to np.histogram(), e.g. add another interval by changing bins:
hist, bins = np.histogram(float_list, bins=[0,1,2,3,4,sys.maxint])
outputs:
0-1: 0
1-2: 5
2-3: 1
3-4: 3
>4: 1
This should do, provided the data from the CSV is in values:
from collections import defaultdict
# compute a histogram
histogram = defaultdict(lambda: 0)
interval = 1.
max = 3
for v in values:
bin = int(v / interval)
bin = max if bin >= max else bin
histogram[bin] += 1
# output
sum = sum(histogram.values())
for k, v in sorted(histogram.items()):
share = 100. * v / sum
if k >= max:
print "{}+ : {}, {}%".format(k, v, share)
else:
print "{}-{}: {}, {}%".format(k, k+interval, v, share)
import csv
a=b=c=d=0
with open('cf.csv', 'r') as csvfile:
spamreader = csv.reader(csvfile)
for row in spamreader:
if 0<float(row[0])<1:
a+=1
elif 1<float(row[0])<2:
b+=1
elif 2<float(row[0])<3:
c+=1
if 3<float(row[0]):
d+=1
print "0-1:{} \n 1-2:{} \n 2-3:{} \n <3:{}".format(a,b,c,d)
out put:
0-1:0
1-2:5
2-3:1
<3:4
Because of your rows are list type we use [0] index to access our data and convert the string to float by float() function .
After you get the entries into a list:
0_to_1 = 0
1_to_2 = 0
2_to_3 = 0
ovr_3 = 0
for i in list:
if i in range(0,1):
0_to_1 += 1
elif i in range (1,2):
1_to_2 += 1
So on and so forth...
And to find the breakdown:
total_values = 0_to_1 + 1_to_2 + 2_to_3 + Ovr_3
perc_0_to_1 = (total_values/0_to_1)*100
perc_1_to_2 = (total_values/1_to_2)*100
perc_2_to_3 = (total_values/2_to_3)*100
perc_ovr_3 = (total_values/ovr_3)*100
+++++ Response to Update +++++++
import csv
a=b=c=0
with open('winter.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',')
for row in spamreader:
for i in row:
i = float(i.strip()) # .strip() removes blank spaces before converting it to float
if row in range(0,1):
a += 1
elif row in range(1,2):
b += 1
# add more elif statements here as desired.
Hope that works.
Side note, I like that a=b=c=o thing. Didn't realize you could do that after all this time haha.

Python: keep top Nth results for csv.reader

I am doing some filtering on csv file where for every title there are many duplicate IDs with different prediction values, so the column 2 (pythoniac) is different. I would like to keep only 30 lowest values but with unique ID. I came to this code, but I don't know how to keep lowest 30 entries.
Can you please help with suggestions how to obtain 30 unique by ID entries?
# title1 id1 100 7.78E-25 # example of the line
with open("test.txt") as fi:
cmp = {}
for R in csv.reader(fi, delimiter='\t'):
for L in ligands:
newR = R[0], R[1]
if R[0] == L:
if (int(R[2]) <= int(1000) and int(R[2]) != int(0) and float(R[3]) < float("1.0e-10")):
if newR in cmp:
if float(cmp[newR][3]) > float(R[3]):
cmp[newR] = R[:-2]
else:
cmp[newR] = R[:-2]
Maybe try something along this line...
from bisect import insort
nth_lowest = [very_high_value] * 30
for x in my_loop:
do_stuff()
...
if x < nth_lowest[-1]:
insort(nth_lowest, x)
nth_lowest.pop() # remove the highest element

Help with an if else loop in python

Hi here is my problem. I have a program that calulcates the averages of data in columns.
Example
Bob
1
2
3
the output is
Bob
2
Some of the data has 'na's
So for Joe
Joe
NA
NA
NA
I want this output to be NA
so I wrote an if else loop
The problem is that it doesn't execute the second part of the loop and just prints out one NA. Any suggestions?
Here is my program:
with open('C://achip.txt', "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
numRowsPerColumn = [0] * len(columns) # this figures out the number of columns
for line in f:
# Skip empty lines since I was getting that error before
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
try: # this is the whole strings to math numbers things
sums[i] += float(values[i])
numRowsPerColumn[i] += 1
except ValueError:
continue
with open('c://chipdone.txt', 'w') as ouf:
for i in xrange(len(columns)):
if numRowsPerColumn[i] ==0 :
print 'NA'
else:
print>>ouf, columns[i], sums[i] / numRowsPerColumn[i] # this is the average calculator
The file looks like so:
Joe Bob Sam
1 2 NA
2 4 NA
3 NA NA
1 1 NA
and final output is the names and the averages
Joe Bob Sam
1.5 1.5 NA
Ok I tried Roger's suggestion and now I have this error:
Traceback (most recent call last):
File "C:/avy14.py", line 5, in
for line in f:
ValueError: I/O operation on closed file
Here is this new code:
with open('C://achip.txt', "rtU") as f:
columns = f.readline().strip().split(" ")
sums = [0] * len(columns)
rows = 0
for line in f:
line = line.strip()
if not line:
continue
rows += 1
for col, v in enumerate(line.split()):
if sums[col] is not None:
if v == "NA":
sums[col] = None
else:
sums[col] += int(v)
with open("c:/chipdone.txt", "w") as out:
for name, sum in zip(columns, sums):
print >>out, name,
if sum is None:
print >>out, "NA"
else:
print >>out, sum / rows
with open("c:/achip.txt", "rU") as f:
columns = f.readline().strip().split()
sums = [0.0] * len(columns)
row_counts = [0] * len(columns)
for line in f:
line = line.strip()
if not line:
continue
for col, v in enumerate(line.split()):
if v != "NA":
sums[col] += int(v)
row_counts[col] += 1
with open("c:/chipdone.txt", "w") as out:
for name, sum, rows in zip(columns, sums, row_counts):
print >>out, name,
if rows == 0:
print >>out, "NA"
else:
print >>out, sum / rows
I'd also use the no-parameter version of split when getting the column names (it allows you to have multiple space separators).
Regarding your edit to include input/output sample, I kept your original format and my output would be:
Joe 1.75
Bob 2.33333333333
Sam NA
This format is 3 rows of (ColumnName, Avg) columns, but you can change the output if you want, of course. :)
Using numpy:
import numpy as np
with open('achip.txt') as f:
names=f.readline().split()
arr=np.genfromtxt(f)
print(arr)
# [[ 1. 2. NaN]
# [ 2. 4. NaN]
# [ 3. NaN NaN]
# [ 1. 1. NaN]]
print(names)
# ['Joe', 'Bob', 'Sam']
print(np.ma.mean(np.ma.masked_invalid(arr),axis=0))
# [1.75 2.33333333333 --]
Using your original code, I would add one loop and edit the print statement
with open(r'C:\achip.txt', "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
numRowsPerColumn = [0] * len(columns) # this figures out the number of columns
for line in f:
# Skip empty lines since I was getting that error before
if not line.strip():
continue
values = line.split(" ")
### This removes any '' elements caused by having two spaces like
### in the last line of your example chip file above
for count, v in enumerate(values):
if v == '':
values.pop(count)
### (End of Addition)
for i in xrange(len(values)):
try: # this is the whole strings to math numbers things
sums[i] += float(values[i])
numRowsPerColumn[i] += 1
except ValueError:
continue
with open('c://chipdone.txt', 'w') as ouf:
for i in xrange(len(columns)):
if numRowsPerColumn[i] ==0 :
print>>ouf, columns[i], 'NA' #Just add the extra parts
else:
print>>ouf, columns[i], sums[i] / numRowsPerColumn[i]
This solution also gives the same result in Roger's format, not your intended format.
Solution below is cleaner and has fewer lines of code ...
import pandas as pd
# read the file into a DataFrame using read_csv
df = pd.read_csv('C://achip.txt', sep="\s+")
# compute the average of each column
avg = df.mean()
# save computed average to output file
avg.to_csv("c:/chipdone.txt")
They key to the simplicity of this solution is the way the input text file is read into a Dataframe. Pandas read_csv allows you to use regular expressions for specifying the sep/delimiter argument. In this case, we used the "\s+" regex pattern to take care of having one or more spaces between columns.
Once the data is in a dataframe, computing the average and saving to a file can all be done with straight forward pandas functions.

Categories

Resources