Python: keep top Nth results for csv.reader - python

I am doing some filtering on csv file where for every title there are many duplicate IDs with different prediction values, so the column 2 (pythoniac) is different. I would like to keep only 30 lowest values but with unique ID. I came to this code, but I don't know how to keep lowest 30 entries.
Can you please help with suggestions how to obtain 30 unique by ID entries?
# title1 id1 100 7.78E-25 # example of the line
with open("test.txt") as fi:
cmp = {}
for R in csv.reader(fi, delimiter='\t'):
for L in ligands:
newR = R[0], R[1]
if R[0] == L:
if (int(R[2]) <= int(1000) and int(R[2]) != int(0) and float(R[3]) < float("1.0e-10")):
if newR in cmp:
if float(cmp[newR][3]) > float(R[3]):
cmp[newR] = R[:-2]
else:
cmp[newR] = R[:-2]

Maybe try something along this line...
from bisect import insort
nth_lowest = [very_high_value] * 30
for x in my_loop:
do_stuff()
...
if x < nth_lowest[-1]:
insort(nth_lowest, x)
nth_lowest.pop() # remove the highest element

Related

File data binding with column names

I have files with hundreds and thousands rows of data but they are without any column.
I am trying to go to every file and make them row by row and store them in list after that I want to assign values by columns. But here I am confused what to do because values are around 60 in every row and some extra columns with value assigned and they should be added in every row.
Code so for:
import re
import glob
filenames = glob.glob("/home/ashfaque/Desktop/filetocsvsample/inputfiles/*.txt")
columns = []
with open("/home/ashfaque/Downloads/coulmn names.txt",encoding = "ISO-8859-1") as f:
file_data = f.read()
lines = file_data.splitlines()
for l in lines:
columns.append(l.rstrip())
total = {}
for name in filenames:
modified_data = []
with open(name,encoding = "ISO-8859-1") as f:
file_data = f.read()
lines = file_data.splitlines()
for l in lines:
if len(l) >= 1:
modified_data.append(re.split(': |,',l))
rows = []
i = len(modified_data)
x = 0
while i > 60:
r = lines[x:x+59]
x = x + 60
i = i - 60
rows.append(r)
z = len(modified_data)
while z >= 60:
z = z - 60
if z > 1:
last_columns = modified_data[-z:]
x = []
for l in last_columns:
if len(l) > 1:
del l[0]
x.append(l)
elif len(l) == 1:
x.append(l)
for row in rows:
for vl in x:
row.append(vl)
for r in rows:
for i in range(0,len(r)):
if len(r) >= 60:
total.setdefault(columns[i],[]).append(r[i])
In other script I have separated both row with 60 values and last 5 to 15 columns which should be added with row are separate but again I am confused how to bind all the data.
Data Should look like this after binding.
outputdata.xlsx
Data Input file:
inputdata.txt
What Am I missing here? any tool ?
I believe that your issue can be resolved by taking the input file and turning it into a CSV file which you can then import into whatever program you like.
I wrote a small generator that would read a file a line at a time and return a row after a certain number of lines, in this case 60. In that generator, you can make whatever modifications to the data as you need.
Then with each generated row, I write it directly to the csv. This should keep the memory requirements for this process pretty low.
I didn't understand what you were doing with the regex split, but it would be simple enough to add it to the generator.
import csv
OUTPUT_FILE = "/home/ashfaque/Desktop/File handling/outputfile.csv"
INPUT_FILE = "/home/ashfaque/Desktop/File handling/inputfile.txt"
# This is a generator that will pull only num number of items into
# memory at a time, before it yields the row.
def get_rows(path, num):
row = []
with open(path, "r", encoding="ISO-8859-1") as f:
for n, l in enumerate(f):
# apply whatever transformations that you need to here.
row.append(l.rstrip())
if (n + 1) % num == 0:
# if rows need padding then do it here.
yield row
row = []
with open(OUTPUT_FILE, "w") as output:
csv_writer = csv.writer(output)
for r in get_rows(INPUT_FILE, 60):
csv_writer.writerow(r)

Calculate while excluding -1's

I have an extremely large file of tab delimited values of 10000+ values. I am trying to find the averages of each row in the data and append these new values to a new file. Howvever, values that weren't found are inputted in the large file as -1. Using the -1 values when calculating my averages will mess up my data. How can i exclude these values?
The large file structure looks like this:
"HsaEX0029886" 100 -1 -1 100 100 100 100 100 100 -1 100 -1 100
"HsaEX0029895" 100 100 91.49 100 100 100 100 100 97.87 95.29 100 100 93.33
"HsaEX0029923" 0 0 0 -1 0 0 0 0 0 9.09 0 5.26 0
In my code Im taking the last 3 elements and finding the average of just the 3 values. If the last 3 elements in the row are 85 , 12, and -1, I need to return the average of 85 and 12. Here's my entire code:
with open("PSI_Datatxt.txt", 'rt') as data:
next(data)
lis = [line.strip("\n").split("\t") for line in data] # create a list of lists(each row)
for row in lis:
x = float(row[11])
y = float(row[12])
z = float(row[13])
avrg = ((x + y + z) / 3)
with open("DataEditted","a+") as newdata:
if avrg == -1:
continue #skipping lines where all 3 values are -1
else:
newdata.write(str(avrg) + ' ' + '\n')
Thanks. Comment if any clarification is needed.
data = [float(x) for x in row[1:] if float(x) > -1]
if data:
avg = sum(data)/len(data)
else:
avg = 0 # or throw an exception; you had a row of all -1's
The first line is a fairly standard Pythonism... given an array (in this case row), you can iterate through the list and filter out stuff by using the for x in array if condition bit.
If you wanted to only look at the last three values, you have two options depending on what you mean by last three:
data = [float(x) for x in row[-3:] if float(x) > -1]
will look at the last 3 and given you 0 to 3 values back depending on if they're -1.
data = [float(x) for x in row[1:] if float(x) > -1][:-3]
will give you up to 3 of the last "good" values (if you have all or almost all -1 for a given row, it will be less than 3)
Here is it in the same format as your original question. It offers you to write an error message if the row is all zeros, or you can ignore it instead and write nothing
with open("PSI_Datatxt.txt", 'r') as data:
for row in data:
vals = [float(val) for val in row[1:] if float(val) != -1]
with open("DataEditted","a+") as newdata:
try:
newdata.write(str(sum(vals)/len(vals)) + ' ' + '\n')
except ZeroDivisionError:
newdata.write("My Error Message Here\n")
This should do it
import csv
def average(L):
L = [i for i in map(float, L) if i != -1]
if not L: return None
return sum(L)/len(L)
with open('path/to/input/file') as infile, open('path/to/output/file', 'w') as fout:
outfile = csv.writer(fout, delimiter='\t')
for name, *vals in csv.reader(infile, delimiter='\t'):
outfile.writerow((name, average(vals))

Python 2 compiler doesn't read the correct values after the 31st inputted value

Solving the Smoothing the Weather problem on Codeabbey. It prints the correct output for the first 32 values after which it doesn't read the inputted values correctly. Inputted test values are well over 150.
Here is my code:
from __future__ import division
num=int(raw_input());
inp=((raw_input()).split(" "));
lists=[];
for i in inp:
if inp.index(i)==0 or inp.index(i)==len(inp)-1:
lists.append(inp[inp.index(i)])
else:
a,b,c=0.0,0.0,0.0;
a=float(inp[(inp.index(i))+1])
b=float(inp[inp.index(i)])
c=float(inp[(inp.index(i))-1])
x=(a+b+c)/3
x = ("%.9f" % x).rstrip('0')
lists.append(x)
for i in lists:
print i,
The index in the following code will always return the first occurrence of i in inp. So, if there are duplicate values in inp, then the whole logic fails.
if inp.index(i)==0 or inp.index(i)==len(inp)-1:
lists.append(inp[inp.index(i)])
The correct approach would be to enumerate and us correct indices:
from __future__ import division
num = int(raw_input())
inp = ((raw_input()).split(" "))
lists = []
for i, item in enumerate(inp): # This will loop through inp, while filling the next item in item and keep on incrementing i each time starting with 0
if i == 0 or i == len(inp)-1:
lists.append(inp[i])
else:
a = float(inp[i+1])
b = float(inp[i])
c = float(inp[i-1])
x = (a+b+c) / 3.0
x = ("%.9f" % x).rstrip('0')
lists.append(x)
for i in lists:
print i,
Hope that helps.

Finding average in .txt file python

i need to print out average height from a .txt file. How do I write it in an easy way? The .txt file has these numbers:
12
14
59
48
45
12
47
65
152
this is what i've got so far:
import math
text = open(r'stuff.txt').read()
data = []
with open(r'stuff.txt') as f:
for line in f:
fields = line.split()
rowdata = map(float, fields)
data.extend(rowdata)
biggest = min(data)
smallest = max(data)
print(biggest - smallest)
To compute the average of some numbers, you should sum them up and then divide by the number of numbers:
data = []
with open(r'stuff.txt') as f:
for line in f:
fields = line.split()
rowdata = map(float, fields)
data.extend(rowdata)
print(sum(data)/len(data))
# import math -- you don't need this
# text = open(r'stuff.txt').read() not needed.
# data = [] not needed
with open(r'stuff.txt') as f:
data = [float(line.rstrip()) for line in f]
biggest = min(data)
smallest = max(data)
print(biggest - smallest)
print(sum(data)/len(data))
data = [float(ln.rstrip()) for ln in f.readlines()] # Within 'with' statement.
mean_average = float(sum(data))/len(data) if len(data) > 0 else float('nan')
That is the way to calculate the mean average, if that is what you meant. Sadly, math does not have a function for this. FYI, the mean_average line is modified in order to avoid the ZeroDivisionError bug that would occur if the list had a length of 0- just in case.
Array average can be computed like this:
print(sum(data) / len(data))
A simple program for finding the average would be the following (if I understand well, your file has one value in each line, if so, it has to be similar to this, else it has to change accordingly):
import sys
f = open('stuff.txt', 'rU')
lines = f.readlines()
f.close()
size = len(lines)
sum=0
for line in lines:
sum = sum + float(line.rstrip())
avg = sum / float(size)
print avg,
Not the best that can be in python but it's quite straight forward I think...
A full, almost-loopless solution combining elements of other answers here:
with open('stuff.txt','r') as f:
data = [float(line.rstrip()) for line in f.readlines()]
f.close()
mean = float(sum(data))/len(data) if len(data) > 0 else float('nan')
and you don't need to prepend, append, enclose or import anything else.

Python Greedy Algorithm

I am writing a greedy algorithm (Python 3.x.x) for a 'jewel heist'. Given a series of jewels and values, the program grabs the most valuable jewel that it can fit in it's bag without going over the bag weight limit. I've got three test cases here, and it works perfectly for two of them.
Each test case is written in the same way: first line is the bag weight limit, all lines following are tuples(weight, value).
Sample Case 1 (works):
10
3 4
2 3
1 1
Sample Case 2 (doesn't work):
575
125 3000
50 100
500 6000
25 30
Code:
def take_input(infile):
f_open = open(infile, 'r')
lines = []
for line in f_open:
lines.append(line.strip())
f_open.close()
return lines
def set_weight(weight):
bag_weight = weight
return bag_weight
def jewel_list(lines):
jewels = []
for item in lines:
jewels.append(item.split())
jewels = sorted(jewels, reverse= True)
jewel_dict = {}
for item in jewels:
jewel_dict[item[1]] = item[0]
return jewel_dict
def greedy_grab(weight_max, jewels):
#first, we get a list of values
values = []
weights = []
for keys in jewels:
weights.append(jewels[keys])
for item in jewels.keys():
values.append(item)
values = sorted(values, reverse= True)
#then, we start working
max = int(weight_max)
running = 0
i = 0
grabbed_list = []
string = ''
total_haul = 0
# pick the most valuable item first. Pick as many of them as you can.
# Then, the next, all the way through.
while running < max:
next_add = int(jewels[values[i]])
if (running + next_add) > max:
i += 1
else:
running += next_add
grabbed_list.append(values[i])
for item in grabbed_list:
total_haul += int(item)
string = "The greedy approach would steal $" + str(total_haul) + " of
jewels."
return string
infile = "JT_test2.txt"
lines = take_input(infile)
#set the bag weight with the first line from the input
bag_max = set_weight(lines[0])
#once we set bag weight, we don't need it anymore
lines.pop(0)
#generate a list of jewels in a dictionary by weight, value
value_list = jewel_list(lines)
#run the greedy approach
print(greedy_grab(bag_max, value_list))
Does anyone have any clues why it wouldn't work for case 2? Your help is greatly appreciated.
EDIT: The expected outcome for case 2 is $6130. I seem to get $6090.
Your dictionary keys are strings, not integers so they are sorted like string when you try to sort them. So you would get:
['6000', '3000', '30', '100']
instead wanted:
['6000', '3000', '100', '30']
Change this function to be like this and to have integer keys:
def jewel_list(lines):
jewels = []
for item in lines:
jewels.append(item.split())
jewels = sorted(jewels, reverse= True)
jewel_dict = {}
for item in jewels:
jewel_dict[int(item[1])] = item[0] # changed line
return jewel_dict
When you change this it will give you:
The greedy approach would steal $6130 of jewels.
In [237]: %paste
def greedy(infilepath):
with open(infilepath) as infile:
capacity = int(infile.readline().strip())
items = [map(int, line.strip().split()) for line in infile]
bag = []
items.sort(key=operator.itemgetter(0))
while capacity and items:
if items[-1][0] <= capacity:
bag.append(items[-1])
capacity -= items[-1][0]
items.pop()
return bag
## -- End pasted text --
In [238]: sum(map(operator.itemgetter(1), greedy("JT_test1.txt")))
Out[238]: 8
In [239]: sum(map(operator.itemgetter(1), greedy("JT_test2.txt")))
Out[239]: 6130
I think in this piece of code i has to be incremented on the else side too
while running < max:
next_add = int(jewels[values[i]])
if (running + next_add) > max:
i += 1
else:
running += next_add
grabbed_list.append(values[i])
i += 1 #here
this and #iblazevic's answer explains why it behaves this way

Categories

Resources