I have a program that operates on a csv file to create output that looks like this:
724, 2
724, 1
725, 3
725, 3
726, 1
726, 0
I would like to modify the script with some simple math operations such that it would render the output:
724, 1.5
725, 3
726, 0.5
The script I'm currently using is here:
lines=open("1.txt",'r').read().splitlines()
for l in lines:
data = l.split('"Overall evaluation:')
if len(data) == 2:
print(data[0] + ", " + data[1])
How could I add a simple averaging and slicing operation to that pipeline?
I guess I need to create some temporary variable, but it should be outside the loop that iterates over lines?
Maybe something like this:
lines=open("EasyChairData.csv",'r').read().splitlines()
for l in lines:
data = l.split('"Overall evaluation:')
submission_number_repo = data[0]
if len(data) == 2:
print(data[0] + ", " + data[1])
if submission_number_repo != data[0]
submission_number_repo = data[0]
EDIT
The function is just a simple average
You can use a dictionary that map the key to the total and count and then print it:
map = {}
lines=open("1.txt",'r').read().splitlines()
for l in lines:
data = l.split('"Overall evaluation:')
if len(data) == 2:
if data[0] not in map.keys():
map[data[0]] = (0,0)
map[data[0]] = (map[data[0]][0]+int(data[1]) , map[data[0]][1]+1)
for x, y in map.items():
print(str(x) + ", " + str(y[0]/y[1]))
I would just store an list of values with the key. Then take the average when file is read.
lines=open("1.txt",'r').read().splitlines()
results = {}
for l in lines:
data = l.split('"Overall evaluation:')
if len(data) == 2:
if data[0] in results:
results[data[0]].append(data[1])
else:
results[data[0]] = [data[1]]
for k,v in results.iteritems():
print("{} , {}".format(k, sum(v)/len(v) ))
A simple way is to keep a state storing current number, current sum and number of items, and only print it when current number changes (do not forget to print last state!). Code could be:
lines=open("1.txt",'r') # .read().splitlines() is useless and only force a full load in memory
state = [None]
for l in lines:
data = l.split('"Overall evaluation:')
if len(data) == 2:
if data[0] != state[0]:
if state[0] is not None:
average = state[1]/state[2]
print(state[0] + ", " + str(average))
state = [data[0], 0., 0]
state[1] += float(data[1])
state[2] += 1
if state[0] is not None:
average = state[1]/state[2]
print(data[0] + ", " + str(average))
(Edited to avoid storing of values)
I love defaultdict:
from collections import defaultdict
average = defaultdict(lambda: (0,0))
with open("1.txt") as input:
for line in input.readlines():
data = line.split('"Overall evaluation:')
if len(data) != 2:
continue
key = data[0].strip()
val = float(data[1])
average[key] = (val+average[key][0], average[key][1]+1)
for k in sorted(average):
v = average[k]
print "{},{}".format(k, v[0]/v[1])
Related
I am trying to solve the "Consensus and Profile" challenge on Rosalind.
The challenge instructions are as follows:
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)
My code is as follows (I got most of it from another user on this website). My only issue is that some of the DNA strands are broken down into multiple separate lines, so they are being appended to the "allstrings" list as separate strings. I am trying to figure out how to write each consecutive line that does not contain ">" as a single string.
import numpy as np
seq = []
allstrings = []
temp_seq = []
matrix = []
C = []
G = []
T = []
A = []
P = []
consensus = []
position = 1
file = open("C:/Users/knigh/Documents/rosalind_cons (3).txt", "r")
conout = open("C:/Users/knigh/Documents/consensus.txt", "w")
# Right now, this is reading and writing each as an individual line. Thus, it
# is splitting each sequence into multiple small sequences. You need to figure
# out how to read this in FASTA format to prevent this from occurring
desc = file.readlines()
for line in desc:
allstrings.append(line)
for string in range(1, len(allstrings)):
if ">" not in allstrings[string]:
temp_seq.append(allstrings[string])
else:
seq.insert(position, temp_seq[0])
temp_seq = []
position += 1
# This last insertion into the sequence must be performed after the loop to empty
# out the last remaining string from temp_seq
seq.insert(position, temp_seq[0])
for base in seq:
matrix.append([pos for pos in base])
M = np.array(matrix).reshape(len(seq), len(seq[0]))
for base in range(len(seq[0])):
A_count = 0
C_count = 0
G_count = 0
T_count = 0
for pos in M[:, base]:
if pos == "A":
A_count += 1
elif pos == "C":
C_count += 1
elif pos == "G":
G_count += 1
elif pos == "T":
T_count += 1
A.append(A_count)
C.append(C_count)
G.append(G_count)
T.append(T_count)
profile_matrix = {"A": A, "C": C, "G": G, "T": T}
P.append(A)
P.append(C)
P.append(G)
P.append(T)
profile = np.array(P).reshape(4, len(A))
for pos in range(len(A)):
if max(profile[:, pos]) == profile[0, pos]:
consensus.append("A")
elif max(profile[:, pos]) == profile[1, pos]:
consensus.append("C")
elif max(profile[:, pos]) == profile[2, pos]:
consensus.append("G")
elif max(profile[:, pos]) == profile[3, pos]:
consensus.append("T")
conout.write("".join(consensus) + "\n")
for k, v in profile_matrix.items():
conout.write(k + ": " + " ".join(str(x) for x in v) + "\n")
conout.close()
There are a couple of ways that you can iterate a FASTA file as records. You can use a prebuilt library or write your own.
A widely used library for working with sequence data is biopython. This code snippet will create a list of strings.
from Bio import SeqIO
file = "path/to/your/file.fa"
sequences = []
with open(file, "r") as file_handle:
for record in SeqIO.parse(file_handle, "fasta"):
sequences.append(record.seq)
Alternatively, you can write your own FASTA parser. Something like this should work:
def read_fasta(fh):
# Iterate to get first FASTA header
for line in fh:
if line.startswith(">"):
name = line[1:].strip()
break
# This list will hold the sequence lines
fa_lines = []
# Now iterate to find the get multiline fasta
for line in fh:
if line.startswith(">"):
# When in this block we have reached
# the next FASTA record
# yield the previous record's name and
# sequence as tuple that we can unpack
yield name, "".join(fa_lines)
# Reset the sequence lines and save the
# name of the next record
fa_lines = []
name = line[1:].strip()
# skip to next line
continue
fa_lines.append(line.strip())
yield name, "".join(fa_lines)
You can use this function like so:
file = "path/to/your/file.fa"
sequences = []
with open(file, "r") as file_handle:
for name, seq in read_fasta(file_handle):
sequences.append(seq)
I am a chip test engineer, and I have one big text file about 8KK lines. For this file, most lines include '='. Meanwhile I have a log file, which is about 300K lines, each line is show a test failure. I need to change the 300K lines of the original file.
Currently it takes about 15 hours to finish the job.
I have existing solution, but it is too slow.
For the code, the parse_log is used to process the log file and get to know each modification to be made, and the stil_parse include below function:
read file as list in memory;
iterate the file, and modify each line in list if included in log file;
write back to disk;
class MaskStil:
def __init__(self):
self.log_signal_file = ''
self.pattern = r"^([^:]+)(:)(\d+)(\s+)(\d+)(\s+)(\d+)(\s+)(\d+)(\s)([.LH]+)$"
self.log_signal = {}
self.log_lines = []
self.mask_dict = {}
self.stil_name_new = ''
self.stil_name = ''
self.signal_all = {}
self.signal_group = []
self.offset = 0
self.mask_mode = -1 # mask_mode 0: revert between L/H; mask_mode 1: mask L/H to Z
self.convert_value=[{"L":"H", "H":"L"}, {"L":"Z", "H":"Z"}]
for i in range(100):
self.log_signal[i] = ''
def digest(self, log_signal, stil_file, signal_group, offset, mask_mode = 1):
self.log_signal_file = log_signal
self.stil_name = stil_file
self.stil_name_new = stil_file[:-5] + '_mask.stil'
self.signal_group = signal_group.replace('=', '+').strip().split('+')
self.offset = offset
self.mask_mode = mask_mode
for i in range(1, len(self.signal_group)):
self.signal_all[self.signal_group[i]] = (i - 1) / 10 + i
print(self.signal_all)
self.parse_log()
self.stil_parse()
def parse_log(self):
with open(self.log_signal_file) as infile:
line_num = 0
blank_line = 0
for line in infile:
line_num += 1
if line_num == 1:
blank_line = line.count(' ')
if "------------------" in line:
break
for i in range(blank_line, len(line)):
self.log_signal[i - blank_line] += line[i]
for (key, value) in self.log_signal.items():
self.log_signal[key] = value.rstrip()
print(self.log_signal)
with open(self.log_signal_file) as log_in:
self.log_lines = log_in.read().splitlines()
for line in self.log_lines:
if re.match(self.pattern, line):
match = re.match(self.pattern, line)
cycle = int(match.group(9))
signals = match.group(11)
# print cycle,signals
self.mask_dict[cycle] = {}
for i in range(len(signals)):
if signals[i] != '.':
self.mask_dict[cycle][i] = signals[i]
def stil_parse(self):
cycle_keys = []
vector_num = 0
for i in self.mask_dict.keys():
cycle_keys.append(i)
with open(self.stil_name, 'r') as stil_in:
stil_in_list = stil_in.read().splitlines()
total_len = len(stil_in_list)
vector_cycle_dict = {}
with tqdm(total=total_len, ncols=100, desc= " Stil Scanning in RAM Progress") as pbar:
for i_iter in range(total_len):
line = stil_in_list[i_iter]
pbar.update(1)
if "=" in line:
vector_num +=1
if (vector_num in cycle_keys):
vector_cycle_dict[vector_num] = i_iter
status = line[line.find("=") + 1:line.find(";")]
# if cycle + self.offset in cycle_keys:
if vector_num in cycle_keys:
match = 1
for (i, j) in self.mask_dict[vector_num].iteritems():
mask_point = i
mask_signal = self.log_signal[i]
mask_value = j
test_point = self.signal_all[mask_signal]
test_value = status[test_point]
if test_value != mask_value:
print("data did not match for cycle: ", test_value, " VS ", line, j, vector_num, mask_point, mask_signal, test_point, test_value)
match = 0
raise NameError
else:
status = status[:test_point] + self.convert_value[self.mask_mode][test_value] + status[test_point + 1:]
if match == 1:
replace_line = line[:line.find("=") + 1] + status + line[line.find(";"):]
print("data change from :", line)
print(" to:", replace_line)
stil_in_list[i_iter] = replace_line
else:
print("No matching for %d with %s" %(vector_num, line))
raise NameError
with tqdm(total=len(stil_in_list), ncols=100, desc= " Masked-stil to in RAM Progress") as pbar:
with open(self.stil_name_new, 'w') as stil_out:
for new_line in range(len(stil_in_list)):
pbar.update(1)
stil_out.write(new_line)
I was expecting a solution that could finish in about 1 or 2 hours.
As I mentioned in the comments, you can get some speedup by refactoring your code to be multithreaded or multiprocess.
I imagine you're also running into memory swapping issues here. If that's the case, this should help:
with open(self.log_signal_file) as log_in:
line = log_in.readline() # First line. Need logic to handle empty logs
while line: #Will return false at EOF
if re.match(self.pattern, line):
match = re.match(self.pattern, line)
cycle = int(match.group(9))
signals = match.group(11)
# print cycle,signals
self.mask_dict[cycle] = {}
for i in range(len(signals)):
if signals[i] != '.':
self.mask_dict[cycle][i] = signals[i]
line = log_in.readline()
Here we only read in one line at a time, so you don't have to try to hold 8KK lines in memory
*In case anyone else didn't know, KK means million apparently.
I managed to optimized the solution, and the timing consumed tremendously reduced to about 1 minute.
Mainly the optimization is in below fields:
instead of keeping checking if (vector_num in cycle_keys):, I use
ordered list and always check whether equal to index_to_mask;
use variable line_find_equal and line_find_coma for further usage
class MaskStil:
def __init__(self):
self.log_signal_file = ''
self.pattern = r"^([^:]+)(:)(\d+)(\s+)(\d+)(\s+)(\d+)(\s+)(\d+)(\s)([.LH]+)$"
self.log_signal = {}
self.log_lines = []
self.mask_dict = {}
self.stil_name_new = ''
self.stil_name = ''
self.signal_all = {}
self.signal_group = []
self.offset = 0
self.mask_mode = -1 # mask_mode 0: revert between L/H; mask_mode 1: mask L/H to Z
self.convert_value=[{"L":"H", "H":"L"}, {"L":"Z", "H":"Z"}]
for i in range(100):
self.log_signal[i] = ''
def digest(self, log_signal, stil_file, signal_group, offset, mask_mode = 1):
self.log_signal_file = log_signal
self.stil_name = stil_file
self.stil_name_new = stil_file[:-5] + '_mask.stil'
self.signal_group = signal_group.replace('=', '+').strip().split('+')
self.offset = offset
self.mask_mode = mask_mode
for i in range(1, len(self.signal_group)):
self.signal_all[self.signal_group[i]] = int(math.floor((i - 1) / 10) + i)
print(self.signal_all)
self.parse_log()
self.stil_parse()
def parse_log(self):
with open(self.log_signal_file) as infile:
line_num = 0
blank_line = 0
for line in infile:
line_num += 1
if line_num == 1:
blank_line = line.count(' ')
if "------------------" in line:
break
for i in range(blank_line, len(line)):
self.log_signal[i - blank_line] += line[i]
for (key, value) in self.log_signal.items():
self.log_signal[key] = value.rstrip()
print(self.log_signal)
with open(self.log_signal_file) as log_in:
self.log_lines = log_in.read().splitlines()
for line in self.log_lines:
if re.match(self.pattern, line):
match = re.match(self.pattern, line)
cycle = int(match.group(9))
signals = match.group(11)
# print cycle,signals
self.mask_dict[cycle] = {}
for i in range(len(signals)):
if signals[i] != '.':
self.mask_dict[cycle][i] = signals[i]
def stil_parse(self):
cycle_keys = []
vector_num = 0
for i in self.mask_dict.keys():
cycle_keys.append(i)
with open(self.stil_name, 'r') as stil_in:
stil_in_list = stil_in.read().splitlines()
total_len = len(stil_in_list)
index_to_mask = 0
with tqdm(total=total_len, ncols=100, desc= " Stil Scanning in RAM Progress") as pbar:
for i_iter in range(total_len):
line = stil_in_list[i_iter]
pbar.update(1)
if "=" in line:
vector_num +=1
if (vector_num<=cycle_keys[-1]):
if (vector_num == cycle_keys[index_to_mask]):
line_find_equal = line.find("=")
line_find_coma = line.find(";")
status = line[line_find_equal + 1:line_find_coma]
# if cycle + self.offset in cycle_keys:
try:
match = 1
for (i, j) in self.mask_dict[vector_num].items():
mask_point = i
mask_signal = self.log_signal[i]
mask_value = j
test_point = self.signal_all[mask_signal]
test_value = status[test_point]
if test_value != mask_value:
print("data did not match for cycle: ", test_value, " VS ", line, j, vector_num, mask_point, mask_signal, test_point, test_value)
match = 0
raise NameError
else:
status = status[:test_point] + self.convert_value[self.mask_mode][test_value] + status[test_point + 1:]
stil_in_list[i_iter] = line[:line_find_equal + 1] + status + line[line_find_coma:]
# print("data change from :", line)
# print(" to:", stil_in_list[i_iter])
index_to_mask = index_to_mask+1
except (Exception) as e:
print("No matching for %d with %s" %(vector_num, line))
raise NameError
with tqdm(total=len(stil_in_list), ncols=100, desc= " Masked-stil to disk Progress") as pbar:
with open(self.stil_name_new, 'w') as stil_out:
for i_iter in range(len(stil_in_list)):
pbar.update(1)
stil_out.write(stil_in_list[i_iter]+ "\n")
I have an assignment where we have to read the file we created that has the test names and scores and print them in columns. Getting the data and displaying it, along with the average, is no problem, but I do not understand how to align the scores to the right in the output column. In the output example the scores line up perfectly to the right of the "SCORES" column. I can format their width using format(scores, '10d') as an example, but that's always relative to how long the name of the test was. Any advice?
def main():
testAndscores = open('tests.txt', 'r')
totalTestscoresVaule = 0
numberOfexams = 0
line = testAndscores.readline()
print("Reading tests and scores")
print("============================")
print("TEST SCORES")
while line != "":
examName = line.rstrip('\n')
testScore = float(testAndscores.readline())
totalTestscoresVaule += testScore
## here is where I am having problems
## can't seem to find info how to align into
## two columns.
print(format(examName),end="")
print(format(" "),end="")
print(repr(testScore).ljust(20))
line = testAndscores.readline()
numberOfexams += 1
averageOftheTestscores = totalTestscoresVaule / numberOfexams
print("Average is", (averageOftheTestscores))
#close the file
testAndscores.close()
main()
You just have to store each name and score in a list, then calculate the longest name, and use this length to print space for shorter name.
def main():
with open('tests.txt', 'r') as f:
data = []
totalTestscoresVaule = 0
numberOfexams = 0
while True:
exam_name = f.readline().rstrip('\n')
if exam_name == "":
break
line = f.readline().rstrip('\n')
test_score = float(line)
totalTestscoresVaule += test_score
data.append({'name': exam_name, 'score': test_score})
numberOfexams += 1
averageOftheTestscores = totalTestscoresVaule / numberOfexams
longuest_test_name = max([len(d['name']) for d in data])
print("Reading tests and scores")
print("============================")
print("TEST{0} SCORES".format(' ' * (longuest_test_name - 4)))
for d in data:
print(format(d['name']), end=" ")
print(format(" " * (longuest_test_name - len(d['name']))), end="")
print(repr(d['score']).ljust(20))
print("Average is", (averageOftheTestscores))
main()
Hello now im working on my project. I want to get candidate of text block by using algorithm below.
My input is a csv document which contain :
HTML column : the html code in a line
TAG column : the tag of html code in a line
Words : the text inside the tag in aline
TC : the number of words in a line
LTC : the number of anchor words in a line
TG : the number of tag in a line
P : the number of tag p and br in a line
CTTD : TC + (0.2*LTC) + TG - P
CTTDs : the smoothed CTTD
This is my algorithm to find candidate of text block. I make the csv file into dataframe using pandas. I am using CTTDs,TC and TG column to find the candidate.
from ListSmoothing import get_filepaths_smoothing
import pandas as pd
import numpy as np
import csv
filenames = get_filepaths_smoothing(r"C:\Users\kimhyesung\PycharmProjects\newsextraction\smoothing")
index = 0
for f in filenames:
file_html=open(str(f),"r")
df = pd.read_csv(file_html)
#df = pd.read_csv('smoothing/Smoothing001.csv')
news = np.array(df['CTTDs'])
new = np.array(df['TG'])
minval = np.min(news[np.nonzero(news)])
maxval = np.max(news[np.nonzero(news)])
j = 0.2
thetaCTTD = minval + j * (maxval-minval)
#maxGap = np.max(new[np.nonzero(new)])
#minGap = np.min(new[np.nonzero(new)])
thetaGap = np.min(new[np.nonzero(new)])
#print thetaCTTD
#print maxval
#print minval
#print thetaGap
def create_candidates(df, thetaCTTD, thetaGAP):
k = 0
TB = {}
TC = 0
for index in range(0, len(df) - 1):
start = index
if df.ix[index]['CTTDs'] > thetaCTTD:
start = index
gap = 0
TC = df.ix[index]['TC']
for index in range(index + 1, len(df) - 1):
if df.ix[index]['TG'] == 0:
continue
elif df.ix[index]['CTTDs'] <= thetaCTTD and gap >= thetaGAP:
break
elif df.ix[index]['CTTDs'] <= thetaCTTD:
gap += 1
TC += df.ix[index]['TC']
if (TC < 1) or (start == index):
continue
TB.update({
k: {
'start': start,
'end': index - 1
}
})
k += 1
return TB
def get_unique_candidate(TB):
TB = tb.copy()
for key, value in tb.iteritems():
if key == len(tb) - 1:
break
if value['end'] == tb[key+1]['end']:
del TB[key+1]
elif value['start'] < tb[key+1]['start'] < value['end']:
TB[key]['end'] = tb[key+1]['start'] - 1
else:
continue
return TB
index += 1
stored_file = "textcandidate/textcandidate" + '{0:03}'.format(index) + ".csv"
tb = create_candidates(df, thetaCTTD, thetaGap)
TB = get_unique_candidate(tb)
filewrite = open(stored_file, "wb")
df_list = []
for (k, d) in TB.iteritems():
candidate_df = df.loc[d['start']:d['end']]
candidate_df['candidate'] = k
df_list.append(candidate_df)
output_df = pd.concat(df_list)
output_df.to_csv(stored_file)
writer = csv.writer(filewrite, lineterminator='\n')
filewrite.close
ThetaCTTD is 10.36 and thethaGap is 1.
The output is
The output means there are 2 candidates of text block . First the candiate of text block start from line number 215 and end line number 225 (like the pict bellow). And the other candidate of text block start from line number 500 and end line number 501.
My question is how to save the output into csv and not only the number of line but the range of the text block and the others column will appear as the output too?
My expected output is like the screenshot of candidate text block is like this one
Assuming your output is a list of dictionaries:
pd.concat([df.loc[d['start']:d['end']] for (k, d) in TB.iteritems()])
Note that we slice by label, so d['end'] will be included.
Edit: add the candidate number in a new column.
It's cleaner to write a loop than to do two concat operations:
df_list = []
for (k, d) in TB.iteritems():
candidate_df = df.loc[d['start']:d['end']]
candidate_df['candidate'] = k
df_list.append(candidate_df)
output_df = pd.concat(df_list)
It's also faster to concatenate all dataframes at once at the end.
I'm trying to open a file for appending, but I keep getting the "except" portion of my try/except block, meaning there is some sort of error with the code but I can't seem to find what exactly is wrong with it. It only happens when I try to open a new file like so:
results = open("results.txt", "a")
results.append(score3)
Here's my full code:
import statistics
# input
filename = input("Enter a class to grade: ")
try:
# open file name
open(filename+".txt", "r")
print("Succesfully opened", filename,".txt", sep='')
print("**** ANALYZING ****")
with open(filename+".txt", 'r') as f:
counter1 = 0
counter2 = 0
right = 0
answerkey = "B,A,D,D,C,B,D,A,C,C,D,B,A,B,A,C,B,D,A,C,A,A,B,D,D"
a = []
# validating files
for line in f:
if len(line.split(',')) !=26:
print("Invalid line of data: does not contain exactly 26 values:")
print(line)
counter2 += 1
counter1 -= 1
if line.split(",")[0][1:9].isdigit() != True:
print("Invalid line of data: wrong N#:")
print(line)
counter2 += 1
counter1 -= 1
if len(line.split(",")[0]) != 9:
print("Invalid line of data: wrong N#:")
print(line)
counter2 += 1
counter1 -= 1
counter1 += 1
#grading students
score = len(([x for x in zip(answerkey.split(","), line.split(",")[1:]) if x[0] != x[1]]))
score1 = 26 - score
score2 = score1 / 26
score3 = score2 * 100
a.append(score3)
# results file
results = open("results.txt", "a")
results.write(score3)
# in case of no errors
if counter2 == 0:
print("No errors found!")
# calculating
number = len(a)
sum1 = sum(a)
max1 = max(a)
min1 = min(a)
range1 = max1 - min1
av = sum1/number
# turn to int
av1 = int(av)
max2 = int(max1)
min2 = int(min1)
range2 = int(range1)
# median
sort1 = sorted(a)
number2 = number / 2
number2i = int(number2)
median = a[number2i]
median1 = int(median)
# mode
from statistics import mode
mode = mode(sort1)
imode = int(mode)
# printing
print ("**** REPORT ****")
print ("Total valid lines of data:", counter1)
print ("Total invalid lines of data:", counter2)
print ("Mean (average) score:", av1)
print ("Highest score:", max2)
print("Lowest score:", min2)
print("Range of scores:", range2)
print("Median Score:", median1)
print("Mode score(s):", imode)
results.close()
except:
print("File cannot be found.")
I don't think there is a method called append for writing into file. You can use the write or writelines method only to write. As you already opened the file with append permissions. It wont change the old data and will append the text to the file.
f=open('ccc.txt','a')
f.write('Hellloooo')
f.close()
Hope it helps.