I am back again with another python query. I have been trying to do some calculations with the items present in a list. Here is the code:
import math
def Usage() :
print "Usage :python beznew.py transcriptionsFile"
if __name__ == "__main__" :
if len(sys.argv) != 2 :
Usage()
else :
transcriptionFile = sys.argv[1]
tFile = open(transcriptionFile, "r")
for line in iter(tFile) :
list = line.split()
# changing the unit of time from 100 nano seconds to seconds
list[0] = list[0] / 100000000
list[1] = list[1] / 100000000
# duration of each phoneme
list = list[1] - list[0]
# extracting the start time of each phoneme
newlist = list.pop[0]
print list
print newlist
close.tFile
The input file looks like the following:
000000 1200000 pau
1200000 1600000 dh
1600000 2000000 ih
2000000 3100000 k
3100000 3400000 aa
3400000 3800000 r
I am trying to change the numerical values to seconds. And also trying to get the difference between first and second numbers. It would not allow me to divide. I dont understand what am I doing wrong. Thank you.
First, don't use list as a variable name. Every time you do that, a kitten dies.
Second, you should convert the strings you've extracted from your file to a number, preferably a Decimal if you value the precision. Currently you're trying to divide a string.
Third, nanoseconds are billionths of a second, not millionths.
Fourth, it's tFile.close(), not close.tfile.
Fifth, use for line in tfile:. A file descriptor is already an iterator.
Sixth, use with open(transcriptionfile, "r") as tfile: and be done with having to close it.
you can simplify your code as follows:
transcriptionFile = 'calculus.txt'
with open(transcriptionFile, "r") as tFile:
for line in tFile :
li = line.split()
if li:
new = ((int(li[1]) - int(li[0]))/10000000. , li[2])
print li,' ',new
The condition if li: is here to eliminate possible void lines.
Important points:
don't call a list with the name list because list is the name of built-in function of Python
in Python, 10/100 produces 0 ; you must put a dot to obtain the right result: 10./100 or 10/100.
do the calculus list = list[1] - list[0] before dividing by 10000000, it is more precise
with open(....) as handle: is better to open the files
Personally, I would do
transcriptionFile = 'calculus.txt'
with open(transcriptionFile, "r") as tFile:
gen = (line.split() for line in tFile if line.strip())
li = [((int(t2)-int(t1))/10000000.,phon) for (t1,t2,phon) in gen]
print '\n'.join(map(str,li))
Note that I used 10000000. to divide: if 1600000 - 1200000 = 400000 is in a unit which is 100 nanoseconds, then 400000 / 10000000 is 0.04 second
Edit 1
transcriptionFile = 'calculus.txt'
with open(transcriptionFile, "r") as tFile:
gen = (line.split() for line in tFile if line.strip())
firstVals, lapTimes = [],[]
for (t1,t2,phon) in gen:
firstVals.append( (int(t1)/10000000.,phon) )
lapTimes.append( (int(t2)-int(t1))/10000000.,phon) )
line.split() returns a list of strings. Try list[0] = float(list[0]) / 100000000.
This converts each string to a number which supports division before you do your calculations.
You do not convert the strings to numerical values. In order to conduct mathematical operations on your data, you have to convert the either to int or float objects:
valueA = int(list[0]) / 100000000
valueB = int(list[1]) / 100000000
Related
I'm new to Python and I'm trying to output the length of a list as a single integer, eg:
l1 = ['a', 'b', 'c']
len(l1) = 3
However, it is printing on cmdline with 1s down the page, eg:
1
1
1
1
1
1
etc
How can I get it to just output the number rather than a list of 1s?
(Here's the code:)
def Q3():
from datetime import datetime, timedelta
inputauth = open("auth.log", "r")
authStrings = inputauth.readlines()
failedPass = 'Failed password for'
for line in authStrings:
time = line[7:15]
dateHour = line[0:9]
countAttack1 = []
if time in line and failedPass in line:
if dateHour == 'Feb 3 08':
countAttack1.append(time)
length1 = len(countAttack1)
print(length1)
Ideally, I'd like it to output the number in a print so that I could format it, aka:
print("Attack 1: " + length1)
I think you are looping and ifs are inside a loop. If so, just print the length outside loop scope.
Please share the complete code for a better answer
Well as Syed Abdul Wahab said, the problem is that the "list" is getting recreated each loop. This makes so that the print reports "1", as it is the actual length of the list.
The other problem, repetition of the printng is similar - you are actually printing "each time in the loop".
The solution is then simple: you initialize the list outside the loop; and also report outside the loop.
def Q3():
from datetime import datetime, timedelta
inputauth = open("auth.log", "r")
authStrings = inputauth.readlines()
failedPass = 'Failed password for'
countAttack1 = [] # after this line the countAttack will be empty
for line in authStrings:
time = line[7:15]
dateHour = line[0:9]
if time in line and failedPass in line:
if dateHour == 'Feb 3 08':
countAttack1.append(time)
length1 = len(countAttack1)
print("Attack 1: " + str(length1))
I'd also like to take a bit of time to link you to string formatting While the documentation is complex it will make printing much easier, above print is trnasformed into:
print("Attack 1: {0}".format(length1))
Further analysing the code gives some peculiarities, you check if time is in the line string. - However just a few codelines above you create time from a slice of line - so it will always be inside line. (Except for the edge case where line is not of correct length, but that'll error anyways). So that if statement should be simplified to:
if failedPass in line:
Here is the function that prints the the length:
def print_length():
if time in line and failedPass in line:
if dateHour == 'Feb 3 08':
countAttack1.append(time)
length1 = len(countAttack1)
print(length1)
print_length()
>>>Print length of the List.
I have a .txt file. It has 3 different columns. The first one is just numbers. The second one is numbers which starts with 0 and it goes until 7. The final one is a sentence like. And I want to keep them in different lists because of matching them for their numbers. I want to write a function. How can I separate them in different lists without disrupting them?
The example of .txt:
1234 0 my name is
6789 2 I am coming
2346 1 are you new?
1234 2 Who are you?
1234 1 how's going on?
And I have keep them like this:
----1----
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
----2----
2346 1 are you new?
----3-----
6789 2 I am coming
What I've tried so far:
inputfile=open('input.txt','r').read()
m_id=[]
p_id=[]
packet_mes=[]
input_file=inputfile.split(" ")
print(input_file)
input_file=line.split()
m_id=[int(x) for x in input_file if x.isdigit()]
p_id=[x for x in input_file if not x.isdigit()]
With your current approach, you are reading the entire file as a string, and performing a split on a whitespace (you'd much rather split on newlines instead, because each line is separated by a newline). Furthermore, you're not segregating your data into disparate columns properly.
You have 3 columns. You can split each line into 3 parts using str.split(None, 2). The None implies splitting on space. Each group will be stored as key-list pairs inside a dictionary. Here I use an OrderedDict in case you need to maintain order, but you can just as easily declare o = {} as a normal dictionary with the same grouping (but no order!).
from collections import OrderedDict
o = OrderedDict()
with open('input.txt', 'r') as f:
for line in f:
i, j, k = line.strip().split(None, 2)
o.setdefault(i, []).append([int(i), int(j), k])
print(dict(o))
{'1234': [[1234, 0, 'my name is'],
[1234, 2, 'Who are you?'],
[1234, 1, "how's going on?"]],
'6789': [[6789, 2, 'I am coming']],
'2346': [[2346, 1, 'are you new?']]}
Always use the with...as context manager when working with file I/O - it makes for clean code. Also, note that for larger files, iterating over each line is more memory efficient.
Maybe you want something like that:
import re
# Collect data from inpu file
h = {}
with open('input.txt', 'r') as f:
for line in f:
res = re.match("^(\d+)\s+(\d+)\s+(.*)$", line)
if res:
if not res.group(1) in h:
h[res.group(1)] = []
h[res.group(1)].append((res.group(2), res.group(3)))
# Output result
for i, x in enumerate(sorted(h.keys())):
print("-------- %s -----------" % (i+1))
for y in sorted(h[x]):
print("%s %s %s" % (x, y[0], y[1]))
The result is as follow (add more newlines if you like):
-------- 1 -----------
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
-------- 2 -----------
2346 1 are you new?
-------- 3 -----------
6789 2 I am coming
It's based on regexes (module re in python). This is a good tool when you want to match simple line based patterns.
Here it relies on spaces as columns separators but it can as easily be adapted for fixed width columns.
The results is collected in a dictionary of lists. each list containing tuples (pairs) of position and text.
The program waits output for sorting items.
It's a quite ugly code but it's quite easy to understand.
raw = []
with open("input.txt", "r") as file:
for x in file:
raw.append(x.strip().split(None, 2))
raw = sorted(raw)
title = raw[0][0]
refined = []
cluster = []
for x in raw:
if x[0] == title:
cluster.append(x)
else:
refined.append(cluster)
cluster = []
title = x[0]
cluster.append(x)
refined.append(cluster)
for number, group in enumerate(refined):
print("-"*10+str(number)+"-"*10)
for line in group:
print(*line)
So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7
Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.
This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()
Currently, I'm using this to calculate the time between two messages and listing the times if they are above 20 seconds.
def time_deltas(infile):
entries = (line.split() for line in open(INFILE, "r"))
ts = {}
for e in entries:
if " ".join(e[2:5]) == "OuchMsg out: [O]":
ts[e[8]] = e[0]
elif " ".join(e[2:5]) == "OuchMsg in: [A]":
in_ts, ref_id = e[0], e[7]
out_ts = ts.pop(ref_id, None)
yield (float(out_ts),ref_id[1:-1],(float(in_ts)*10000 - float(out_ts)*10000))
n = (float(in_ts)*10000 - float(out_ts)*10000)
if n> 20:
print float(out_ts),ref_id[1:-1], n
INFILE = 'C:/Users/klee/Documents/text.txt'
import csv
with open('output_file1.csv', 'w') as f:
csv.writer(f).writerows(time_deltas(INFILE))
However, there are two major errors. First of all, python drops zeros when the time is before 10, ie. 0900. And, it drops zeros making the time difference not accurate.
It looks like:
130203.08766
when it should be:
130203.087660
You are yielding floats, so the csv writer turns those floats into strings as it pleases.
If you want your output values to be a certain format, yield a string in that format.
Perhaps something like this?
print "%04.0f" % (900) # prints 0900
I have a set of 1000 text files with names in_s1.txt, in_s2.txt and so. Each file contains millions of rows and each row has 7 columns like:
ccc245 1 4 5 5 3 -12.3
For me the most important is the values from the first and seventh columns; the pairs ccc245 , -12.3
What I need to do is to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value, and I also need to get where each value is located, in which file. I need something like:
FILE 1st_col 7th_col
in_s540.txt ccc3456 -9000.5
in_s520.txt ccc488 -723.4
in_s12.txt ccc34 -123.5
in_s344.txt ccc56 -45.6
I was thinking about using python and bash for this purpose but at the moment I did not find a practical approach. All what I know to do is:
concatenate all in_ files in IN.TXT
search the lowest values there using: for i in IN.TXT ; do sort -k6n $i | head -n 10; done
given the 1st_col and 7th_col values of the top ten list, use them to filter the in_s files, using grep -n VALUE in_s*, so I get for each value the name of the file
It works but it is a bit tedious. I wonder about a faster approach only using bash or python or both. Or another better language for this.
Thanks
In python, use the nsmallest function in the heapq module -- it's designed for exactly this kind of task.
Example (tested) for Python 2.5 and 2.6:
import heapq, glob
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield fname, items[0], float(items[6])
f.close()
result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result
Update after above answer accepted
Looking at the source code for Python 2.6, it appears that there's a possibility that it does list(iterable) and works on that ... if so, that's not going to work with a thousand files each with millions of lines. If the first answer gives you MemoryError etc, here's an alternative which limits the size of the list to n (n == 10 in your case).
Note: 2.6 only; if you need it for 2.5 use a conditional heapreplace() as explained in the docs. Uses heappush() and heappushpop() which don't have the key arg :-( so we have to fake it.
import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield -float(items[6]), fname, items[0]
f.close()
def homegrown_nlargest(n, iterable):
"""Ensures heap never has more than n entries"""
heap = []
for item in iterable:
if len(heap) < n:
heappush(heap, item)
else:
heappushpop(heap, item)
return heap
result = homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)
I would:
take first 10 items,
sort them and then
for every line read from files insert the element into those top10:
in case its value is lower than highest one from current top10,
(keeping the sorting for performance)
I wouldn't post the complete program here as it looks like homework.
Yes, if it wasn't ten, this would be not optimal
Try something like this in python:
min_values = []
def add_to_min(file_name, one, seven):
# checks to see if 7th column is a lower value than exiting values
if len(min_values) == 0 or seven < max(min_values)[0]:
# let's remove the biggest value
min_values.sort()
if len(min_values) != 0:
min_values.pop()
# and add the new value tuple
min_values.append((seven, file_name, one))
# loop through all the files
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
add_to_min(file_name, columns[0], float(columns[6]))
# print answers
for (seven, file_name, one) in min_values:
print file_name, one, seven
Haven't tested it, but it should get you started.
Version 2, just runs the sort a single time (after a prod by S. Lott):
values = []
# loop through all the files and make a long list of all the rows
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
values.append((file_name, columns[0], float(columns[6]))
# sort values, print the 10 smallest
values.sort()
for (seven, file_name, one) in values[:10]
print file_name, one, seven
Just re-read you question, with millions of rows, you might run out of RAM....
A small improvement of your shell solution:
$ cat in.txt
in_s1.txt
in_s2.txt
...
$ cat in.txt | while read i
do
cat $i | sed -e "s/^/$i /" # add filename as first column
done |
sort -n -k8 | head -10 | cut -d" " -f1,2,8
This might be close to what you're looking for:
for file in *; do sort -k6n "$file" | head -n 10 | cut -f1,7 -d " " | sed "s/^/$file /" > "${file}.out"; done
cat *.out | sort -k3n | head -n 10 > final_result.out
If your files are million lines, you might want to consider using "buffering". the below script goes through those million lines, each time comparing field 7 with those in the buffer. If a value is smaller than those in the buffer, one of them in buffer is replaced by the new lower value.
for file in in_*.txt
do
awk -vt=$t 'NR<=10{
c=c+1
val[c]=$7
tag[c]=$1
}
NR>10{
for(o=1;o<=c;o++){
if ( $7 <= val[o] ){
val[o]=$7
tag[o]=$1
break
}
}
}
END{
for(i=1;i<=c;i++){
print val[i], tag[i] | "sort"
}
}' $file
done