I'm trying to count a line 8 characters or less at a time and have it count how many times lower case "f" shows up. The value for how many times f shows up keeps showing zero. Text1.txt has lower case f"" one time on line 1 and 4 times on line 2.
with open("text1.txt","r+") as r:
while True:
cCount = r.readlines(1)
charSet = cCount.count("f")
print charSet
if not cCount:
break
if charSet == 1:
print("hello")
Where has my python logic failed.
Try this:
with open("text1.txt","r") as r:
for line in r:
print(line.count("f"))
this is the proper way to iterate over a file
EDIT: to change " fghfghf" to "3ghgh"
with open("text1.txt","r") as r:
for line in r:
if line.count("f")==3:
print("3"+line.replace("f",""))
Related
I'm new to Python & here is my question
Write a program to read through the mbox-short.txt and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From ' line by finding the time and then splitting the string a second time using a colon.
From stephen.marquard#uct.ac.za Sat Jan 5 09:14:16 2008
Once you have accumulated the counts for each hour, print out the counts, sorted by hour as shown below.
Link of the file:
http://www.pythonlearn.com/code/mbox-short.txt
This is my code:
name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
counts = dict()
for line in handle:
if not line.startswith ("From "):continue
#words = line.split()
col = line.find(':')
coll = col - 2
print coll
#zero = line.find('0')
#one = line.find('1')
#b = line[ zero or one : col ]
#print b
#hour = words[5:6]
#print hour
#for line in hour:
# hr = line.split(':')
# x = hr[1]
for x in coll:
counts[x] = counts.get(x,0) + 1
for key, value in sorted(counts.items()):
print key, value
My first try was with list splitting(Comments) and it didn't work as it considered the 0 & the 1 as the first & the second letter not the numbers
second one was with line find (:) which is partially worked with minutes not with hours as required!!
First question
Why when I write line.find(:), it takes automatically the 2 numbers after?
Second question
Why when I run the program now, it gives an error
TypeError: 'int' object is not iterable on line 26 ??
Third question
Why it considered 0 & 1 as first & second letters of the line not 0 & 1 numbers
Finally
If possible please solve me this problem with a little of explanation please (with the same codes to keep my learning sequence)
Thank you...
First question
Why when I write line.find(:), it takes automatically the 2 numbers
after?
str.find() return the first index of the character that you want to find. If your string is "From 00:00:00", it returns 7 as the first ':' is at index 7.
Second question
Why when I run the program now, it gives an error TypeError: 'int'
object is not iterable on line 26 ??
As have said above, it returns an int, which you cannot iterate
Third question
Why it considered 0 & 1 as first & second letters of the line not 0 &
1 numbers
I don't really understand what do you mean here. Anyway, as I understand, you try to find the first index which '0' or '1' occurs and assume that the first letter of hour? What about 8-11pm(start with 2)?
Finally If possible please solve me this problem with a little of
explanation please (with the same codes to keep my learning sequence)
Sure, it will be like this:
for line in f:
if not line.startswith("From "): continue
first_colon_index = line.find(":")
if first_colon_index == -1: # there is no ':'
continue
first_char_hour_index = first_colon_index - 2
# string slicing
# [a:b] get string from index a to b
hour = line[first_char_hour_index:first_char_hour_index+2]
hour_int = int(hour)
# if key exist, increase by 1. If not, set to 1
if hour_int in count:
count[hour_int] += 1
else:
count[hour_int] = 1
# print hour & count, in sorting order
for hour in sorted(count):
print hour, count[hour]
The part about string slicing can be confusing, you can read more about it at Python docs.
And you have to sure that: in the line, there is no other ":" or this method will fail as the first ":" will not be the one between hour and minute.
To make sure it works, it's better to use Regex. Something like:
for line in f:
if not line.startswith("From"): continue
match = re.search(r'^From.*?([0-9]{2,2}:[0-9]{2,2}:[0-9]{2,2})', line)
if match:
time = match.group(1) # hh:mm:ss
hh = int(time.split(":")[0])
# if key exist, increase by 1. If not, set to 1
if hh in count:
count[hh] += 1
else:
count[hh] = 1
# print hour & count, in sorting order
for hour in sorted(count):
print hour, count[hour]
That's because str.find() returns an index of the found substring, not the string itself. Consequently, when you subtract 2 from it and then try to loop through it it will complain that you're trying to loop through an integer and raise a TypeError.
You can grab the whole time string as:
time_start = line.find(":")
if time_start == -1: # not found
continue
time_string = line[time_start-2:time_start+6] # slice out the whole time string
You can then further split the time_string by : to get hours, minutes and seconds (e.g. hours, minutes, seconds = time_string.split(":", 2) just keep in mind that those will be strings, not integers), or if you just want the hour:
hour = int(line[time_start-2:time_start])
You can take it from there - just increase your dict value and when you're done with parsing the file sort everything out.
I'm the new one to machine learning. I got some problem when trying to use int for letters. I use Python 3.5 on Mac OS. This is my code:
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines())
returnMat = zeros((numberOfLines, 3))
classLabelVector = []
fr = open(filename)
index=0
for line in fr.readlines():
line = line.strip()
listFromLine1 = line.split('\t')
listFromLine = zeros(3)
i = 0
for value in listFromLine1:
if value.isdigit():
valueAsInt = int(value)
listFromLine[i] = valueAsInt
i += 1
returnMat[index, :] = listFromLine[0:3]
classLabelVector.append(int(listFromLine1[-1]))
index += 1
return returnMat, classLabelVector
This is my txt file:
23 8 1 f
7 8 5 j
5 9 1 j
6 6 6 f
This is the error:
classLabelVector.append(int(listFromLine1[-1])) ValueError: invalid literal for int() with base 10: 'f'
Can anybody help me with these problems?
If I understand your desired outcome correctly, you want to return a list with n lists in it. Each list will be along the line of [23. 8. 1.]. Then you want a second list that takes the last number of each list like this: [1, 5, 1, 6].
Assuming this is all correct, the reason you are getting classLabelVector.append(int(listFromLine1[-1])) ValueError: invalid literal for int() with base 10: 'f' is because you are not returning any numbers, but a string instead. I found 3 issues that should fix the error.
First, I found no '\t' in your text document. I instead used listFromLine1 = line.split(' ') and it split based on spaces. This could simply be from the way it copied over when you posted, though.
Second, when you assign a value for each position in listFromLine you then ignore it and append from listFromLine1 which you did nothing to, so it remains a string.
Third, try using if value.isnumeric(): instead of if value.isdigit():.
Fixing these few problems should get the program working. Also, you open the file and run fr.readlines() twice and never tell it to close. Your making the program work twice for the same information. You should try to rewrite it to only open once and use with open() as fr: because it will close when done.
EDIT: if you want the second list to be the letters instead [f, j, j, f] then keep it as listFromLine1 and use str() instead of int(): classLabelVector.append(str(listFromLine1[-1]))
I'm wondering, how can I count for example all "s" characters and print their number in a text file that I'm importing? Tried few times to do it by my own but I'm still doing something wrong. If someone could give me some tips I would really appreciate that :)
Open the file, the "r" means it is opened as readonly mode.
filetoread = open("./filename.txt", "r")
With this loop, you iterate over all the lines in the file and counts the number of times the character chartosearch appears. Finally, the value is printed.
total = 0
chartosearch = 's'
for line in filetoread:
total += line.count(chartosearch)
print("Number of " + chartosearch + ": " + total)
I am assuming you want to read a file, find the number of s s and then, store the result at the end of the file.
f = open('blah.txt','r+a')
data_to_read = f.read().strip()
total_s = sum(map(lambda x: x=='s', data_to_read ))
f.write(str(total_s))
f.close()
I did it functionally just to give you another perspective.
You open the file with an open("myscript.txt", "r") with the mode as "r" because you are reading. To remove whitespaces and \n's, we do a .read().split(). Then, using a for loop, we loop over each individual character and check if it is an 'S' or an 's', and each time we find one, we add one to the scount variable (scount is supposed to mean S-count).
filetoread = open("foo.txt").read().split()
scount = 0
for k in ''.join(filetoread):
if k.lower() == 's':
scount+=1
print ("There are %d 's' characters" %(scount))
Here's a version with a reasonable time performance (~500MB/s on my machine) for ascii letters:
#!/usr/bin/env python3
import sys
from functools import partial
byte = sys.argv[1].encode('ascii') # s
print(sum(chunk.count(byte)
for chunk in iter(partial(sys.stdin.buffer.read, 1<<14), b'')))
Example:
$ echo baobab | ./count-byte b
3
It could be easily changed to support arbitrary Unicode codepoints:
#!/usr/bin/env python3
import sys
from functools import partial
char = sys.argv[1]
print(sum(chunk.count(char)
for chunk in iter(partial(sys.stdin.read, 1<<14), '')))
Example:
$ echo ⛄⛇⛄⛇⛄ | ./count-char ⛄
3
To use it with a file, you could use a redirect:
$ ./count-char < input_file
So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7
Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.
This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()
I have 15 lines in a log file and i want to read the 4th and 10 th line for example through python and display them on output saying this string is found :
abc
def
aaa
aaa
aasd
dsfsfs
dssfsd
sdfsds
sfdsf
ssddfs
sdsf
f
dsf
s
d
please suggest through code how to achieve this in python .
just to elaborate more on this example the first (string or line is unique) and can be found easily in logfile the next String B comes within 40 lines of the first one but this one occurs at lots of places in the log file so i need to read this string withing the first 40 lines after reading string A and print the same that these strings were found.
Also I cant use with command of python as this gives me errors like 'with' will become a reserved keyword in Python 2.6. I am using Python 2.5
You can use this:
fp = open("file")
for i, line in enumerate(fp):
if i == 3:
print line
elif i == 9:
print line
break
fp.close()
def bar(start,end,search_term):
with open("foo.txt") as fil:
if search_term in fil.readlines()[start,end]:
print search_term + " has found"
>>>bar(4, 10, "dsfsfs")
"dsfsfs has found"
#list of random characters
from random import randint
a = list(chr(randint(0,100)) for x in xrange(100))
#look for this
lookfor = 'b'
for element in xrange(100):
if lookfor==a[element]:
print a[element],'on',element
#b on 33
#b on 34
is one easy to read and simple way to do it. Can you give part of your log file as an example? There are other ways that may work better :).
after edits by author:
The easiest thing you can do then is:
looking_for = 'findthis' i = 1 for line in open('filename.txt','r'):
if looking_for == line:
print i, line
i+=1
it's efficient and easy :)