Convert string (letter) from file text to integer

Convert string (letter) from file text to integer - python

I'm the new one to machine learning. I got some problem when trying to use int for letters. I use Python 3.5 on Mac OS. This is my code:
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines())
returnMat = zeros((numberOfLines, 3))
classLabelVector = []
fr = open(filename)
index=0
for line in fr.readlines():
line = line.strip()
listFromLine1 = line.split('\t')
listFromLine = zeros(3)
i = 0
for value in listFromLine1:
if value.isdigit():
valueAsInt = int(value)
listFromLine[i] = valueAsInt
i += 1
returnMat[index, :] = listFromLine[0:3]
classLabelVector.append(int(listFromLine1[-1]))
index += 1
return returnMat, classLabelVector
This is my txt file:
23 8 1 f
7 8 5 j
5 9 1 j
6 6 6 f
This is the error:
classLabelVector.append(int(listFromLine1[-1])) ValueError: invalid literal for int() with base 10: 'f'
Can anybody help me with these problems?

If I understand your desired outcome correctly, you want to return a list with n lists in it. Each list will be along the line of [23. 8. 1.]. Then you want a second list that takes the last number of each list like this: [1, 5, 1, 6].
Assuming this is all correct, the reason you are getting classLabelVector.append(int(listFromLine1[-1])) ValueError: invalid literal for int() with base 10: 'f' is because you are not returning any numbers, but a string instead. I found 3 issues that should fix the error.
First, I found no '\t' in your text document. I instead used listFromLine1 = line.split(' ') and it split based on spaces. This could simply be from the way it copied over when you posted, though.
Second, when you assign a value for each position in listFromLine you then ignore it and append from listFromLine1 which you did nothing to, so it remains a string.
Third, try using if value.isnumeric(): instead of if value.isdigit():.
Fixing these few problems should get the program working. Also, you open the file and run fr.readlines() twice and never tell it to close. Your making the program work twice for the same information. You should try to rewrite it to only open once and use with open() as fr: because it will close when done.
EDIT: if you want the second list to be the letters instead [f, j, j, f] then keep it as listFromLine1 and use str() instead of int(): classLabelVector.append(str(listFromLine1[-1]))

Related

generate string with length equal to length of time in file, with 1 label per second , python

I have a file like this:
https://gist.github.com/manbharae/70735d5a7b2bbbb5fdd99af477e224be
What I want to do is generate 1 label for 1 second.
Since this above file is 160 seconds long, there should be 160 labels. in other words I want to generate string of length 160.
However I'm ending up having an str of len 166 instead of 160.
My code :
filename = './test_file.txt'
ann = []
with open(filename, 'r') as f:
for line in f:
_, end, label = line.strip().split('\t')
ann.append((int(float(end)), 'MIT' if label == 'MILAN' else 'not-MIT'))
str = ''
prev_value = 0
for s in ann:
value = s[0]
letter = 'M' if s[1] == 'MIT' else 'x'
str += letter * (value - prev_value)
print str
prev_value = value
name_of_file, file_ext = os.path.splitext(os.path.basename(filename))
print "\n\nfile_name processed:", name_of_file
print str
print "length of string", len(str),"\n\n"
My final output:
xxxxxxxMxMMMMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxMMMMMMMMMMMMMMMMMMMMxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
166.
Which is wrong. Str should be 160characters with each character per second, because file is 160 seconds long.
There is some small bug somewhere, unable to find it.
Please advise what's going wrong here?
Thanks.
Few things that I tried were , trying to include an if condition to break out of the loop once length of 160 is reached like this:
if ann[len(ann)-1][0] == len(str):
break;
AFAIK, something is going wrong in the last iteration, because until then everything is fine.
however it didn't help.
I looked at : https://stackoverflow.com/a/14927311/4932791
https://stackoverflow.com/a/1424016/4932791

The reason it doesn't add up is because you have two occasions which should add a negative amount of letters because the value is lower than the previous number:
(69, 'not-MIT')
(68, 'not-MIT')
(76, 'not-MIT')
(71, 'not-MIT')
For future reference: it's better not to call your variables 'str' as 'str()' already is a defined function in python.

Formatting output of CSV data?

I'm fairly new to python and made something that had this output:
(The text is in a csv file so so:
1,A
2,B
3,C etc)
Number Letter
1 A
2 B
3 C
26 Z
Unfortunately, I spent a good amount of time making it using a complicated method in which I manually made spaces like this:
Updated Code rn
fx = int(input('Number?\n'))
f=open('nums.txt','r')
lines=f.readlines()
line = lines[fx - 1]
with open('nums.txt','r') as f:
for i, line in enumerate(f):
if i >= 5:
break
NUM, LTR, SMB = line.rsplit(',', 1)
print(NUM.ljust(13) + LTR.ljust(13) + SMB)
How do I get it to make 3 columns? Right now it comes up with a
ValueError: not enough values to unpack (expected 3, got 2)
So is there a simpler method of achieving this that doesn't move the strings around like this:
Number Letter
1 A
2 B
3 C
26 Z #< string moves with spaces.

For simple alignment, you can use ljust or rjust. There is also no need to read the entire file for each line you want to process:
with open('numberletter','r') as f:
for i, line in enumerate(f):
if i >= 5:
break
number, letter = line.rsplit(',', 1)
print(number.ljust(13) + letter)
For more complex output formatting, look at str.format() and the formatting syntax

You can use sys module for that.
import sys
a=[1,"A"]
sys.stdout.write("%-6s %-50s " % (a[0],a[1]))

Find and copy a line using regex in Python

I am new to this forum and to programming and apologize in advance if I violate any of the forum rules. I have researched this extensively, but I couldn't find a solution for my problem.
So I have a very long file that has this general structure:
data="""
20.020001 563410 9
20.520001 577410 20
21.022001 591466 9
21.522001 605466 120
23.196001 652338 2
25.278001 710634 7
25.780001 724690 144
26.280001 738690 9
26.782001 752746 40
27.282001 766746 9
27.784001 780802 140
29.372001 825266 2
31.458001 883674 7
31.958002 897674 8
32.458002 911674 9
32.958002 925674 10
"""
I imported the file using
with open("C:\blablabla\text.txt", 'r+') as infile:
data = infile.read()
Now I am trying to use a regular expression to find all lines that end with 140 through 146, so I did this:
items=re.findall('.......................14[0-6]\n',data,re.MULTILINE)
for x in items:
print x
This works, but when I now try to copy those lines that contain the regular expression,
for x in items:
if items in data:
data.write(items)
I get the following error:
if items in data:
TypeError: 'in <string>' requires string as left operand, not list
I understand what the problem is, but I don't know how to solve it. How can I feed the left operand a string when the outcome of my regex is a list?
Any help is much appreciated!

You should simply handle each line separately:
data = infile.readlines()
for line in data:
if re.match('.......................14[0-6]\n', line):
print line[:-1]
The last character of the line is a trailing newline, which would be duplicated by the one the print statement includes.

You can read the file line by line:
data=""
with open("file.txt", 'r+') as infile:
for line in infile:
if (146 >= int(line.split()[-1]) >= 140) :
data = data + line
print data

Your Regex can be simplified further
re.findall('.*?14[0-6]\n')
To overcome your further problems
items = re.findall('.*?14[0-6]\n',data)
result=""""""
for x in items:
result+=str(x)
print result

How do you make tables with previously stored strings?

So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7

Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.

This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()

matching and dispalying specific lines through python

I have 15 lines in a log file and i want to read the 4th and 10 th line for example through python and display them on output saying this string is found :
abc
def
aaa
aaa
aasd
dsfsfs
dssfsd
sdfsds
sfdsf
ssddfs
sdsf
f
dsf
s
d
please suggest through code how to achieve this in python .
just to elaborate more on this example the first (string or line is unique) and can be found easily in logfile the next String B comes within 40 lines of the first one but this one occurs at lots of places in the log file so i need to read this string withing the first 40 lines after reading string A and print the same that these strings were found.
Also I cant use with command of python as this gives me errors like 'with' will become a reserved keyword in Python 2.6. I am using Python 2.5

You can use this:
fp = open("file")
for i, line in enumerate(fp):
if i == 3:
print line
elif i == 9:
print line
break
fp.close()

def bar(start,end,search_term):
with open("foo.txt") as fil:
if search_term in fil.readlines()[start,end]:
print search_term + " has found"
>>>bar(4, 10, "dsfsfs")
"dsfsfs has found"

#list of random characters
from random import randint
a = list(chr(randint(0,100)) for x in xrange(100))
#look for this
lookfor = 'b'
for element in xrange(100):
if lookfor==a[element]:
print a[element],'on',element
#b on 33
#b on 34
is one easy to read and simple way to do it. Can you give part of your log file as an example? There are other ways that may work better :).
after edits by author:
The easiest thing you can do then is:
looking_for = 'findthis' i = 1 for line in open('filename.txt','r'):
if looking_for == line:
print i, line
i+=1
it's efficient and easy :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert string (letter) from file text to integer - python

Related

generate string with length equal to length of time in file, with 1 label per second , python

Formatting output of CSV data?

Find and copy a line using regex in Python

How do you make tables with previously stored strings?

matching and dispalying specific lines through python

Categories

Resources