Find and copy a line using regex in Python - python

I am new to this forum and to programming and apologize in advance if I violate any of the forum rules. I have researched this extensively, but I couldn't find a solution for my problem.
So I have a very long file that has this general structure:
data="""
20.020001 563410 9
20.520001 577410 20
21.022001 591466 9
21.522001 605466 120
23.196001 652338 2
25.278001 710634 7
25.780001 724690 144
26.280001 738690 9
26.782001 752746 40
27.282001 766746 9
27.784001 780802 140
29.372001 825266 2
31.458001 883674 7
31.958002 897674 8
32.458002 911674 9
32.958002 925674 10
"""
I imported the file using
with open("C:\blablabla\text.txt", 'r+') as infile:
data = infile.read()
Now I am trying to use a regular expression to find all lines that end with 140 through 146, so I did this:
items=re.findall('.......................14[0-6]\n',data,re.MULTILINE)
for x in items:
print x
This works, but when I now try to copy those lines that contain the regular expression,
for x in items:
if items in data:
data.write(items)
I get the following error:
if items in data:
TypeError: 'in <string>' requires string as left operand, not list
I understand what the problem is, but I don't know how to solve it. How can I feed the left operand a string when the outcome of my regex is a list?
Any help is much appreciated!

You should simply handle each line separately:
data = infile.readlines()
for line in data:
if re.match('.......................14[0-6]\n', line):
print line[:-1]
The last character of the line is a trailing newline, which would be duplicated by the one the print statement includes.

You can read the file line by line:
data=""
with open("file.txt", 'r+') as infile:
for line in infile:
if (146 >= int(line.split()[-1]) >= 140) :
data = data + line
print data

Your Regex can be simplified further
re.findall('.*?14[0-6]\n')
To overcome your further problems
items = re.findall('.*?14[0-6]\n',data)
result=""""""
for x in items:
result+=str(x)
print result

Related

Convert string (letter) from file text to integer

I'm the new one to machine learning. I got some problem when trying to use int for letters. I use Python 3.5 on Mac OS. This is my code:
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines())
returnMat = zeros((numberOfLines, 3))
classLabelVector = []
fr = open(filename)
index=0
for line in fr.readlines():
line = line.strip()
listFromLine1 = line.split('\t')
listFromLine = zeros(3)
i = 0
for value in listFromLine1:
if value.isdigit():
valueAsInt = int(value)
listFromLine[i] = valueAsInt
i += 1
returnMat[index, :] = listFromLine[0:3]
classLabelVector.append(int(listFromLine1[-1]))
index += 1
return returnMat, classLabelVector
This is my txt file:
23 8 1 f
7 8 5 j
5 9 1 j
6 6 6 f
This is the error:
classLabelVector.append(int(listFromLine1[-1])) ValueError: invalid literal for int() with base 10: 'f'
Can anybody help me with these problems?
If I understand your desired outcome correctly, you want to return a list with n lists in it. Each list will be along the line of [23. 8. 1.]. Then you want a second list that takes the last number of each list like this: [1, 5, 1, 6].
Assuming this is all correct, the reason you are getting classLabelVector.append(int(listFromLine1[-1])) ValueError: invalid literal for int() with base 10: 'f' is because you are not returning any numbers, but a string instead. I found 3 issues that should fix the error.
First, I found no '\t' in your text document. I instead used listFromLine1 = line.split(' ') and it split based on spaces. This could simply be from the way it copied over when you posted, though.
Second, when you assign a value for each position in listFromLine you then ignore it and append from listFromLine1 which you did nothing to, so it remains a string.
Third, try using if value.isnumeric(): instead of if value.isdigit():.
Fixing these few problems should get the program working. Also, you open the file and run fr.readlines() twice and never tell it to close. Your making the program work twice for the same information. You should try to rewrite it to only open once and use with open() as fr: because it will close when done.
EDIT: if you want the second list to be the letters instead [f, j, j, f] then keep it as listFromLine1 and use str() instead of int(): classLabelVector.append(str(listFromLine1[-1]))

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:
Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))
My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.
Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

How do you make tables with previously stored strings?

So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7
Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.
This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()

matching and dispalying specific lines through python

I have 15 lines in a log file and i want to read the 4th and 10 th line for example through python and display them on output saying this string is found :
abc
def
aaa
aaa
aasd
dsfsfs
dssfsd
sdfsds
sfdsf
ssddfs
sdsf
f
dsf
s
d
please suggest through code how to achieve this in python .
just to elaborate more on this example the first (string or line is unique) and can be found easily in logfile the next String B comes within 40 lines of the first one but this one occurs at lots of places in the log file so i need to read this string withing the first 40 lines after reading string A and print the same that these strings were found.
Also I cant use with command of python as this gives me errors like 'with' will become a reserved keyword in Python 2.6. I am using Python 2.5
You can use this:
fp = open("file")
for i, line in enumerate(fp):
if i == 3:
print line
elif i == 9:
print line
break
fp.close()
def bar(start,end,search_term):
with open("foo.txt") as fil:
if search_term in fil.readlines()[start,end]:
print search_term + " has found"
>>>bar(4, 10, "dsfsfs")
"dsfsfs has found"
#list of random characters
from random import randint
a = list(chr(randint(0,100)) for x in xrange(100))
#look for this
lookfor = 'b'
for element in xrange(100):
if lookfor==a[element]:
print a[element],'on',element
#b on 33
#b on 34
is one easy to read and simple way to do it. Can you give part of your log file as an example? There are other ways that may work better :).
after edits by author:
The easiest thing you can do then is:
looking_for = 'findthis' i = 1 for line in open('filename.txt','r'):
if looking_for == line:
print i, line
i+=1
it's efficient and easy :)

Python regex question

I'm having some problems figuring out a solution to this problem.
I want to read from a file on a per line basis and analyze whether that line has one of two characters (1 or 0). I then need to sum up the value of the line and also find the index value (location) of each of the "1" character instances.
so for example:
1001
would result in:
line 1=(count:2, pos:[0,3])
I tried a lot of variations of something like this:
r=urllib.urlopen(remote-resouce)
list=[]
for line in lines:
for m in re.finditer(r'1',line):
list.append((m.start()))
I'm having two issues:
1) I thought that the best solution would be to iterate through each line and then use a regex finditer function. My issue here is that I keep failing to write a for loop that works. Despite my best efforts, I keep returning the results as one long list, rather than a multidimensional array of dictionaries.
Is this approach the right one? If so, how do I write the correct for loop?
If not, what else should I try?
Perhaps do it without regex:
import urllib
url='http://stackoverflow.com/questions/5158168/python-regex-question/5158341'
f=urllib.urlopen(url)
for linenum,line in enumerate(f):
print(line)
locations=[pos for pos,char in enumerate(line) if char=='1']
print('line {n}=(count:{c}, pos:{l})'.format(
n=linenum,
c=len(locations),
l=locations
))
Using regexes here is probably a bad idea. You can see if a 1 or 0 is in a line of text with '0' in line or '1' in line, and you can get the count with line.count('1').
Finding all of the locations of 1s does require iterating through the string, I believe.
Unubtu's code works fine. I tested it on a sample file which also has all 0's for a particular line. Here is the complete code -
#! /usr/bin/python
2
3 # Write a program to read a text file which has 1's and 0's on each line
4 # For each line count the number of 1's and their position and print it
5
6 import sys
7
8 def countones(infile):
9 f = open(infile,'r')
10 for linenum, line in enumerate(f):
11 locations = [pos for pos,char in enumerate(line) if char == '1']
12 print('line {n}=(count:{c}, pos:{l})'.format(n=linenum,c=len(locations),l= locations))
13
14
15 def main():
16 infile = './countones.txt'
17 countones(infile)
18
19 # Standard boilerplate to call the main() function to begin the program
20 if __name__ == '__main__':
21 main()
Input file -
1001
110001
111111
00001
010101
00000
Result -
line 0=(count:2, pos:[0, 3])
line 1=(count:3, pos:[0, 1, 5])
line 2=(count:6, pos:[0, 1, 2, 3, 4, 5])
line 3=(count:1, pos:[4])
line 4=(count:3, pos:[1, 3, 5])
line 5=(count:0, pos:[])

Categories

Resources