I have a BLAST output in default format. I want to parse and extract only the info I need using regex. However, in the line below
Query= contig1
There is a space there between '=' and 'contig1'. So in my output it prints a space in front. How to avoid this? Below is a piece of my code,
import re
output = open('out.txt','w')
with open('in','r') as f:
for line in f:
if re.search('Query=\s', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('Query=\s')
line = line.rstrip('\s/')
query = line
print >> output,query
output.close()
Output should look like this,
contig1
You could actually use the returned match to extract the value you want:
for line in f:
match = re.search('Query=\s?(.*)', line)
if match is not None:
query = match.groups()[0]
print >> output,query
What we do here is: we search for a Query= followed (or not) by a space character and extract any other characters (with match.groups()[0], because we have only one group in the regular expression).
Also depending on the data nature you might want to do only simple string prefix matching like in the following example:
output = open('out.txt','w')
with open('in.txt','r') as f:
for line in f:
if line.startswith('Query='):
query = line.replace('Query=', '').strip()
print >> output,query
output.close()
In this case you don't need the re module at all.
If you are just looking for lines like tag=value, do you need regex?
tag,value=line.split('=')
if tag == 'Query':
print value.strip()
a='Query= conguie'
print "".join(a.split('Query='))
#output conguie
Comma in print statement adds space between parameters. Change
print output,query
to
print "%s%s"%(output,query)
Related
I want to get complete line after first occurrence of the delimiter char I am using below regex but it is taking hell lot of time to parse
import re
str = "abc:dad:kdl--sa:dajs: idsa:kd"
mypat = re.compile(r'.+?:(.+)')
result = mypat.search(str)
line = result.group(1)
line = line.replace("\r", "").replace("\n", "")
print (line)
"I want to get complete line after first occurrence of the delimiter char"
first_occurence = line.index(DELIMITER)
line_after = line[first_occurence+1:]
Note: will raise the ValueError if the DELIMITER is not present.
I have a file with some lines. Out of those lines I will choose only lines which starts with xxx. Now the lines which starts with xxx have pattern as follows:
xxx:(12:"pqrs",223,"rst",-90)
xxx:(23:"abc",111,"def",-80)
I want to extract only the string which are their in the first double quote
i.e., "pqrs" and "abc".
Any help using regex is appreciated.
My code is as follows:
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in 'xxx:':
if re.match('^xxx:',line):
c=line
break
this code is giving me error
Your code is wrongly indented. Your f = f.readlines() has 9 spaces in front while for line in f: has 4 spaces. It should look like below.
import re
list_of_prefixes = ["xxx","aaa"]
resulting_list = []
with open("raw.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in list_of_prefixes:
if re.match(phrase + ':\(\d+:\"(\w+)',line) != None:
resulting_list.append(re.findall(phrase +':\(\d+:\"(\w+)',line)[0])
Well you are heading in the right direction.
If the input is this simple, you can use regex groups.
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
m = re.match('^xxx:\(\d*:("[^"]*")',line)
if m is not None:
print(m.group(1))
All the magic is in the regular expression.
^xxx:(\d*:("[^"]*") means
Start from the beginning of the line, match on "xxx:(<any number of numbers>:"<anything but ">"
and because the sequence "<anything but ">" is enclosed in round brackets it will be available as a group (by calling m.group(1)).
PS: next time make sure to include the exact error you are getting
results = []
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
if line.startswith("xxx"):
line = line.split(":") # line[1] will be what is after :
result = line[1].split(",")[0][1:-1] # will be pqrs
results.append(result)
You want to look for lines that start with xxx
then split the line on the :. The first thing after the : is what you want -- up to the comma. Then your result is that string, but remove the quotes. There is no need for regex. Python string functions will be fine
To check if a line starts with xxx do
line.startswith('xxx')
To find the text in first double-quotes do
re.search(r'"(.*?)"', line).group(1)
(as match.group(1) is the first parenthesized subgroup)
So the code will be
with open("file") as f:
for line in f:
if line.startswith('xxx'):
print(re.search(r'"(.*?)"', line).group(1))
re module docs
I have a script that contains this line in multiples:
Wait(0.000005);
The objective is make a search/replace function to convert into this format:
//Wait(0.000005);
Wait(5);
The first part of commenting off the "Wait(0.000005);" is done.
Having difficulty with the replacement from the first line.
Could someone please enlighten me. Thank you.
Here's a one-liner to do the substitution at the command line using sed:
sed 's/Wait(0.000005);/ \/\/Wait(0.000005);\'$'\nWait(50);/' filename.py
import re
regex = re.compile(r"^.*Wait(0.000005);.*$", re.IGNORECASE)
for line in some_file:
line = regex.sub("^.*Wait(0.000005);.*$","Wait(50);", line)
This might be a possible solution
Regular expressions can come in very handy here:
import re
with open("example.txt") as infile:
data = infile.readlines()
pattern = re.compile("Wait\(([0-9]+\.[0-9]*)\)")
for line in data:
matches = pattern.match(line)
if matches:
value = float(matches.group(1))
newLine = "//" + matches.group(0) + "\n\n"
newLine += "Wait(%i)" % (value * 10000000)
print newLine
else:
print line
Example input:
Wait(0.000005);
Wait(0.000010);
Wait(0.000005);
Example output:
//Wait(0.000005)
Wait(50)
//Wait(0.000010)
Wait(100)
//Wait(0.000005)
Wait(50)
I am attempting to pull out multiple (50-100) sequences from a large .txt file seperated by new lines ('\n'). The sequence is a few lines long but not always the same length so i can't just print lines x-y. The sequences end with " and the next line always starts with the same word so maybe that could be used as a keyword.
I am writing using python 3.3
This is what I have so far:
searchfile = open('filename.txt' , 'r')
cache = []
for line in searchfile:
cache.append(line)
for line in range(len(cache)):
if "keyword1" in cache[line].lower():
print(cache[line+5])
This pulls out the starting line (which is 5 lines below the keyword line always) however it only pulls out this line.
How do I print the whole sequence?
Thankyou for your help.
EDIT 1:
Current output = ABCDCECECCECECE ...
Desired output = ABCBDBEBSOSO ...
ABCBDBDBDBDD ...
continued until " or new line
Edit 2
Text file looks like this:
Name (keyword):
Date
Address1
Address2
Sex
Response"................................"
Y/N
The sequence between the " and " is what I need
TL;DR - How do I print from line + 5 to end when end = keyword
Not sure if I understand your sequence data but if you're searching for each 'keyword' then the next " char then the following should work:
keyword_pos =[]
endseq_pos = []
for line in range(len(cache)):
if 'keyword1' in cache[line].lower():
keyword_pos.append(line)
if '"' in cache[line]:
endseq_pos.append(line)
for key in keyword_pos:
for endseq in endseq_pos:
if endseq > key:
print(cache[key:endseq])
break
This simply compiles a list of all the positions of all the keywords and " characters and then matches the two and prints all the lines in between.
Hope that helps.
I agree with #Michal Frystacky that regex is the way forward. However as I now understand the problem, we need two searches one for the 'keyword' then again 5 lines on, to find the 'sequence'
This should work but may need the regex to be tweaked:
import re
with open('yourfile.txt') as f:
lines = f.readlines()
for i,line in enumerate(lines):
#first search for keyword
key_match = re.search(r'\((keyword)',line)
if key_match:
#if successful search 5 lines on for the string between the quotation marks
seq_match = re.search(r'"([A-Z]*)"',lines[i+5])
if seq_match:
print(key_match.group(1) +' '+ seq_match.group(1))
1This can be done rather simply with regex
import re
lines = 'Name (keyword):','Date','Address1','Address2','Sex','Response"................................" '
for line in lines:
match = re.search('.*?"(:?.*?)"?',line)
if match:
print(match.group(1))
Eventually to use this sample code we would lines = f.readlines() from the dataset. Its important to note that we catch only things between " and another ", if there is no " mark at the end, we will miss this data, but accounting for that isn't too difficult.
Is there a way to detect the new line character after I've read from a file and stored the results into a string? Here is the code:
with open("text.txt") as file:
content_string = file.read()
file.close()
re.search("\n", content_string)
The content_string looks like this:
Hello world!
Hello WORLD!!!!!
I want to extract the new line character after the first line "Hello world!". Does this character even exist at that point?
As per Jongware comment, the regex search you perform finds the newline. You just need to use that result.
From the re module documentation
re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular
expression pattern produces a match, and return a corresponding
MatchObject instance. Return None if no position in the string matches
the pattern; note that this is different from finding a zero-length
match at some point in the string.
In terms of code, checking that translates into:
with open("text.txt") as file:
content_string = file.read()
file.close()
m = re.search("\n", content_string)
if m:
print "Found a newline"
else:
print "No newline found"
Now, your file might very well contain "\r" rather than "\n": they print likely the same, but the regex would not match. In that case, give also this test a try, replacing the correct line in the code:
m = re.search("\n", content_string)
with:
m = re.search("[\r\n]", content_string)
which will look for either.
Is there a way to detect the new line character after I've read from a
file and stored the results into a string?
If I understand you correctly, you want to concatenate multiple lines into one string.
Input:
Hello world!
Hello WORLD!!!!!
test.py:
result = []
with open("text.txt", "rb") as inputs:
for line in inputs:
result.append(line.strip()) # strip() removes newline charactor
print " ".join([x for x in result])
output:
Hello world! Hello WORLD!!!!!
How about if I have more lines, and I want to detect the first
newline? For some reason, in my text it won't detect it.
with open("text.txt") as f:
first_line = f.readline()
print(first_line)