Python regex is taking more time - python

I want to get complete line after first occurrence of the delimiter char I am using below regex but it is taking hell lot of time to parse
import re
str = "abc:dad:kdl--sa:dajs: idsa:kd"
mypat = re.compile(r'.+?:(.+)')
result = mypat.search(str)
line = result.group(1)
line = line.replace("\r", "").replace("\n", "")
print (line)

"I want to get complete line after first occurrence of the delimiter char"
first_occurence = line.index(DELIMITER)
line_after = line[first_occurence+1:]
Note: will raise the ValueError if the DELIMITER is not present.

Related

regex Extract word and end with space in string

I am trying to filter and extract one word from line.
Pattern is: GR.C.24 GRCACH GRALLDKD GR_3AD etc
input will be : the data is GRCACH got from server.
output : GRCAACH
problem : Pattern will start from GR<can be any thing> and end when whitespace encount
I am able to find pattern but not able to end when space encounter.
code is:
import re
fp_data = []
with open("output", "r") as fp:
fp_data = fp.readlines()
for da in fp_data:
match = re.search("\sGR.*", da)
print da
if match:
print dir(match)
print match.group()
Output: GRCACH got from server
Excepted: GRCAACH (or possible word start with GR)
Use:
(?:\s|^)(GR\S*)
(?:\s|^) matches whitespace or start of string
(GR\S*) matches GR followed by 0 or more non-whitespace characters and places match in Group 1
No need to read the entire file into memory (what if the file were very large?). You can iterate the file line by line.
import re
with open("output", "r") as fp:
for line in fp:
matches = re.findall(r"(?:\s|^)(GR\S*)", line)
print(line, matches)
Regex Demo
readlines() method leave trailing new line character "\n" so I used list comprehension to delete this character using rstrip() method and to not operate on empty lines using isspace() method.
import re
fp_data = []
with open("output", "r") as fp:
fp_data = [line.rstrip() for line in fp if not line.isspace()]
for line in fp_data:
match = re.search("\sGR.*", line)
print(line)
if match:
print(match)
print(match.group())
Not sure if I understood your answer and your edit after my question about the desired output correctly, but assuming that you want to list all occurences of words that start with GR, here is a suggestion:
import re
fp_data = []
with open("output", "r") as fp:
fp_data = fp.readlines()
for da in fp_data:
print da
match = re.findall('\\b(GR\\S*)\\b', da)
if match:
print match
The usage of word boundaries (\b) has the benefit of matching at beginning of line and end of line as well.

Parsing Successfully Until IndexError?

I have a script that's parsing for the first upper-case words in this file:
IMPORT fs
IF fs.exists("fs.pyra") THEN
PRINT "fs.pyra Exists!"
END
The script looks like this:
file = open(sys.argv[1], "r")
file = file.read().split("\n")
while '' in file:
findIt = file.index('')
file.pop(findIt)
for line in file:
func = ""
index = 0
while line[index] == " ":
index = index + 1
while not line[index] == " " or "=" and line[index].isupper():
func = func + line[index]
index = index + 1
print func
All used modules are already imported.
I passed the file that's being parsed's path in the arguments, and I'm getting this output:
IMPORT
IF
PRINT
Traceback (most recent call last):
File "src/source.py", line 20, in <module>
while not line[index] == " " or "=" and line[index].isupper():
IndexError: string index out of range
Which means it's parsing successfully until the last argument in the list, and then it's not parsing it at all. How do I fix this?
You don't need to increment the index on spaces - line.strip() will remove leading and trailing spaces.
You could split() the line on spaces to get words.
Then you can iterate over those strings and use isupper() to check whole words, rather than individual characters
Alternatively, run the whole file through a pattern matcher for [A-Z]+
Anyways, your error...
while not line[index] == " " or "="
The or "=" is always True, so your index is going out of bounds
If the file you're trying to process is compatible with Python's built in tokenizer, you can use that so it'll also handle stuff within quotes, then take the very first name token it finds in capitals from each line, eg:
import sys
from itertools import groupby
from tokenize import generate_tokens, NAME
with open(sys.argv[1]) as fin:
# Tokenize and group by each line
grouped = groupby(tokenize.generate_tokens(fin.readline), lambda L: L[4])
# Go over the lines
for k, g in grouped:
try:
# Get the first capitalised name
print next(t[1] for t in g if t[0] == NAME and t[1].isupper())
except StopIteration:
# Couldn't find one - so no panic - move on
pass
This gives you:
IMPORT
IF
PRINT
END

Search and replace a word within a line, and print the result in next line

I have a script that contains this line in multiples:
Wait(0.000005);
The objective is make a search/replace function to convert into this format:
//Wait(0.000005);
Wait(5);
The first part of commenting off the "Wait(0.000005);" is done.
Having difficulty with the replacement from the first line.
Could someone please enlighten me. Thank you.
Here's a one-liner to do the substitution at the command line using sed:
sed 's/Wait(0.000005);/ \/\/Wait(0.000005);\'$'\nWait(50);/' filename.py
import re
regex = re.compile(r"^.*Wait(0.000005);.*$", re.IGNORECASE)
for line in some_file:
line = regex.sub("^.*Wait(0.000005);.*$","Wait(50);", line)
This might be a possible solution
Regular expressions can come in very handy here:
import re
with open("example.txt") as infile:
data = infile.readlines()
pattern = re.compile("Wait\(([0-9]+\.[0-9]*)\)")
for line in data:
matches = pattern.match(line)
if matches:
value = float(matches.group(1))
newLine = "//" + matches.group(0) + "\n\n"
newLine += "Wait(%i)" % (value * 10000000)
print newLine
else:
print line
Example input:
Wait(0.000005);
Wait(0.000010);
Wait(0.000005);
Example output:
//Wait(0.000005)
Wait(50)
//Wait(0.000010)
Wait(100)
//Wait(0.000005)
Wait(50)

Avoiding printing space in regex

I have a BLAST output in default format. I want to parse and extract only the info I need using regex. However, in the line below
Query= contig1
There is a space there between '=' and 'contig1'. So in my output it prints a space in front. How to avoid this? Below is a piece of my code,
import re
output = open('out.txt','w')
with open('in','r') as f:
for line in f:
if re.search('Query=\s', line) != None:
line = line.strip()
line = line.rstrip()
line = line.strip('Query=\s')
line = line.rstrip('\s/')
query = line
print >> output,query
output.close()
Output should look like this,
contig1
You could actually use the returned match to extract the value you want:
for line in f:
match = re.search('Query=\s?(.*)', line)
if match is not None:
query = match.groups()[0]
print >> output,query
What we do here is: we search for a Query= followed (or not) by a space character and extract any other characters (with match.groups()[0], because we have only one group in the regular expression).
Also depending on the data nature you might want to do only simple string prefix matching like in the following example:
output = open('out.txt','w')
with open('in.txt','r') as f:
for line in f:
if line.startswith('Query='):
query = line.replace('Query=', '').strip()
print >> output,query
output.close()
In this case you don't need the re module at all.
If you are just looking for lines like tag=value, do you need regex?
tag,value=line.split('=')
if tag == 'Query':
print value.strip()
a='Query= conguie'
print "".join(a.split('Query='))
#output conguie
Comma in print statement adds space between parameters. Change
print output,query
to
print "%s%s"%(output,query)

python chain a list from a tsv file

i have this tsv file containing some paths of links each link is seperated by a ';' i want to use:
In the example below we can se that the text in the file is seperated
and i only want to read through the last column wich is a path starting with '14th'
6a3701d319fc3754 1297740409 166 14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade NULL
3824310e536af032 1344753412 88 14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade 3
415612e93584d30e 1349298640 138 14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade
I want to somehow split the path into a chain like this:
['14th_century', 'Niger', 'Nigeria'....]
how do i read the file and remove the first 3 columns so i only got the last one ?
UPDATE:
i have tried this now:
import re
with open('test.tsv') as f:
lines = f.readlines()
for line in lines[22:len(lines)]:
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
e_line = line.split(' ')
real_line = e_line[0]
print real_line.split(';')
But the problem is that it not deleting the first 3 columns ?
If the separator betweeen first is only a space and not a serie of spaces or a tab, you could do that
with open('file_name') as f:
lines = f.readlines()
for line in lines:
e_line = line.split(' ')
real_line = e_line[3]
print real_line.split(';')
Answer to your updated question.
But the problem is that it not deleting the first 3 columns ?
There are several mistakes.
Your code:
import re
with open('test.tsv') as f:
lines = f.readlines()
for line in lines[22:len(lines)]:
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
e_line = line.split(' ')
real_line = e_line[0]
print real_line.split(';')
This line does nothing...
re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
Because re.sub function doesn't change your line variable, but returns replaced string.
So you may want to do as below.
line = re.sub(r"^\s+", " ", line, flags = re.MULTILINE)
And your regexp ^s\+ matches only string which starts with whitespaces or tabs. Because you use ^.
But I think you just want to replace consective whitespaces or tabs with one space.
So then, above code will be as below.(Just remove ^ in the regexp)
line = re.sub(r"\s+", " ", line, flags = re.MULTILINE)
Now, each string in line are separated just one space. So line.split(' ') will work as you want.
Next, e_line[0] returns first element of e_line which is 1st column of the line.
But you want to skip first 3 columns and get 4th column. You can do like this:
e_line = line.split(' ')
real_line = e_line[3]
OK. Now entire code is look like this.
for line in lines:#<---I also changed here because there is no need to skip first 22 lines in your example.
line = re.sub(r"\s+", " ", line)
e_line = line.split(' ')
real_line = e_line[3]
print real_line
output:
14th_century;15th_century;16th_century;Pacific_Ocean;Atlantic_Ocean;Accra;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Europe;Africa;Atlantic_slave_trade;African_slave_trade
14th_century;Niger;Nigeria;British_Empire;Slavery;Africa;Atlantic_slave_trade;African_slave_trade
P.S:
This line can become more pythonic.
before:
for line in lines[22:len(lines)]:
after:
for line in lines[22:]:
And, you don't need to use flags = re.MULTILINE, because line is single-line in the for-loop.
You don't need to use regex for this. The csv module can handle tab-separated files too:
import csv
filereader = csv.reader(open('test.tsv', 'rb'), delimiter='\t')
path_list = [row[3].split(';') for row in filereader]
print(path_list)

Categories

Resources