Need to create a function with two params, a filename to open and a pattern.
The pattern will be a search string.
Eg. the function will open sentence.txt that has something like "The quick brown fox" (can possibly be more than one line)
The pattern will be "brown fox"
So if found, as this will be, it should return a line number and index of the character the found string starts on. Else, return -1.
Catch is I've never programmed in python before so I don't know the syntax.
Previously coded in C, C#, Java, VB, etc..
EDIT:
.....Id
.....Name
#
my intent was for you to write HW3 code as iteration or
nested iterations that explicitly index the character
string as an array; i.e, the Python index() also known as
string.index() function is not allowed for this homework.
#
filename = raw_input('Enter filename: ')
pattern = raw_input('Enter pattern: ')
def findPattern(fname, pat):
Reading in one whole chunk
filetext = open(fname).read()
if pat in filetext:
print("Found it -- chunk")
else:
print("Nothing -- chunk")
Reading in line by line
for search in open(fname):
if pat in search:
print("Found it -- line")
else:
print("Nothing -- line")
findPattern(filename, pattern)
you can simulate simple "grep" with the "in" operator
def grep(filename, pattern):
for n,line in enumerate(open(filename)):
if pattern in line:
print line, n
To get index, you can use str.index() or str.find()
Here's a very simple grep. You could hack it out to use regular expressions pretty trivially. globbing wouldn't be much more difficult with glob. Also, the code you want is in there spread between grep and main so that might be of more interest than a custom grep ;)
def grep(filename, needle):
with open(filename) as f_in:
matches = ((i, line.find(needle), line) for i, line in enumerate(f_in))
return [match for match in matches if match[0] != -1]
def main(filename, needle):
matches = grep(filename, needle)
if matches:
print "{0} found on {1} lines in {2}".format(needle, len(matches), filename)
for line in matches:
print "{0}:{1}:{2}".format(*line)
return 1
else:
return -1
if __name__=='__main__':
import sys
filename = sys.argv[1]
needle = sys.argv[2]
return sys.exit(main(filename, needle))
Note that I haven't tested this code so there might be slight bugs. If it compiles, it should run fine though.
Also, you should tell your teacher that signalling failure with return codes is a terrible way to do things. If the caller of the function that you're going to write needs to know if no matches were found, it can just check for an empty list.
Related
I have been stuck at this point for quite a while, hope to get some tips.
The problem can be simplified as to find what is the largest consecutive occurrence of a pattern in a string. As a pattern AATG, for a string like ATAATGAATGAATGGAATG the right result should be 3. I tired to count the occurrences of the pattern by using re.compile(). I have found out from the doc that if i want to find consecutive occurrence of a pattern i possibly have to use special character +. For instance, a pattern like AATG i have to use re.compile(r'(AATG)+') instead of re.compile(r'AATG'). Otherwise, the occurrences will be overcounted. However, in this program the pattern is not a fixed string. I have treat it as a variable. I have tried many ways to put it into re.compile() without positive results. Could anyone enlighten me the correct way to format it (which is in the Function def countSTR below)?
After that, i think finditer(the_string_to_be_analysis) should return a iterator including all matches found. Then i used match.end() - match.start() to obtain the length of every match to compare with each other in order to get the longest consecutive occurrence of the pattern. maybe something goes wrong there?
code attached. Every input would be appreciated!
from sys import argv, exit
import csv
import re
def main():
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
# read DNA sequence
with open(argv[2], "r") as file:
if file.mode != 'r':
print(f"database {argv[2]} can not be read")
exit(1)
sequence = file.read()
# read database.csv
with open(argv[1], newline='') as file:
if file.mode != 'r':
print(f"database {argv[1]} can not be read")
exit(1)
# get the heading of the csv file in order to obtain STRs
csv_reader = csv.reader(file)
headings = next(csv_reader)
# dictionary to store STRs match result of DNA-sequence
STR_counter = {}
for STR in headings[1::]:
# entry result accounting to the STR keys
STR_counter[STR] = countSTR(STR, sequence)
# read csv file as a dictionary
with open(argv[1], newline='') as file:
database = csv.DictReader(file)
for row in database:
count = 0
for STR in STR_counter:
# print("row in database ", row[STR], "STR in STR_counter", STR_counter[STR])
if int(row[STR]) == int(STR_counter[STR]):
count += 1
if count == len(STR_counter):
print(row['name'])
exit(0)
else:
print("No match")
# find non-overlapping occurrences of STR in DNA-sequence
def countSTR(STR, sequence):
count = 0
maxcount = 0
# in order to match repeat STR. for example: "('AATG')+" as pattern
# into re.compile() to match repeat STR
# rewrite STR to "(STR)+"
STR = "(" + STR + ")+"
pattern = re.compile(r'STR')
# matches should be a iterator object
matches = pattern.finditer(sequence)
# go throgh every repeat and find the longest one
# by match.end() - match.start()
for match in matches:
count = match.end() - match.start()
if count > maxcount:
maxcount = count
# return repeat times of the longest repeat
return maxcount/len(STR)
main()
just find out a correct way to get the desired result.
post it here in case any others are also confused.
From what I have understand, to match a variable named var_pattern could use re.compile(rf'{var_pattern}'). Then if consecutive occurrences of the var_pattern should be searched, could use re.compile(rf'(var_pattern)+'). There may be other smarter ways to implement that, however i managed to get it work as fine as previously .
I want to replace multiple patterns in a file with regex.
This is my (working) code so far:
import re
with open('test.txt', "r") as fp:
text = fp.read()
result = re.sub(r'pattern', 'replacement', str)
result2 = re.sub(r'anotherpattern', 'anotherreplacement2', result)
...
with open('results.txt', 'w') as fp:
fp.write(result_x)
This works. But it seems to be inelegant to increment the vars names manually in every new line. How can I increment them better? It must be a for loop, I think. But how?
You do not need the previous result once you used it. You can store the new result in the same variable:
text = re.sub(r'pattern1', 'replacement1', text) # str() is a string constructor!
text = re.sub(r'pattern2', 'replacement2', text)
You can also have a list of patterns and replacements and loop through it:
to_replace = [('pattern1', 'replacement1'), ('pattern2', 'replacement2')]
for pattern,replacement in to_replace:
text = re.sub(pattern, replacement, text)
Or in an even more Pythonic way:
to_replace = [('pattern1', 'replacement1'), ('pattern2', 'replacement2')]
for pr in to_replace:
text = re.sub(*pr, string=text)
I don't know Python too well, but I think if you want to combine the patterns,
you could do it in a single pass using a callback.
Example:
def repl(m):
contents = m.group(1)
if m.group(1) != '':
return sr1
if m.group(2) != '':
return sr2
if m.group(3) != '':
return sr3
return m.group(0)
print re.sub('(stuff1)|(stuff2)|(stuff3)', repl, text)
And, it could also be looped inside the callback.
For instance, a var holding the fixed number of patterns
which is looped to test the match object.
There must be a replacement array the same size (and position) of the
number of groups in the regex.
How much of a performance increase will this give you?
Doing this in a single pass, you gain exponential performance.
Note that it is almost an error to re-examine the same text over and over again. Imagine searching the library of congress one word at a time from the beginning each time.. How long would that take ?
I want to write a UDF python for pig, to read lines from the file called like
#'prefix.csv'
spol.
LLC
Oy
OOD
and match the names and if finds any matches, then replaces it with white space. here is my python code
def list_files2(name, f):
fin = open(f, 'r')
for line in fin:
final = name
extra = 'nothing'
if (name != name.replace(line.strip(), ' ')):
extra = line.strip()
final = name.replace(line.strip(), ' ').strip()
return final, extra,'insdie if'
return final, extra, 'inside for'
Running this code in python,
>print list_files2('LLC nakisa', 'prefix.csv' )
>print list_files2('AG company', 'prefix.csv' )
returns
('nakisa', 'LLC', 'insdie if')
('AG company', 'nothing', 'inside for')
which is exactly what I need. But when I register this code as a UDF in apache pig for this sample list:
nakisa company LLC
three Oy
AG Lans
Test OOD
pig returns wrong answer on the third line:
((nakisa company,LLC,insdie if))
((three,Oy,insdie if))
((A G L a n s,,insdie if))
((Test,OOD,insdie if))
The question is why UDF enters the if loop for the third entry which does not have any match in the prefix.csv file.
I don't know pig but the way you are checking for a match is strange and might be the cause of your problem.
If you want to check whether a string is a substring of another, python provides
the find method on strings:
if name.find(line.strip()) != -1:
# find will return the first index of the substring or -1 if it was not found
# ... do some stuff
additionally, your code might leave the file handle open. A way better approach to handle file operations is by using the with statement. This assures that in any case (except of interpreter crashes) the file handle will get closed.
with open(filename, "r") as file_:
# Everything within this block can use the opened file.
Last but not least, python provides a module called csv with a reader and a writer, that handle the parsing of the csv file format.
Thus, you could try the following code and check if it returns the correct thing:
import csv
def list_files2(name, filename):
with open(filename, 'rb') as file_:
final = name
extra = "nothing"
for prefix in csv.reader(file_):
if name.find(prefix) != -1:
extra = prefix
final = name.replace(prefix, " ")
return final, extra, "inside if"
return final, extra, "inside for"
Because your file is named prefix.csv I assume you want to do prefix substitution. In this case, you could use startswith instead of find for the check and replace the line final = name.replace(prefix, " ") with final = " " + name[name.find(prefix):]. This assures that only a prefix will be substituted with the space.
I hope, this helps
So right now I'm looking for something in a file. I am getting a value variable, which is a rather long string, with newlines and so on. Then, I use re.findall(regex, value) to find regex. Regex is rather simple - something like "abc de.*".
Now, I want not only to capture whatever regex has, but also context(exactly like -C flag for grep).
So, assuming that I dumped value to file and ran grep on it, what I'd do is grep -C N 'abc de .*' valueinfile
How can I achieve the same thing in Python? I need the answer to work with Unicode regex/text.
My approach is to split the text block into list of lines. Next, iterate through each line and see if there is a match. In case of a match, then gather the context lines (lines that happens before and after the current line) and return it. Here is my code:
import re
def grep(pattern, block, context_lines=0):
lines = block.splitlines()
for line_number, line in enumerate(lines):
if re.match(pattern, line):
lines_with_context = lines[line_number - context_lines:line_number + context_lines + 1]
yield '\n'.join(lines_with_context)
# Try it out
text_block = """One
Two
Three
abc defg
four
five
six
abc defoobar
seven
eight
abc de"""
pattern = 'abc de.*'
for line in grep(pattern, text_block, context_lines=2):
print line
print '---'
Output:
Two
Three
abc defg
four
five
---
five
six
abc defoobar
seven
eight
---
seven
eight
abc de
---
As recommended by Ignacio Vazquez-Abrams, use a deque to store the last n lines. Once that many lines are present, popleft for each new line added. When your regular expression finds a match, return the previous n lines in the stack then iterate n more lines and return those also.
This keeps you from having to iterate on any line twice (DRY) and stores only minimal data in memory. You also mentioned the need for Unicode, so handling file encoding and adding the Unicode flag to RegEx searches is important. Also, the other answer uses re.match() instead of re.search() and as such may have unintended consequences.
Below is an example. This example only iterates over every line ONCE in the file, which means context lines that also contain hits don't get looked at again. This may or may not be desirable behavior but can easily be tweaked to highlight or otherwise flag lines with additional hits within context for a previous hit.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
import re
from collections import deque
def grep(pattern, input_file, context=0, case_sensitivity=True, file_encoding='utf-8'):
stack = deque()
hits = []
lines_remaining = None
with codecs.open(input_file, mode='rb', encoding=file_encoding) as f:
for line in f:
# append next line to stack
stack.append(line)
# keep adding context after hit found (without popping off previous lines of context)
if lines_remaining and lines_remaining > 0:
continue # go to next line in file
elif lines_remaining and lines_remaining == 0:
hits.append(stack)
lines_remaining = None
stack = deque()
# if stack exceeds needed context, pop leftmost line off stack
# (but include current line with possible search hit if applicable)
if len(stack) > context+1:
last_line_removed = stack.popleft()
# search line for pattern
if case_sensitivity:
search_object = re.search(pattern, line, re.UNICODE)
else:
search_object = re.search(pattern, line, re.IGNORECASE|re.UNICODE)
if search_object:
lines_remaining = context
# in case there is not enough lines left in the file to provide trailing context
if lines_remaining and len(stack) > 0:
hits.append(stack)
# return list of deques containing hits with context
return hits # you'll probably want to format the output, this is just an example
Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)