Python Find Method in Strings Seems To Be Failing - python

I have a text file like so: http://pastie.org/10309944
This contains numbers corresponding to lists of EDI segments that could possibly be associated with them. My goal is to write a script that takes in one of these codes (the numbers) as input and outputs the corresponding lists. The lists are surrounded by "-" characters to make the parsing easier.
I wrote the following code: `class SegmentsUsedFinder(object):
'''Finds a transaction code and returns the possible segments used.
'''
def __init__(self, transaction_code):
'''Initializes the segment finder.
Args:
transaction_code: The transaction code to find possible segments from.
'''
self._transaction_code = transaction_code + " -"
def find_segment(self):
'''Finds the possible segments that correspond to the
transaction code.
'''
fileObject = open("transactioncodes.txt", 'r')
data = ""
for line in fileObject:
line = line.rstrip('\n').rstrip()
data += line
fileObject.close()
position = data.find(self._transaction_code) + len(self._transaction_code)
with open("transactioncodes.txt", 'r') as file:
file.seek(position)
segments = ""
char = ""
while True:
char = file.read(1)
if char == "-":
break
segments += char
return segments
I then create a finder object like so:
finder = SegmentsUsedFinder("270")
print finder.find_segment()
This code actually works but when I adjust the string inside the SegmentsUsedFinder constructor to 271 or 837 it fails for some reason. I think I'm perhaps misusing the find method, but it works for the first instance. I can also get it to work for 271 if I add 2 to position and to work for 837 if I add 4 to position.
Any help would be greatly appreciated, thanks.

Here's how your find_segment method should look like:
def find_segment(self):
'''Finds the possible segments that correspond to the
transaction code.
'''
with open("transactioncodes.txt", 'r') as _file:
for line in _file:
if line.startswith(self._transaction_code):
return line[len(self._transaction_code):line.rfind("-")]
return ""
Of course it can be improved (the file name to be a private member of the class), but this is a prototype that works (assuming that all the lines respect the pattern: ID -LIST-).
Note: I also renamed the variable name to _file because it was shadowing the builtin file type.

Related

How can I effectively pull out human readable strings/terms from code automatically?

I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files.
Example - For this line of code found in a file:
for w in sorted(strings, key=strings.get, reverse=True):
I'd want these unique strings/terms returned to my dictionary as keys:
for
w
in
sorted
strings
key
strings
get
reverse
True
However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times:
strings.get
How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term":
strings = dict()
fname = '/tmp/bigfile.txt'
with open(fname, "r") as f:
for line in f:
if line in strings:
strings[line] += 1
else:
strings[line] = 1
for w in sorted(strings, key=strings.get, reverse=True):
print str(w).rstrip() + " : " + str(strings[w])
(Yes I used code from my little snippet here as the example at the top.)
If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression
\w+\.?\w*
Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"
note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards.
then you can use collections.Counter to do the actual counting for you:
import collections
import re
pattern = re.compile(r"\w+\.?\w*")
#here I'm using the source file for `collections` as the test example
with open(collections.__file__, "r") as f:
tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
for token, count in tokens.most_common(5): #show only the top 5
print(token, count)
Running python version 3.6.0a1 the output is this:
self 226
def 173
return 170
self.data 129
if 102
which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.

Parsing GenBank to FASTA with yield in Python (x, y)

For now I have tried to define and document my own function to do it, but I am encountering issues with testing the code and I have actually no idea if it is correct. I found some solutions with BioPython, re or other, but I really want to make this work with yield.
#generator for GenBank to FASTA
def parse_GB_to_FASTA (lines):
#set Default label
curr_label = None
#set Default sequence
curr_seq = ""
for line in lines:
#if the line starts with ACCESSION this should be saved as the beginning of the label
if line.startswith('ACCESSION'):
#if the label has already been changed
if curr_label is not None:
#output the label and sequence
yield curr_label, curr_seq
''' if the label starts with ACCESSION, immediately replace the current label with
the next ACCESSION number and continue with the next check'''
#strip the first column and leave the number
curr_label = '>' + line.strip()[12:]
#check for the organism column
elif line.startswith (' ORGANISM'):
#add the organism name to the label line
curr_label = curr_label + " " + line.strip()[12:]
#check if the region of the sequence starts
elif line.startswith ('ORIGIN'):
#until the end of the sequence is reached
while line.startswith ('//') is False:
#get a line without spaces and numbers
curr_seq += line.upper().strip()[12:].translate(None, '1234567890 ')
#if no more lines, then give the last label and sequence
yield curr_label, curr_seq
I often work with very large GenBank files and found (years ago) that the BioPython parsers were too brittle to make it through 100's of thousands of records (at the time), without crashing on an unusual record.
I wrote a pure python(2) function to return the next whole record from an open file, reading in 1k chunks, and leaving the file pointer ready to get the next record. I tied this in with a simple iterator that uses this function, and a GenBank Record class which has a fasta(self) method to get a fasta version.
YMMV, but the function that gets the next record is here as should be pluggable in any iterator scheme you want to use. As far as converting to fasta goes you can use logic similar to your ACCESSION and ORIGIN grabbing above, or you can get the text of sections (like ORIGIN) using:
sectionTitle='ORIGIN'
searchRslt=re.search(r'^(%s.+?)^\S'%sectionTitle,
gbrText,re.MULTILINE | re.DOTALL)
sectionText=searchRslt.groups()[0]
Subsections like ORGANISM, require a left side pad of 5 spaces.
Here's my solution to the main issue:
def getNextRecordFromOpenFile(fHandle):
"""Look in file for the next GenBank record
return text of the record
"""
cSize =1024
recFound = False
recChunks = []
try:
fHandle.seek(-1,1)
except IOError:
pass
sPos = fHandle.tell()
gbr=None
while True:
cPos=fHandle.tell()
c=fHandle.read(cSize)
if c=='':
return None
if not recFound:
locusPos=c.find('\nLOCUS')
if sPos==0 and c.startswith('LOCUS'):
locusPos=0
elif locusPos == -1:
continue
if locusPos>0:
locusPos+=1
c=c[locusPos:]
recFound=True
else:
locusPos=0
if (len(recChunks)>0 and
((c.startswith('//\n') and recChunks[-1].endswith('\n'))
or (c.startswith('\n') and recChunks[-1].endswith('\n//'))
or (c.startswith('/\n') and recChunks[-1].endswith('\n/'))
)):
eorPos=0
else:
eorPos=c.find('\n//\n',locusPos)
if eorPos == -1:
recChunks.append(c)
else:
recChunks.append(c[:(eorPos+4)])
gbrText=''.join(recChunks)
fHandle.seek(cPos-locusPos+eorPos)
return gbrText

Creating a dictionary using python and a .txt file

I have downloaded the following dictionary from Project Gutenberg http://www.gutenberg.org/cache/epub/29765/pg29765.txt (it is 25 MB so if you're on a slow connection avoid clicking the link)
In the file the keywords I am looking for are in uppercases for instance HALLUCINATION, then in the dictionary there are some lines dedicated to the pronunciation which are obsolete for me.
What I want to extract is the definition, indicated by "Defn" and then print the lines. I have came up with this rather ugly 'solution'
def lookup(search):
find = search.upper() # transforms our search parameter all upper letters
output = [] # empty dummy list
infile = open('webster.txt', 'r') # opening the webster file for reading
for line in infile:
for part in line.split():
if (find == part):
for line in infile:
if (line.find("Defn:") == 0): # ugly I know, but my only guess so far
output.append(line[6:])
print output # uncertain about how to proceed
break
Now this of course only prints the first line that comes right after "Defn:". I am new when it comes to manipulate .txt files in Python and therefore clueless about how to proceed. I did read in the line in a tuple and noticed that there are special new line characters.
So I want to somehow tell Python to keep on reading until it runs out of new line characters I suppose, but also that doesn't count for the last line which has to be read.
Could someone please enhance me with useful functions I might could use to solve this problem (with a minimal example would be appreciated).
Example of desired output:
lookup("hallucination")
out: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.
lookup("hallucination")
out: The perception of objects which have no reality, or of \r\n
sensations which have no corresponding external cause, arising from \r\n
disorder or the nervous system, as in delirium tremens; delusion.\r\n
Hallucinations are always evidence of cerebral derangement and are\r\n
common phenomena of insanity. W. A. Hammond.
from text:
HALLUCINATE
Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of
hallucinari, alucinari, to wander in mind, talk idly, dream.]
Defn: To wander; to go astray; to err; to blunder; -- used of mental
processes. [R.] Byron.
HALLUCINATION
Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.]
1. The act of hallucinating; a wandering of the mind; error; mistake;
a blunder.
This must have been the hallucination of the transcriber. Addison.
2. (Med.)
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.
HALLUCINATOR
Hal*lu"ci*na`tor, n. Etym: [L.]
Here is a function that returns the first definition:
def lookup(word):
word_upper = word.upper()
found_word = False
found_def = False
defn = ''
with open('dict.txt', 'r') as file:
for line in file:
l = line.strip()
if not found_word and l == word_upper:
found_word = True
elif found_word and not found_def and l.startswith("Defn:"):
found_def = True
defn = l[6:]
elif found_def and l != '':
defn += ' ' + l
elif found_def and l == '':
return defn
return False
print lookup('hallucination')
Explanation: There are four different cases we have to consider.
We haven't found the word yet. We have to compare the current line to the word we are looking for in uppercases. If they are equal, we found the word.
We have found the word, but haven't found the start of the definition. We therefore have to look for a line that starts with Defn:. If we found it, we add the line to the definition (excluding the six characters for Defn:.
We have already found the start of the definition. In that case, we just add the line to the definition.
We have already found the start of definition and the current line is empty. The definition is complete and we return the definition.
If we found nothing, we return False.
Note: There are certain entries, e.g. CRANE, that have multiple definitions. The above code is not able to handle that. It will just return the first definition. However, it is far from easy to code a perfect solution considering the format of the file.
You can split into paragraphs and use the index of the search word and find the first Defn paragraph after:
def find_def(f,word):
import re
with open(f) as f:
lines = f.read()
try:
start = lines.index("{}\r\n".format(word)) # find where our search word is
except ValueError:
return "Cannot find search term"
paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions
for para in paras:
if para.startswith("Defn:"): # if para startswith Defn: we have what we need
return para # return the para
print(find_def("in.txt","HALLUCINATION"))
Using the whole file returns:
In [5]: print find_def("gutt.txt","VACCINATOR")
Defn: One who, or that which, vaccinates.
In [6]: print find_def("gutt.txt","HALLUCINATION")
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.
A slightly shorter version:
def find_def(f,word):
import re
with open(f) as f:
lines = f.read()
try:
start = lines.index("{}\r\n".format(word))
except ValueError:
return "Cannot find search term"
defn = lines[start:].index("Defn:")
return re.split("\s+\r\n",lines[start+defn:],1)[0]
From here I learned an easy way to deal with memory mapped files and use them as if they were strings. Then you can use something like this to get the first definition for a term.
def lookup(search):
term = search.upper()
f = open('webster.txt')
s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
index = s.find('\r\n\r\n' + term + '\r\n')
if index == -1:
return None
definition = s.find('Defn:', index) + len('Defn:') + 1
endline = s.find('\r\n\r\n', definition)
return s[definition:endline]
print lookup('hallucination')
print lookup('hallucinate')
Assumptions:
There is at least one definition per term
If there are more than one, only the first is returned

Creating a table which has sentences from a paragraph each on a row with Python

I have an abstract which I've split to sentences in Python. I want to write to 2 tables. One which has the following columns: abstract id (which is the file number that I extracted from my document), sentence id (automatically generated) and each sentence of this abstract on a row.
I would want a table that looks like this
abstractID SentenceID Sentence
a9001755 0000001 Myxococcus xanthus development is regulated by(1st sentence)
a9001755 0000002 The C signal appears to be the polypeptide product (2nd sentence)
and another table NSFClasses having abstractID and nsfOrg.
How to write sentences (each on a row) to table and assign sentenceId as shown above?
This is my code:
import glob;
import re;
import json
org = "NSF Org";
fileNo = "File";
AbstractString = "Abstract";
abstractFlag = False;
abstractContent = []
path = 'awardsFile/awd_1990_00/*.txt';
files = glob.glob(path);
for name in files:
fileA = open(name,'r');
for line in fileA:
if line.find(fileNo)!= -1:
file = line[14:]
if line.find(org) != -1:
nsfOrg = line[14:].split()
print file
print nsfOrg
fileA = open(name,'r')
content = fileA.read().split(':')
abstract = content[len(content)-1]
abstract = abstract.replace('\n','')
abstract = abstract.split();
abstract = ' '.join(abstract)
sentences = abstract.split('.')
print sentences
key = str(len(sentences))
print "Sentences--- "
As others have pointed out, it's very difficult to follow your code. I think this code will do what you want, based on your expected output and what we can see. I could be way off, though, since we can't see the file you are working with. I'm especially troubled by one part of your code that I can't see enough to refactor, but feels obviously wrong. It's marked below.
import glob
for filename in glob.glob('awardsFile/awd_1990_00/*.txt'):
fh = open(filename, 'r')
abstract = fh.read().split(':')[-1]
fh.seek(0) # reset file pointer
# See comments below
for line in fh:
if line.find('File') != -1:
absID = line[14:]
print absID
if line.find('NSF Org') != -1:
print line[14:].split()
# End see comments
fh.close()
concat_abstract = ''.join(abstract.replace('\n', '').split())
for s_id, sentence in enumerate(concat_abstract.split('.')):
# Adjust numeric width arguments to prettify table
print absID.ljust(15),
print '{:06d}'.format(s_id).ljust(15),
print sentence
In that section marked, you are searching for the last occurrence of the strings 'File' and 'NSF Org' in the file (whether you mean to or not because the loop will keep overwriting your variables as long as they occur), then doing something with the 15th character onward of that line. Without seeing the file, it is impossible to say how to do it, but I can tell you there is a better way. It probably involves searching through the whole file as one string (or at least the first part of it if this is in its header) rather than looping over it.
Also, notice how I condensed your code. You store a lot of things in variables that you aren't using at all, and collecting a lot of cruft that spreads the state around. To understand what line N does, I have to keep glancing ahead at line N+5 and back over lines N-34 to N-17 to inspect variables. This creates a lot of action at a distance, which for reasons cited is best to avoid. In the smaller version, you can see how I substituted in string literals in places where they are only used once and called print statements immediately instead of storing the results for later. The results are usually more concise and easily understood.

How do I perform binary search on a text file to search a keyword in python?

The text file contains two columns- index number(5 spaces) and characters(30 spaces).
It is arranged in lexicographic order. I want to perform binary search to search for the keyword.
Here's an interesting way to do it with Python's built-in bisect module.
import bisect
import os
class Query(object):
def __init__(self, query, index=5):
self.query = query
self.index = index
def __lt__(self, comparable):
return self.query < comparable[self.index:]
class FileSearcher(object):
def __init__(self, file_pointer, record_size=35):
self.file_pointer = file_pointer
self.file_pointer.seek(0, os.SEEK_END)
self.record_size = record_size + len(os.linesep)
self.num_bytes = self.file_pointer.tell()
self.file_size = (self.num_bytes // self.record_size)
def __len__(self):
return self.file_size
def __getitem__(self, item):
self.file_pointer.seek(item * self.record_size)
return self.file_pointer.read(self.record_size)
if __name__ == '__main__':
with open('data.dat') as file_to_search:
query = raw_input('Query: ')
wrapped_query = Query(query)
searchable_file = FileSearcher(file_to_search)
print "Located # line: ", bisect.bisect(searchable_file, wrapped_query)
Do you need do do a binary search? If not, try converting your flatfile into a cdb (constant database). This will give you very speedy hash lookups to find the index for a given word:
import cdb
# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
index, word = line.split()
db.add(word, index)
db.finish()
In a separate script, run queries against it:
import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345
If you need to find a single keyword in a file:
line_with_keyword = next((line for line in open('file') if keyword in line),None)
if line_with_keyword is not None:
print line_with_keyword # found
To find multiple keywords you could use set() as #kriegar suggested:
def extract_keyword(line):
return line[5:35] # assuming keyword starts on 6 position and has length 30
with open('file') as f:
keywords = set(extract_keyword(line) for line in f) # O(n) creation
if keyword in keywords: # O(1) search
print(keyword)
You could use dict() above instead of set() to preserve index information.
Here's how you could do a binary search on a text file:
import bisect
lines = open('file').readlines() # O(n) list creation
keywords = map(extract_keyword, lines)
i = bisect.bisect_left(keywords, keyword) # O(log(n)) search
if keyword == keywords[i]:
print(lines[i]) # found
There is no advantage compared to the set() variant.
Note: all variants except the first one load the whole file in memory. FileSearcher() suggested by #Mahmoud Abdelkader don't require to load the whole file in memory.
I wrote a simple Python 3.6+ package that can do this. (See its github page for more information!)
Installation: pip install binary_file_search
Example file:
1,one
2,two_a
2,two_b
3,three
Usage:
from binary_file_search.BinaryFileSearch import BinaryFileSearch
with BinaryFileSearch('example.file', sep=',', string_mode=False) as bfs:
# assert bfs.is_file_sorted() # test if the file is sorted.
print(bfs.search(2))
Result: [[2, 'two_a'], [2, 'two_b']]
It is quite possible, with a slight loss of efficiency to perform a binary search on a sorted text file with records of unknown length, by repeatedly bisecting the range, and reading forward past the line terminator. Here's what I do to look for look thru a csv file with 2 header lines for a numeric in the first field. Give it an open file, and the first field to look for. It should be fairly easy to modify this for your problem. A match on the very first line at offset zero will fail, so this may need to be special-cased. In my circumstance, the first 2 lines are headers, and are skipped.
Please excuse my lack of polished python below. I use this function, and a similar one, to perform GeoCity Lite latitude and longitude calculations directly from the CSV files distributed by Maxmind.
Hope this helps
========================================
# See if the input loc is in file
def look1(f,loc):
# Compute filesize of open file sent to us
hi = os.fstat(f.fileno()).st_size
lo=0
lookfor=int(loc)
# print "looking for: ",lookfor
while hi-lo > 1:
# Find midpoint and seek to it
loc = int((hi+lo)/2)
# print " hi = ",hi," lo = ",lo
# print "seek to: ",loc
f.seek(loc)
# Skip to beginning of line
while f.read(1) != '\n':
pass
# Now skip past lines that are headers
while 1:
# read line
line = f.readline()
# print "read_line: ", line
# Crude csv parsing, remove quotes, and split on ,
row=line.replace('"',"")
row=row.split(',')
# Make sure 1st fields is numeric
if row[0].isdigit():
break
s=int(row[0])
if lookfor < s:
# Split into lower half
hi=loc
continue
if lookfor > s:
# Split into higher half
lo=loc
continue
return row # Found
# If not found
return False
Consider using a set instead of a binary search for finding a keyword in your file.
Set:
O(n) to create, O(1) to find, O(1) to insert/delete
If your input file is separated by a space then:
f = open('file')
keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) )
f.close()
my_word in keywords
<returns True or False>
Dictionary:
f = open('file')
keywords = dict( [ (pair[1],pair[0]) for pair in [line.strip().split(" ") for line in f.readlines()] ] )
f.close()
keywords[my_word]
<returns index of my_word>
Binary Search is:
O(n log n) create, O(log n) lookup
edit: for your case of 5 characters and 30 characters you can just use string slicing
f = open('file')
keywords = set( (line[5:-1] for line in f.readlines()) )
f.close()
myword_ in keywords
or
f = open('file')
keywords = dict( [(line[5:-1],line[:5]) for line in f.readlines()] )
f.close()
keywords[my_word]

Categories

Resources