Python: Extract substring from text file based on character index

Python: Extract substring from text file based on character index - python

So I have a File with some thousand entries of the form (fasta format, if anyone wants to know):
>scaffold1110_len145113_cov91
TAGAAAATTGAATAATTGATAGTTCTTAACGAAAAGTAAAAGTTTAAAGTATACAGAAATTTCAGGCTATTCACTCTTTT
ATAATCCAAAATTAGAAATACCACACCTTGCATAAAGTTTAAGATATTTACAAAAACCTGAAGTGGATAATCCGAAATCG
...
>Next_Header
ATGCTA...
And I have a python-dictionary from part of my code that contains information like the following for a number of headers:
{'scaffold1110_len145113_cov91': [[38039, 38854, 106259], [40035, 40186, 104927]]}
This describes the entry by header and a list of start position, end position and rest of characters in that entry (so start=1 means the first character of the line below that corresponding header). [start, end, left]
What I want to do is extract the string for this interval inclusive 25 (or a variable number) of characters in front and behind of it, if the entry allows for, otherwise include all characters to the begin/end. (like when the start position is 8, I cant include 25 chars in front but only 8.)
And that for every entry in my dict.
Sounds not too hard probably but I am struggling to come up with a clever way to do it.
For now my idea was to read lines from my file, check if they begin with ">" and look up if they exist in my dict. Then add up the chars per line until they exceed my start position and from there somehow manage to get the right part of that line to match my startPos - X.
for line in genomeFile:
line = line.strip()
if(line[0] == ">"):
header = line
currentCluster = foundClusters.get(header[1:])
if(currentCluster is not None):
outputFile.write(header + "\n")
if(currentCluster is not None):
charCount += len(line)
# *crazy calculations to find the actual part i want to extract*
I am quite the python beginner so maybe someone has a better idea how to solve this?
-- While typing this I got the idea to use file.read(startPos-X-1) after a line matches to a header I am looking for to read characters to get to my desired position and from there use file.read((endPos+X - startPos-X)) to extract the part I am looking for. If this works it seems pretty easy to accomplish what I want.
I'll post this anyway, maybe someone has an even better way or maybe my idea wont work.
thanks for any input.
EDIT:
turns out you cant mix for line in file with file.read(x) since the former uses buffering, soooooo back to the batcave. also file.read(x) probably counts newlines too, which my data for start and end position do not.
(also fixed some stupid errors in my posted code)

Perhaps you could use a function to generate your needed splice indices.
def biggerFrame( start, end, left, frameSize=25 ) : #defaults to 25 frameSize
newStart = start - frameSize
if newStart < 0 :
newStart = 0
if frameSize > left :
newEnd = left
else :
newEnd = end + frameSize
return newStart, newEnd
With that function, you can add something like the following to your code.
for indices in currentCluster :
slice, dice = biggerFrame( indices[0], indices[1], indices[2], 50) # frameSize is 50 here; you can make it whatever you want.
outputFile.write(line[slice:dice] + '\n')

Related

Amateur Text Editor Malfunctioning

My brother and I are creating a simple text editor that changes entries to pig latin using Python. Code below:
our_word = ("cat")
vowels = ("a","e","i","o","u")
#remember I have to compare variables not strings
way = "way"
for i in range(len(our_word)):
for j in range (len(vowels)):
#checking if there is any vowel present
if our_word[i] == vowels[j]:
# if there were to be any vowels our_word[i] wil now be changed with way
#.replace is our function the dot is what notates this in the python library
our_word = our_word.replace(our_word[i], way)
print(our_word)
Right now we're testing the word 'cat' but the program when run returns the following:
/Users/x/PycharmProjects/pythonProject3/venv/bin/python /Users/x/PycharmProjects/pythonProject3/main.py
cwwayyt
Process finished with exit code 0
We're not sure why there is a double 'w' and a double 'y'. It seems the word 'cat' is edited once to 'cwayt' and then a second time to 'cwwayyt'.
Any suggestions are welcome!

The problem arises from the fact that on the next iteration of for loop after doing the substitution, you are looking at the next position, which is part of the way that you just substituted into place. Instead, you need to skip past this. You would also experience another problem, that it only loops up to the original length, rather than the new increased length. You are probably better in this situation to use a while loop with an index variable that you can manipulate to point to the correct place as needed. For example:
our_word = "cat"
vowels = "aeiou"
way = "way"
i = 0
while i < len(our_word):
if our_word[i] in vowels:
our_word = our_word[:i] + way + our_word[i + 1:]
i += len(way) # <=== if you made a substitution, skip over the bit
# that you just substituted in place
else:
i += 1 # <=== if you didn't make any substitution
# just go to the next position next time
print(our_word)

Looking for vulnerabilities in my code (split method)

Here I've tried to recreate the str.split() method in Python. I've tried and tested this code and it works fine, but I'm looking for vulnerabilities to correct. Do check it out and give feedback, if any.
Edit:Apologies for not being clear,I meant to ask you guys for exceptions where the code won't work.I'm also trying to think of a more refined way without looking at the source code.
def splitt(string,split_by = ' '):
output = []
x = 0
for i in range(string.count(split_by)):
output.append((string[x:string.index(split_by,x+1)]).strip())
x = string.index(split_by,x+1)
output.append((((string[::-1])[:len(string)-x])[::-1]).strip())
return output

There are in fact a few problems with your code:
by searching from x+1, you may miss an occurance of split_by at the very start of the string, resulting in index to fail in the last iteration
you are calling index more often than necessary
strip only makes sense if the separator is whitespace, and even then might remove more than intended, e.g. trailing spaces when splitting lines
instead, add len(split_by) to the offset for the next call to index
no need to reverse the string twice in the last step
This should fix those problems:
def splitt(string,split_by=' '):
output = []
x = 0
for i in range(string.count(split_by)):
x2 = string.index(split_by, x)
output.append((string[x:x2]))
x = x2 + len(split_by)
output.append(string[x:])
return output

Conversion to Logn Python 3.7

I have this code that works great and does what I want, however it does it in linear form which is way to slow for the size of my data files so I want to convert it to Log. I tried this code and many others posted here but still no luck at getting it to work. I will post both sets of code and give examples of what I expect.
import pandas
import fileinput
'''This code runs fine and does what I expect removing duplicates from big
file that are in small file, however it is a linear function.'''
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
'''This code is my attempt at conversion to a log function.'''
def log_search(small, big):
first = 0
last = len(big.txt) - 1
while first <= last:
mid = (first + last) / 2
if str(mid) == small.txt:
return True
elif small.txt < str(mid):
last = mid - 1
else:
first = mid + 1
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
return log_search(small, big)
big file has millions of lines of int data.
small file has hundreds of lines of int data.
compare data and remove duplicated data in big file but leave line number blank.
running the first block of code works but it takes too long to search through the big file. Maybe I am approaching the problem in a wrong way. My attempt at converting it to log runs without error but does nothing.

I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from small.txt in a set and iterating the lines in big.txt, checking whether they are in that set, will have complexity of O(b), with b being the number of lines in big.txt.
What you seem to be trying is to reduce this to O(s*logb), with s being the number of lines in small.txt, by using binary search to check for each line in small.txt whether it is in big.txt and removing/overwriting it then.
This would work well if all the lines were in a list with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character with file.seek, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.
So, yes, theoretically it can be done in O(s*logb), if you do the following:
implement binary search, searching not on the lines, but on the characters of the big file
for each position, backtrack to the last line break, then read the line to get the number
try again in the lower/upper half as usual with binary search
if the number is found, replace with as many spaces as there are digits in the number
repeat with the next number from the small file
On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with fileinput.input and print. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.
Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of - (not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.
import os
def getlineat(f, pos):
pos = f.seek(pos)
while pos > 0 and f.read(1) != "\n":
pos = f.seek(pos-1)
return pos+1 if pos > 0 else 0
def bsearch(f, num):
lower = 0
upper = os.stat(f.name).st_size - 1
while lower <= upper:
mid = (lower + upper) // 2
pos = getlineat(f, mid)
line = f.readline()
if not line: break # end of file
val = int(line)
if val == num:
return (pos, len(line.strip()))
elif num < val:
upper = mid - 1
elif num > val:
lower = mid + 1
return (-1, -1)
def overwrite(filename, to_remove):
with open(filename, "r+") as f:
positions = [bsearch(f, n) for n in to_remove]
for n, (pos, length) in sorted(zip(to_remove, positions)):
print(n, pos)
if pos != -1:
f.seek(pos)
f.write("-" * length)
import random
to_remove = [random.randint(-500, 1500) for _ in range(10)]
overwrite("test.txt", to_remove)
This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.
Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of file.seek may depend on file system, file type, etc.).
The bsearch function could also be generalized to accept another parameter value_function instead of hardcoding val = int(line). Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.

Python: Get the line and column number of string index?

Say I have a text file I'm operating on. Something like this (hopefully this isn't too unreadable):
data_raw = open('my_data_file.dat').read()
matches = re.findall(my_regex, data_raw, re.MULTILINE)
for match in matches:
try:
parse(data_raw, from_=match.start(), to=match.end())
except Exception:
print("Error parsing data starting on line {}".format(what_do_i_put_here))
raise
Notice in the exception handler there's a certain variable named what_do_i_put_here. My question is: how can I assign to that name so that my script will print the line number that contains the start of the 'bad region' I'm trying to work with? I don't mind re-reading the file, I just don't know what I'd do...

Here's something a bit cleaner, and in my opinion easier to understand than your own answer:
def index_to_coordinates(s, index):
"""Returns (line_number, col) of `index` in `s`."""
if not len(s):
return 1, 1
sp = s[:index+1].splitlines(keepends=True)
return len(sp), len(sp[-1])
It works essentially the same way as your own answer, but by utilizing string slicing splitlines() actually calculates all the information you need for you without the need for any post processing.
Using the keepends=True is necessary to give correct column counts for end of line characters.
The only extra problem is the edge case of an empty string, which can easily be handled by a guard-clause.
I tested it in Python 3.8, but it probably works correctly after about version 3.4 (in some older versions len() counts code units instead of code points, and I assume it would break for any string containing characters outside of the BMP)

I wrote this. It's untested and inefficient but it does help my exception message be a little clearer:
def coords_of_str_index(string, index):
"""Get (line_number, col) of `index` in `string`."""
lines = string.splitlines(True)
curr_pos = 0
for linenum, line in enumerate(lines):
if curr_pos + len(line) > index:
return linenum + 1, index-curr_pos
curr_pos += len(line)
I haven't even tested to see if the column number is vaguely accurate. I failed to abide by YAGNI

Column indexing starts with 0 so you need to extract 1 from len(sp[-1]) at the very end of your code to get the correct column value. Also, I'd perhaps return None (instead of "1.1" - which is also incorrect since it should be "1.0"...) if the lenght of string is 0 or if the string is too short to fit the index.
Otherwise, it's an excellent and elegant solution Tim.
def index_to_coordinates(txt:str, index:int) -> str:
"""Returns 'line.column' of index in 'txt'."""
if not txt or len(txt)-1 < index:
return None
sp = txt[:index+1].splitlines(keepends=True)
return (f"{len(sp)}.{len(sp[-1])-1}")

Binary search over a huge file with unknown line length

I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data.
I've seen this solution : Reading Huge File in Python
But it suggests that you use the same length of lines on the file - which is not supported in my case.
I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.
I'm working with python

You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:
bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, bytes - 1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid)
# now realign to a record
if mid:
fin.readline()
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search

To resolve it, you also can use binary search, but you need to change it a bit:
Get the file size.
Seek to the middle of size, with File.seek.
And search the first EOL character. Then you find a new line.
Check this line's key and if not what you want, update size and go to 2.
Here is a sample code:
fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()
while (begin < end):
fp.seek((end + begin) / 2, 0)
fp.readline()
line_key = get_key(fp.readline())
if (key == line_key):
pass # find what you want
elif (key > line_key):
begin = fp.tell()
else:
end = fp.tell()
Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.

The answer on the referenced question that says binary search only works with fixed-length records is wrong. And you don't need to do a search at all, since you have multiple items to look up. Just walk through the entire file one line at a time, build a dictionary of key:offset for each line, and then for each of your search items jump to the record of interest using os.lseek on the offset corresponding to each key.
Of course, if you don't want to read the entire file even once, you'll have to do a binary search. But if building the index can be amortized over several lookups, perhaps saving the index if you only do one lookup per day, then a search is unnecessary.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Extract substring from text file based on character index - python

Related

Amateur Text Editor Malfunctioning

Looking for vulnerabilities in my code (split method)

Conversion to Logn Python 3.7

Python: Get the line and column number of string index?

Binary search over a huge file with unknown line length

Categories

Resources