Here I've tried to recreate the str.split() method in Python. I've tried and tested this code and it works fine, but I'm looking for vulnerabilities to correct. Do check it out and give feedback, if any.
Edit:Apologies for not being clear,I meant to ask you guys for exceptions where the code won't work.I'm also trying to think of a more refined way without looking at the source code.
def splitt(string,split_by = ' '):
output = []
x = 0
for i in range(string.count(split_by)):
output.append((string[x:string.index(split_by,x+1)]).strip())
x = string.index(split_by,x+1)
output.append((((string[::-1])[:len(string)-x])[::-1]).strip())
return output
There are in fact a few problems with your code:
by searching from x+1, you may miss an occurance of split_by at the very start of the string, resulting in index to fail in the last iteration
you are calling index more often than necessary
strip only makes sense if the separator is whitespace, and even then might remove more than intended, e.g. trailing spaces when splitting lines
instead, add len(split_by) to the offset for the next call to index
no need to reverse the string twice in the last step
This should fix those problems:
def splitt(string,split_by=' '):
output = []
x = 0
for i in range(string.count(split_by)):
x2 = string.index(split_by, x)
output.append((string[x:x2]))
x = x2 + len(split_by)
output.append(string[x:])
return output
Related
My brother and I are creating a simple text editor that changes entries to pig latin using Python. Code below:
our_word = ("cat")
vowels = ("a","e","i","o","u")
#remember I have to compare variables not strings
way = "way"
for i in range(len(our_word)):
for j in range (len(vowels)):
#checking if there is any vowel present
if our_word[i] == vowels[j]:
# if there were to be any vowels our_word[i] wil now be changed with way
#.replace is our function the dot is what notates this in the python library
our_word = our_word.replace(our_word[i], way)
print(our_word)
Right now we're testing the word 'cat' but the program when run returns the following:
/Users/x/PycharmProjects/pythonProject3/venv/bin/python /Users/x/PycharmProjects/pythonProject3/main.py
cwwayyt
Process finished with exit code 0
We're not sure why there is a double 'w' and a double 'y'. It seems the word 'cat' is edited once to 'cwayt' and then a second time to 'cwwayyt'.
Any suggestions are welcome!
The problem arises from the fact that on the next iteration of for loop after doing the substitution, you are looking at the next position, which is part of the way that you just substituted into place. Instead, you need to skip past this. You would also experience another problem, that it only loops up to the original length, rather than the new increased length. You are probably better in this situation to use a while loop with an index variable that you can manipulate to point to the correct place as needed. For example:
our_word = "cat"
vowels = "aeiou"
way = "way"
i = 0
while i < len(our_word):
if our_word[i] in vowels:
our_word = our_word[:i] + way + our_word[i + 1:]
i += len(way) # <=== if you made a substitution, skip over the bit
# that you just substituted in place
else:
i += 1 # <=== if you didn't make any substitution
# just go to the next position next time
print(our_word)
I'm making a method that takes a string, and it outputs parts of the strings on separate line according to a window.
For example:
I want to output every 3 letters of my string on separate line.
Input : "Advantage"
Output:
Adv
ant
age
Input2: "23141515"
Output:
231
141
515
My code:
def print_method(input):
mywindow = 3
start_index = input[0]
if(start_index == input[len(input)-1]):
exit()
print(input[1:mywindow])
printmethod(input[mywindow:])
However I get a runtime error.... Can someone help?
I think this is what you're trying to get. Here's what I changed:
Renamed input to input_str. input is a keyword in Python, so it's not good to use for a variable name.
Added the missing _ in the recursive call to print_method
Print from 0:mywindow instead of 1:mywindow (which would skip the first character). When you start at 0, you can also just say :mywindow to get the same result.
Change the exit statement (was that sys.exit?) to be a return instead (probably what is wanted) and change the if condition to be to return once an empty string is given as the input. The last string printed might not be of length 3; if you want this, you could use instead if len(input_str) < 3: return
def print_method(input_str):
mywindow = 3
if not input_str: # or you could do if len(input_str) == 0
return
print(input_str[:mywindow])
print_method(input_str[mywindow:])
edit sry missed the title: if that is not a learning example for recursion you shouldn't use recursion cause it is less efficient and slices the list more often.
def chunked_print (string,window=3):
for i in range(0,len(string) // window + 1): print(string[i*window:(i+1)*window])
This will work if the window size doesn't divide the string length, but print an empty line if it does. You can modify that according to your needs
EDIT: My question was answered on reddit. Here is the link if anyone is interested in the answer to this problem https://www.reddit.com/r/learnpython/comments/42ibhg/how_to_match_fields_from_two_lists_and_further/
I am attempting to get the pos and alt strings from file1 to match up with what is in
file2, fairly simple. However, file2 has values in the 17th split element/column to the
last element/column (340th) which contains string such as 1/1:1.2.2:51:12 which
I also want to filter for.
I want to extract the rows from file2 that contain/match the pos and alt from file1.
Thereafter, I want to further filter the matched results that only contain certain
values in the 17th split element/column onwards. But to do so the values would have to
be split by ":" so I can filter for split[0] = "1/1" and split[2] > 50. The problem is
I have no idea how to do this.
I imagine I will have to iterate over these and split but I am not sure how to do this
as the code is presently in a loop and the values I want to filter are in columns not rows.
Any advice would be greatly appreciated, I have sat with this problem since Friday and
have yet to find a solution.
import os,itertools,re
file1 = open("file1.txt","r")
file2 = open("file2.txt","r")
matched = []
for (x),(y) in itertools.product(file2,file1):
if not x.startswith("#"):
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y in pos_x and alt_y in alt_x:
matched.append(x)
for z in matched:
cells_z = z.split("\t")
if cells_z[16:len(cells_z)]:
Your requirement is not clear, but you might mean this:
for (x),(y) in itertools.product(file2,file1):
if x.startswith("#"):
continue
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y != pos_x: continue
if alt_y != alt_x: continue
extra_match = False
for f in range(17, 341):
y_extra = y[f].split(':')
if y_extra[0] != '1/1': continue
if y_extra[2] <= 50: continue
extra_match = True
break
if not extra_match: continue
xy = x + y
matched.append(xy)
I chose to concatenate x and y into the matched array, since I wasn't sure whether or not you would want all the data. If not, feel free to go back to just appending x or y.
You may want to look into the csv library, which can use tab as a delimiter. You can also use a generator and/or guards to make the code a bit more pythonic and efficient. I think your approach with indexes works pretty well, but it would be easy to break when trying to modify down the road, or to update if your file lines change shape. You may wish to create objects (I use NamedTuples in the last part) to represent your lines and make it much easier to read/refine down the road.
Lastly, remember that Python has a shortcut feature with the comparative 'if'
for example:
if x_evaluation and y_evaluation:
do some stuff
when x_evaluation returns False, Python will skip y_evaluation entirely. In your code, cells_x[0]+":"+cells_x[1] is evaluated every single time you iterate the loop. Instead of storing this value, I wait until the easier alt comparison evaluates to True before doing this (comparatively) heavier/uglier check.
import csv
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
if x[3] == y[4] and x[0] == ":".join(y[:1]):
yield x
def match_datestamp_and_alt_and_pos(first_file, second_file):
for z in filter_matching_alt_and_pos(first_file, second_file):
for element in z[16:]:
# I am not sure I fully understood your filter needs for the 2nd half. Here, I split all elements from the 17th onward and look for the two cases you mentioned. This seems like it might be very heavy, but at least we're using generators!
# same idea as before, we abort as early as possible to avoid needless indexing and checks
for chunk in element.split(":"):
# WARNING: if you aren't 100% sure the 2nd element is an int, this is very dangerous
# here, I use the continue keyword and the negative-check to help eliminate excess overhead. The execution is very similar as above, but might be easier to read/understand and can help speed things along in some cases
# once again, I do the lighter check before the heavier one
if not int(chunk[2])> 50:
# continue automatically skips to the next iteration on element
continue
if not chunk[:1] == "1/1":
continue
yield z
if __name__ == '__main__':
first_file = "first.txt"
second_file = "second.txt"
# match_datestamp_and_alt_and_pos returns a generator; for loop through it for the lines which matched all 4 cases
match_datestamp_and_alt_and_pos(first_file=first_file, second_file=second_file)
namedtuples for the first part
from collections import namedtuple
FirstFileElement = namedtuple("FirstFrameElement", "pos unused1 unused2 alt")
SecondFileElement = namedtuple("SecondFrameElement", "pos1 pos2 unused2 unused3 alt")
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
x_element = FirstFileElement(*x)
y_element = SecondFileElement(*y)
if x.alt == y.alt and x.pos == ":".join([y.pos1, y.pos2]):
yield x
So I have a File with some thousand entries of the form (fasta format, if anyone wants to know):
>scaffold1110_len145113_cov91
TAGAAAATTGAATAATTGATAGTTCTTAACGAAAAGTAAAAGTTTAAAGTATACAGAAATTTCAGGCTATTCACTCTTTT
ATAATCCAAAATTAGAAATACCACACCTTGCATAAAGTTTAAGATATTTACAAAAACCTGAAGTGGATAATCCGAAATCG
...
>Next_Header
ATGCTA...
And I have a python-dictionary from part of my code that contains information like the following for a number of headers:
{'scaffold1110_len145113_cov91': [[38039, 38854, 106259], [40035, 40186, 104927]]}
This describes the entry by header and a list of start position, end position and rest of characters in that entry (so start=1 means the first character of the line below that corresponding header). [start, end, left]
What I want to do is extract the string for this interval inclusive 25 (or a variable number) of characters in front and behind of it, if the entry allows for, otherwise include all characters to the begin/end. (like when the start position is 8, I cant include 25 chars in front but only 8.)
And that for every entry in my dict.
Sounds not too hard probably but I am struggling to come up with a clever way to do it.
For now my idea was to read lines from my file, check if they begin with ">" and look up if they exist in my dict. Then add up the chars per line until they exceed my start position and from there somehow manage to get the right part of that line to match my startPos - X.
for line in genomeFile:
line = line.strip()
if(line[0] == ">"):
header = line
currentCluster = foundClusters.get(header[1:])
if(currentCluster is not None):
outputFile.write(header + "\n")
if(currentCluster is not None):
charCount += len(line)
# *crazy calculations to find the actual part i want to extract*
I am quite the python beginner so maybe someone has a better idea how to solve this?
-- While typing this I got the idea to use file.read(startPos-X-1) after a line matches to a header I am looking for to read characters to get to my desired position and from there use file.read((endPos+X - startPos-X)) to extract the part I am looking for. If this works it seems pretty easy to accomplish what I want.
I'll post this anyway, maybe someone has an even better way or maybe my idea wont work.
thanks for any input.
EDIT:
turns out you cant mix for line in file with file.read(x) since the former uses buffering, soooooo back to the batcave. also file.read(x) probably counts newlines too, which my data for start and end position do not.
(also fixed some stupid errors in my posted code)
Perhaps you could use a function to generate your needed splice indices.
def biggerFrame( start, end, left, frameSize=25 ) : #defaults to 25 frameSize
newStart = start - frameSize
if newStart < 0 :
newStart = 0
if frameSize > left :
newEnd = left
else :
newEnd = end + frameSize
return newStart, newEnd
With that function, you can add something like the following to your code.
for indices in currentCluster :
slice, dice = biggerFrame( indices[0], indices[1], indices[2], 50) # frameSize is 50 here; you can make it whatever you want.
outputFile.write(line[slice:dice] + '\n')
Hello I'm facing a problem and I don't how to fix it. All I know is that when I add an else statement to my if statement the python execution always goes to the else statement even there is there a true statement in if and can enter the if statement.
Here is the script, without the else statement:
import re
f = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
d = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
w = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
s=""
av =0
b=""
filtred=[]
Mlines=f.readlines()
Wlines=d.readlines()
for line in Wlines:
Wspl=line.split()
for line2 in Mlines:
Mspl=line2.replace('\n','').split("\t")
if ((Mspl[0]).lower()==(Wspl[0])):
Wspl.append(Mspl[1])
if(len(Mspl)>=3):
Wspl.append(Mspl[2])
s="\t".join(Wspl)+"\n"
if s not in filtred:
filtred.append(s)
break
for x in filtred:
w.write(x)
f.close()
d.close()
w.close()
with the else statement and I want else for the if ((Mspl[0]).lower()==(Wspl[0])):
import re
f = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
d = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
w = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
s=""
av =0
b=""
filtred=[]
Mlines=f.readlines()
Wlines=d.readlines()
for line in Wlines:
Wspl=line.split()
for line2 in Mlines:
Mspl=line2.replace('\n','').split("\t")
if ((Mspl[0]).lower()==(Wspl[0])):
Wspl.append(Mspl[1])
if(len(Mspl)>=3):
Wspl.append(Mspl[2])
s="\t".join(Wspl)+"\n"
if s not in filtred:
filtred.append(s)
break
else:
b="\t".join(Wspl)+"\n"
if b not in filtred:
filtred.append(b)
break
for x in filtred:
w.write(x)
f.close()
d.close()
w.close()
first of all, you're not using "re" at all in your code besides importing it (maybe in some later part?) so the title is a bit misleading.
secondly, you are doing a lot of work for what is basically a filtering operation on two files. Remember, simple is better than complex, so for starters, you want to clean your code a bit:
you should use a little more indicative names than 'd' or 'w'. This goes for 'Wsplt', 's' and 'av' as well. Those names don't mean anything and are hard to understand (why is the d.readlines named Wlines when ther's another file named 'w'? It's really confusing).
If you choose to use single letters, it should still make sense (if you iterate over a list named 'results' it makes sense to use 'r'. 'line1' and 'line2' however, are not recommanded for anything)
You don't need parenthesis for conditions
You want to use as little variables as you can as to not get confused. There's too much different variables in your code, it's easy to get lost. You don't even use some of them.
you want to use strip rather than replace, and you want the whole 'cleaning' process to come first and then just have a code the deals with the filtering logic on the two lists. If you split each line according to some logic, and you don't use the original line anywhere in the iteration, then you can do the whole thing in the beggining.
Now, I'm really confused what you're trying to achieve here, and while I don't understand why your doing it that way, I can say that looking at your logic you are repeating yourself a lot. The action of checking against the filtered list should only happend once, and since it happens regardless of whether the 'if' checks out or not, I see absolutely no reason to use an 'else' clause at all.
Cleaning up like I mentioned, and re-building the logic, the script looks something like this:
# PART I - read and analyze the lines
Wappresults = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
Mikrofull = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
Wapp = map(lambda x: x.strip().split(), Wappresults.readlines())
Mikro = map(lambda x: x.strip().split('\t'), Mikrofull.readlines())
Wappresults.close()
Mikrofull.close()
# PART II - filter using some logic
filtred = []
for w in Wapp:
res = w[:] # So as to copy the list instead of point to it
for m in Mikro:
if m[0].lower() == w[0]:
res.append(m[1])
if len(m) >= 3 :
res.append(m[2])
string = '\t'.join(res)+'\n' # this happens regardles of whether the 'if' statement changed 'res' or not
if string not in filtred:
filtred.append(string)
# PART III - write the filtered results into a file
combination = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
for comb in filtred:
combination.write(comb)
combination.close()
I can't promise it will work (because again, like I said, I don't know what you're trying to achive) but this should be a lot easier to work with.