Hello I'm facing a problem and I don't how to fix it. All I know is that when I add an else statement to my if statement the python execution always goes to the else statement even there is there a true statement in if and can enter the if statement.
Here is the script, without the else statement:
import re
f = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
d = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
w = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
s=""
av =0
b=""
filtred=[]
Mlines=f.readlines()
Wlines=d.readlines()
for line in Wlines:
Wspl=line.split()
for line2 in Mlines:
Mspl=line2.replace('\n','').split("\t")
if ((Mspl[0]).lower()==(Wspl[0])):
Wspl.append(Mspl[1])
if(len(Mspl)>=3):
Wspl.append(Mspl[2])
s="\t".join(Wspl)+"\n"
if s not in filtred:
filtred.append(s)
break
for x in filtred:
w.write(x)
f.close()
d.close()
w.close()
with the else statement and I want else for the if ((Mspl[0]).lower()==(Wspl[0])):
import re
f = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
d = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
w = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
s=""
av =0
b=""
filtred=[]
Mlines=f.readlines()
Wlines=d.readlines()
for line in Wlines:
Wspl=line.split()
for line2 in Mlines:
Mspl=line2.replace('\n','').split("\t")
if ((Mspl[0]).lower()==(Wspl[0])):
Wspl.append(Mspl[1])
if(len(Mspl)>=3):
Wspl.append(Mspl[2])
s="\t".join(Wspl)+"\n"
if s not in filtred:
filtred.append(s)
break
else:
b="\t".join(Wspl)+"\n"
if b not in filtred:
filtred.append(b)
break
for x in filtred:
w.write(x)
f.close()
d.close()
w.close()
first of all, you're not using "re" at all in your code besides importing it (maybe in some later part?) so the title is a bit misleading.
secondly, you are doing a lot of work for what is basically a filtering operation on two files. Remember, simple is better than complex, so for starters, you want to clean your code a bit:
you should use a little more indicative names than 'd' or 'w'. This goes for 'Wsplt', 's' and 'av' as well. Those names don't mean anything and are hard to understand (why is the d.readlines named Wlines when ther's another file named 'w'? It's really confusing).
If you choose to use single letters, it should still make sense (if you iterate over a list named 'results' it makes sense to use 'r'. 'line1' and 'line2' however, are not recommanded for anything)
You don't need parenthesis for conditions
You want to use as little variables as you can as to not get confused. There's too much different variables in your code, it's easy to get lost. You don't even use some of them.
you want to use strip rather than replace, and you want the whole 'cleaning' process to come first and then just have a code the deals with the filtering logic on the two lists. If you split each line according to some logic, and you don't use the original line anywhere in the iteration, then you can do the whole thing in the beggining.
Now, I'm really confused what you're trying to achieve here, and while I don't understand why your doing it that way, I can say that looking at your logic you are repeating yourself a lot. The action of checking against the filtered list should only happend once, and since it happens regardless of whether the 'if' checks out or not, I see absolutely no reason to use an 'else' clause at all.
Cleaning up like I mentioned, and re-building the logic, the script looks something like this:
# PART I - read and analyze the lines
Wappresults = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
Mikrofull = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
Wapp = map(lambda x: x.strip().split(), Wappresults.readlines())
Mikro = map(lambda x: x.strip().split('\t'), Mikrofull.readlines())
Wappresults.close()
Mikrofull.close()
# PART II - filter using some logic
filtred = []
for w in Wapp:
res = w[:] # So as to copy the list instead of point to it
for m in Mikro:
if m[0].lower() == w[0]:
res.append(m[1])
if len(m) >= 3 :
res.append(m[2])
string = '\t'.join(res)+'\n' # this happens regardles of whether the 'if' statement changed 'res' or not
if string not in filtred:
filtred.append(string)
# PART III - write the filtered results into a file
combination = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
for comb in filtred:
combination.write(comb)
combination.close()
I can't promise it will work (because again, like I said, I don't know what you're trying to achive) but this should be a lot easier to work with.
Related
Here I've tried to recreate the str.split() method in Python. I've tried and tested this code and it works fine, but I'm looking for vulnerabilities to correct. Do check it out and give feedback, if any.
Edit:Apologies for not being clear,I meant to ask you guys for exceptions where the code won't work.I'm also trying to think of a more refined way without looking at the source code.
def splitt(string,split_by = ' '):
output = []
x = 0
for i in range(string.count(split_by)):
output.append((string[x:string.index(split_by,x+1)]).strip())
x = string.index(split_by,x+1)
output.append((((string[::-1])[:len(string)-x])[::-1]).strip())
return output
There are in fact a few problems with your code:
by searching from x+1, you may miss an occurance of split_by at the very start of the string, resulting in index to fail in the last iteration
you are calling index more often than necessary
strip only makes sense if the separator is whitespace, and even then might remove more than intended, e.g. trailing spaces when splitting lines
instead, add len(split_by) to the offset for the next call to index
no need to reverse the string twice in the last step
This should fix those problems:
def splitt(string,split_by=' '):
output = []
x = 0
for i in range(string.count(split_by)):
x2 = string.index(split_by, x)
output.append((string[x:x2]))
x = x2 + len(split_by)
output.append(string[x:])
return output
I have a code, written as shown below, which comprises of a with block.
alleles=[]
pop = 1 + 4
snp=[]
# read in input file
with open("input.txt", 'r') as f:
for line in f:
alleles.append(line.split()[2]+line.split()[1]) # risk first, then ref. keep risk as second allele
snp = [{"riskall": line[1],"weight": float(line[4]),"freq": float(line[pop]),
line[1]+line[1]:(2*float(line[4]),(float(line[pop])*float(line[pop]))),
line[2]+line[1]:(float(line[4]),(2*(((1-float(line[pop]))*(float(line[pop])))))),
line[2]+line[2]:(0, ((1-float(line[pop]))*((1-float(line[pop])))))} for line in map(lambda x: x.split(),f)]
Strangely enough, the snp assignment simply does not work, resulting in an empty array. The same occurs if I switch the places of my for loop and the snp assignment, where the latter works fine but the former does not happen.
Does anyone know what is going on? I'm quite sure the indentation is correct..
EDIT: My question was answered on reddit. Here is the link if anyone is interested in the answer to this problem https://www.reddit.com/r/learnpython/comments/42ibhg/how_to_match_fields_from_two_lists_and_further/
I am attempting to get the pos and alt strings from file1 to match up with what is in
file2, fairly simple. However, file2 has values in the 17th split element/column to the
last element/column (340th) which contains string such as 1/1:1.2.2:51:12 which
I also want to filter for.
I want to extract the rows from file2 that contain/match the pos and alt from file1.
Thereafter, I want to further filter the matched results that only contain certain
values in the 17th split element/column onwards. But to do so the values would have to
be split by ":" so I can filter for split[0] = "1/1" and split[2] > 50. The problem is
I have no idea how to do this.
I imagine I will have to iterate over these and split but I am not sure how to do this
as the code is presently in a loop and the values I want to filter are in columns not rows.
Any advice would be greatly appreciated, I have sat with this problem since Friday and
have yet to find a solution.
import os,itertools,re
file1 = open("file1.txt","r")
file2 = open("file2.txt","r")
matched = []
for (x),(y) in itertools.product(file2,file1):
if not x.startswith("#"):
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y in pos_x and alt_y in alt_x:
matched.append(x)
for z in matched:
cells_z = z.split("\t")
if cells_z[16:len(cells_z)]:
Your requirement is not clear, but you might mean this:
for (x),(y) in itertools.product(file2,file1):
if x.startswith("#"):
continue
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y != pos_x: continue
if alt_y != alt_x: continue
extra_match = False
for f in range(17, 341):
y_extra = y[f].split(':')
if y_extra[0] != '1/1': continue
if y_extra[2] <= 50: continue
extra_match = True
break
if not extra_match: continue
xy = x + y
matched.append(xy)
I chose to concatenate x and y into the matched array, since I wasn't sure whether or not you would want all the data. If not, feel free to go back to just appending x or y.
You may want to look into the csv library, which can use tab as a delimiter. You can also use a generator and/or guards to make the code a bit more pythonic and efficient. I think your approach with indexes works pretty well, but it would be easy to break when trying to modify down the road, or to update if your file lines change shape. You may wish to create objects (I use NamedTuples in the last part) to represent your lines and make it much easier to read/refine down the road.
Lastly, remember that Python has a shortcut feature with the comparative 'if'
for example:
if x_evaluation and y_evaluation:
do some stuff
when x_evaluation returns False, Python will skip y_evaluation entirely. In your code, cells_x[0]+":"+cells_x[1] is evaluated every single time you iterate the loop. Instead of storing this value, I wait until the easier alt comparison evaluates to True before doing this (comparatively) heavier/uglier check.
import csv
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
if x[3] == y[4] and x[0] == ":".join(y[:1]):
yield x
def match_datestamp_and_alt_and_pos(first_file, second_file):
for z in filter_matching_alt_and_pos(first_file, second_file):
for element in z[16:]:
# I am not sure I fully understood your filter needs for the 2nd half. Here, I split all elements from the 17th onward and look for the two cases you mentioned. This seems like it might be very heavy, but at least we're using generators!
# same idea as before, we abort as early as possible to avoid needless indexing and checks
for chunk in element.split(":"):
# WARNING: if you aren't 100% sure the 2nd element is an int, this is very dangerous
# here, I use the continue keyword and the negative-check to help eliminate excess overhead. The execution is very similar as above, but might be easier to read/understand and can help speed things along in some cases
# once again, I do the lighter check before the heavier one
if not int(chunk[2])> 50:
# continue automatically skips to the next iteration on element
continue
if not chunk[:1] == "1/1":
continue
yield z
if __name__ == '__main__':
first_file = "first.txt"
second_file = "second.txt"
# match_datestamp_and_alt_and_pos returns a generator; for loop through it for the lines which matched all 4 cases
match_datestamp_and_alt_and_pos(first_file=first_file, second_file=second_file)
namedtuples for the first part
from collections import namedtuple
FirstFileElement = namedtuple("FirstFrameElement", "pos unused1 unused2 alt")
SecondFileElement = namedtuple("SecondFrameElement", "pos1 pos2 unused2 unused3 alt")
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
x_element = FirstFileElement(*x)
y_element = SecondFileElement(*y)
if x.alt == y.alt and x.pos == ":".join([y.pos1, y.pos2]):
yield x
I'm working on taking apart a python script piece by piece to get a better understanding of 1.) how python works & 2.) what this particular script does & if I can make it better (i.e. usable by slightly more varied inputs).
Okay so, there is a line in my code that looks like this:
thisChrominfo = chrominfo[thisChrom]
Where chrominfo calls a dictionary set that looks like this:
{'chrY': ['chrY', '59373566', '3036303846'], 'chrX': ['chrX', '155270560', '2881033286'], 'chr13': ['chr13', '115169878', '2084766301'], 'chr12': ['chr12', '133851895', '1950914406'], 'chr11': ['chr11', '135006516', '1815907890'], 'chr10': ['chr10', '135534747', '1680373143'], 'chr17': ['chr17', '81195210', '2500171864'], 'chr16': ['chr16', '90354753', '2409817111'], 'chr15': ['chr15', '102531392', '2307285719'], 'chr14': ['chr14', '107349540', '2199936179'], 'chr19': ['chr19', '59128983', '2659444322'], 'chr18': ['chr18', '78077248', '2581367074'], 'chrM': ['chrM', '16571', '3095677412'], 'chr22': ['chr22', '51304566', '2829728720'], 'chr20': ['chr20', '63025520', '2718573305'], 'chr21': ['chr21', '48129895', '2781598825'], 'chr7': ['chr7', '159138663', '1233657027'], 'chr6': ['chr6', '171115067', '1062541960'], 'chr5': ['chr5', '180915260', '881626700'], 'chr4': ['chr4', '191154276', '690472424'], 'chr3': ['chr3', '198022430', '492449994'], 'chr2': ['chr2', '243199373', '249250621'], 'chr1': ['chr1', '249250621', '0'], 'chr9': ['chr9', '141213431', '1539159712'], 'chr8': ['chr8', '146364022', '1392795690']}
and thisChrom calls a single column (non-integer) that includes things like this:
'*' to `chr4`to `chrY` etc.
thisChrom only returns one value at a time, because it relies on a piece higher up in the file that specifies only a single row:
for x in INFILE:
arow = x.rstrip().split("\t")
thisChrom = arow[2]
thisChrompos = arow[3]
So it's pulling one column from one row.
The whole thing falls apart when values like '*' are present in arow, because that's not in the chrominfo dictionary. At first I thought I should just go ahead and add it to the dictionary, but now I'm thinking it would be easier and better to instead add a line at the top that says something like, if arow[2] == '*' then delete it, else continue.
I know it should look something like this:
for x in INFILE:
arow = x.rstrip().split("\t")
thisChrom = arow[2]
thisChrompos = arow[3]
if arow == '*': arow.remove(*)
else:
continue
but I haven't been able to get the syntax quite right. All of my Python is self & stackoverflow taught, so I appreciate your suggestions and guidance. Sorry if that was an over-explanation of something that is very simple for most experts (I am not an expert).
The continue keyword is somewhat non-obvious. What it does is skip the remaining contents of the loop, and start the next iteration. So, what you wrote will skip the rest of the loop only if arow is not equal to '*'.
Instead of
if arow == '*': arow.remove(*)
else:
continue
# process the row
you might simply want to either use a simple if condition:
if arow != '*':
# process the row
or use continue in the way you probably intended:
if arow == '*': continue
# process the row
See how it works in the opposite way of what you thought? Also, you don't need an else in this case because of how the continue skips the rest of the loop.
If you're familiar with the break keyword, it may make more sense as a comparison. The break keyword stops the loop entirely - it "breaks" out of it and moves on. The continue keyword is simply a "weaker" version of that - it "breaks" out of the current iteration but not all the way out of the loop.
In the following code i am trying to make a "more" command (unix) using python script by reading the file into a list and printing 10 lines at a time and then asking user do you want to print next 10 lines (Print More..).
Problem is that raw_input is asking again and again input if i give 'y' or 'Y' as input and do not continue with the while loop and if i give any other input the while loop brakes.
My code may not be best as am learning python.
import sys
import string
lines = open('/Users/abc/testfile.txt').readlines()
chunk = 10
start = 0
while 1:
block = lines[start:chunk]
for i in block:
print i
if raw_input('Print More..') not in ['y', 'Y']:
break
start = start + chunk
Output i am getting for this code is:-
--
10 lines from file
Print More..y
Print More..y
Print More..y
Print More..a
You're constructing your slices wrong: The second parameter in a slice gives the stop position, not the chunk size:
chunk = 10
start = 0
stop = chunk
end = len(lines)
while True:
block = lines[start:stop] # use stop, not chunk!
for i in block:
print i
if raw_input('Print More..') not in ['y', 'Y'] or stop >= end:
break
start += chunk
stop += chunk
Instead of explaining why your code doesn't work and how to fix it (because Tim Pietzcker already did an admirable job of that), I'm going to explain how to write code so that issues like this don't come up in the first place.
Trying to write your own explicit loops, checks, and index variables is difficult and error-prone. That's why Python gives you nice tools that almost always make it unnecessary to do so. And that's why you're using Python instead of C.
For example, look at the following version of your program:
count = 10
with open('/Users/abc/testfile.txt', 'r') as testfile:
for i, line in enumerate(testfile):
print line
if (i + 1) % count == 0:
if raw_input('Print More..') not in ['y', 'Y']:
break
This is shorter than the original code, and it's also much more efficient (no need to read the whole file in and then build a huge list in advance), but those aren't very good reasons to use it.
One good reason is that it's much more robust. There's very little explicit loop logic here to get wrong. You don't even need to remember how slices work (sure, it's easy to learn that they're [start:stop] rather than [start:length]… but if you program in another language much more frequently than Python, and you're always writing s.sub(start, length), you're going to forget…). It also automatically takes care of ending when you get to the end of the file instead of continuing forever, closing the file for you (even on exceptions, which is painful to get right manually), and other stuff that you haven't written yet.
The other good reason is that it's much easier to read, because, as much as possible, the code tells you what it's doing, rather than the details of how it's doing it.
But it's still not perfect, because there's still one thing you could easily get wrong: that (i + 1) % count == 0 bit. In fact, I got it wrong in my first attempt (I forgot the +1, so it gave me a "More" prompt after lines 0, 10, 20, … instead of 9, 19, 29, …). If you have a grouper function, you can rewrite it even more simply and robustly:
with open('/Users/abc/testfile.txt', 'r') as testfile:
for group in grouper(testfile, 10):
for line in group:
print line
if raw_input('Print More..') not in ['y', 'Y']:
break
Or, even better:
with open('/Users/abc/testfile.txt', 'r') as testfile:
for group in grouper(testfile, 10):
print '\n'.join(group)
if raw_input('Print More..') not in ['y', 'Y']:
break
Unfortunately, there's no such grouper function built into, say, the itertools module, but you can write one very easily:
def grouper(iterator, size):
return itertools.izip(*[iterator]*size)
(If efficiency matters, search around this site—there are a few questions where people do in-depth comparisons of different ways to achieve the same effect. But usually it doesn't matter. For that matter, if you want to understand why this groups things, search this site, because it's been explained at least twice.)
As #Tim Pietzcker pointed out, there's no need of updating chunk here, just use start+10 instead of chunk.
block = lines[start:start+10]
and update start using start += 10.
Another alternative solution using itertools.islice():
with open("data1.txt") as f:
slc=islice(f,5) #replace 5 by 10 in your case
for x in slc:
print x.strip()
while raw_input("wanna see more : ") in("y","Y"):
slc=islice(f,5) #replace 5 by 10 in your case
for x in slc:
print x.strip()
this outputs:
1
2
3
4
5
wanna see more : y
6
7
8
9
10
wanna see more : n