Removing rows dependent on one column Python

Removing rows dependent on one column Python - python

I'm working on taking apart a python script piece by piece to get a better understanding of 1.) how python works & 2.) what this particular script does & if I can make it better (i.e. usable by slightly more varied inputs).
Okay so, there is a line in my code that looks like this:
thisChrominfo = chrominfo[thisChrom]
Where chrominfo calls a dictionary set that looks like this:
{'chrY': ['chrY', '59373566', '3036303846'], 'chrX': ['chrX', '155270560', '2881033286'], 'chr13': ['chr13', '115169878', '2084766301'], 'chr12': ['chr12', '133851895', '1950914406'], 'chr11': ['chr11', '135006516', '1815907890'], 'chr10': ['chr10', '135534747', '1680373143'], 'chr17': ['chr17', '81195210', '2500171864'], 'chr16': ['chr16', '90354753', '2409817111'], 'chr15': ['chr15', '102531392', '2307285719'], 'chr14': ['chr14', '107349540', '2199936179'], 'chr19': ['chr19', '59128983', '2659444322'], 'chr18': ['chr18', '78077248', '2581367074'], 'chrM': ['chrM', '16571', '3095677412'], 'chr22': ['chr22', '51304566', '2829728720'], 'chr20': ['chr20', '63025520', '2718573305'], 'chr21': ['chr21', '48129895', '2781598825'], 'chr7': ['chr7', '159138663', '1233657027'], 'chr6': ['chr6', '171115067', '1062541960'], 'chr5': ['chr5', '180915260', '881626700'], 'chr4': ['chr4', '191154276', '690472424'], 'chr3': ['chr3', '198022430', '492449994'], 'chr2': ['chr2', '243199373', '249250621'], 'chr1': ['chr1', '249250621', '0'], 'chr9': ['chr9', '141213431', '1539159712'], 'chr8': ['chr8', '146364022', '1392795690']}
and thisChrom calls a single column (non-integer) that includes things like this:
'*' to `chr4`to `chrY` etc.
thisChrom only returns one value at a time, because it relies on a piece higher up in the file that specifies only a single row:
for x in INFILE:
arow = x.rstrip().split("\t")
thisChrom = arow[2]
thisChrompos = arow[3]
So it's pulling one column from one row.
The whole thing falls apart when values like '*' are present in arow, because that's not in the chrominfo dictionary. At first I thought I should just go ahead and add it to the dictionary, but now I'm thinking it would be easier and better to instead add a line at the top that says something like, if arow[2] == '*' then delete it, else continue.
I know it should look something like this:
for x in INFILE:
arow = x.rstrip().split("\t")
thisChrom = arow[2]
thisChrompos = arow[3]
if arow == '*': arow.remove(*)
else:
continue
but I haven't been able to get the syntax quite right. All of my Python is self & stackoverflow taught, so I appreciate your suggestions and guidance. Sorry if that was an over-explanation of something that is very simple for most experts (I am not an expert).

The continue keyword is somewhat non-obvious. What it does is skip the remaining contents of the loop, and start the next iteration. So, what you wrote will skip the rest of the loop only if arow is not equal to '*'.
Instead of
if arow == '*': arow.remove(*)
else:
continue
# process the row
you might simply want to either use a simple if condition:
if arow != '*':
# process the row
or use continue in the way you probably intended:
if arow == '*': continue
# process the row
See how it works in the opposite way of what you thought? Also, you don't need an else in this case because of how the continue skips the rest of the loop.
If you're familiar with the break keyword, it may make more sense as a comparison. The break keyword stops the loop entirely - it "breaks" out of it and moves on. The continue keyword is simply a "weaker" version of that - it "breaks" out of the current iteration but not all the way out of the loop.

Related

Python and CSV: find a row and give column value

It consists in creating a function def(,) that searches for the name of the kid in the CSV file and gives his age.
The CSV file is structured as this:
Nicholas,12
Matthew,6
Lorna,12
Michael,8
Sebastian,8
Joseph,10
Ahmed,15
while the code that I tried is this:
def fetchcolvalue(kids_agefile, kidname):
import csv
file = open(kids_agefile, 'r')
ct = 0
for row in csv.reader(file):
while True:
print(row[0])
if row[ct] == kidname:
break
The frustrating thing is that it doesn't give me any error, but an infinite loop: I think that's what I'm doing wrong.
So far, what I learnt from the book is only loops (while and for) and if-elif-else cycles, besides CSV and file basic manipulation operations, so I can't really figure out how can I solve the problem with only those tools.
Please notice that the function would have to work with a generic 2-columns CSV file and not only the kids' one.

the while True in your loop is going to make you loop forever (no variables are changed within the loop). Just remove it:
for row in csv.reader(file):
if row[ct] == kidname:
break
else:
print("{} not found".format(kidname))
the csv file is iterated upon, and as soon as row[ct] equals kidname it breaks.
I would add an else statement so you know if the file has been completely scanned without finding the kid's name (just to expose some little-known usage of else after a for loop: if no break encountered, goes into else branch.)
EDIT: you could do it in one line using any and a generator comprehension:
any(kidname == row[ct] for row in csv.reader(file))
will return True if any first cell matches, probably faster too.

This should work, in your example the for loop sets row to the first row of the file, then starts the while loop. The while loop never updates row so it is infinite. Just remove the while loop:
def fetchcolvalue(kids_agefile, kidname):
import csv
file = open(kids_agefile, 'r')
ct = 0
for row in csv.reader(file):
if row[ct] == kidname:
print(row[1])

How to match fields from two lists and further filter based upon the values in subsequent fields?

EDIT: My question was answered on reddit. Here is the link if anyone is interested in the answer to this problem https://www.reddit.com/r/learnpython/comments/42ibhg/how_to_match_fields_from_two_lists_and_further/
I am attempting to get the pos and alt strings from file1 to match up with what is in
file2, fairly simple. However, file2 has values in the 17th split element/column to the
last element/column (340th) which contains string such as 1/1:1.2.2:51:12 which
I also want to filter for.
I want to extract the rows from file2 that contain/match the pos and alt from file1.
Thereafter, I want to further filter the matched results that only contain certain
values in the 17th split element/column onwards. But to do so the values would have to
be split by ":" so I can filter for split[0] = "1/1" and split[2] > 50. The problem is
I have no idea how to do this.
I imagine I will have to iterate over these and split but I am not sure how to do this
as the code is presently in a loop and the values I want to filter are in columns not rows.
Any advice would be greatly appreciated, I have sat with this problem since Friday and
have yet to find a solution.
import os,itertools,re
file1 = open("file1.txt","r")
file2 = open("file2.txt","r")
matched = []
for (x),(y) in itertools.product(file2,file1):
if not x.startswith("#"):
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y in pos_x and alt_y in alt_x:
matched.append(x)
for z in matched:
cells_z = z.split("\t")
if cells_z[16:len(cells_z)]:

Your requirement is not clear, but you might mean this:
for (x),(y) in itertools.product(file2,file1):
if x.startswith("#"):
continue
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y != pos_x: continue
if alt_y != alt_x: continue
extra_match = False
for f in range(17, 341):
y_extra = y[f].split(':')
if y_extra[0] != '1/1': continue
if y_extra[2] <= 50: continue
extra_match = True
break
if not extra_match: continue
xy = x + y
matched.append(xy)
I chose to concatenate x and y into the matched array, since I wasn't sure whether or not you would want all the data. If not, feel free to go back to just appending x or y.

You may want to look into the csv library, which can use tab as a delimiter. You can also use a generator and/or guards to make the code a bit more pythonic and efficient. I think your approach with indexes works pretty well, but it would be easy to break when trying to modify down the road, or to update if your file lines change shape. You may wish to create objects (I use NamedTuples in the last part) to represent your lines and make it much easier to read/refine down the road.
Lastly, remember that Python has a shortcut feature with the comparative 'if'
for example:
if x_evaluation and y_evaluation:
do some stuff
when x_evaluation returns False, Python will skip y_evaluation entirely. In your code, cells_x[0]+":"+cells_x[1] is evaluated every single time you iterate the loop. Instead of storing this value, I wait until the easier alt comparison evaluates to True before doing this (comparatively) heavier/uglier check.
import csv
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
if x[3] == y[4] and x[0] == ":".join(y[:1]):
yield x
def match_datestamp_and_alt_and_pos(first_file, second_file):
for z in filter_matching_alt_and_pos(first_file, second_file):
for element in z[16:]:
# I am not sure I fully understood your filter needs for the 2nd half. Here, I split all elements from the 17th onward and look for the two cases you mentioned. This seems like it might be very heavy, but at least we're using generators!
# same idea as before, we abort as early as possible to avoid needless indexing and checks
for chunk in element.split(":"):
# WARNING: if you aren't 100% sure the 2nd element is an int, this is very dangerous
# here, I use the continue keyword and the negative-check to help eliminate excess overhead. The execution is very similar as above, but might be easier to read/understand and can help speed things along in some cases
# once again, I do the lighter check before the heavier one
if not int(chunk[2])> 50:
# continue automatically skips to the next iteration on element
continue
if not chunk[:1] == "1/1":
continue
yield z
if __name__ == '__main__':
first_file = "first.txt"
second_file = "second.txt"
# match_datestamp_and_alt_and_pos returns a generator; for loop through it for the lines which matched all 4 cases
match_datestamp_and_alt_and_pos(first_file=first_file, second_file=second_file)
namedtuples for the first part
from collections import namedtuple
FirstFileElement = namedtuple("FirstFrameElement", "pos unused1 unused2 alt")
SecondFileElement = namedtuple("SecondFrameElement", "pos1 pos2 unused2 unused3 alt")
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
x_element = FirstFileElement(*x)
y_element = SecondFileElement(*y)
if x.alt == y.alt and x.pos == ":".join([y.pos1, y.pos2]):
yield x

appending array breaks program

I am writing a program to analyze some of our invoice data. Basically,I need to take an array containing each individual invoice we sent out over the past year & break it down into twelve arrays which contains the invoices for that month using the dateSeperate() function, so that monthly_transactions[0] returns Januaries transactions, monthly_transactions[1] returns Februaries & so forth.
I've managed to get it working so that dateSeperate returns monthly_transactions[0] as the january transactions. However, once all of the January data is entered, I attempt to append the monthly_transactions array using line 44. However, this just causes the program to break & become unrepsonsive. The code still executes & doesnt return an error, but Python becomes unresponsive & I have to force quite out of it.
I've been writing the the global array monthly_transactions. dateSeperate runs fine as long as I don't include the last else statement. If I do that, monthly_transactions[0] returns an array containing all of the january invoices. the issue arises in my last else statement, which when added, causes Python to freeze.
Can anyone help me shed any light on this?
I have written a program that defines all of the arrays I'm going to be using (yes I know global arrays aren't good. I'm a marketer trying to learn programming so any input you could give me on how to improve this would be much appreciated
import csv
line_items = []
monthly_transactions = []
accounts_seperated = []
Then I import all of my data and place it into the line_items array
def csv_dict_reader(file_obj):
global board_info
reader = csv.DictReader(file_obj, delimiter=',')
for line in reader:
item = []
item.append(line["company id"])
item.append(line["user id"])
item.append(line["Amount"])
item.append(line["Transaction Date"])
item.append(line["FIrst Transaction"])
line_items.append(item)
if __name__ == "__main__":
with open("ChurnTest.csv") as f_obj:
csv_dict_reader(f_obj)
#formats the transacation date data to make it more readable
def dateFormat():
for i in range(len(line_items)):
ddmmyyyy =(line_items[i][3])
yyyymmdd = ddmmyyyy[6:] + "-"+ ddmmyyyy[:2] + "-" + ddmmyyyy[3:5]
line_items[i][3] = yyyymmdd
#Takes the line_items array and splits it into new array monthly_tranactions, where each value holds one month of data
def dateSeperate():
for i in range(len(line_items)):
#if there are no values in the monthly transactions, add the first line item
if len(monthly_transactions) == 0:
test = []
test.append(line_items[i])
monthly_transactions.append(test)
# check to see if the line items year & month match a value already in the monthly_transaction array.
else:
for j in range(len(monthly_transactions)):
line_year = line_items[i][3][:2]
line_month = line_items[i][3][3:5]
array_year = monthly_transactions[j][0][3][:2]
array_month = monthly_transactions[j][0][3][3:5]
#print(line_year, array_year, line_month, array_month)
#If it does, add that line item to that month
if line_year == array_year and line_month == array_month:
monthly_transactions[j].append(line_items[i])
#Otherwise, create a new sub array for that month
else:
monthly_transactions.append(line_items[i])
dateFormat()
dateSeperate()
print(monthly_transactions)
I would really, really appreciate any thoughts or feedback you guys could give me on this code.

Based on the comments on the OP, your csv_dict_reader function seems to do exactly what you want it to do, at least inasmuch as it appends data from its argument csv file to the top-level variable line_items. You said yourself that if you print out line_items, it shows the data that you want.
"But appending doesn't work." I take it you mean that appending the line_items to monthly_transactions isn't being done. The reason for that is that you didn't tell the program to do it! The appending that you're talking about is done as part of your dateSeparate function, however you still need to call the function.
I'm not sure exactly how you want to use your dateFormat and dateSeparate functions, but in order to use them, you need to include them in the main function somehow as calls, i.e. dateFormat() and dateSeparate().
EDIT: You've created the potential for an endless loop in the last else: section, which extends monthly_transactions by 1 if the line/array year/month aren't equal. This is problematic because it's within the loop for j in range(len(monthly_transactions)):. This loop will never get to the end if the length of monthly_transactions is increased by 1 every time through.

Why re is not compiling 'if' when there is 'else'?

Hello I'm facing a problem and I don't how to fix it. All I know is that when I add an else statement to my if statement the python execution always goes to the else statement even there is there a true statement in if and can enter the if statement.
Here is the script, without the else statement:
import re
f = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
d = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
w = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
s=""
av =0
b=""
filtred=[]
Mlines=f.readlines()
Wlines=d.readlines()
for line in Wlines:
Wspl=line.split()
for line2 in Mlines:
Mspl=line2.replace('\n','').split("\t")
if ((Mspl[0]).lower()==(Wspl[0])):
Wspl.append(Mspl[1])
if(len(Mspl)>=3):
Wspl.append(Mspl[2])
s="\t".join(Wspl)+"\n"
if s not in filtred:
filtred.append(s)
break
for x in filtred:
w.write(x)
f.close()
d.close()
w.close()
with the else statement and I want else for the if ((Mspl[0]).lower()==(Wspl[0])):
import re
f = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
d = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
w = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
s=""
av =0
b=""
filtred=[]
Mlines=f.readlines()
Wlines=d.readlines()
for line in Wlines:
Wspl=line.split()
for line2 in Mlines:
Mspl=line2.replace('\n','').split("\t")
if ((Mspl[0]).lower()==(Wspl[0])):
Wspl.append(Mspl[1])
if(len(Mspl)>=3):
Wspl.append(Mspl[2])
s="\t".join(Wspl)+"\n"
if s not in filtred:
filtred.append(s)
break
else:
b="\t".join(Wspl)+"\n"
if b not in filtred:
filtred.append(b)
break
for x in filtred:
w.write(x)
f.close()
d.close()
w.close()

first of all, you're not using "re" at all in your code besides importing it (maybe in some later part?) so the title is a bit misleading.
secondly, you are doing a lot of work for what is basically a filtering operation on two files. Remember, simple is better than complex, so for starters, you want to clean your code a bit:
you should use a little more indicative names than 'd' or 'w'. This goes for 'Wsplt', 's' and 'av' as well. Those names don't mean anything and are hard to understand (why is the d.readlines named Wlines when ther's another file named 'w'? It's really confusing).
If you choose to use single letters, it should still make sense (if you iterate over a list named 'results' it makes sense to use 'r'. 'line1' and 'line2' however, are not recommanded for anything)
You don't need parenthesis for conditions
You want to use as little variables as you can as to not get confused. There's too much different variables in your code, it's easy to get lost. You don't even use some of them.
you want to use strip rather than replace, and you want the whole 'cleaning' process to come first and then just have a code the deals with the filtering logic on the two lists. If you split each line according to some logic, and you don't use the original line anywhere in the iteration, then you can do the whole thing in the beggining.
Now, I'm really confused what you're trying to achieve here, and while I don't understand why your doing it that way, I can say that looking at your logic you are repeating yourself a lot. The action of checking against the filtered list should only happend once, and since it happens regardless of whether the 'if' checks out or not, I see absolutely no reason to use an 'else' clause at all.
Cleaning up like I mentioned, and re-building the logic, the script looks something like this:
# PART I - read and analyze the lines
Wappresults = open('C:\Users\Ziad\Desktop\Combination\WhatsappResult.txt', 'r')
Mikrofull = open('C:\Users\Ziad\Desktop\Combination\MikrofullCombMaj.txt', 'r')
Wapp = map(lambda x: x.strip().split(), Wappresults.readlines())
Mikro = map(lambda x: x.strip().split('\t'), Mikrofull.readlines())
Wappresults.close()
Mikrofull.close()
# PART II - filter using some logic
filtred = []
for w in Wapp:
res = w[:] # So as to copy the list instead of point to it
for m in Mikro:
if m[0].lower() == w[0]:
res.append(m[1])
if len(m) >= 3 :
res.append(m[2])
string = '\t'.join(res)+'\n' # this happens regardles of whether the 'if' statement changed 'res' or not
if string not in filtred:
filtred.append(string)
# PART III - write the filtered results into a file
combination = open('C:\Users\Ziad\Desktop\Combination\combination.txt','w')
for comb in filtred:
combination.write(comb)
combination.close()
I can't promise it will work (because again, like I said, I don't know what you're trying to achive) but this should be a lot easier to work with.

Serial Key Generation and Validation

I'm toying around with writing creating a serial code generator/validator, but I can't seem to get how to do a proper check.
Here's my generator code:
# Serial generator
# Create sequences from which random.choice can choose
Sequence_A = 'ABCDEF'
Sequence_B = 'UVWQYZ'
Sequence_C = 'NOPQRS'
Sequence_D = 'MARTIN'
import random
# Generate a series of random numbers and Letters to later concatenate into a pass code
First = str(random.randint(1,5))
Second = str(random.choice(Sequence_A))
Third = str(random.randint(6,9))
Fourth = str(random.choice(Sequence_B))
Fifth = str(random.randint(0,2))
Sixth = str(random.choice(Sequence_C))
Seventh = str(random.randint(7,8))
Eighth = str(random.choice(Sequence_D))
Ninth = str(random.randint(3,5))
serial = First+Second+Third+Fourth+Fifth+Sixth+Seventh+Eighth+Ninth
print serial
I'd like to make a universal check so that my validation code will accept any key generated by this.
My intuition was to create checks like this:
serial_check = raw_input("Please enter your serial code: ")
# create a control object for while loop
control = True
# Break up user input into list that can be analyzed individually
serial_list = list(serial_check)
while control:
if serial_list[0] == range(1,5):
pass
elif serial_list[0] != range(1,5):
control = False
if serial_list[1] == random.choice('ABCDEF'):
pass
elif serial_list[1] != random.choice('ABCDEF'):
control = False
# and so on until the final, where, if valid, I would print that the key is valid.
if control == False:
print "Invalid Serial Code"
I'm well aware that the second type of check won't work at all, but it's a place holder because I've got no idea how to check that.
But I thought the method for checking numbers would work, but it doesn't either.

The expression `range(1, 5)' creates a list of numbers from 1 to 4. So in your first test, you're asking whether the first character in your serial number is equal to that list:
"1" == [1, 2, 3, 4]
Probably not...
What you probably want to know is whether a digit is in the range (i.e. from 1 to 5, I assume, not 1 to 4).
Your other hurdle is that the first character of the serial is a string, not an integer, so you would want to take the int() of the first character. But that will raise an exception if it's not a digit. So you must first test to make sure it's a digit:
if serial_list[0].isdigit() and int(serial_list[0]) in range(1, 6):
Don't worry, if it's not a digit, Python won't even try to evaluate the part after and. This is called short-circuiting.
However, I would not recommend doing it this way. Instead, simply check to make sure it is at least "1" and no more than "5", like this:
if "1" <= serial_list <= "5":
You can do the same thing with each of your tests, varying only what you're checking.
Also, you don't need to convert the serial number to a list. serial_check is a string and accessing strings by index is perfectly acceptable.
And finally, there's this pattern going on in your code:
if thing == other:
pass
elif thing != other:
(do something)
First, because the conditions you are testing are logical opposites, you don't need elif thing != other -- you can just say else, which means "whatever wasn't matched by any if condition."
if thing == other:
pass
else:
(do something)
But if you're just going to pass when the condition is met, why not just test the opposite condition to begin with? You clearly know how to write it 'cause you were putting it in the elif. Put it right in the if instead!
if thing != other:
(do something)
Yes, each of your if statements can easily be cut in half. In the example I gave you for checking the character range, probably the easiest way to do it is using not:
if not ("1" <= serial_list <= "5"):

Regarding your python, I'm guessing that when your wrote this:
if serial_list[0] == range(1,5):
You probably meant this:
if 1 <= serial_list[0] <= 5:
And when you wrote this:
if serial_list[1] == random.choice('ABCDEF'):
You probably meant this:
if serial_list[1] in 'ABCDEF':
There are various other problems with your code, but I'm sure you'll improve it as you learn python.
At a higher level, you seem to be trying to build something like a software activation code generator/validator. You should know that just generating a string of pseudo-random characters and later checking that each is in range is an extremely weak form of validation. If you want to prevent forgeries, I would suggest learning about HMAC (if you're validating on a secure server) or public key cryptography (if you're validating on a user's computer) and incorporating that into your design. There are libraries available for python that can handle either approach.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing rows dependent on one column Python - python

Related

Python and CSV: find a row and give column value

How to match fields from two lists and further filter based upon the values in subsequent fields?

appending array breaks program

Why re is not compiling 'if' when there is 'else'?

Serial Key Generation and Validation

Categories

Resources