Python Code line by line meaning - python

I have got a code and need to get the line by line meaning of this python code.
marksheet = []
for i in range(0,int(input())):
marksheet.append([raw_input(), float(input())])
second_highest = sorted(list(set([marks for name, marks in marksheet])))[1]
print('\n'.join([a for a,b in sorted(marksheet) if b == second_highest]))

I highly recommend you to go through the python tutorial
Just for your understanding of this code, I've added the comments.
#initialising an empty list!
marksheet = []
#iterating through a for loop starting from zero, to some user input(default type string) - that is converted to int
for i in range(0,int(input())):
#appending user input(some string) and another user input(a float value) as a list to marksheet
marksheet.append([raw_input(), float(input())])
#[marks for name, marks in marksheet] - get all marks from list
#set([marks for name, marks in marksheet]) - getting unique marks
#list(set([marks for name, marks in marksheet])) - converting it back to list
#sorting the result in decending order with reverse=True and getting the value as first index which would be the second largest.
second_highest = sorted(list(set([marks for name, marks in marksheet])),reverse=True)[1]
#printing the name and mark of student that has the second largest mark by iterating through the sorted list.
#If the condition matches, the result list is appended to tuple -`[a for a,b in sorted(marksheet) if b == second_highest])`
#now join the list with \n - newline to print name and mark of student with second largest mark
print('\n'.join([a for a,b in sorted(marksheet) if b == second_highest]))
Hope it helps!

Would do this in a comment but I don't have 50 reputation yet:
You don't need to use sorted on second_highest but apparently it is not a good habit to rely on this so you can keep the sorted. Calling sorted on an already sorted list doesn't use a lot of resources anyway.
second_highest = sorted(list(set([marks for name, marks in marksheet])))[1]
Also if the list contains something like [1,3,2,5,3,2,1] it will give 2 as result and not 1 since a set removes all duplicates.
If you want to keep duplicates use:
second_highest = sorted([marks for name, marks in marksheet]))[1]

Related

Comparing the elements of a list with themselves

I have lists of items:
['MRS_103_005_010_BG_001_v001',
'MRS_103_005_010_BG_001_v002',
'MRS_103_005_010_FG_001_v001',
'MRS_103_005_010_FG_001_v002',
'MRS_103_005_010_FG_001_v003',
'MRS_103_005_020_BG_001_v001',
'MRS_103_005_020_BG_001_v002',
'MRS_103_005_020_BG_001_v003']
I need to identify the latest version of each item and store it to a new list. Having trouble with my logic.
Based on how this has been built I believe I need to first compare the indices to each other. If I find a match I then check to see which number is greater.
I figured I first needed to do a check to see if the folder names matched between the current index and the next index. I did this by making two variables, 0 and 1, to represent the index so I could do a staggered incremental comparison of the list on itself. If the two indices matched I then needed to check the vXXX number on the end. whichever one was the highest would be appended to the new list.
I suspect that the problem lies in one copy of the list getting to an empty index before the other one does but I'm unsure of how to compensate for that.
Again, I am not a programmer by trade. Any help would be appreciated! Thank you.
# Preparing variables for filtering the folders
versions = foundVerList
verAmountTotal = len(foundVerList)
verIndex = 0
verNextIndex = 1
highestVerCount = 1
filteredVersions = []
# Filtering, this will find the latest version of each folder and store to a list
while verIndex < verAmountTotal:
try:
nextVer = (versions[verIndex])
nextVerCompare = (versions[verNextIndex])
except IndexError:
verNextIndex -= 1
if nextVer[0:24] == nextVerCompare[0:24]:
if nextVer[-3:] < nextVerCompare [-3:]:
filteredVersions.append(nextVerCompare)
else:
filteredVersions.append(nextVer)
verIndex += 1
verNextIndex += 1
My expected output is:
print filteredVersions
['MRS_103_005_010_BG_001_v002', 'MRS_103_005_010_FG_001_v003']
['MRS_103_005_020_BG_001_v003']
The actual output is:
print filteredVersions
['MRS_103_005_010_BG_001_v002', 'MRS_103_005_010_FG_001_v002',
'MRS_103_005_010_FG_001_v003']
['MRS_103_005_020_BG_001_v002', 'MRS_103_005_020_BG_001_v003']
During the with loop I am using os.list on each folder referenced via verIndex. I believe the problem is that a list is being generated for every folder that is searched but I want all the searches to be combined in a single list which will THEN go through the groupby and sorted actions.
Seems like a case for itertools.groupby:
from itertools import groupby
grouped = groupby(data, key=lambda version: version.rsplit('_', 1)[0])
result = [sorted(group, reverse=True)[0] for key, group in grouped]
print(result)
Output:
['MRS_103_005_010_BG_001_v002',
'MRS_103_005_010_FG_001_v003',
'MRS_103_005_020_BG_001_v003']
This groups the entries by everything before the last underscore, which I understand to be the "item code".
Then, it sorts each group in reverse order. The elements of each group differ only by the version, so the entry with the highest version number will be first.
Lastly, it extracts the first entry from each group, and puts it back into a result list.
Try this:
text = """MRS_103_005_010_BG_001_v001
MRS_103_005_010_BG_001_v002
MRS_103_005_010_FG_001_v001
MRS_103_005_010_FG_001_v002
MRS_103_005_010_FG_001_v003
MRS_103_005_020_BG_001_v001
MRS_103_005_020_BG_001_v002
MRS_103_005_020_BG_001_v003
"""
result = {}
versions = text.splitlines()
for item in versions:
v = item.split('_')
num = int(v.pop()[1:])
name = item[:-3]
if result.get(name, 0) < num:
result[name] = num
filteredVersions = [k + str(v) for k, v in result.items()]
print(filteredVersions)
output:
['MRS_103_005_010_BG_001_v2', 'MRS_103_005_010_FG_001_v3', 'MRS_103_005_020_BG_001_v3']

Remove duplicates from user input

I want to ignore any duplicate entry given by user as input. I have below code :
def pITEMName():
global ITEMList,fITEMList
pITEMList = []
fITEMList = []
ITEMList = str(raw_input('Enter pipe separated list of ITEMS : ')).upper().strip()
items = ITEMList.split("|")
count = len(items)
print 'Total Distint ITEM Count : ', count
pipelst = [i.replace('-mc','').replace('-MC','').replace('$','').replace('^','') for i in ITEMList.split('|')]
filepath = '/location/data.txt'
f = open(filepath, 'r')
for lns in f:
split_pipe = lns.split(':', 1)
if split_pipe[0] in pipelst:
index = pipelst.index(split_pipe[0])
pITEMList=split_pipe[0]+"|"
fITEMList.append(pITEMList)
del pipelst[index]
for lns in pipelst:
print bcolors.red + lns,' is wrong ITEM Name' + bcolors.ENDC
f.close()
When I execute above code it prompts me for user input as :
Enter pipe separated list of items :
And if I provide the input as :
Enter pipe separated list of items : AAA|IFA|AAA
After pressing enter I am getting the result as :
Enter pipe separated list of Items : AAA|IFA|AAA
Total Distint Item Count : 3
AAA is wrong Item Name
Items Belonging to other Centers :
Other Centers :
Item Count From Other Center = 0
Items Belonging to Current Centers :
Active Items in US1:
^IFA$
Active Items in US2 :
^AAA$|^AAA$
Ignored Item Count From Current Center = 0
You Have Entered ItemList belonging to this Center as:
^IFA$|^AAA$|^AAA$
Active Item Count : 3
Do You Want To Continue [YES|Y|NO|N] :
In above result you must be noticing that I have mentioned AAA entry twice so its counting as wrong Item. I want as duplicate entry to be ignored. Here I want to ignore the case sensitive condition also. Means If I give AAA|aaa|ifa, one 'aaa' should get ignored.
Please help me that how I can implement this.
First, you're doing ITEMList.split("|") several times. You should just use your already calculated items.
Second, you probably want:
items = set(ITEMList.lower().split("|"))
This way you get a set with unique, all lowercase elements.
I assume this doesn't matter since you can discard either uppercase or lowercase.
If item order is not important, then a set will do this very well.
items = set(ITEMList.split("|"))
Lots of great answers here; throwing my hat into the ring as well. One straightforward way to do this:
items = list(set(ITEMList.split("|")))
items.sort()
This preserves your items object as a list and orders it (which is something you may or may not prefer in this case).
If you decide later that you want to return an element of your items list in your code, you will be able to do it by referring to the list index (this functionality doesn't exist with sets).
If you want to preserve the value of the variable count, you could implement the code as:
items = ITEMList.split("|")
count = len(items)
items = list(set(ITEMList.split("|")))
items.sort()
You will also want to adjust this line:
pipelst = [i.replace('-mc','').replace('MC','').replace('$','').replace('^','') for i in ITEMList.split('|')]
to this:
pipelst = [i.replace('-mc','').replace('MC','').replace('$','').replace('^','') for i in items]
if order is important
my_list = "^IFA$|^AAA$|^AAA$"
"|".join(collections.Counter(my_list.upper().split("|")).keys())
is one way to do it

Error in sorting operation on dictionary

I am trying to sort a file of sequences according to a certain parameter. The data looks as follows:
ID1 ID2 32
MVKVYAPASSANMSVGFDVLGAAVTP ...
ID1 ID2 18
MKLYNLKDHNEQVSFAQAVTQGLGKN ...
....
There are about 3000 sequences like this, i.e. the first line contains two ID field and one rank field (the sorting key) while the second one contains the sequence. My approach is to open the file, convert the file object to a list object, separate the annotation line (ID1, ID2, rank) from the actual sequence (annotation lines always occur on even indices, while sequence lines always occur on odd indices), merge them into a dictionary and sort the dictionary using the rank field. The code reads like so:
#!/usr/bin/python
with open("unsorted.out","rb") as f:
f = f.readlines()
assert type(f) == list, "ERROR: file object not converted to list"
annot=[]
seq=[]
for i in range(len(f)):
# IDs
if i%2 == 0:
annot.append(f[i])
# Sequences
elif i%2 != 0:
seq.append(f[i])
# Make dictionary
ids_seqs = {}
ids_seqs = dict(zip(annot,seq))
# Solub rankings are the third field of the annot list, i.e. annot[i].split()[2]
# Use this index notation to rank sequences according to solubility measurements
sorted_niwa = sorted(ids_seqs.items(), key = lambda val: val[0].split()[2], reverse=False)
# Save to file
with open("sorted.out","wb") as out:
out.write("".join("%s %s" % i for i in sorted_niwa))
The problem I have encountered is that when I open the sorted file to inspect manually, as I scroll down I notice that some sequences have been wrongly sorted. For example, I see the rank 9 placed after rank 89. Up until a certain point the sorting is correct, but I don't understand why it hasn't worked throughout.
Many thanks for any help!
Sounds like you're comparing strings instead of numbers. "9" > "89" because the character '9' comes lexicographically after the character '8'. Try converting to integers in your key.
sorted_niwa = sorted(ids_seqs.items(), key = lambda val: int(val[0].split()[2]), reverse=False)

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Mapping out a third list based on two list values

I have two lists one is a subject list, it can vary from 2 to 4 subjects at max. Second list is reporting list which provides information,whether we need a report for the subject.
Possible values for reporting list are:
All_Subjects which means we require report for all subjects.
No_Subject which means we dont require report for any subject
Lastly the format, SubjectName_(All|NO)_Report which means if for a particular subject we want a report or not.
-subject_list = ["Subject", "Chemistry", "Physics" , "Mathematics" , "Bio"] #sequence always remains same.
reporting_list can be ["All_Subjects", "No_Subjects", "Chemistry_No_Report","Chemistry_All_Report"] #sequence does not matter
Function report_required returns a list whether we want a report or not, and returns a list. If list has all "None" values it means no report required.
For example: I have:
reporting_list = ["Chemistry_No_Report", "Mathematics_All_Report]
subject_list = ["Subject", "Chemistry", "Physics" , "Mathematics"]
My subject_list always starts with a value Subject, which I ignore when returning mapped values
my return value should be ["No", None, "Yes"]
My current function below works, is there a more efficient way of mapping out a third list based on two list values.
def reportRequired( reporting_list , subject_list):
report_list = [None]*4
for value in reporting_list:
# subject_list starts with a header value "Subject", thats why iterating from index 1
if value.startswith("All"):
for idx in range(1, len(subject_list)):
report_list[idx-1] = "Yes"
if value.startswith("No"):
for idx in range(1, len(subject_list)):
report_list[idx-1] = "No"
if value.split("_")[1].lower() == "no":
for idx in range(1, len(subject_list)):
if value.split("_")[0].strip() == subject_list[idx]:
report_list[idx-1] = "No"
if value.split("_")[1].lower() == "all":
for idx in range(1, len(subject_list)):
if value.split("_")[0].strip() == subject_list[idx]:
report_list[idx-1] = "Yes"
return report_list
Build a dictionary that maps a subject name to an index and use that to access an element of the report_list. This way, you avoid quadratic complexity of the 3rd and 4rth case.
For the 1st and 2nd case: prepare a list filled with Yes-es and a list filled with No-s. Then you can use them, regardless how often that case appears in the reporting_list.
Note: You can use ['Yes']*4, as you already do in the initialization of the report_list.
Overall complexity is "almost" linear (assuming O(1) dictionary access...)
Edit: If a subject can appear multiple times in the subject list, this doesn't work.
But you can build a dictionary in which you store the answer for every subject, and in the second phase, walk through the subject list and output the answer for every subject.

Categories

Resources