Mapping out a third list based on two list values - python

I have two lists one is a subject list, it can vary from 2 to 4 subjects at max. Second list is reporting list which provides information,whether we need a report for the subject.
Possible values for reporting list are:
All_Subjects which means we require report for all subjects.
No_Subject which means we dont require report for any subject
Lastly the format, SubjectName_(All|NO)_Report which means if for a particular subject we want a report or not.
-subject_list = ["Subject", "Chemistry", "Physics" , "Mathematics" , "Bio"] #sequence always remains same.
reporting_list can be ["All_Subjects", "No_Subjects", "Chemistry_No_Report","Chemistry_All_Report"] #sequence does not matter
Function report_required returns a list whether we want a report or not, and returns a list. If list has all "None" values it means no report required.
For example: I have:
reporting_list = ["Chemistry_No_Report", "Mathematics_All_Report]
subject_list = ["Subject", "Chemistry", "Physics" , "Mathematics"]
My subject_list always starts with a value Subject, which I ignore when returning mapped values
my return value should be ["No", None, "Yes"]
My current function below works, is there a more efficient way of mapping out a third list based on two list values.
def reportRequired( reporting_list , subject_list):
report_list = [None]*4
for value in reporting_list:
# subject_list starts with a header value "Subject", thats why iterating from index 1
if value.startswith("All"):
for idx in range(1, len(subject_list)):
report_list[idx-1] = "Yes"
if value.startswith("No"):
for idx in range(1, len(subject_list)):
report_list[idx-1] = "No"
if value.split("_")[1].lower() == "no":
for idx in range(1, len(subject_list)):
if value.split("_")[0].strip() == subject_list[idx]:
report_list[idx-1] = "No"
if value.split("_")[1].lower() == "all":
for idx in range(1, len(subject_list)):
if value.split("_")[0].strip() == subject_list[idx]:
report_list[idx-1] = "Yes"
return report_list

Build a dictionary that maps a subject name to an index and use that to access an element of the report_list. This way, you avoid quadratic complexity of the 3rd and 4rth case.
For the 1st and 2nd case: prepare a list filled with Yes-es and a list filled with No-s. Then you can use them, regardless how often that case appears in the reporting_list.
Note: You can use ['Yes']*4, as you already do in the initialization of the report_list.
Overall complexity is "almost" linear (assuming O(1) dictionary access...)
Edit: If a subject can appear multiple times in the subject list, this doesn't work.
But you can build a dictionary in which you store the answer for every subject, and in the second phase, walk through the subject list and output the answer for every subject.

Related

python comparing two lists and retaining second list index

I am taking a user input of "components" splitting it into a list and comparing those components to a list of available components generated from column A of a google sheet. Then what I am attempting to do is return the cell value from column G corresponding the Column A index. Then repeat this for all input values.
So far I am getting the first value just fine but I'm obviously missing something to get it to cycle back and to the remaining user input components. I tried some stuff using itertools but wasn't able to get the results I wanted. I have a feeling I will facepalm when I discover the solution to this through here or on my own.
mix = select.split(',') # sets user input to string and sparates elements
ws = s.worksheet("Details") # opens table in google sheet
c_list = ws.col_values(1) # sets column A to a list
modifier = [""] * len(mix) # sets size of list based on user input
list = str(c_list).lower()
for i in range(len(mix)):
if str(mix[i]).lower() in str(c_list).lower():
for j in range(len(c_list)):
if str(mix[i]).lower() == str(c_list[j]).lower():
modifier[i] = ws.cell(j+1,7).value # get value of cell from Column G corresponding to Column A for component name
print(mix)
print(modifier)
You are over complicating the code by writing C like code.
I have changed all the loops you had to a simpler single loop, I have also left comments above each code line to explain what it does.
# Here we use .lower() to lower case all the values in select
# before splitting them and adding them to the list "mix"
mix = select.lower().split(",")
ws = s.worksheet("Details")
# Here we used a list comprehension to create a list of the "column A"
# values but all in lower case
c_list = [cell.lower() for cell in ws.col_values(1)]
modifier = [""] * len(mix)
# Here we loop through every item in mix, but also keep a count of iterations
# we have made, which we will use later to add the "column G" element to the
# corresponding location in the list "modifier"
for i, value in enumerate(mix):
# Here we check if the value exists in the c_list
if value in c_list:
# If we find the value in the c_list, we get the index of the value in c_list
index = c_list.index(value)
# Here we add the value of column G that has an index of "index + 1" to
# the modifier list at the same location of the value in list "mix"
modifier[i] = ws.cell(index + 1, 7).value

Python Code line by line meaning

I have got a code and need to get the line by line meaning of this python code.
marksheet = []
for i in range(0,int(input())):
marksheet.append([raw_input(), float(input())])
second_highest = sorted(list(set([marks for name, marks in marksheet])))[1]
print('\n'.join([a for a,b in sorted(marksheet) if b == second_highest]))
I highly recommend you to go through the python tutorial
Just for your understanding of this code, I've added the comments.
#initialising an empty list!
marksheet = []
#iterating through a for loop starting from zero, to some user input(default type string) - that is converted to int
for i in range(0,int(input())):
#appending user input(some string) and another user input(a float value) as a list to marksheet
marksheet.append([raw_input(), float(input())])
#[marks for name, marks in marksheet] - get all marks from list
#set([marks for name, marks in marksheet]) - getting unique marks
#list(set([marks for name, marks in marksheet])) - converting it back to list
#sorting the result in decending order with reverse=True and getting the value as first index which would be the second largest.
second_highest = sorted(list(set([marks for name, marks in marksheet])),reverse=True)[1]
#printing the name and mark of student that has the second largest mark by iterating through the sorted list.
#If the condition matches, the result list is appended to tuple -`[a for a,b in sorted(marksheet) if b == second_highest])`
#now join the list with \n - newline to print name and mark of student with second largest mark
print('\n'.join([a for a,b in sorted(marksheet) if b == second_highest]))
Hope it helps!
Would do this in a comment but I don't have 50 reputation yet:
You don't need to use sorted on second_highest but apparently it is not a good habit to rely on this so you can keep the sorted. Calling sorted on an already sorted list doesn't use a lot of resources anyway.
second_highest = sorted(list(set([marks for name, marks in marksheet])))[1]
Also if the list contains something like [1,3,2,5,3,2,1] it will give 2 as result and not 1 since a set removes all duplicates.
If you want to keep duplicates use:
second_highest = sorted([marks for name, marks in marksheet]))[1]

How to replace strings in a list using tuples of strings and their replacements?

I have a list of player ranks attributed to players. In addition to this, I have a list of tuples of ranks...
rank_database = [("Unprocessed Rank", "Processed Rank"), ("Unprocessed Rank 2", "Processed Rank 2")]
What I would like to do is, for every item in the player ranks list, process them through the rank database --- like find and replace.
So, before replacement:
player_ranks = ["Unprocessed Rank", "Unprocessed Rank 2"]
After replacement:
player_ranks = ["Processed Rank", "Processed Rank 2"]
Essentially, I would like to use rank_database to perform a find and replace operation on the player_ranks list.
Proposed Solution
My idea was to try to use the tuples with the str.replace method as follows...
player_ranks = ["Unprocessed Rank", "Unprocessed Rank 2"]
rank_database = [("Unprocessed Rank", "Processed Rank"), ("Unprocessed Rank 2", "Processed Rank 2")]
for x in player_ranks:
for y in rank_database:
print("Changed "+x+" to")
if x == y[0]:
player_ranks[x].replace(rank_database[y]) #Line 5
print (x)
break
else:
continue
print("Finished!")
When I execute the code, since ("Unprocessed Rank", "Processed Rank") is a tuple found at rank_database[i], I'm hoping this will sort of "inject" the tuple as the replacement strings in the str.replace method.
So, when executing the code, Line 5 should look like...
rank.replace(("Unprocessed Rank", "Processed Rank"))
Would this be a possible solution, or is this not possible, and would other solutions be more appropriate? This is for a personal project, so I would prefer to get my own solution working.
I'm making these assumptions:
The "unprocessed" ranks in your database are unique, because otherwise you'll have to add a way to determine which tuple is the "correct" mapping from that unprocessed rank to a processed one.
Returning a new list of processed ranks is as good as mutating the original list.
Your data will all fit in memory easily, because this will take at least twice the memory your database already uses.
Your database should be stored as a dict, or at least should be converted into one for the kind of work you're doing, since all you're doing is mapping unique(?) keys to values. The dict initializer can take an iterable of key-value pairs, which you already have.
Below, I've created a stand-alone function to do the work.
#!/usr/bin/env python3
def process_ranks(player_ranks, rank_database):
rank_map = dict(rank_database)
return [rank_map[rank] for rank in player_ranks]
def main():
# Sample data.
player_ranks = ['old' + str(n) for n in range(4)]
# Database contains more rank data than we will use.
rank_database = [
('old' + str(n), 'new' + str(n)) for n in range(40)
]
print("Original player ranks:")
print(player_ranks)
processed_ranks = process_ranks(player_ranks, rank_database)
print("Processed player ranks:")
print(processed_ranks)
return
if "__main__" == __name__:
main()
Output:
Original player ranks:
['old0', 'old1', 'old2', 'old3']
Processed player ranks:
['new0', 'new1', 'new2', 'new3']
If you really do need to mutate the original list, you could replace its contents with a slightly different call to process_ranks in main with:
player_ranks[:] = process_ranks(player_ranks, rank_database)
In general, though, I find preserving the original list and creating a new processed_ranks list is easier to code and especially to debug.

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Python data structure recommendation?

I currently have a structure that is a dict: each value is a list that contains numeric values. Each of these numeric lists contain what (to borrow a SQL idiom) you could call a primary key containing the first three values which are: a year, a player identifier, and a team identifier. This is the key for the dict.
So you can get a unique row by passing the a value in for the year, player ID, and team ID like so:
statline = stats[(2001, 'SEA', 'suzukic01')]
Which yields something like
[305, 20, 444, 330, 45]
I'd like to alter this data structure to be quickly summed by either of these three keys: so you could easily slice the totals for a given index in the numeric lists by passing in ONE of year, player ID, and team ID, and then the index. I want to be able to do something like
hr_total = stats[year=2001, idx=3]
Where that idx of 3 corresponds to the third column in the numeric list(s) that would be retrieved.
Any ideas?
Read up on Data Warehousing. Any book.
Read up on Star Schema Design. Any book. Seriously.
You have several dimensions: Year, Player, Team.
You have one fact: score
You want to have a structure like this.
You then want to create a set of dimension indexes like this.
years = collections.defaultdict( list )
players = collections.defaultdict( list )
teams = collections.defaultdict( list )
Your fact table can be this a collections.namedtuple. You can use something like this.
class ScoreFact( object ):
def __init__( self, year, player, team, score ):
self.year= year
self.player= player
self.team= team
self.score= score
years[self.year].append( self )
players[self.player].append( self )
teams[self.team].append( self )
Now you can find all items in a given dimension value. It's a simple list attached to a dimension value.
years['2001'] are all scores for the given year.
players['SEA'] are all scores for the given player.
etc. You can simply use sum() to add them up. A multi-dimensional query is something like this.
[ x for x in players['SEA'] if x.year == '2001' ]
Put your data into SQLite, and use its relational engine to do the work. You can create an in-memory database and not even have to touch the disk.
The syntax stats[year=2001, idx=3] is invalid Python and there is no way you can make it work with those square brackets and "keyword arguments"; you'll need to have a function or method call in order to accept keyword arguments.
So, say we make it a function, to be called like wells(stats, year=2001, idx=3). I imagine the idx argument is mandatory (which is very peculiar given the call, but you give no indication of what could possibly mean to omit idx) and exactly one of year, playerid, and teamid must be there.
With your current data structure, wells can already be implemented:
def wells(stats, year=None, playerid=None, teamid=None, idx=None):
if idx is None: raise ValueError('idx must be specified')
specifiers = [(i, x) for x in enumerate((year, playerid, teamid)) if x is not None]
if len(specifiers) != 2:
raise ValueError('Exactly one of year, playerid, teamid, must be given')
ikey, keyv = specifiers[0]
return sum(v[idx] for k, v in stats.iteritems() if k[ikey]==keyv)
of course, this is O(N) in the size of stats -- it must examine every entry in it. Please measure correctness and performance with this simple implementation as a baseline. An alternative solutions (much speedier in use, but requiring much time for preparation) is to put three dicts of lists (one each for year, playerid, teamid) to the side of stats, each entry indicating (or copying, but I think indicating by full key may suffice) all entries of stats that match that that ikey / keyv pair. But it's not clear at this time whether this implementation may not be premature, so please try first with the simple-minded idea!-)
def getSum(d, year, idx):
sum = 0
for key in d.keys():
if key[0] == year:
sum += d[key][idx]
return sum
This should get you started. I have made the assumption in this code, that ONLY year will be asked for, but it should be easy enough for you to manipulate this to check for other parameters as well
Cheers

Categories

Resources