Python, complex looping calculations with lists or arrays - python

I am converting old pseudo-Fortran code into python and am struggling to create a framework within which I can perform some complex iterative calculations.
As a beginner, my first instinct is to use lists as I find them easier to work with, but i understand that arrays would probably be a more suitable method.
I already have all the input channels as lists and am hoping for a good explanation of how to set up loops for such calculations.
This is an example of the pseudo-Fortran i am replicating. Each (t) indicates a 'time-series channel' that I currently have stored as lists (ie. ECART2(t) and NNNN(t) are lists) All lists have the same number of entries.
do while ( ecart2(t) > 0.0002 .and. nnnn(t) < 2000. ) ;
mmm(t)=nnnn(t)+1.;
if YRPVBPO(t).ge.0.1 .and. YRPVBPO(t).le.0.999930338 .and. YAEVBPO(t).ge.0.000015 .and. YAEVBPO(t).le.0.000615 then do;
YM5(t) = customFunction(YRPVBPO,YAEVBPO);*
end;
YUEVBO(t) = YU0VBO(t) * YM5(t) ;*m/s
YHEVBO(t) = YCPEVBO(t)*TPO_TGETO1(t)+0.5*YUEVBO(t)*YUEVBO(t);*J/kg
YAVBO(t) = ddnn2(t)*(YUEVBO(t)**2);*
YDVBO(t) = YCPEVBO(t)**2 + 4*YHEVBO(t)*YAVBO(t) ;*
YTSVBPO(t) = (sqrt(YDVBO(t))-YCPEVBO(t))/2./YAVBO(t);*K
YUSVBO(t) = ddnn(t)*YUEVBO(t)*YTSVBPO(t);*m/s
YM7(t) = YUSVBO(t)/YU0VBO(t);*
YPHSVBPOtot(t) = (YPHEVBPO(t) - YPDHVBPO(t))/(1.+((YGAMAEVBO(t)-1)/2)*(YM7(t)**2))**(YGAMAEVBO(t)/(1-YGAMAEVBO(t)));*bar
YPHEVBPOtot(t) = YPHEVBPO(t) / (1.+rss0(t)*YM5(t)*YM5(t))**rss1(t);*bar
YDPVBPOtot(t) = YPHEVBPOtot(t) - YPHSVBPOtot(t) ;*bar
iter(t) = (YPHEVBPOtot(t) - YDPVBPOtot(t))/YPHEVBPOtot(t);*
ecart2(t)= ABS(iter(t)-YRPVBPO(t));*
aa(t)=YRPVBPO(t)+0.0001;
YRPVBPO(t)=aa(t);*
nnnn(t)=mmm(t);*
end;
Understanding the pseudo-fortran: With 'time-series data' there is an impicit loop iterating through the individual values in each list - as well as looping over each of those values until the conditions are met.
It will carry out the loop calculations on the first list values until the conditions are met. It then moves onto the second value in the lists and perform the same looping calculations until the conditions are met...
ECART2 = [2,0,3,5,3,4]
NNNN = [6,7,5,8,6,7]
do while ( ecart2(t) > 0.0002 .and. nnnn(t) < 2000. )
MMM = NNNN + 1
this looks at the first values in each list (2 and 6). Because the conditions are met, subsequent calculations are performed on the first values in the new lists such as MMM = [6+1,...]
Once the rest of the calculations have been performed (looping multiple times if the conditions are not met) only then does the second value in every list get considered. The second values (0 and 7) do not meet the conditions and therefore the second entry for MMM is 0.
MMM=[6+1, 0...]
Because 0 must be entered if conditons are not met, I am considering setting up all the 'New lists' in advance and populating them with 0s.
NB: 'customFunction()' is a separate function that is called, returning a value from two input values

MY CURRENT SOLUTION
set up all the empty lists
nPts = range(ECART2)
MMM = [0]*nPts
YM5 = [0]*nPts
etc...
then start performing calculations
for i in ECART2:
while (ECART2[i] > 0.0002) and (NNNN[i] < 2000):
MMM[i] = NNNN[i]+1
if YRPVBPO[i]>=0.1 and YRPVBPO[i]<=0.999930338 and YAEVBPO[i]>=0.000015 and YAEVBPO[i]<=0.000615:
YM5[i] = MACH_LBP_DIA30(YRPVBPO[i],YAEVBPO[i])
YUEVBO[i] = YU0VBO[i]*YM5[i]
YHEVBO[i] = YCPEVBO[i]*TGETO1[i] + 0.5*YUEVBO[i]^2
YAVBO[i] = DDNN2[i]*YUEVBO[i]^2
YDVBO[i] = YCPEVBO[i]^2 + 4*YHEVBO[i]*YAVBO[i]
etc etc...
but i'm guessing that there are better ways of doing this - such as the suggestion to use numpy arrays (something i plan on learning in the near future)

Related

How to implement a for loop to iterate over this dataset

I'm implementing the following code to get match history data from an API:
my_matches = watcher.match.matchlist_by_puuid(
region=my_region,
puuid=me["puuid"],
count=100,
start=1)
The max items I can display per page is 100 (count), I would like my_matches to equal the first 1000 matches, thus looping start from 1 - 10.
Is there any way to effectively do this?
Based on the documentation (see page 17), this function returns a list of strings. The function can only return a 100 count max. Also, it accepts a start for where to start returning these matches (which defaults at 0). A possible solution for your problem would look like this:
allMatches = [] # will become a list containing 10 lists of matches
for match_page in range(9): # remember arrays start at 0!
countNum = match_page * 100 # first will be 0, second 100, third 200 etc...
my_matches = watcher.match.matchlist_by_puuid(
region=my_region,
puuid=me["puuid"],
count=100,
start=countNum)
# ^ Notice how we use countNum as the start for returning
allMatches.append(my_matches)
If you want to remain concise, and you want your matchesto be a 1000 long list of results, you can concatenate direclty all the outputs of size 100 as:
import itertools
matches = list(itertools.chain.from_iterable(watcher.match.matchlist_by_puuid(
region=my_region,
puuid=me["puuid"],
count=100,
start=i*100) for i in range(10)))

Parse list of strings for speed

Background
I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list of strings and also set the int value to a very small number.
Each string in the player_file_list either has a year-date/player_id/some_random_file.file_extension or
year-date/player_id/IDATs/some_random_number/some_random_file.file_extension
Issue
What i am essentially trying to achieve here is go through this list and store all unique year-date/player_id path in a set until it's length reaches the value of total_players
My current approach does not seem the most efficient to me and i am wondering if i can speed up my function get_player_path in anyway??
Code
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
player_file = player_file.split("/")
file_path = f"{player_file[0]}/{player_file[1]}/"
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
player_file_list = [
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]
print(get_player_path(player_file_list, 2))
Output
['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']
Let's analyze your function first:
your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
the sorting takes O(n log(n)) comparisons.
Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.
Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:
replace the generic sorting algorithm with something specific for strings; see here for example
use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.
Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
end = player_file.find('/', 12) # <--- len(date) + len('/') + 1
file_path = player_file[:end] # <---
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:
LENGTH = DATE_LENGTH + ID_LENGTH + 1 # 1 is for the slash between date and id
...
for player_file in player_file_list:
file_path = player_file[:LENGTH]
...
EDIT: fixed the LENGTH initialization, I had forgotten to add 1
I'll leave this solution here which can be further improved, hope it helps.
player_file_list = (
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)
def get_player_path(l, n):
pfl = set()
for i in l:
i = "/".join(i.split("/")[0:2])
if i not in pfl:
pfl.add(i)
if len(pfl) == n:
return pfl
if n > len(pfl):
print("not enough matches")
return
print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}
Python Demo
Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:
def get_player_path(player_file_list, total_players):
dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
return [k for i,k in enumerate(dct.keys()) if i < total_players]

Create arrays of fixed size within a while loop in python

I am trying to create arrays of fixed size within a while loop. Since I do not know how many arrays I have to create, I am using a loop to initiate them within a while loop. The problem I am facing is, with the array declaration.I would like the name of each array to end with the index of the while loop, so it will be later useful for my calculations. I do not expect to find a easy way out, however it would be great if someone can point me in the right direction
I tried using arrayname + str(i). This returns the error 'Can't assign to operator'.
#parse through the Load vector sheet to load the values of the stress vector into the dataframe
Loadvector = x2.parse('Load_vector')
Lvec_rows = len(Loadvector.index)
Lvec_cols = len(Loadvector.columns)
i = 0
while i < Lvec_cols:
y_values + str(i) = np.zeros(Lvec_rows)
i = i +1
I expect arrays with names arrayname1, arrayname2 ... to be created.
I think the title is somewhat misleading.
An easy way to do this would be using a dictionary:
dict_of_array = {}
i = 0
while i < Lvec_cols:
dict_of_array[y_values + str(i)] = np.zeros(Lvec_rows)
i = i +1
and you can access arrayname1 by dict_of_array[arrayname1].
If you want to create a batch of arrays, try:
i = 0
while i < Lvec_cols:
exec('{}{} = np.zeros(Lvec_rows)'.format(y_values, i))
i = i +1

How to write if statement from SAS to python

I am a SAS user who try to transform SAS code to python version.
I have create SAS code as below and have some issues to apply to python language. Supposed I have data table, which contained fields aging1 to aging60 and I want to create new two fields, named 'life_def' and 'obs_time'. These two fields contained value as 0 and will be changed based on condition from other fields, which are aging1 to aging60.
data want;
set have;
array aging_array(*) aging1--aging60;
life_def=0;
obs_time=0;
do i to 60;
if life_def=0 and aging_array[i] ne . then do;
if aging_array[i]>=4 then do;
obs_time=i;
life_def=1;
end;
if aging_array[i]<4 then do;
obs_time=i;
end;
end;
end;
drop i;
run;
I have tried to re-create above SAS code into python version but it doesn't work that I though. Below is my code that currently working on.
df['life_def']=0
df['obs_time']=0
for i in range(1,lag+1):
if df['life_def'].all()==0 and pd.notnull(df[df.columns[i+4]].all()):
condition=df[df.columns[i+4]]>=4
df['life_def']=np.where(condition, 1, df['life_def'])
df['obs_time']=np.where(condition, i, df['obs_time'])
Supposed df[df.columns[i+4]] is my aging columns in SAS. By using code above, the loop continue when i is increased. However, the logic from SAS provided is stop i at the first time that aging>=4.
For example, if aging7>=4 (first time) life_def will be 1 and obs_time will be 7 and assign the next loop, which is 8.
Thank you!
Your objective is to get the first aging**x** column's x (per row) that is ge 4. The snippet below would do the same thing.
Note - I am using python 2.7
mydf['obs_time'] = 0
agingcols_len = len([k for k in mydf.columns.tolist() if 'aging' in k])
rowcnt = mydf['aging1'].fillna(0).count()
for k in xrange(rowcnt):
isFirst = True
for i in xrange(1, agingcols_len):
if isFirst and mydf['aging' + str(i)][k] >= 4:
mydf['obs_time'][k] = i
isFirst = False
elif isFirst and mydf['aging' + str(i)][k] < 4:
pass
I have uploaded the data that I used to test the above. The same can be found here.
The snippet iterates over all the aging**x**columns (e.g. - aging1, aging2), and keeps increasing the obs_time till it is greater than or equal to 4. This whole thing iterates over the DataFrame rows with k.
FYI - However, this is super slow when you have million rows to loop through.

Python - masking in a for loop?

I have three arrays, r_vals, Tgas_vals, and n_vals. They are all numpy arrays of the shape (9998.). The arrays have repeated values and I want to iterate over the unique values of r_vals and find the corresponding values of Tgas_vals, and n_vals so I can use the last two arrays to calculate the weighted average. This is what I have right now:
def calc_weighted_average (r_vals,Tgas_vals,n_vals):
for r in r_vals:
mask = r == r_vals
count = 0
count += 1
for t in Tgas_vals[mask]:
print (count, np.average(Tgas_vals[mask]*n_vals[mask]))
weighted_average = calc_weighted_average (r_vals,Tgas_vals,n_vals)
The problem I am running into is that the function is only looping through once. Did I implement mask incorrectly, or is the problem somewhere else in the for loop?
I'm not sure exactly what you plan to do with all the averages, so I'll toss this out there and see if it's helpful. The following code will calculate a bunch of weighted averages, one per unique value of r_vals and store them in a dictionary(which is then printed out).
def calc_weighted_average (r_vals, z_vals, Tgas_vals, n_vals):
weighted_vals = {} #new variable to store rval=>weighted ave.
for r in np.unique(r_vals):
mask = r_vals == r # I think yours was backwards
weighted_vals[r] = np.average(Tgas_vals[mask]*n_vals[mask])
return weighted_vals
weighted_averages = calc_weighted_average (r_vals, z_vals, Tgas_vals, n_vals)
for rval in weighted_averages:
print ('%i : %0.4f' % (rval, weighted_averages[rval])) #assuming rval is integer
alternatively, you may want to factor in "z_vals" in somehow. Your question was not clear in this.

Categories

Resources