How to implement a for loop to iterate over this dataset

How to implement a for loop to iterate over this dataset - python

I'm implementing the following code to get match history data from an API:
my_matches = watcher.match.matchlist_by_puuid(
region=my_region,
puuid=me["puuid"],
count=100,
start=1)
The max items I can display per page is 100 (count), I would like my_matches to equal the first 1000 matches, thus looping start from 1 - 10.
Is there any way to effectively do this?

Based on the documentation (see page 17), this function returns a list of strings. The function can only return a 100 count max. Also, it accepts a start for where to start returning these matches (which defaults at 0). A possible solution for your problem would look like this:
allMatches = [] # will become a list containing 10 lists of matches
for match_page in range(9): # remember arrays start at 0!
countNum = match_page * 100 # first will be 0, second 100, third 200 etc...
my_matches = watcher.match.matchlist_by_puuid(
region=my_region,
puuid=me["puuid"],
count=100,
start=countNum)
# ^ Notice how we use countNum as the start for returning
allMatches.append(my_matches)

If you want to remain concise, and you want your matchesto be a 1000 long list of results, you can concatenate direclty all the outputs of size 100 as:
import itertools
matches = list(itertools.chain.from_iterable(watcher.match.matchlist_by_puuid(
region=my_region,
puuid=me["puuid"],
count=100,
start=i*100) for i in range(10)))

Related

Parse list of strings for speed

Background
I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list of strings and also set the int value to a very small number.
Each string in the player_file_list either has a year-date/player_id/some_random_file.file_extension or
year-date/player_id/IDATs/some_random_number/some_random_file.file_extension
Issue
What i am essentially trying to achieve here is go through this list and store all unique year-date/player_id path in a set until it's length reaches the value of total_players
My current approach does not seem the most efficient to me and i am wondering if i can speed up my function get_player_path in anyway??
Code
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
player_file = player_file.split("/")
file_path = f"{player_file[0]}/{player_file[1]}/"
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
player_file_list = [
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]
print(get_player_path(player_file_list, 2))
Output
['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']

Let's analyze your function first:
your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
the sorting takes O(n log(n)) comparisons.
Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.
Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:
replace the generic sorting algorithm with something specific for strings; see here for example
use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.
Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
end = player_file.find('/', 12) # <--- len(date) + len('/') + 1
file_path = player_file[:end] # <---
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:
LENGTH = DATE_LENGTH + ID_LENGTH + 1 # 1 is for the slash between date and id
...
for player_file in player_file_list:
file_path = player_file[:LENGTH]
...
EDIT: fixed the LENGTH initialization, I had forgotten to add 1

I'll leave this solution here which can be further improved, hope it helps.
player_file_list = (
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)
def get_player_path(l, n):
pfl = set()
for i in l:
i = "/".join(i.split("/")[0:2])
if i not in pfl:
pfl.add(i)
if len(pfl) == n:
return pfl
if n > len(pfl):
print("not enough matches")
return
print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}
Python Demo

Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:
def get_player_path(player_file_list, total_players):
dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
return [k for i,k in enumerate(dct.keys()) if i < total_players]

Format a python list and search for patterns

I am getting rows from a spreadsheet with mixtures of numbers, text and dates
I want to find elements within the list, some numbers and some text
for example
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(str, sg)
#sg = map(unicode, sg) #option?
if any("-1" in s for s in sg):
#do something if matched
I don't feel this is the correct way to do this, I am also trying to match stuff like -1.5 and -1.5C and other unexpected characters like OPEN15 compared to 15
I have also looked at
sg.index("-1")
If positive then its a match (Only good for direct matches)
Some help would be appreciated

If you want to call a function for each case, I would do it this way:
def stub1(elem):
#do something for match of type '-1'
return
def stub2(elem):
#do something for match of type 'SD4'
return
def stub3(elem):
#do something for match of type 'OPEN15'
return
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(unicode, sg)
patterns = {u"-1":stub1, u"SD4": stub2, u"OPEN15": stub3} # add more if you want
for elem in sg:
for k, stub in patterns.iteritems():
if k in elem:
stub(elem)
break
Where stub1, stub2, ... are the fonctions that contains the code for each case.
It will be called (max 1 time per strings) if the string contains a matching substring.

What do you mean by "I don't feel this is the correct way to do this" ? Are you not getting the result you expect ? Is it too slow ?
Maybe, you can organize your data by columns instead of rows and have a more specific filters. If you are looking for speed, I'd suggest using the numpy module which has a very intersting function called select()
Scipy select example
By transforming all your rows in a numpy array, you can test several columns in one pass. This function is amazingly efficient and powerful ! Basically it's used like this:
import numpy as np
a = array(...)
conds = [a < 10, a % 3 == 0, a > 25]
actions = [a + 100, a / 3, a * 10]
result = np.select(conds, actions, default = 0)
All values in a will be transformed as follow:
A value 100 will be added to any value of a which is smaller than 10
Any value in a which is a multiple of 3, will be divided by 3
Any value above 25 will be multiplied by 10
Any other value, not matching the previous conditions, will be set to 0
Bot conds and actions are lists, and must have the same number of arguments. The first element in conds has its action set as the first element of actions.
It could be used to determine the index in a vector for a particular value (eventhough this should be done using the nonzero() numpy function).
a = array(....)
conds = [a <= target, a > target]
actions = [1, 0]
index = select(conds, actions).sum()
This is probably a stupid way of getting an index, but it demonstrates how we can use select()... and it works :-)

Python, complex looping calculations with lists or arrays

I am converting old pseudo-Fortran code into python and am struggling to create a framework within which I can perform some complex iterative calculations.
As a beginner, my first instinct is to use lists as I find them easier to work with, but i understand that arrays would probably be a more suitable method.
I already have all the input channels as lists and am hoping for a good explanation of how to set up loops for such calculations.
This is an example of the pseudo-Fortran i am replicating. Each (t) indicates a 'time-series channel' that I currently have stored as lists (ie. ECART2(t) and NNNN(t) are lists) All lists have the same number of entries.
do while ( ecart2(t) > 0.0002 .and. nnnn(t) < 2000. ) ;
mmm(t)=nnnn(t)+1.;
if YRPVBPO(t).ge.0.1 .and. YRPVBPO(t).le.0.999930338 .and. YAEVBPO(t).ge.0.000015 .and. YAEVBPO(t).le.0.000615 then do;
YM5(t) = customFunction(YRPVBPO,YAEVBPO);*
end;
YUEVBO(t) = YU0VBO(t) * YM5(t) ;*m/s
YHEVBO(t) = YCPEVBO(t)*TPO_TGETO1(t)+0.5*YUEVBO(t)*YUEVBO(t);*J/kg
YAVBO(t) = ddnn2(t)*(YUEVBO(t)**2);*
YDVBO(t) = YCPEVBO(t)**2 + 4*YHEVBO(t)*YAVBO(t) ;*
YTSVBPO(t) = (sqrt(YDVBO(t))-YCPEVBO(t))/2./YAVBO(t);*K
YUSVBO(t) = ddnn(t)*YUEVBO(t)*YTSVBPO(t);*m/s
YM7(t) = YUSVBO(t)/YU0VBO(t);*
YPHSVBPOtot(t) = (YPHEVBPO(t) - YPDHVBPO(t))/(1.+((YGAMAEVBO(t)-1)/2)*(YM7(t)**2))**(YGAMAEVBO(t)/(1-YGAMAEVBO(t)));*bar
YPHEVBPOtot(t) = YPHEVBPO(t) / (1.+rss0(t)*YM5(t)*YM5(t))**rss1(t);*bar
YDPVBPOtot(t) = YPHEVBPOtot(t) - YPHSVBPOtot(t) ;*bar
iter(t) = (YPHEVBPOtot(t) - YDPVBPOtot(t))/YPHEVBPOtot(t);*
ecart2(t)= ABS(iter(t)-YRPVBPO(t));*
aa(t)=YRPVBPO(t)+0.0001;
YRPVBPO(t)=aa(t);*
nnnn(t)=mmm(t);*
end;
Understanding the pseudo-fortran: With 'time-series data' there is an impicit loop iterating through the individual values in each list - as well as looping over each of those values until the conditions are met.
It will carry out the loop calculations on the first list values until the conditions are met. It then moves onto the second value in the lists and perform the same looping calculations until the conditions are met...
ECART2 = [2,0,3,5,3,4]
NNNN = [6,7,5,8,6,7]
do while ( ecart2(t) > 0.0002 .and. nnnn(t) < 2000. )
MMM = NNNN + 1
this looks at the first values in each list (2 and 6). Because the conditions are met, subsequent calculations are performed on the first values in the new lists such as MMM = [6+1,...]
Once the rest of the calculations have been performed (looping multiple times if the conditions are not met) only then does the second value in every list get considered. The second values (0 and 7) do not meet the conditions and therefore the second entry for MMM is 0.
MMM=[6+1, 0...]
Because 0 must be entered if conditons are not met, I am considering setting up all the 'New lists' in advance and populating them with 0s.
NB: 'customFunction()' is a separate function that is called, returning a value from two input values

MY CURRENT SOLUTION
set up all the empty lists
nPts = range(ECART2)
MMM = [0]*nPts
YM5 = [0]*nPts
etc...
then start performing calculations
for i in ECART2:
while (ECART2[i] > 0.0002) and (NNNN[i] < 2000):
MMM[i] = NNNN[i]+1
if YRPVBPO[i]>=0.1 and YRPVBPO[i]<=0.999930338 and YAEVBPO[i]>=0.000015 and YAEVBPO[i]<=0.000615:
YM5[i] = MACH_LBP_DIA30(YRPVBPO[i],YAEVBPO[i])
YUEVBO[i] = YU0VBO[i]*YM5[i]
YHEVBO[i] = YCPEVBO[i]*TGETO1[i] + 0.5*YUEVBO[i]^2
YAVBO[i] = DDNN2[i]*YUEVBO[i]^2
YDVBO[i] = YCPEVBO[i]^2 + 4*YHEVBO[i]*YAVBO[i]
etc etc...
but i'm guessing that there are better ways of doing this - such as the suggestion to use numpy arrays (something i plan on learning in the near future)

Importing big tecplot block files in python as fast as possible

I want to import in python some ascii file ( from tecplot, software for cfd post processing).
Rules for those files are (at least, for those that I need to import):
The file is divided in several section
Each section has two lines as header like:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
Each section has a set of variable given by the first line. When a section ends, a new section starts with two similar lines.
For each variable there are I*J*K values.
Each variable is a continous block of values.
There are a fixed number of values per row (6).
When a variable ends, the next one starts in a new line.
Variables are "IJK ordered data".The I-index varies the fastest; the J-index the next fastest; the K-index the slowest. The I-index should be the inner loop, the K-index shoould be the outer loop, and the J-index the loop in between.
Here is an example of data:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
-3.9999999E+00 -3.3327306E+00 -2.7760824E+00 -2.3117116E+00 -1.9243209E+00 -1.6011492E+00
[...]
0.0000000E+00 #fin first variable
-4.3532482E-02 -4.3584235E-02 -4.3627592E-02 -4.3663762E-02 -4.3693815E-02 -4.3718831E-02 #second variable, 'y'
[...]
1.0738781E-01 #end of second variable
[...]
[...]
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen" #next zone
ZONE T="Window(s) : E_W_Block0003_ALL", I=17, J=17, K=25, F=BLOCK
I am quite new at python and I have written a code to import the data to a dictionary, writing the variables as 3D numpy.array . Those files could be very big, (up to Gb). How can I make this code faster? (or more generally, how can I import such files as fast as possible)?
import re
from numpy import zeros, array, prod
def vectorr(I, J, K):
"""function"""
vect = []
for k in range(0, K):
for j in range(0, J):
for i in range(0, I):
vect.append([i, j, k])
return vect
a = open('E:\u.dat')
filelist = a.readlines()
NumberCol = 6
count = 0
data = dict()
leng = len(filelist)
countzone = 0
while count < leng:
strVARIABLES = re.findall('VARIABLES', filelist[count])
variables = re.findall(r'"(.*?)"', filelist[count])
countzone = countzone+1
data[countzone] = {key:[] for key in variables}
count = count+1
strI = re.findall('I=....', filelist[count])
strI = re.findall('\d+', strI[0])
I = int(strI[0])
##
strJ = re.findall('J=....', filelist[count])
strJ = re.findall('\d+', strJ[0])
J = int(strJ[0])
##
strK = re.findall('K=....', filelist[count])
strK = re.findall('\d+', strK[0])
K = int(strK[0])
data[countzone]['indmax'] = array([I, J, K])
pr = prod(data[countzone]['indmax'])
lin = pr // NumberCol
if pr%NumberCol != 0:
lin = lin+1
vect = vectorr(I, J, K)
for key in variables:
init = zeros((I, J, K))
for ii in range(0, lin):
count = count+1
temp = map(float, filelist[count].split())
for iii in range(0, len(temp)):
init.itemset(tuple(vect[ii*6+iii]), temp[iii])
data[countzone][key] = init
count = count+1
Ps. In python, no cython or other languages

Converting a large bunch of strings to numbers is always going to be a little slow, but assuming the triple-nested for-loop is the bottleneck here maybe changing it to the following gives you a sufficient speedup:
# add this line to your imports
from numpy import fromstring
# replace the nested for-loop with:
count += 1
for key in variables:
str_vector = ' '.join(filelist[count:count+lin])
ar = fromstring(str_vector, sep=' ')
ar = ar.reshape((I, J, K), order='F')
data[countzone][key] = ar
count += lin
Unfortunately at the moment I only have access to my smartphone (no pc) so I can't test how fast this is or even if it works correctly or at all!
Update
Finally I got around to doing some testing:
My code contained a small error, but it does seem to work correctly now.
The code with the proposed changes runs about 4 times faster than the original
Your code spends most of its time on ndarray.itemset and probably loop overhead and float conversion. Unfortunately cProfile doesn't show this in much detail..
The improved code spends about 70% of time in numpy.fromstring, which, in my view, indicates that this method is reasonably fast for what you can achieve with Python / NumPy.
Update 2
Of course even better would be to iterate over the file instead of loading everything all at once. In this case this is slightly faster (I tried it) and significantly reduces memory use. You could also try to use multiple CPU cores to do the loading and conversion to floats, but then it becomes difficult to have all the data under one variable. Finally a word of warning: the fromstring method that I used scales rather bad with the length of the string. E.g. from a certain string length it becomes more efficient to use something like np.fromiter(itertools.imap(float, str_vector.split()), dtype=float).

If you use regular expressions here, there's two things that I would change:
Compile REs which are used more often (which applies to all REs in your example, I guess). Do regex=re.compile("<pattern>") on them, and use the resulting object with match=regex.match(), as described in the Python documentation.
For the I, J, K REs, consider reducing two REs to one, using the grouping feature (also described above), by searching for a pattern of the form "I=(\d+)", and grabbing the part matched inside the parentheses using regex.group(1). Taking this further, you can define a single regex to capture all three variables in one step.
At least for starting the sections, REs seem a bit overkill: There's no variation in the string you need to look for, and string.find() is sufficient and probably faster in that case.
EDIT: I just saw you use grouping already for the variables...

Append Rows of Different Lengths to the Same Variable

I am trying to append a lengthy list of rows to the same variable. It works great for the first thousand or so iterations in the loop (all of which have the same lengths), but then, near the end of the file, the rows get a bit shorter, and while I still want to append them, I am not sure how to handle it.
The script gives me an out of range error, as expected.
Here is what the part of code in question looks like:
ii = 0
NNCat = []
NNCatelogue = []
while ii <= len(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii], nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
ii = ii + 1
print NNCatelogue, ii
Any help on this would be greatly appreciated!

I'll answer the question you didn't ask first ;) : how can this code be more pythonic?
Instead of
ii = 0
NNCat = []
NNCatelogue = []
while ii <= len(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii], nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
ii = ii + 1
you should do
NNCat = []
NNCatelogue = []
for ii, line in enumerate(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii],
nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
During each pass ii will be incremented by one for you and line will be the current line.
As for your short lines, you have two choices
Use a special value (such as None) to fill in when you don't have a real value
check the length of nn1, nn2, ..., nn11 to see if they are large enough
The second solution will be much more verbose, hard to maintain, and confusing. I strongly recommend using None (or another special value you create yourself) as a placeholder when there is no data.

def gvop(vals,indx): #get values or padding
return vals[indx] if indx<len(vals) else None
NNCatelogue = [(gvop(ev_id,ii), gvop(nn1,ii), gvop(nn2,ii), gvop(nn3,ii), gvop(nn4,ii),
gvop(nn5,ii), gvop(nn6,ii), gvop(nn7,ii), gvop(nn8,ii), gvop(nn9,ii),
gvop(nn10,ii), gvop(nn11,ii)) for ii in xrange(0, len(lines))]
By defining this other function to return either the correct value or padding, you can ensure rows are the same length. You can change the padding to anything, if None is not what you want.
Then the list comp creates a list of tuples as before, except containing padding in cases where some of the lines in the input are shorter.

from itertools import izip_longest
NNCatelogue = list(izip_longest(ev_id, nn1, nn2, ... nn11, fillvalue=None))
See here for documentation of izip. Do yourself a favour and skip the list around the iterator, if you don't need it. In many cases you can use the iterator as well as the list, and you save a lot of memory. Especially if you have long lists, that you're grouping together here.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to implement a for loop to iterate over this dataset - python

Related

Parse list of strings for speed

Format a python list and search for patterns

Python, complex looping calculations with lists or arrays

Importing big tecplot block files in python as fast as possible

Append Rows of Different Lengths to the Same Variable

Categories

Resources