Parse list of strings for speed

Parse list of strings for speed - python

Background
I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list of strings and also set the int value to a very small number.
Each string in the player_file_list either has a year-date/player_id/some_random_file.file_extension or
year-date/player_id/IDATs/some_random_number/some_random_file.file_extension
Issue
What i am essentially trying to achieve here is go through this list and store all unique year-date/player_id path in a set until it's length reaches the value of total_players
My current approach does not seem the most efficient to me and i am wondering if i can speed up my function get_player_path in anyway??
Code
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
player_file = player_file.split("/")
file_path = f"{player_file[0]}/{player_file[1]}/"
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
player_file_list = [
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]
print(get_player_path(player_file_list, 2))
Output
['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']

Let's analyze your function first:
your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
the sorting takes O(n log(n)) comparisons.
Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.
Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:
replace the generic sorting algorithm with something specific for strings; see here for example
use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.
Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
end = player_file.find('/', 12) # <--- len(date) + len('/') + 1
file_path = player_file[:end] # <---
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:
LENGTH = DATE_LENGTH + ID_LENGTH + 1 # 1 is for the slash between date and id
...
for player_file in player_file_list:
file_path = player_file[:LENGTH]
...
EDIT: fixed the LENGTH initialization, I had forgotten to add 1

I'll leave this solution here which can be further improved, hope it helps.
player_file_list = (
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)
def get_player_path(l, n):
pfl = set()
for i in l:
i = "/".join(i.split("/")[0:2])
if i not in pfl:
pfl.add(i)
if len(pfl) == n:
return pfl
if n > len(pfl):
print("not enough matches")
return
print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}
Python Demo

Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:
def get_player_path(player_file_list, total_players):
dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
return [k for i,k in enumerate(dct.keys()) if i < total_players]

Related

Check if files in dir are the same

I have a folder of 5000+ images in jpeg/png etc. How can I check if any of the images are the same. The images were collected through web scraping and have been sequentially renamed so I cannot compare file names.
I am currently checking if the hashes are the same however this is a very long process. I am currently using:
def sameIm(file_name1,file_name2):
hash = imagehash.average_hash(Image.open(path + file_name1))
otherhash = imagehash.average_hash(Image.open(path + file_name2))
return (hash == otherhash)
Then nested loops. Comparing 1 image to 5000+ others takes about 5mins so comparing each to each would take days to compute.
Is there a faster way to do this in python. I was thinking parallel processing but would that still take a long time?
or is there another way to compare files which is faster?
Thanks

There is indeed a much faster way of doing this:
import collections
import glob
import os
def dupDetector(dirpath, ext):
hashes = collections.defaultdict(list)
for fpath in glob.glob(os.path.join(dirpath, "*.{}".format(ext))):
h = imagehash.average_hash(Image.open(fpath))
hashes[h].append(fpath)
for h,fpaths in hashes.items():
if len(fpaths) == 1:
print(fpaths[0], "is one of a kind")
continue
print("The following files are duplicates of each other (with the hash {}): \n\t{}".format(h, '\n\t'.join(fpaths)))
Using the dictionary with the file hash as a key gives you O(1) lookups, which means you don't need to do the pair-wise comparisons. You therefore go from a quadratic runtime, to a linear runtime (yay!)

Why not compute hash only once?
hashes = [imagehash.average_hash(Image.open(path + fn)) for fn in file_names]
def compare_hashes(hash1, hash2):
return hash1 == hash2

One solution is to keep using the hash but stocking it in a list of tuple (or a dic, i don't know wich is more efficient here) where the first element is the name of the image and the second is the hash. It should take aproximatively the same 5 mins.
If you have 5000 images,
You compare the value of the first element of the list to the 4999 others
Then the second to the 4998 others (as you already checked the first one)
Then the third ...
This "just" make you do n²/2 comparisons (where n is the number of images)

Just use map structure to calculate hashes for each image,then store hashes as a key and name of the image as a value.
As a result you would have unique images names array.
def get_hash(filename):
return imagehash.average_hash(Image.open(path + filename))
def get_unique_images(filenames):
hashes = {}
for filename in filenames:
image_hash = get_hash(filename)
hashes[image_hash] = filename
return hashes.values()

Importing big tecplot block files in python as fast as possible

I want to import in python some ascii file ( from tecplot, software for cfd post processing).
Rules for those files are (at least, for those that I need to import):
The file is divided in several section
Each section has two lines as header like:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
Each section has a set of variable given by the first line. When a section ends, a new section starts with two similar lines.
For each variable there are I*J*K values.
Each variable is a continous block of values.
There are a fixed number of values per row (6).
When a variable ends, the next one starts in a new line.
Variables are "IJK ordered data".The I-index varies the fastest; the J-index the next fastest; the K-index the slowest. The I-index should be the inner loop, the K-index shoould be the outer loop, and the J-index the loop in between.
Here is an example of data:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
-3.9999999E+00 -3.3327306E+00 -2.7760824E+00 -2.3117116E+00 -1.9243209E+00 -1.6011492E+00
[...]
0.0000000E+00 #fin first variable
-4.3532482E-02 -4.3584235E-02 -4.3627592E-02 -4.3663762E-02 -4.3693815E-02 -4.3718831E-02 #second variable, 'y'
[...]
1.0738781E-01 #end of second variable
[...]
[...]
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen" #next zone
ZONE T="Window(s) : E_W_Block0003_ALL", I=17, J=17, K=25, F=BLOCK
I am quite new at python and I have written a code to import the data to a dictionary, writing the variables as 3D numpy.array . Those files could be very big, (up to Gb). How can I make this code faster? (or more generally, how can I import such files as fast as possible)?
import re
from numpy import zeros, array, prod
def vectorr(I, J, K):
"""function"""
vect = []
for k in range(0, K):
for j in range(0, J):
for i in range(0, I):
vect.append([i, j, k])
return vect
a = open('E:\u.dat')
filelist = a.readlines()
NumberCol = 6
count = 0
data = dict()
leng = len(filelist)
countzone = 0
while count < leng:
strVARIABLES = re.findall('VARIABLES', filelist[count])
variables = re.findall(r'"(.*?)"', filelist[count])
countzone = countzone+1
data[countzone] = {key:[] for key in variables}
count = count+1
strI = re.findall('I=....', filelist[count])
strI = re.findall('\d+', strI[0])
I = int(strI[0])
##
strJ = re.findall('J=....', filelist[count])
strJ = re.findall('\d+', strJ[0])
J = int(strJ[0])
##
strK = re.findall('K=....', filelist[count])
strK = re.findall('\d+', strK[0])
K = int(strK[0])
data[countzone]['indmax'] = array([I, J, K])
pr = prod(data[countzone]['indmax'])
lin = pr // NumberCol
if pr%NumberCol != 0:
lin = lin+1
vect = vectorr(I, J, K)
for key in variables:
init = zeros((I, J, K))
for ii in range(0, lin):
count = count+1
temp = map(float, filelist[count].split())
for iii in range(0, len(temp)):
init.itemset(tuple(vect[ii*6+iii]), temp[iii])
data[countzone][key] = init
count = count+1
Ps. In python, no cython or other languages

Converting a large bunch of strings to numbers is always going to be a little slow, but assuming the triple-nested for-loop is the bottleneck here maybe changing it to the following gives you a sufficient speedup:
# add this line to your imports
from numpy import fromstring
# replace the nested for-loop with:
count += 1
for key in variables:
str_vector = ' '.join(filelist[count:count+lin])
ar = fromstring(str_vector, sep=' ')
ar = ar.reshape((I, J, K), order='F')
data[countzone][key] = ar
count += lin
Unfortunately at the moment I only have access to my smartphone (no pc) so I can't test how fast this is or even if it works correctly or at all!
Update
Finally I got around to doing some testing:
My code contained a small error, but it does seem to work correctly now.
The code with the proposed changes runs about 4 times faster than the original
Your code spends most of its time on ndarray.itemset and probably loop overhead and float conversion. Unfortunately cProfile doesn't show this in much detail..
The improved code spends about 70% of time in numpy.fromstring, which, in my view, indicates that this method is reasonably fast for what you can achieve with Python / NumPy.
Update 2
Of course even better would be to iterate over the file instead of loading everything all at once. In this case this is slightly faster (I tried it) and significantly reduces memory use. You could also try to use multiple CPU cores to do the loading and conversion to floats, but then it becomes difficult to have all the data under one variable. Finally a word of warning: the fromstring method that I used scales rather bad with the length of the string. E.g. from a certain string length it becomes more efficient to use something like np.fromiter(itertools.imap(float, str_vector.split()), dtype=float).

If you use regular expressions here, there's two things that I would change:
Compile REs which are used more often (which applies to all REs in your example, I guess). Do regex=re.compile("<pattern>") on them, and use the resulting object with match=regex.match(), as described in the Python documentation.
For the I, J, K REs, consider reducing two REs to one, using the grouping feature (also described above), by searching for a pattern of the form "I=(\d+)", and grabbing the part matched inside the parentheses using regex.group(1). Taking this further, you can define a single regex to capture all three variables in one step.
At least for starting the sections, REs seem a bit overkill: There's no variation in the string you need to look for, and string.find() is sufficient and probably faster in that case.
EDIT: I just saw you use grouping already for the variables...

Append Rows of Different Lengths to the Same Variable

I am trying to append a lengthy list of rows to the same variable. It works great for the first thousand or so iterations in the loop (all of which have the same lengths), but then, near the end of the file, the rows get a bit shorter, and while I still want to append them, I am not sure how to handle it.
The script gives me an out of range error, as expected.
Here is what the part of code in question looks like:
ii = 0
NNCat = []
NNCatelogue = []
while ii <= len(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii], nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
ii = ii + 1
print NNCatelogue, ii
Any help on this would be greatly appreciated!

I'll answer the question you didn't ask first ;) : how can this code be more pythonic?
Instead of
ii = 0
NNCat = []
NNCatelogue = []
while ii <= len(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii], nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
ii = ii + 1
you should do
NNCat = []
NNCatelogue = []
for ii, line in enumerate(lines):
NNCat = (ev_id[ii], nn1[ii], nn2[ii], nn3[ii], nn4[ii], nn5[ii], nn6[ii],
nn7[ii], nn8[ii], nn9[ii], nn10[ii], nn11[ii])
NNCatelogue.append(NNCat)
During each pass ii will be incremented by one for you and line will be the current line.
As for your short lines, you have two choices
Use a special value (such as None) to fill in when you don't have a real value
check the length of nn1, nn2, ..., nn11 to see if they are large enough
The second solution will be much more verbose, hard to maintain, and confusing. I strongly recommend using None (or another special value you create yourself) as a placeholder when there is no data.

def gvop(vals,indx): #get values or padding
return vals[indx] if indx<len(vals) else None
NNCatelogue = [(gvop(ev_id,ii), gvop(nn1,ii), gvop(nn2,ii), gvop(nn3,ii), gvop(nn4,ii),
gvop(nn5,ii), gvop(nn6,ii), gvop(nn7,ii), gvop(nn8,ii), gvop(nn9,ii),
gvop(nn10,ii), gvop(nn11,ii)) for ii in xrange(0, len(lines))]
By defining this other function to return either the correct value or padding, you can ensure rows are the same length. You can change the padding to anything, if None is not what you want.
Then the list comp creates a list of tuples as before, except containing padding in cases where some of the lines in the input are shorter.

from itertools import izip_longest
NNCatelogue = list(izip_longest(ev_id, nn1, nn2, ... nn11, fillvalue=None))
See here for documentation of izip. Do yourself a favour and skip the list around the iterator, if you don't need it. In many cases you can use the iterator as well as the list, and you save a lot of memory. Especially if you have long lists, that you're grouping together here.

Only index needed: enumerate or (x)range?

If I want to use only the index within a loop, should I better use the range/xrange function in combination with len()
a = [1,2,3]
for i in xrange(len(a)):
print i
or enumerate? Even if I won't use p at all?
for i,p in enumerate(a):
print i

I would use enumerate as it's more generic - eg it will work on iterables and sequences, and the overhead for just returning a reference to an object isn't that big a deal - while xrange(len(something)) although (to me) more easily readable as your intent - will break on objects with no support for len...

Using xrange with len is quite a common use case, so yes, you can use it if you only need to access values by index.
But if you prefer to use enumerate for some reason, you can use underscore (_), it's just a frequently seen notation that show you won't use the variable in some meaningful way:
for i, _ in enumerate(a):
print i
There's also a pitfall that may happen using underscore (_). It's also common to name 'translating' functions as _ in i18n libraries and systems, so beware to use it with gettext or some other library of such kind (thnks to #lazyr).

That's a rare requirement – the only information used from the container is its length! In this case, I'd indeed make this fact explicit and use the first version.

xrange should be a little faster, but enumerate will mean you don't need to change it when you realise that you need p afterall

I ran a time test and found out range is about 2x faster than enumerate. (on python 3.6 for Win32)
best of 3, for len(a) = 1M
enumerate(a): 0.125s
range(len(a)): 0.058s
Hope it helps.
FYI: I initialy started this test to compare python vs vba's speed...and found out vba is actually 7x faster than range method...is it because of my poor python skills?
surely python can do better than vba somehow
script for enumerate
import time
a = [0]
a = a * 1000000
time.perf_counter()
for i,j in enumerate(a):
pass
print(time.perf_counter())
script for range
import time
a = [0]
a = a * 1000000
time.perf_counter()
for i in range(len(a)):
pass
print(time.perf_counter())
script for vba (0.008s)
Sub timetest_for()
Dim a(1000000) As Byte
Dim i As Long
tproc = Timer
For i = 1 To UBound(a)
Next i
Debug.Print Timer - tproc
End Sub

I wrote this because I wanted to test it.
So it depends if you need the values to work with.
Code:
testlist = []
for i in range(10000):
testlist.append(i)
def rangelist():
a = 0
for i in range(len(testlist)):
a += i
a = testlist[i] + 1 # Comment this line for example for testing
def enumlist():
b = 0
for i, x in enumerate(testlist):
b += i
b = x + 1 # Comment this line for example for testing
import timeit
t = timeit.Timer(lambda: rangelist())
print("range(len()):")
print(t.timeit(number=10000))
t = timeit.Timer(lambda: enumlist())
print("enum():")
print(t.timeit(number=10000))
Now you can run it and will get most likely the result, that enum() is faster.
When you comment the source at a = testlist[i] + 1 and b = x + 1 you will see range(len()) is faster.
For the code above I get:
range(len()):
18.766527627612255
enum():
15.353173553868345
Now when commenting as stated above I get:
range(len()):
8.231641875551514
enum():
9.974262515773656

Based on your sample code,
res = [[profiel.attr[i].x for i,p in enumerate(profiel.attr)] for profiel in prof_obj]
I would replace it with
res = [[p.x for p in profiel.attr] for profiel in prof_obj]

Just use range(). If you're going to use all the indexes anyway, xrange() provides no real benefit (unless len(a) is really large). And enumerate() creates a richer datastructure that you're going to throw away immediately.

Translate ruby to python

I'm rewriting some code from Ruby to Python. The code is for a Perceptron, listed in section 8.2.6 of Clever Algorithms: Nature-Inspired Programming Recipes. I've never used Ruby before and I don't understand this part:
def test_weights(weights, domain, num_inputs)
correct = 0
domain.each do |pattern|
input_vector = Array.new(num_inputs) {|k| pattern[k].to_f}
output = get_output(weights, input_vector)
correct += 1 if output.round == pattern.last
end
return correct
end
Some explanation: num_inputs is an integer (2 in my case), and domain is a list of arrays: [[1,0,1], [0,0,0], etc.]
I don't understand this line:
input_vector = Array.new(num_inputs) {|k| pattern[k].to_f}
It creates an array with 2 values, every values |k| stores pattern[k].to_f, but what is pattern[k].to_f?

Try this:
input_vector = [float(pattern[i]) for i in range(num_inputs)]

pattern[k].to_f
converts pattern[k] to a float.

I'm not a Ruby expert, but I think it would be something like this in Python:
def test_weights(weights, domain, num_inputs):
correct = 0
for pattern in domain:
output = get_output(weights, pattern[:num_inputs])
if round(output) == pattern[-1]:
correct += 1
return correct
There is plenty of scope for optimising this: if num_inputs is always one less then the length of the lists in domain then you may not need that parameter at all.
Be careful about doing line by line translations from one language to another: that tends not to give good results no matter what languages are involved.
Edit: since you said you don't think you need to convert to float you can just slice the required number of elements from the domain value. I've updated my code accordingly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse list of strings for speed - python

Related

Check if files in dir are the same

Importing big tecplot block files in python as fast as possible

Append Rows of Different Lengths to the Same Variable

Only index needed: enumerate or (x)range?

Translate ruby to python

Categories

Resources