Python: Finding Unique Subsequences of Unique Strings

Python: Finding Unique Subsequences of Unique Strings - python

Edit: To the people that downvoted: I was perfectly clear that I did not want code and that I had already tried it myself. All I was looking for was an explanation of what mathematical process yielded the sample results.
First question. I have done a lot of research and finally resorted to asking, so if I missed the answer somewhere I apologize. I have a problem I am really struggling with:
Write a Python 3 script that takes three command line arguments: 1. The name of a text file that contains n strings separated
by white spaces. 2. A positive integer k. 3. The name of a
text file that the script will create in order to store all possible
subsequences of k unique strings out of the n strings from the input
file, one subsequence per line. For example, assume the
command line is gen.py input.txt 3 output.txt and the file input.txt
contains the following line: Python Java C++ Java Java Python
Then the program should create the file output.txt containing
the following lines (in any order): Python Java C++ Python
C++ Java Java C++ Python C++ Java Python The
combinations should be generated with your implementation of a
generator function (i.e. using the keyword yield).
From my understanding, based on the sample output this doesn't quite follow the definition of a subsequence; nor are they quite permutations, so I'm at a loss for how to go about this. I know how to do the file IO and command line argument portions, I just can't get the right subsequences. I don't need a direct answer as I am supposed to solve this, but if someone could give me some helpful insight it would be much appreciated.

If you're allowed to use itertools:
import itertools
import sys
def unique_substrings(txt_lst:list, k:int) -> set:
return set([' '.join(combo) for combo in itertools.combinations(txt_lst, 3) \
if len(set(combo))==3])
if __name__ == "__main__":
infile, k, outfile = sys.argv[1:]
with open(infile) as inf:
txt_lst = infile.read().split()
with open(outfile) as outf:
for line in unique_substrings(txt_lst, k):
outf.write(line + "\n")
However from your instructor's comment:
The combinations should be generated with your implementation of a generator function (i.e. using the keyword yield).
It doesn't look like that's actually going to work.
itertools.combinations could be re-implemented with something approximating the following (from the docs):
def combinations(iterable, r):
# combinations('ABCD', 2) --> AB AC AD BC BD CD
# combinations(range(4), 3) --> 012 013 023 123
pool = tuple(iterable)
n = len(pool)
if r > n:
return
indices = list(range(r))
yield tuple(pool[i] for i in indices)
while True:
for i in reversed(range(r)):
if indices[i] != i + n - r:
break
else:
return
indices[i] += 1
for j in range(i+1, r):
indices[j] = indices[j-1] + 1
yield tuple(pool[i] for i in indices)

Related

GSVD for python Generalized Singular Value Decomposition

MATLAB has a gsvd function to perform the generalised SVD. Since 2013 I think there has been a lot of discussion on the github pages regarding putting it in scipy and some pages have code that I can use such as here which is super complicated for a novice like me(to get it running).
I also found LJWilliams github page with an implementation. This is of no good as has lot of bugs when transferred to python 3. Attempted correcting the simple ones such as assert and print. It quickly gets complicated.
Can someone help me with a gsvd code for python or show me how to use the ones that are online?
Also, This is what I get with the LJWilliams implementation, once the print and assert statements are corrected. The code looks complicated and I am not sure spending time on it is the best thing to do! Also some people have reported issues on the same github page which I am not sure are fixed or connected.
n = 10
m = 6
p = 6
A = np.random.rand(m,n)
B = np.random.rand(p,n)
gsvd(A,B)
File "/home/eghx/agent18/master_thesis/AMfe/amfe/gsvd.py", line 260,
in gsvd
U, V, Z, C, S = csd(Q[0:m,:],Q[m:m+n,:])
File "/home/eghx/agent18/master_thesis/AMfe/amfe/gsvd.py", line 107,
in csd
Q,R = scipy.linalg.qr(S[q:n,m:p])
File
"/home/eghx/anaconda3/lib/python3.5/site-packages/scipy/linalg/decomp_qr.py",
line 141, in qr
overwrite_a=overwrite_a)
File
"/home/eghx/anaconda3/lib/python3.5/site-packages/scipy/linalg/decomp_qr.py",
line 19, in safecall
ret = f(*args, **kwargs)
ValueError: failed to create intent(cache|hide)|optional array-- must
have defined dimensions but got (0,)

If you want to work from the LJWillams implementation on github, there are a couple of bugs. However, to understand the technique fully, I'd probably recommend having a go at implementing it yourself. I looked up what Octave (MATLAB free software equivalent) do and their "code is a wrapper to the corresponding Lapack dggsvd and zggsvd routines.", which is what scipy should do IMHO.
I'll post up the bugs I found, but I'm not going to post the code in full working order, because I'm not sure how that stands with regard to copyright, given the copyrighted MATLAB implementation from which it is translated.
Caveat : I am not an expert on the Generalised SVD and have approached this only from the perspective of debugging, not whether the underlying algorithm is correct. I have had this working on your original random arrays and the test case already present in the Python file.
Bugs
Setting k
Around line 63, the conditions for setting k and a misunderstanding of numpy.argparse (particularly in comparison to MATLAB's find) seem to set k wrong in some circumstances. Change that code to
if q == 1:
k = 0
elif m < p:
k = n;
else:
k = max([0,sum((np.diag(C) <= 1/np.sqrt(2)))])
line 79
S[1,1] should be S[0,0], I think (Python 0-indexed arrays)
lines 83 onwards
The numpy matrix slicing around here seems wrong. I got the code working by changing lines 83-95 to read:
UT, ST, VT = scipy.linalg.svd(slice_matrix(S,i,j))
ST = add_zeros(ST,np.zeros([n-k,r-k]))
if k > 0:
print('Zeroing elements of S in row indices > r, to be replaced by ST')
S[0:k,k:r] = 0
S[k:n,k:r] = ST
C[:,j] = np.dot(C[:,j],VT)
V[:,i] = np.dot(V[:,i],UT)
Z[:,j] = np.dot(Z[:,j],VT)
i = np.arange(k,q)
Q,R = scipy.linalg.qr(C[k:q,k:r])
C[i,j] = np.diag(diagf(R))
U[:,k:q] = np.dot(U[:,k:q],Q)
in diagp()
There are two matrix multiplications using X*Y that should be np.dot(X,Y) instead (note * is element-wise multiplication in numpy, not matrix multiplication.)

Operations in Python

My question is fairly basic yet might need a challenging solution.
Essentially, I have an arbitrary function which we will call some_function.
def some_function(n):
for i in range(n):
i+i
r = 1
r = r+1
And I want to count the number of operations took place in an arbitrary call to this function is executed (e.g. some_function(5). there are 7 operations that took place).
How would one count the number of operations that took place within a function call? I cannot modify some_function.

I think you're really after what others already told you - the big O notation.
But if you really want to know the actual number of instructions executed you can use this on linux:
perf stat -e instructions:u python yourscript.py
Which will output:
Performance counter stats for 'python yourscript.py':
22,260,577 instructions:u
0.014450363 seconds time elapsed
Note though that it includes all the instructions for executing python itself. So you'd have to find your own reference.

Using byteplay:
Example:
#!/usr/bin/env python
from byteplay import Code
def some_function(n):
for i in range(n):
i + i
r = 1
r = r + 1
def no_of_bytecode_instructions(f):
code = Code.from_code(f.func_code)
return len(code.code)
print(no_of_bytecode_instructions(some_function))
Output:
$ python -i foo.py
28
>>>
NB:
This still gives you no idea how complex f is here.
"Number of Instructions" != "Algorithm Complexity" (not by itself)
See: Big O
Algorithm complexity is a measure of the no. of instructions executed
relative to the size of your input data set(s).
Some naive examples of "complexity" and Big O:
def func1(just_a_list):
"""O(n)"""
for i in just_a_list:
...
def func2(list_of_lists):
"""O(n^2)"""
for i in list_of_lsits:
for j in i:
...
def func3(a_dict, a_key):
"""O(1)"""
return a_dict[a_key]

Importing big tecplot block files in python as fast as possible

I want to import in python some ascii file ( from tecplot, software for cfd post processing).
Rules for those files are (at least, for those that I need to import):
The file is divided in several section
Each section has two lines as header like:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
Each section has a set of variable given by the first line. When a section ends, a new section starts with two similar lines.
For each variable there are I*J*K values.
Each variable is a continous block of values.
There are a fixed number of values per row (6).
When a variable ends, the next one starts in a new line.
Variables are "IJK ordered data".The I-index varies the fastest; the J-index the next fastest; the K-index the slowest. The I-index should be the inner loop, the K-index shoould be the outer loop, and the J-index the loop in between.
Here is an example of data:
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen"
ZONE T="Window(s) : E_W_Block0002_ALL", I=29, J=17, K=25, F=BLOCK
-3.9999999E+00 -3.3327306E+00 -2.7760824E+00 -2.3117116E+00 -1.9243209E+00 -1.6011492E+00
[...]
0.0000000E+00 #fin first variable
-4.3532482E-02 -4.3584235E-02 -4.3627592E-02 -4.3663762E-02 -4.3693815E-02 -4.3718831E-02 #second variable, 'y'
[...]
1.0738781E-01 #end of second variable
[...]
[...]
VARIABLES = "x" "y" "z" "ro" "rovx" "rovy" "rovz" "roE" "M" "p" "Pi" "tsta" "tgen" #next zone
ZONE T="Window(s) : E_W_Block0003_ALL", I=17, J=17, K=25, F=BLOCK
I am quite new at python and I have written a code to import the data to a dictionary, writing the variables as 3D numpy.array . Those files could be very big, (up to Gb). How can I make this code faster? (or more generally, how can I import such files as fast as possible)?
import re
from numpy import zeros, array, prod
def vectorr(I, J, K):
"""function"""
vect = []
for k in range(0, K):
for j in range(0, J):
for i in range(0, I):
vect.append([i, j, k])
return vect
a = open('E:\u.dat')
filelist = a.readlines()
NumberCol = 6
count = 0
data = dict()
leng = len(filelist)
countzone = 0
while count < leng:
strVARIABLES = re.findall('VARIABLES', filelist[count])
variables = re.findall(r'"(.*?)"', filelist[count])
countzone = countzone+1
data[countzone] = {key:[] for key in variables}
count = count+1
strI = re.findall('I=....', filelist[count])
strI = re.findall('\d+', strI[0])
I = int(strI[0])
##
strJ = re.findall('J=....', filelist[count])
strJ = re.findall('\d+', strJ[0])
J = int(strJ[0])
##
strK = re.findall('K=....', filelist[count])
strK = re.findall('\d+', strK[0])
K = int(strK[0])
data[countzone]['indmax'] = array([I, J, K])
pr = prod(data[countzone]['indmax'])
lin = pr // NumberCol
if pr%NumberCol != 0:
lin = lin+1
vect = vectorr(I, J, K)
for key in variables:
init = zeros((I, J, K))
for ii in range(0, lin):
count = count+1
temp = map(float, filelist[count].split())
for iii in range(0, len(temp)):
init.itemset(tuple(vect[ii*6+iii]), temp[iii])
data[countzone][key] = init
count = count+1
Ps. In python, no cython or other languages

Converting a large bunch of strings to numbers is always going to be a little slow, but assuming the triple-nested for-loop is the bottleneck here maybe changing it to the following gives you a sufficient speedup:
# add this line to your imports
from numpy import fromstring
# replace the nested for-loop with:
count += 1
for key in variables:
str_vector = ' '.join(filelist[count:count+lin])
ar = fromstring(str_vector, sep=' ')
ar = ar.reshape((I, J, K), order='F')
data[countzone][key] = ar
count += lin
Unfortunately at the moment I only have access to my smartphone (no pc) so I can't test how fast this is or even if it works correctly or at all!
Update
Finally I got around to doing some testing:
My code contained a small error, but it does seem to work correctly now.
The code with the proposed changes runs about 4 times faster than the original
Your code spends most of its time on ndarray.itemset and probably loop overhead and float conversion. Unfortunately cProfile doesn't show this in much detail..
The improved code spends about 70% of time in numpy.fromstring, which, in my view, indicates that this method is reasonably fast for what you can achieve with Python / NumPy.
Update 2
Of course even better would be to iterate over the file instead of loading everything all at once. In this case this is slightly faster (I tried it) and significantly reduces memory use. You could also try to use multiple CPU cores to do the loading and conversion to floats, but then it becomes difficult to have all the data under one variable. Finally a word of warning: the fromstring method that I used scales rather bad with the length of the string. E.g. from a certain string length it becomes more efficient to use something like np.fromiter(itertools.imap(float, str_vector.split()), dtype=float).

If you use regular expressions here, there's two things that I would change:
Compile REs which are used more often (which applies to all REs in your example, I guess). Do regex=re.compile("<pattern>") on them, and use the resulting object with match=regex.match(), as described in the Python documentation.
For the I, J, K REs, consider reducing two REs to one, using the grouping feature (also described above), by searching for a pattern of the form "I=(\d+)", and grabbing the part matched inside the parentheses using regex.group(1). Taking this further, you can define a single regex to capture all three variables in one step.
At least for starting the sections, REs seem a bit overkill: There's no variation in the string you need to look for, and string.find() is sufficient and probably faster in that case.
EDIT: I just saw you use grouping already for the variables...

Only index needed: enumerate or (x)range?

If I want to use only the index within a loop, should I better use the range/xrange function in combination with len()
a = [1,2,3]
for i in xrange(len(a)):
print i
or enumerate? Even if I won't use p at all?
for i,p in enumerate(a):
print i

I would use enumerate as it's more generic - eg it will work on iterables and sequences, and the overhead for just returning a reference to an object isn't that big a deal - while xrange(len(something)) although (to me) more easily readable as your intent - will break on objects with no support for len...

Using xrange with len is quite a common use case, so yes, you can use it if you only need to access values by index.
But if you prefer to use enumerate for some reason, you can use underscore (_), it's just a frequently seen notation that show you won't use the variable in some meaningful way:
for i, _ in enumerate(a):
print i
There's also a pitfall that may happen using underscore (_). It's also common to name 'translating' functions as _ in i18n libraries and systems, so beware to use it with gettext or some other library of such kind (thnks to #lazyr).

That's a rare requirement – the only information used from the container is its length! In this case, I'd indeed make this fact explicit and use the first version.

xrange should be a little faster, but enumerate will mean you don't need to change it when you realise that you need p afterall

I ran a time test and found out range is about 2x faster than enumerate. (on python 3.6 for Win32)
best of 3, for len(a) = 1M
enumerate(a): 0.125s
range(len(a)): 0.058s
Hope it helps.
FYI: I initialy started this test to compare python vs vba's speed...and found out vba is actually 7x faster than range method...is it because of my poor python skills?
surely python can do better than vba somehow
script for enumerate
import time
a = [0]
a = a * 1000000
time.perf_counter()
for i,j in enumerate(a):
pass
print(time.perf_counter())
script for range
import time
a = [0]
a = a * 1000000
time.perf_counter()
for i in range(len(a)):
pass
print(time.perf_counter())
script for vba (0.008s)
Sub timetest_for()
Dim a(1000000) As Byte
Dim i As Long
tproc = Timer
For i = 1 To UBound(a)
Next i
Debug.Print Timer - tproc
End Sub

I wrote this because I wanted to test it.
So it depends if you need the values to work with.
Code:
testlist = []
for i in range(10000):
testlist.append(i)
def rangelist():
a = 0
for i in range(len(testlist)):
a += i
a = testlist[i] + 1 # Comment this line for example for testing
def enumlist():
b = 0
for i, x in enumerate(testlist):
b += i
b = x + 1 # Comment this line for example for testing
import timeit
t = timeit.Timer(lambda: rangelist())
print("range(len()):")
print(t.timeit(number=10000))
t = timeit.Timer(lambda: enumlist())
print("enum():")
print(t.timeit(number=10000))
Now you can run it and will get most likely the result, that enum() is faster.
When you comment the source at a = testlist[i] + 1 and b = x + 1 you will see range(len()) is faster.
For the code above I get:
range(len()):
18.766527627612255
enum():
15.353173553868345
Now when commenting as stated above I get:
range(len()):
8.231641875551514
enum():
9.974262515773656

Based on your sample code,
res = [[profiel.attr[i].x for i,p in enumerate(profiel.attr)] for profiel in prof_obj]
I would replace it with
res = [[p.x for p in profiel.attr] for profiel in prof_obj]

Just use range(). If you're going to use all the indexes anyway, xrange() provides no real benefit (unless len(a) is really large). And enumerate() creates a richer datastructure that you're going to throw away immediately.

Reading n lines from file (but not all) in Python

How to read n lines from a file instead of just one when iterating over it? I have a file which has well defined structure and I would like to do something like this:
for line1, line2, line3 in file:
do_something(line1)
do_something_different(line2)
do_something_else(line3)
but it doesn't work:
ValueError: too many values to unpack
For now I am doing this:
for line in file:
do_someting(line)
newline = file.readline()
do_something_else(newline)
newline = file.readline()
do_something_different(newline)
... etc.
which sucks because I am writing endless 'newline = file.readline()' which are cluttering the code.
Is there any smart way to do this ? (I really want to avoid reading whole file at once because it is huge)

Basically, your fileis an iterator which yields your file one line at a time. This turns your problem into how do you yield several items at a time from an iterator. A solution to that is given in this question. Note that the function isliceis in the itertools module so you will have to import it from there.

If it's xml why not just use lxml?

You could use a helper function like this:
def readnlines(f, n):
lines = []
for x in range(0, n):
lines.append(f.readline())
return lines
Then you can do something like you want:
while True:
line1, line2, line3 = readnlines(file, 3)
do_stuff(line1)
do_stuff(line2)
do_stuff(line3)
That being said, if you are using xml files, you will probably be happier in the long run if you use a real xml parser...

itertools to the rescue:
import itertools
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
fobj= open(yourfile, "r")
for line1, line2, line3 in grouper(3, fobj):
pass

for i in file produces a str, so you can't just do for i, j, k in file and read it in batches of three (try a, b, c = 'bar' and a, b, c = 'too many characters' and look at the values of a, b and c to work out why you get the "too many values to unpack").
It's not clear entirely what you mean, but if you're doing the same thing for each line and just want to stop at some point, then do it like this:
for line in file_handle:
do_something(line)
if some_condition:
break # Don't want to read anything else
(Also, don't use file as a variable name, you're shadowning a builtin.)

If your're doing the same thing why do you need to process multiple lines per iteration?
For line in file is your friend. It is in general much more efficient than manually reading the file, both in terms of io performance and memory.

Do you know something about the length of the lines/format of the data? If so, you could read in the first n bytes (say 80*3) and f.read(240).split("\n")[0:3].

If you want to be able to use this data over and over again, one approach might be to do this:
lines = []
for line in file_handle:
lines.append(line)
This will give you a list of the lines, which you can then access by index. Also, when you say a HUGE file, it is most likely trivial what the size is, because python can process thousands of lines very quickly.

why can't you just do:
ctr = 0
for line in file:
if ctr == 0:
....
elif ctr == 1:
....
ctr = ctr + 1
if you find the if/elif construct ugly you could just create a hash table or list of function pointers and then do:
for line in file:
function_list[ctr]()
or something similar

It sounds like you are trying to read from disk in parallel... that is really hard to do. All the solutions given to you are realistic and legitimate. You shouldn't let something put you off just because the code "looks ugly". The most important thing is how efficient/effective is it, then if the code is messy, you can tidy it up, but don't look for a whole new method of doing something because you don't like how one way of doing it looks like in code.
As for running out of memory, you may want to check out pickle.

It's possible to do it with a clever use of the zip function. It's short, but a bit voodoo-ish for my tastes (hard to see how it works). It cuts off any lines at the end that don't fill a group, which may be good or bad depending on what you're doing. If you need the final lines, itertools.izip_longest might do the trick.
zip(*[iter(inputfile)] * 3)
Doing it more explicitly and flexibly, this is a modification of Mats Ekberg's solution:
def groupsoflines(f, n):
while True:
group = []
for i in range(n):
try:
group.append(next(f))
except StopIteration:
if group:
tofill = n - len(group)
yield group + [None] * tofill
return
yield group
for line1, line2, line3 in groupsoflines(inputfile, 3):
...
N.B. If this runs out of lines halfway through a group, it will fill in the gaps with None, so that you can still unpack it. So, if the number of lines in your file might not be a multiple of three, you'll need to check whether line2 and line3 are None.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.