Iterate over an HDF file with groups and subgroups with python itertools

Iterate over an HDF file with groups and subgroups with python itertools - python

I would like to parse an HDF file that has the following format
HDFFile/
Group1/Subgroup1/DataNDArray
...
/SubgroupN/DataNDArray
...
GroupM/Subgroup1/DataNDArray
...
/SubgroupN/DataNDArray
I am trying to use itertools.product but I get stuck on what to use for the second iterator
MWE:
from itertools import *
import h5py
hfilename = 'data.hdf'
with h5py.File(hfilename, 'r') as hfile:
for group, subgroup, dim in product(hfile.itervalues(), ????, range(10));
parse(group, subgroup, dim)
Obviously my problem is that the second iterator would depend on the extracted value of the first iterator, which can't be available in the same one liner.
I know that I can do it with for loops or with the following example:
with h5py.File(hfilename, 'r') as hfile:
for group in hfile.itervalues():
for subgroup, dim in product(group.itervalues(), range(10)):
parse(group, subgroup, dim)
but I was wondering if there is a way to do it in one itertools run.

Does the second iterator depend on the extracted value of the first iterator? From your example it seems like there are N subgroups in every group.
A solution with list comprehensions and () generators (instead of product) would look like:
M = 3
N = 2
a = ['Group' + str(m) for m in range(1, M + 1)]
b = ['Subgroup' + str(n) for n in range(1, N + 1)]
c = ('{}/{}/DataNDArray'.format(ai, bi) for ai in a for bi in b)
for key in c:
print(key)
and print:
Group1/Subgroup1/DataNDArray
Group1/Subgroup2/DataNDArray
Group2/Subgroup1/DataNDArray
Group2/Subgroup2/DataNDArray
Group3/Subgroup1/DataNDArray
Group3/Subgroup2/DataNDArray
which should be what you want.

Related

Create words by combining all elements in a set of lists with each other

I have 6 different lists, which I need to combine and create sequences/words for all possible combinations.
These are my lists:
product = ['SOC', 'SOCCAMP', 'ETI', 'CARDASS', 'VRS', 'PRS', 'INT', 'GRS', 'VeloVe', 'HOME']
fam = ['IND', 'FAM']
area = ['EUR', 'WOR']
type = ['STD', 'PLU']
assist = ['MOT', 'NMOT']
All of the elements in all of the lists need to be combined in words.
The result will be a list of all the elements possible.
I will have among all of the elements f.e. the following words:
('SOC', 'SOCIND', 'SOCFAM', 'SOCFAMMOT', 'SOCMOTFAM'...)
I thus combine each element of a precise list with all the elements of the other lists.
Up to now I managed to combine them with a series of loops:
combos = []
##1##
for i in range(len(product)):
combos.append(str(product[i]))
##2##
for a in range(len(product)):
for b in range(len(fam)):
combos.append(str(product[a]) + str(fam[b]))
##3##
for a in range(len(product)):
for b in range(len(fam)):
for c in range(len(area)):
combos.append(str(product[a]) + str(fam[b]) + str(area[c]))
##4##
for a in range(len(product)):
for b in range(len(fam)):
for c in range(len(area)):
for d in range(len(type)):
combos.append(str(product[a]) + str(fam[b]) + str(area[c]) + str(type[d]))
##5##
for a in range(len(product)):
for b in range(len(fam)):
for c in range(len(area)):
for d in range(len(type)):
for e in range(len(assist)):
combos.append(str(product[a]) + str(fam[b]) + str(area[c]) + str(type[d]) + str(assist[e]))
This code manages to combine the words in a list of combinations but solely in the precise order the lists are mentioned:
['SOC', 'SOCCAMP', 'ETI', 'CARDASS', 'VRS', 'PRS', 'INT', 'GRS', 'VeloVe', 'HOME', 'SOCIND', 'SOCFAM', 'SOCCAMPIND', 'SOCCAMPFAM', 'ETIIND', 'ETIFAM', 'CARDASSIND', 'CARDASSFAM', 'VRSIND', 'VRSFAM', 'PRSIND', 'PRSFAM', 'INTIND', 'INTFAM', 'GRSIND', 'GRSFAM', 'VeloVeIND', 'VeloVeFAM', 'HOMEIND', 'HOMEFAM', ...]
So, 'SOCINDEUR' is a combination in this list but 'SOCEURIND' is not.
Is there a smart way to avoid writing down another 100 loops to look for all the possible combinations?

You can make use of various tools provided by itertools.
Overall, one solution is:
source_lists = product, fam, area, type, assist
combos = [
''.join(prod)
for l in range(1, len(source_lists))
for subset in itertools.permutations(source_lists, l)
for prod in itertools.product(*subset)
]
This code is effectively equivalent of:
combos = []
source_lists = product, fam, area, type, assist
for l in range(1, len(source_lists)):
for subset in itertools.permutations(source_lists, l):
for prod in itertools.product(*subset):
combos.append(''.join(prod))
The first loop selects how many different source lists you want to select, so it will first produce the results from single list, i.e. the "original" input. Then it moves to combining two etc.
The second loop select which source lists and in which order we will combine (and goes over all possible permutations).
Finally the third and last loop takes the selected source list and produces all the possible combinations (or, better said "product") of these lists.
Finally this produces the tuples of results, since you want one single string per result, we need to join them.

The magic of itertools!
from itertools import product, permutations
def concat_combinations(sequence):
new_seq = []
for i,j in enumerate(sequence):
if i == 0:
new_str = j
new_seq.append(new_str)
else:
new_str += j
new_seq.append(new_str)
return new_seq
def final_list(*iterables)->list:
s = set()
for seq in list(product(*iterables)):
[s.update(concat_combinations(i)) for i in permutations(seq)]
return sorted(s,key=lambda x: len(x))

Getting union of files after the combination generated from command line in Python

I want to write a python program where I want to take n number of command line arguments.
For an example : python3 myProgram.py 3 A B C
In the above example n = 3 and the 3 arguments are A, B, C
Now 1st I want to generate all the combinations of those n arguments except for the empty one. For the above example it will be : A, B, C, AB, AC, BC, ABC
So I am going to get 2^n-1 number of combinations.
For the above part I am trying like:
import sys
import itertools
from itertools import combinations
number = int(sys.argv[1]);
a_list=list(sys.argv[2:number+2])
all_combinations = []
for r in range(len(a_list) + 1):
combinations_object = itertools.combinations(a_list, r)
combinations_list = list(combinations_object)
all_combinations += combinations_list
print(all_combinations)
But here I am unable to remove the empty combination.
Now initially I have n files in that same directory. For an example in above case, I have 3 files : A.txt, B.txt, C.txt
Now after that for each combination I want to generate an output file like:
When it is only A then the outputfile_1 = A.txt
When it is only B then the outputfile_2 = B.txt
When it is only C then the outputfile_3 = C.txt
When it is AB then the outputfile_4 = union (A.txt, B.txt)
...
so on
When it is ABC then the outputfile_7 = union (A.txt, B.txt, C.txt)
So for this above example if I run the code like : python3 myProgram.py 3 A B C then I am going to get 7 output files as output.
And if it is python3 myProgram.py 4 A B C D then I am going to get 15 output files as output.
To use the concept of Union, I am trying to use the logic:
with open("A.txt") as fin1: lines = set(fin1.readlines())
with open("B.txt") as fin2: lines.update(set(fin2.readlines()))
with open("outputfile_4.txt", 'w') as fout: fout.write('\n'.join(list(lines)))
But I am unable to understand how to merge these 2 things and get my desired outcome. Please help me out.

I think this is probably two separate questions. The first is how to get all of the combinations where n is greater than 0. #timus was on the right track there. To make their answer more complete:
Use list comprehension to generate a list of itertools.combinations objects
Use nested list comprehension to make a one-dimension list of tuples
matrix = [itertools.combinations(a_list, r) for r in range(1, len(a_list) + 1)]
combinations = [c for combinations in matrix for c in combinations]
The second question seems a bit less clear. I'm not sure if it's how to iterate the combinations, how to get filenames from the combinations, or something else. I've provided a sample implementation below (python3.6+).
import sys
import itertools
def union(files):
lines = set()
for file in files:
with open(file) as fin:
lines.update(fin.readlines())
return lines
def main():
number = int(sys.argv[1]);
a_list=sys.argv[2:number+2]
matrix = [itertools.combinations(a_list, r) for r in range(1, len(a_list) + 1)]
combinations = [c for combinations in matrix for c in combinations]
for combination in combinations:
filenames = [f'{name}.txt' for name in combination]
output = f'{"".join(combination)}_output.txt'
print(f'Writing union of {filenames} to {output}')
with open(output, 'w') as fout:
fout.writelines(union(filenames))
if __name__ == '__main__':
main()

How can I print strings in a list by the number in a different list, on different lines depending on a third list?

For example, given:
On_A_Line = [2,2,3]
Lengths_Of_Lines = [5,2,4,3,2,3,2]
Characters = ['a','t','i','e','u','w','x']
I want it to print:
aaaaatt
iiiieee
uuwwwxx
So far I have tried:
iteration = 0
for number in Lengths_Of_Lines:
s = Lengths_Of_Lines[iteration]*Characters[iteration]
print(s, end = "")
iteration += 1
which prints what I want without the line spacing:
aaaaattiiiieeeuuwwwxx
I just don't have the python knowledge to know what to do from there.

Solution using a generator and itertools:
import itertools
def repeat_across_lines(chars, repetitions, per_line):
gen = ( c * r for c, r in zip(chars, repetitions) )
return '\n'.join(
''.join(itertools.islice(gen, n))
for n in per_line
)
Example:
>>> repeat_across_lines(Characters, Lengths_Of_Lines, On_A_Line)
'aaaaatt\niiiieee\nuuwwwxx'
>>> print(_)
aaaaatt
iiiieee
uuwwwxx
The generator gen yields each character repeated the appropriate number of times. These are joined together n at a time with itertools.islice, where n comes from per_line. Those results are then joined with newline characters. Because gen is a generator, the next call to islice yields the next n of them that haven't been consumed yet, rather than the first n.

You need to loop over the On_A_Line list. This tells you have many iterations of the inner loop to perform before printing a newline.
iteration = 0
for count in On_A_Line:
for _ in range(count):
s = Lengths_Of_Lines[iteration]*Characters[iteration]
print(s, end = "")
iteration += 1
print("") # Print newline

Loop through multiple lists in specific intervals

I have two lists. One with names, and one with numbers that correspond with a name in the first list (corresponding name and number are at the same index point in each list). I need to reference each name and number in a url that can only handle 25 different names & points at a time.
pointNames = ['name1', 'name2', 'name3']
points = ['1', '2', '3'] #yes, the numbers are meant to be strings
My actual lists have roughly 600 values in each. What I'm trying to do is loop through each list at the same time, but in increments of 25. I'm able to do this with a single list using the following:
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
for group in chunker(pointNames, 25):
print (group)
This prints multiple groups of 25 values from the list until it has gone through the entire list. I want to do exactly this, but with two lists. I'm able to print each list entirely with for(point, name) in zip(points, pointNames): but it I need it in groups of 25.
I've also tried combining the two lists into a dictionary:
dictionary = dict(zip(points, pointNames))
for group in chunker(dictionary, 25):
print (group)
but i get the following error:
TypeError: unhashable type: 'slice'

A generator would be more efficient:
import itertools
def chunker(size, *seq):
seq = zip(*seq)
while True:
val = list(itertools.islice(seq, size))
if not val:
break
yield val
for group in chunker(2, pointNames, points):
print(group)
gen_groups = chunker(2, pointNames, points, pointNames, points)
group = next(gen_groups)
print(group)
Using *seq allows you to give any number of list as parameters.

How about this relatively minimal change to your first function:
def chunker(seq1, seq2, size):
seq = list(zip(seq1, seq2))
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
Call as follows:
for group in chunker(pointNames, points, 25):
print(group)

Itertools can slice an iterator (or generator) into chunks, together with a small helper function to keep going until it is done:
import itertools
# helper function, https://stackoverflow.com/a/24527424
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield itertools.chain([first], itertools.islice(iterator, size - 1))
# 600 points and pointNames
points = (str(i) for i in range(600))
pointNames = ('name ' + str(i) for i in range(600))
# work with the chunks
for chunk in chunks(zip(pointNames, points), 25):
print('-' * 40)
print(list(chunk))

Iterate through lines changing only one character python

I have a file that looks like this
N1 1.023 2.11 3.789
Cl1 3.124 2.4534 1.678
Cl2 # # #
Cl3 # # #
Cl4
Cl5
N2
Cl6
Cl7
Cl8
Cl9
Cl10
N3
Cl11
Cl12
Cl13
Cl14
Cl15
The three numbers continue down throughout.
What I would like to do is pretty much a permutation. These are 3 data sets, set 1 is N1-Cl5, 2 is N2-Cl10, and set three is N3 - end.
I want every combination of N's and Cl's. For example the first output would be
Cl1
N1
Cl2
then everything else the same. the next set would be Cl1, Cl2, N1, Cl3...and so on.
I have some code but it won't do what I want, becuase it would know that there are three individual data sets. Should I have the three data sets in three different files and then combine, using a code like:
list1 = ['Cl1','Cl2','Cl3','Cl4', 'Cl5']
for line in file1:
line.replace('N1', list1(0))
list1.pop(0)
print >> file.txt, line,
or is there a better way?? Thanks in advance

This should do the trick:
from itertools import permutations
def print_permutations(in_file):
separators = ['N1', 'N2', 'N3']
cur_separator = None
related_elements = []
with open(in_file, 'rb') as f:
for line in f:
line = line.strip()
# Split Nx and CIx from numbers.
value = line.split()[0]
# Found new Nx. Print previous permutations.
if value in separators and related_elements:
for perm in permutations([cur_separator] + related_elements)
print perm
cur_separator = line
related_elements = []
else:
# Found new CIx. Append to the list.
related_elements.append(value)

You could use regex to find the line numbers of the "N" patterns and then slice the file using those line numbers:
import re
n_pat = re.compile(r'N\d')
N_matches = []
with open(sample, 'r') as f:
for num, line in enumerate(f):
if re.match(n_pat, line):
N_matches.append((num, re.match(n_pat, line).group()))
>>> N_matches
[(0, 'N1'), (12, 'N2'), (24, 'N3')]
After you figure out the line numbers where these patterns appear, you can then use itertools.islice to break the file up into a list of lists:
import itertools
first = N_matches[0][0]
final = N_matches[-1][0]
step = N_matches[1][0]
data_set = []
locallist = []
while first < final + step:
with open(file, 'r') as f:
for item in itertools.islice(f, first, first+step):
if item.strip():
locallist.append(item.strip())
dataset.append(locallist)
locallist = []
first += step
itertools.islice is a really nice way to take a slice of an iterable. Here's the result of the above on a sample:
>>> dataset
[['N1 1.023 2.11 3.789', 'Cl1 3.126 2.6534 1.878', 'Cl2 3.124 2.4534 1.678', 'Cl3 3.924 2.1134 1.1278', 'Cl4', 'Cl5'], ['N2', 'Cl6 3.126 2.6534 1.878', 'Cl7 3.124 2.4534 1.678', 'Cl8 3.924 2.1134 1.1278', 'Cl9', 'Cl10'], ['N3', 'Cl11', 'Cl12', 'Cl13', 'Cl14', 'Cl15']]
After that, I'm a bit hazy on what you're seeking to do, but I think you want permutations of each sublist of the dataset? If so, you can use itertools.permutations to find permutations on various sublists of dataset:
for item in itertools.permutations(dataset[0]):
print(item)
etc.
Final Note:
Assuming I understand correctly what you're doing, the number of permutations is going to be huge. You can calculate how many permutations there are them by taking the factorial of the number of items. Anything over 10 (10!) is going to produce over 3,000,000 million permutations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate over an HDF file with groups and subgroups with python itertools - python

Related

Create words by combining all elements in a set of lists with each other

Getting union of files after the combination generated from command line in Python

How can I print strings in a list by the number in a different list, on different lines depending on a third list?

Loop through multiple lists in specific intervals

Iterate through lines changing only one character python

Categories

Resources