Python- validate generated powerset - python

I want to validate generated combinations only based on data with in "< >".
I have an excel sheet consisting of all the possible combinations generated based on "<>" condition:
Below is the sample of that:
[<Pen(x)>-C(A2)-C(60)-<jack(c)>-xy1-[dress0]-C(D0)-lbr-]
[<Pen(x)>-C(A2)-C(60)-NULL-xy1-[dress0]-C(D0)-lbr-]
[NULL-C(A2)-C(60)-<jack(c)>-xy1-[dress0]-C(D0)-lbr-]
[NULL-C(A2)-C(60)-NULL-xy1-[dress0]-C(D0)-lbr-]
I want to check if the generated combinations is valid or not.
For example: for the above list the original string before generating combinations is below:
<Pen(x)>-C(A2)-C(60)--<jack(c)>-xy1-[address0]-C(D0)-lbr-
Kindly help me to find a generic method to validate all the powersets generated based on <>.
To give a simple example:
I have the below list1.
[<A><B>-CAT-DOG]
[NULL-<B>-CAT-DOG]
[<A>-NULL-CAT-DOG]
[NULL-NULL-CAT-DOG]
The list1 is all possible combination of:
<A><B>-CAT-DOG
I want to check if the above list1 is valid or not

We can build the desired combinations using itertools.product, which generates the Cartesian product of its iterable arguments. But first we need to split the input string up into its components. We can do that by first adding some extra spaces and then calling the .split method.
We can then transform each string in the list returned by .split into a tuple. Items enclosed by < and > get transformed into a 2-tuple containing the item and the 'NULL' string, all other items become 1-tuples.
from itertools import product
def make_powerset(base):
# Add some spaces to make splitting easier
s = base.replace('-', ' ').replace('<', ' <').replace('>', '> ')
# Convert items enclosed in <> into 2-tuples and make other items 1-tuples
elements = [(u, 'NULL') if u.startswith('<') else (u,) for u in s.split()]
# Create all the subsets by finding the Cartesian product of all the tuples
return {'-'.join(t).replace('>-<', '><') for t in product(*elements)}
# Tests
# Make a powerset from base
base = '<Pen(x)>-C(A2)-C(60)--<jack(c)>-xy1-[address0]-C(D0)-lbr-'
powerset = make_powerset(base)
for t in powerset:
print(t)
print()
# Test if the following data are in the powerset
data = (
'<Pen(x)>-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr-',
'<Pen(x)>-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr-',
'NULL-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr-',
'NULL-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr-',
'<Pen(y)>-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr-',
)
for s in data:
print(s, s.rstrip('-') in powerset)
print('\n', '- ' * 20, '\n')
# Make another powerset
for t in make_powerset('<A><B>-CAT-DOG<C>'):
print(t)
output
<Pen(x)>-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr
NULL-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr
<Pen(x)>-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr
NULL-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr
<Pen(x)>-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr- True
<Pen(x)>-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr- True
NULL-C(A2)-C(60)-<jack(c)>-xy1-[address0]-C(D0)-lbr- True
NULL-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr- True
<Pen(y)>-C(A2)-C(60)-NULL-xy1-[address0]-C(D0)-lbr- False
- - - - - - - - - - - - - - - - - - - -
NULL-NULL-CAT-DOG-NULL
NULL-<B>-CAT-DOG-NULL
<A>-NULL-CAT-DOG-<C>
<A>-NULL-CAT-DOG-NULL
NULL-NULL-CAT-DOG-<C>
<A><B>-CAT-DOG-<C>
NULL-<B>-CAT-DOG-<C>
<A><B>-CAT-DOG-NULL

Related

Python: Unify multiple lists into one

Could you help me with the following challenge I am currently facing:
I have multiple lists, each of which contains multiple strings. Each string has the following format:
"ID-Type" - where ID is a number and type is a common Python type. One such example can be found here:
["1-int", "2-double", "1-string", "5-list", "5-int"],
["3-string", "1-int", "1-double", "5-double", "5-string"]
Before calculating further, I now want to preprocess these list to unify them the following way:
Count how often each type is appearing in each list
Generate a new list, combining both results
Create a mapping from initial list to that new list
As an example
In the above lists, we have the following types:
List 1: 2 int, 1 double, 1 string, 1 list
List 2: 2 string, 2 double, 1 int
The resulting table should now contain:
2 int, 2 double, 2 string, 1 list (in order to be able to contain both lists), like this:
[
"int_1-int",
"int_2-int",
"double_1-double",
"double_2-double",
"string_1-string",
"string_2-string",
"list_1-list"
]
And lastly, in order to map input to output, the idea is to have a corresponding dictionary to map this transformation, e.g., for list_1:
{
"1-int": "int_1-int",
"2-double": "double_1-double",
"1-string": "string_1-string",
"5-list": "list_1-list",
"5-int": "int_2-int"
}
I want to prevent to do this with a nested loop and multiple iterations - are there any libraries or is there maybe a smart vectorized solution to address this challenge?
Just add them:
Example :
['it'] + ['was'] + ['annoying']
You should read the Python tutorial to learn basic info like this.
Just another method....
import itertools
ab = itertools.chain(['it'], ['was'], ['annoying'])
list(ab)
Just add them: Example :
['it'] + ['was'] + ['annoying']
You should read the Python tutorial to learn basic info like this.
Just another method....
import itertools
ab = itertools.chain(['it'], ['was'], ['annoying'])
list(ab)
In general, this approach doesn't really make sense unless you specifically need to have the items in the resulting list and dict in this exact format. But here's how you can do it:
def process_type_list(type_list):
mapping = dict()
for i in type_list:
i_type = i.split('-')[1]
n_occur = 1
map_val = f'{i_type}_{n_occur}-{i_type}'
while map_val in mapping.values():
n_occur += 1
map_val = f'{i_type}_{n_occur}-{i_type}'
mapping[i] = map_val
return mapping
l1 = ["1-int", "2-double", "1-string", "5-list", "5-int"]
l2 = ["3-string", "1-int", "1-double", "5-double", "5-string"]
l1_mapping = process_type_list(l1)
l2_mapping = process_type_list(l2)
Additionally, Python does not have a double type. C doubles are implemented as Python floats (or decimal.Decimal if you need fine control over the precision)
I am pretty sure that this is what you want to do:
To make a joint list:
['any item'] + ['any item 2']
If you want to turn the list into a dictionary:
dict(zip(['key 1', 'key 2'], ['value 1', 'value 2']))
Another method of joining 2 lists:
a = ['list item', 'another list item']
a.extend(['another list item', 'another list item'])

pyhamcrest - Compare two list

I've just started learning python. Currently writing a unit test to assert if the elements in the expected list is present in the actual list
def test_compare_list_of_items():
actual_list_of_items = ['a','b']
expected_list_of_items = ['a']
assert_that(actual_list_of_items, has_item(has_items(expected_list_of_items)))
but i'm getting errors like
E Expected: a sequence containing (a sequence containing <['a']>)
E but: was <['a', 'b']>
How and what sequence matcher should i use in order to assert if item 'a' in the expected list is present in the actual list?
You are using has_item when you should only be using has_items. According to the docs this takes multiple matchers which is what you want. Your function then becomes
def test_compare_list_of_items():
actual_list_of_items = ['a','b']
expected_list_of_items = ['a']
assert_that(actual_list_of_items, has_items(*expected_list_of_items))
We use iterable unpacking for the list to feed as the arguments and now when you run it, it shouldn't error out.
I don't know about has_items function, but can you just use something like this?
assertTrue(all(item in expected_list_of_items for item in actual_list_of_items))

Mapping Python for loops through an SPSS regression

I need to run two loops through my regression, one of them being the independent variable and the other is a suffix for the prediction I need to save with each round of independent variables. I can do either of these loops separately and it works fine but not when I combine them in the same regression. I think this has something to do with the loop mapping at the end of my regression after the %. I get the error code "TypeError: list indices must be integers, not str." But, that is because my Dependent variables are read as strings to get the values from SPSS data frame. Any way to map a for loop in a regression that includes string variables?
I have tried using the map() function, but I got the code that the iteration is not supported.
begin program.
import spss,spssaux
dependent = ['dv1', 'dv2', 'dv3', 'dv4', 'dv5']
spssSyntax = ''
depList = spssaux.VariableDict(caseless = True).expand(dependent)
varSuffix = [1,2,3,4,5]
for dep in depList:
for var in varSuffix:
spssSyntax += '''
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT %(dep)s
/METHOD=FORWARD iv1 iv2 iv3
/SAVE PRED(PRE_%(var)d).
'''%(depList[dep],varSuffix[var])
end program.
I get the error code 'TypeError: list indices must be integers, not str'
with the code above. How do I map the loop while also including a string?
In Python, when you loop directly through an iterable, the loop variable becomes the current value so there is no need to index original lists with depList[dep] and varSuffix[var] but use variables directly: dep and var.
Additionally, consider str.format for string interpolation which is the Python 3 preferred method rather than the outmoded, de-emphasized (not yet deprecated) string modulo % operator:
for dep in depList:
for var in varSuffix:
spssSyntax += '''REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT {0}
/METHOD=FORWARD iv1 iv2 iv3
/SAVE PRED(PRE_{1})
'''.format(dep, var)
Alternatively, consider combining the two lists for one loop using itertools.product, then use a list comprehension to build string with join instead of concatenating loop iterations with +=:
from itertools import product
import spss,spssaux
dependent = ['dv1', 'dv2', 'dv3', 'dv4', 'dv5']
depList = spssaux.VariableDict(caseless = True).expand(dependent)
varSuffix = [1,2,3,4,5]
base_string = '''REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT {0}
/METHOD=FORWARD iv1 iv2 iv3
/SAVE PRED(PRE_{1})
'''
# LIST COMPREHENSION UNPACKING TUPLES TO FORMAT BASE STRING
# JOIN RESULTING LIST WITH LINE BREAKS SEPARATING ITEMS
spssSyntax = "\n".join([base_string.format(*dep_var)
for dep_var in product(depList, varSuffix)])
Now if you need to iterate in parallel elementwise between the equal length lists consider zip instead of product:
spssSyntax = "\n".join([base_string.format(d,v)
for d,v in zip(depList, varSuffix)])
Or enumerate for index number:
spssSyntax = "\n".join([base_string.format(d,i+1)
for i,d in enumerate(depList)])

Lists and indexes gives error because list out of range

I get an error when I try to run my funtion.
I know the reason. but I search a way to fix this.
list2=['name','position','salary','bonus']
list3=['name','position','salary']
def funtionNew(list):
print(len(list))
po= '{} {} {} {}'.format(list[0],list[1],list[2],list[3])
print(po)
funtionNew(list3)
So that I can make this for list2
po='{}{}{}{}'..format(list[0],list[1],list[2],list[3])
and make this for list3
po='{}{}{}'..format(list[0],list[1],list[2])
From the function implementation it seems like you try to concat the list items with spaces in between so you can try instead -
po=' '.join(list)
This is independent from the list length, however you have to make sure that all the items in the list are strings. So you can do the following -
po = ' '.join[str(s) for s in list]
Try the following:
def funtionNew(list):
print(len(list))
string_for_formatting = '{} ' * len(list)
po = string_for_formatting.format(*list)
print(po)
This creates a string with a variable number of {} terms according to the length of your list, and then uses format on the list. The asterisk *, unpacks the elements of your list as inputs for the function.

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Categories

Resources