Python, adding two values from data input - python

staright to the point:
my input:
d = {'Key1': [('aaaaa', '834M', '118G'),
('bbbbb', '216G', '220.3M')],
'Key2': [('ccccc', '790M', '162G'),
('ddddd', '203G', '34.8G'),
('eeeee', '126M', '112G')],
'Key3': [('fffff', '921G', '30.8M'),
('ggggg', '235G', '2.63G')]}
I have this so far and it works but only for G (Gb) values:
for p, vl in pools.items():
alloc = ('{}G'.format(round(sum([float(j[1].split('G')[0]) for j in vl]))))
free = ('{}G'.format(round(sum([float(j[2].split('G')[0]) for j in vl]))))
I need to add values accordingly:
from key1 aaaaa value 834M +
from key1 bbbbb value 216G
then
from key1 aaaaa value 118G +
from key1 bbbbb value 220.3M
and so on for every key
so the output will look like this:
216.8G 118.2G
and so on.

I'll refactor this a bit to make the lines stay within 80 characters long and to improve readability:
def split(j, i):
if 'G' in j[i]:
return float(j[i].split('G')[0])
return float(j[i].split('M')[0]) / 1000
for p, vl in pools.items():
alloc = ('{}G'.format(round(sum([split(j, 1) for j in vl]))))
free = ('{}G'.format(round(sum(split(j, 2) for j in vl]))))
You could also write the split function as:
def split(j, i):
suffix = j[i][-1]
value = float(j[i][:-1])
return value if suffix == 'G' else value / 1000

I suggest you code some functions to parse your input and to create the output the way you want.
A simple example could be:
def toFloat(s):
return float(s.replace("G","e9").replace("M","e6").replace("k","e3"))
def toPString(num):
lv = math.log(num,1000)
prefs = ['k','M','G','T']
return "{:.1f}{}".format(num/(1000**int(lv)),prefs[int(lv)-1])
Then you can do:
for p, vl in d.items():
alloc = toPString(sum([toFloat(l[1]) for l in vl]))
free = toPString(sum([toFloat(l[2]) for l in vl]))
print(p, alloc, free)
It gives me:
Key3 1.2T 2.7G
Key2 203.9G 308.8G
Key1 216.8G 118.2G
Hope that is what you are looking for.

Related

What is the cleanest way to iterate over multiple outcomes when order matters (and assign each to a unique key)?

Currently I attempting to write to an empty page in python, and using the built in dictionary in python. I am reading each line and if the line doesn't exist in the dictionary, I add it to the dictionary.
However I do not have the time to manually write every possible combination, but it is very important I am able to assign every possible combination its own unique key.
To solve this problem I attempted something like this:
while x <= number_of_outcomes:
print("{0}\t{1}\t{2}\t{3}".format(value1, value2, value3, value4),file=outfile)
nvalue1 = value1
nvalue2 = value2
nvalue3 = value3
nvalue4 = value4
nvalue1 = 0
print("{0}\t{1}\t{2}\t{3}".format(nvalue1, nvalue2, nvalue3, nvalue4),file=outfile)
while nvalue1 == 0:
###print("{0}\t{1}\t{2}\t{3}".format(value1, value2, value3, value4),file=outfile)
if nvalue4 == 2:
break
else:
for i in setoftwo:
###nvalue2 = nvalue2 + 1
###print("{0}\t{1}\t{2}\t{3}".format(nvalue1, nvalue2, nvalue3, nvalue4),file=outfile)
mvalue1 = nvalue1
mvalue2 = nvalue2
mvalue3 = nvalue3
mvalue4 = nvalue4
if mvalue2 == 1:
for i in setoftwo:
mmvalue1 = mvalue1
mmvalue2 = mvalue2
mmvalue3 = mvalue3
mmvalue4 = mvalue4
if mmvalue3 == 1:
for i in setoftwo:
mmvalue4 = mmvalue4 + 1
print("{0}\t{1}\t{2}\t{3}".format(mmvalue1, mmvalue2, mmvalue3, mmvalue4),file=outfile)
mvalue3 = mvalue3 + 1
print("{0}\t{1}\t{2}\t{3}".format(mvalue1, mvalue2, mvalue3, mvalue4),file=outfile)
for i in setoftwo:
mvalue4 = mvalue4 + 1
print("{0}\t{1}\t{2}\t{3}".format(nvalue1, nvalue2, nvalue3, nvalue4),file=outfile)
nvalue2 = nvalue2 + 1
print("{0}\t{1}\t{2}\t{3}".format(nvalue1, nvalue2, nvalue3, nvalue4),file=outfile)
for i in setoftwo:
nvalue3 = nvalue3 + 1
print("{0}\t{1}\t{2}\t{3}".format(nvalue1, nvalue2, nvalue3, nvalue4),file=outfile)
for i in setoftwo:
nvalue4 = nvalue4 + 1
print("{0}\t{1}\t{2}\t{3}".format(nvalue1, nvalue2, nvalue3, nvalue4),file=outfile)
I kept running into issues with the results printing the first value "0" for everything, I solved this by giving each nest its own "new version" of the value, then iterating that value.
However, at 4 values deep I realized there was no way I could possibly use this method for 256 values, at least not in my life time.
The goal is to end up with a dictionary where each key in the dictionary references to a string of numbers/letters , and each of those only have 1 of 3 options, and if the third option is choosen, no other number/letter may have that option.
So at 20 numbers long, there may be a definition like "01010201111110100000" or "ABABACABBBBBBABAAAAA" and it would be unique and defined by 1 key.
At the simple length of 4 I begin to notice I was missing values ( like 0112, using my method I could get 0221, but I would have to make an if statement after every 1 assignment that if the value is a 1, to iterate the next vale, and if it is a 1, iterate the next value ( that's already pretty wordy, and I knew right then going forward this couldn't work)
This has me mind blown because I have a hand drawn reference, of out to 8, ( around 20 pages ) and decided maybe coding would be easier, but I am lost on how to easily move the "unique" value around so that for each unique value there is 0,1 or a,b in every possible combo.
2 values would print out
CA (01)
CB (02)
AA (11)
AB (12)
BB (22)
BA (21)
BC (20)
AC (10)
3 values would look like
CAA (011)
CAB (012)
CBB (022)
CBA (021)
AAA (111)
ABB (122)
ABA (121)
AAB (112)
... ( so forth) ...
BAC (210)
AAC (110)
Finally , in a perfect world if the key 11 or 111 or 1111 exist then the key 22, 222, or 2222 does not exist. ( or visa-versa ) as they are equivalent, but in the worst case scenario, I can manually remove those from the dictionary after generating it.
I don't know how long the key length will end up being, I was actually trying to predict that by using smaller lengths and python, as I was drawing it out in hopes to find a pattern, but began missing key sequences and having to erase half a page to put it in the correct place (to attempt to visually see some kind of pattern )
For future purposes being able to have 2 or 3 unique characters in a long strand would be amazing, so any suggestions that allow for future modifications to easily simulate how 2 or 3 unique characters play would be amazing.
I have a python text book from college in front of me, I have been up and down the search engines trying to see if some kind of probability and stat page would enlighten me, and I know my lack of experience is probably why I am failing to find the answer I am looking for, so if I missed an easy obvious solution , thank you for being kind enough to let me know.
I think that this solves your base problem (one unique value):
from itertools import product
base_values = ['A', 'B'] # = 'AB' is also possible
unique_value = 'C'
key_len = n
keys = {
''.join(key)
for key in product(base_values, repeat=key_len)
}
keys.remove(base_values[-1] * key_len)
keys |= {
''.join(key[:i] + (unique_value,) + key[i:])
for key in product(base_values, repeat=key_len-1)
for i in range(key_len)
}
Result for key_len = 2
{'AA', 'AB', 'AC', 'BA', 'BC', 'CA', 'CB'}
and for key_len = 3
{'AAA', 'AAB', 'AAC', 'ABA', 'ABB', 'ABC', 'ACA', 'ACB', 'BAA', 'BAB',
'BAC', 'BBA', 'BBC', 'BCA', 'BCB', 'CAA', 'CAB', 'CBA', 'CBB'}
The more general question is a bit trickier. Here's a solutions (at least I think it is one):
from itertools import product, combinations, combinations_with_replacement, permutations
base_values = ['A', 'B'] # = 'AB' is also possible
unique_values = ['C', 'D', 'E'] # = 'C...' is also possible
key_len = n
keys = set()
for unique_len in range(min(key_len, len(unique_values)) + 1):
uniques = combinations(unique_values, unique_len)
base = combinations_with_replacement(base_values, key_len - unique_len)
keys |= {
''.join(key)
for ukey, bkey in product(uniques, base)
for key in permutations(ukey + bkey)
}
keys.remove(base_values[-1] * key_len)
But: The use of permutations is suboptimal, it produces multiple times the same key if bkey contains multiples, which then get unified by the set building. I know how to solve this in a messy way. I'll update when I come up with a better solution (provided you're interested).
This version is much faster but unfortunately not super readable:
from itertools import product, combinations, permutations
base_values = ['A', 'B'] # = 'AB' is also possible
unique_values = ['C', 'D', 'E'] # = 'C...' is also possible
key_len = n
def positioning(u_pos, k_len):
b_pos = sorted(set(range(k_len)).difference(u_pos))
u_len = len(u_pos)
return {
**{p: i for i, p in enumerate(u_pos)},
**{b_pos[j]: i for i, j in zip(range(u_len, k_len), range(k_len - u_len))}
}
keys = set()
for u_len in range(min(key_len, len(unique_values)) + 1):
u_keys = [
u_key + b_key
for u_key, b_key in product(
combinations(unique_values, u_len),
product(base_values, repeat=key_len-u_len)
)
]
for u_pos in permutations(range(key_len), u_len):
pos = positioning(u_pos, key_len)
keys |= {
''.join(key[pos[i]] for i in range(key_len))
for key in u_keys
}
keys.remove(base_values[-1] * key_len)

create all possible combinations with multiple variants from list

Ok so the problem is as follows:
let's say I have a list like this [12R,102A,102L,250L] what I would want is a list of all possible combinations, however for only one combination/number. so for the example above, the output I would like is:
[12R,102A,250L]
[12R,102L,250L]
my actual problem is a lot more complex with many more sites. Thanks for your help
edit: after reading some comments I guess this is slightly unclear. I have 3 unique numbers here, [12, 102, and 250] and for some numbers, I have different variations, for example [102A, 102L]. what I need is a way to combine the different positions[12,102,250] and all possible variations within. just like the lists, I presented above. they are the only valid solutions. [12R] is not. neither is [12R,102A,102L,250L]. so far I have done this with nested loops, but I have a LOT of variation within these numbers, so I can't really do that anymore
ill edit this again: ok so it seems as though there is still some confusion so I might extend the point I made before. what I am dealing with there is DNA. 12R means the 12th position in the sequence was changed to an R. so the solution [12R,102A,250L] means that the amino acid on position 12 is R, 102 is A 250 is L.
this is why a solution like [102L, 102R, 250L] is not usable, because the same position can not be occupied by 2 different amino acids.
thank you
You can use a recursive generator function:
from itertools import groupby as gb
import re
def combos(d, c = []):
if not d:
yield c
else:
for a, b in d[0]:
yield from combos(d[1:], c + [a+b])
d = ['12R', '102A', '102L', '250L']
vals = [re.findall('^\d+|\w+$', i) for i in d]
new_d = [list(b) for _, b in gb(sorted(vals, key=lambda x:x[0]), key=lambda x:x[0])]
print(list(combos(new_d)))
Output:
[['102A', '12R', '250L'], ['102L', '12R', '250L']]
So it works with ["10A","100B","12C","100R"] (case 1) and ['12R','102A','102L','250L'] (case 2)
import itertools as it
liste = ['12R','102A','102L','250L']
comb = []
for e in it.combinations(range(4), 3):
e1 = liste[e[0]][:-1]
e2 = liste[e[1]][:-1]
e3 = liste[e[2]][:-1]
if e1 != e2 and e2 != e3 and e3 != e1:
comb.append([e1+liste[e[0]][-1], e2+liste[e[1]][-1], e3+liste[e[2]][-1]])
print(list(comb))
# case 1 : [['10A', '100B', '12C'], ['10A', '12C', '100R']]
# case 2 : [['12R', '102A', '250L'], ['12R', '102L', '250L']]
import re
def get_grouped_options(input):
options = {}
for option in input:
m = re.match('([\d]+)([A-Z])$', option)
if m:
position = int(m.group(1))
acid = m.group(2)
else:
continue
if position not in options:
options[position] = []
options[position].append(acid)
return options
def yield_all_combos(options):
n = len(options)
positions = list(options.keys())
indices = [0] * n
while True:
yield ["{}{}".format(position, options[position][indices[i]])
for i, position in enumerate(positions)]
j = 0
indices[j] += 1
while indices[j] == len(options[positions[j]]):
# carry
indices[j] = 0
j += 1
if j == n:
# overflow
return
indices[j] += 1
input = ['12R', '102A', '102L', '250L']
options = get_grouped_options(input)
for combo in yield_all_combos(options):
print("[{}]".format(",".join(combo)))
Gives:
[12R,102A,250L]
[12R,102L,250L]
Try this:
from itertools import groupby
import re
def __genComb(arr, res=[]):
for i in range(len(res), len(arr)):
el=arr[i]
if(len(el[1])==1):
res+=el[1]
else:
for el_2 in el[1]:
yield from __genComb(arr, res+[el_2])
break
if(len(res)==len(arr)): yield res
def genComb(arr):
res=[(k, list(v)) for k,v in groupby(sorted(arr), key=lambda x: re.match(r"(\d*)", x).group(1))]
yield from __genComb(res)
Sample output (using the input you provided):
test=["12R","102A","102L","250L"]
for el in genComb(test):
print(el)
# returns:
['102A', '12R', '250L']
['102L', '12R', '250L']
I believe this is what you're looking for!
This works by
generating a collection of all the postfixes each prefix can have
finding the total count of positions (multiply the length of each sublist together)
rotating through each postfix by basing the read index off of both its member postfix position in the collection and the absolute result index (known location in final results)
import collections
import functools
import operator
import re
# initial input
starting_values = ["12R","102A","102L","250L"]
d = collections.defaultdict(list) # use a set if duplicates are possible
for value in starting_values:
numeric, postfix = re.match(r"(\d+)(.*)", value).groups()
d[numeric].append(postfix) # .* matches ""; consider (postfix or "_") to give value a size
# d is now a dictionary of lists where each key is the prefix
# and each value is a list of possible postfixes
# each set of postfixes multiplies the total combinations by its length
total_combinations = functools.reduce(
operator.mul,
(len(sublist) for sublist in d.values())
)
results = collections.defaultdict(list)
for results_pos in range(total_combinations):
for index, (prefix, postfix_set) in enumerate(d.items()):
results[results_pos].append(
"{}{}".format( # recombine the values
prefix, # numeric prefix
postfix_set[(results_pos + index) % len(postfix_set)]
))
# results is now a dictionary mapping { result index: unique list }
displaying
# set width of column by longest prefix string
# need a collection for intermediate cols, but beyond scope of Q
col_width = max(len(str(k)) for k in results)
for k, v in results.items():
print("{:<{w}}: {}".format(k, v, w=col_width))
0: ['12R', '102L', '250L']
1: ['12R', '102A', '250L']
with a more advanced input
["12R","102A","102L","250L","1234","1234A","1234C"]
0: ['12R', '102L', '250L', '1234']
1: ['12R', '102A', '250L', '1234A']
2: ['12R', '102L', '250L', '1234C']
3: ['12R', '102A', '250L', '1234']
4: ['12R', '102L', '250L', '1234A']
5: ['12R', '102A', '250L', '1234C']
You can confirm the values are indeed unique with a set
final = set(",".join(x) for x in results.values())
for f in final:
print(f)
12R,102L,250L,1234
12R,102A,250L,1234A
12R,102L,250L,1234C
12R,102A,250L,1234
12R,102L,250L,1234A
12R,102A,250L,1234C
notes
in cPython, regexes are cached after their first compile
list member multiplier from "How can I multiply all items in a list together with Python?"

What is a good Pythonic way to handle complicated for loops with many functions?

I have a relatively complicated script that requires functions to be executed within a for loop and in some cases the result of one function is read into the next function. I can handle this relatively easy with a for loop, but the execution speed is significantly less than with list comprehension. I am not sure how to execute this problem with list comprehension. Is there a better vectorized way to do this in python. I am attaching an example that is significantly simpler than my actual problem, but it I think it highlights the problem. Any thoughts would be appreciated.
def func1(i):
return i + 1
def func2(j):
return j + 2
def func3(k):
return k + 3
class test:
def __init__(self, one, two, three):
self.one = one
self.two = two
self.three = three
if __name__ == "__main__":
obj = []
for i in range(10):
if i !=3 and i != 7:
value1 = func1(i)
value2 = func2(i)
value3 = func3(value2)
one1 = value1 + value2
two1 = value1 + value2 + value3
three1 = value1 + value3
obj.append(test(one1, two1, three1))
Just cram the inside of your loop into its own function.
def loop_interior(i):
value1 = func1(i)
value2 = func2(i)
value3 = func3(value2)
one1 = value1 + value2
two1 = value1 + value2 + value3
three1 = value1 + value3
return test(one1, two1, three1)
Now the loop populating obj is short and sweet. You could even use a list comprehension if you like. obj = [loop_interior(i) for i in range(10) if (i != 3 and i != 7)]
I dunno if this is quite what you're looking for, but you can do it slightly more elegantly in two lines, if you first create a comprehension for values and then populate your obj list.
values = [(func1(i), func2(i), func3(func2(i)) for i in range(10) if (i != 3 and i != 7)]
obj = [test(v[0]+v[1], v[0]+v[1]+v[2], v[0]+v[1]) for v in values]
The downside is more memory usage, having to keep the values in memory, but this also should call each of the functions the same time as your above code. If you can create a list generator instead of a comprehension for values, that would speed it up further.

Loop through dictionary values and print subsequently

I am trying to print the following dictionary in a hierarchy format
fam_dict{'6081740103':['60817401030000','60817401030100','60817401030200',
'60817401030300','60817401030400','60817401030500','60817401030600']
as shown here:
60817401030000
60817401030100
60817401030200
60817401030400
60817401030500
60817401030600
So far I have the following code which works but I'm having to manually input the i'th index in each line. How can I readjust this code in a recursive format instead of having to count how many lines of code and manually put the index value each time
my_p = node(fam_dict['6081740103'][0], None)
my_c = node(fam_dict['6081740103'][1], my_p)
my_d = node(fam_dict['6081740103'][2], my_c)
my_e = node(fam_dict['6081740103'][4], my_d)
my_f = node(fam_dict['6081740103'][5], my_e)
my_g = node(fam_dict['6081740103'][6], my_f)
print (my_p.name)
print_children(my_p)
You can try this:
fam_dict = {'6081740103':['60817401030000','60817401030100','60817401030200',
'60817401030300','60817401030400','60817401030500','60817401030600']}
for i, val in enumerate(fam_dict['6081740103']):
print(' ' * i * 4 + val)
Which outputs your desired hierachy:
60817401030000
60817401030100
60817401030200
60817401030300
60817401030400
60817401030500
60817401030600
You can create a variable that stores the line that you are iterating through, and then increment the variable each time through the loop. You can multiply that variable by \t Which is the tab operator in order to control how many tabs you want. Here is an example:
lines = 0
fam_dict = {'6081740103': ['60817401030000','60817401030100','60817401030200',
'60817401030300','60817401030400','60817401030500','60817401030600']}
for k, val in fam_dict.items():
for v in val:
lines += 1
t = '\t'
t = t * lines
print(t + str(v))
Here is your output:
60817401030000
60817401030100
60817401030200
60817401030300
60817401030400
60817401030500
60817401030600
You can do it this way too.
for key in fam_dict.keys():
for i in range(len(fam_dict[key])):
print(i*"\t"+ fam_dict[key][i])
Here is an example:
fam_dict = {'6081740103':['60817401030000','60817401030100','60817401030200','60817401030300','60817401030400','60817401030500','60817401030600']}
for k, v in fam_dict.items():
for i, s in enumerate(v):
print("%s%s"% ("\t"*i, s))
In case you want to make nodes for it:
fam_dict = {'6081740103':['60817401030000','60817401030100','60817401030200','60817401030300','60817401030400','60817401030500','60817401030600']}
node_list = []
for k, v in fam_dict.items():
last_parent = none
for i, s in enumerate(v):
print("%s%s"% ("\t"*i, s))
node_list.append(node(v, last_parent))
last_parent=node_list[-1]
The parent node will be node_list[0].
Try this:
fam_dict = {'6081740103':['60817401030000','60817401030100','60817401030200',
'60817401030300','60817401030400','60817401030500','60817401030600']}
l = fam_dict['6081740103']
for i in l:
print(' '*l.index(i)*4+i)
Output:
60817401030000
60817401030100
60817401030200
60817401030300
60817401030400
60817401030500
60817401030600

Finding motif in a sequence of characters

I have also a dictionary in which the keys are ids and the values are long sequences made not only with K and M but also with some more characters which are not important for me.
li = {id1: "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM",
id2:"MMKFJDFKFGKJKMKMKMKMKMJKJHFKMKMKM"}
I want to find the motifs of "KMKMKM" with the length of at least 6. it could be even or odd just equal or longer than 6. it should also be in a dictionary with the same keys but instead of the whole sequence, the value must be the list of motifs. like the following example.
results = {id1: ["KMKMKMK"], id2: ["KMKMKMKMKM", "KMKMKM"] }
I have wrote this code but did not return interested motifs.
{k: re.findall(r'(?:KM){6,1000}', v) for k, v in li.items()}
This one does the job:
((?:KM){3,}K?)
Explanation:
( : group 1
(?:KM){3,} : non capture group, 3 or more times KM
K? : optional K
) : end group 1
In action:
import re
li = {'id1': "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM",
'id2':"MMKFJDFKFGKJKMKMKMKMKMJKJHFKMKMKM"}
res = {k: re.findall(r'((?:KM){3,}K?)', v) for k, v in li.items()}
print(res)
Output:
{'id2': ['KMKMKMKMKM', 'KMKMKM'], 'id1': ['KMKMKMK']}
Is this what you are looking for:
import re
stringA = "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM";
motifs = "KMKMKM";
m = re.search(motifs, stringA)
if m:
print(motifs);
In reply to your comment below:
stringA = "KKMKMKMKJASGKKKMOOGBMMMMMMMMMMMMMMMMMM";
motifs = "KMKMKM";
i = 0;
while True:
seq = stringA[i:]
i = i + 1;
if (seq.startswith(motifs)):
print(seq);
if (len(stringA) == i):
break;

Categories

Resources