how to print dict values based on key containing delimiters - python

Actually i have a dict
x1={'b;0':'A1;B2;C3','b;1':'aa1;aa2;aa3','a;1': 'a1;a2;a3', 'a;0': 'A;B;C'}
Actually here my convention is 'a;0','b;0' will contain tags and 'a;1','b;1' will have corresponding values, based on this i have to group and print.
From this dict what output i want is
<a> #this is group name
<A>a1</A> # this are tags n values
<B>a2</B>
<C>a3</C>
</a>
<b>
<A1>aa1</A1>
<B2>aa2</B2>
<C1>aa3</C1>
</b>
This is the sample dict which i given like this many groups may come like c;0:.... d;0.....
I am using code like
a=[]
b=[]
c=[]
d=[]
e=[]
for k,v in x1.iteritems():
if k.split(";").count('0')==1: # i am using this bcoz a;0,b;0 contains tag so i am checking if they contain zero split it.
a=k.split(";") # this contains a=['a','0','b','0']
b=v.split(";") # this contains 'a;0','b;0' values
else:
c=v.split(";") # this contains 'a;1','b;1' values
for i in range(0,len(b)):
d=b[i]
e=c[i]
print "<%s>%s<%s>"%(c,e,c)
Actually this code is working only 50% when single group is their in
dict('a;1': 'a1;a2;a3', 'a;0': 'A;B;C') and when multiple groups r their in
dict ('b;0':'A1;B2;C3','b;1':'aa1;aa2;aa3','a;1': 'a1;a2;a3', 'a;0': 'A;B;C')
in both cases it prints
aa1
aa2
aa3
its printing only recent value not all values

Be aware: dictionaries have no order. So the iteritems() loop does not necessarily start with 'b;0'. Try for example
for k,v in x1.iteritems():
print k
to see. On my computer it gives
a;1
a;0
b;0
b;1
This gives a problem since your code assumes the keys to come in the order they appear in the definition of x1 [edit: or rather that they come in order]. You can e.g. iterate over sorted keys instead:
for k in sorted(x1.keys()):
v = x1[k]
print k, v
Then the problem with the order is solved. But I think you have more problems in your code.
Edit: Data structures:
it might be better to store your data in some way like
x1 = {'a': [('A','a1'),('B','a2'),('C','a3')], 'b': ... }
if you cannot change the format, this is how you could convert your data:
x1f = {}
for k in x1.iterkeys():
tag, id = k.split(';')
if int(id) == 0:
x1f[tag] = zip(x1[k].split(';'), x1[tag+';'+'1'].split(';'))
print x1f
From there it should be easier to convert to the desired output.
And depending if you want extend the complexity of the output in future,
you might want to consider using pyxml:
from xml.dom import minidom
doc = minidom.Document()
then you can use the createElement and appendChild methods.

Related

Dynamically building a dictionary based on variables

I am trying to build a dictionary based on a larger input of text. From this input, I will create nested dictionaries which will need to be updated as the program runs. The structure ideally looks like this:
nodes = {}
node_name: {
inc_name: inc_capacity,
inc_name: inc_capacity,
inc_name: inc_capacity,
}
Because of the nature of this input, I would like to use variables to dynamically create dictionary keys (or access them if they already exist). But I get KeyError if the key doesn't already exist. I assume I could do a try/except, but was wondering if there was a 'cleaner' way to do this in python. The next best solution I found is illustrated below:
test_dict = {}
inc_color = 'light blue'
inc_cap = 2
test_dict[f'{inc_color}'] = inc_cap
# test_dict returns >>> {'light blue': 2}
Try this code, for Large Scale input. For example file input
Lemme give you an example for what I am aiming for, and I think, this what you want.
File.txt
Person1: 115.5
Person2: 128.87
Person3: 827.43
Person4:'18.9
Numerical Validation Function
def is_number(a):
try:
float (a)
except ValueError:
return False
else:
return True
Code for dictionary File.txt
adict = {}
with open("File.txt") as data:
adict = {line[:line.index(':')]: line[line.index(':')+1: ].strip(' \n') for line in data.readlines() if is_number(line[line.index(':')+1: ].strip('\n')) == True}
print(adict)
Output
{'Person1': '115.5', 'Person2': '128.87', 'Person3': '827.43'}
For more explanation, please follow this issue solution How to fix the errors in my code for making a dictionary from a file
As already mentioned in the comments sections, you can use setdefault.
Here's how I will implement it.
Assume I want to add values to dict : node_name and I have the keys and values in two lists. Keys are in inc_names and values are in inc_ccity. Then I will use the below code to load them. Note that inc_name2 key exists twice in the key list. So the second occurrence of it will be ignored from entry into the dictionary.
node_name = {}
inc_names = ['inc_name1','inc_name2','inc_name3','inc_name2']
inc_ccity = ['inc_capacity1','inc_capacity2','inc_capacity3','inc_capacity4']
for i,names in enumerate(inc_names):
node = node_name.setdefault(names, inc_ccity[i])
if node != inc_ccity[i]:
print ('Key=',names,'already exists with value',node, '. New value=', inc_ccity[i], 'skipped')
print ('\nThe final list of values in the dict node_name are :')
print (node_name)
The output of this will be:
Key= inc_name2 already exists with value inc_capacity2 . New value= inc_capacity4 skipped
The final list of values in the dict node_name are :
{'inc_name1': 'inc_capacity1', 'inc_name2': 'inc_capacity2', 'inc_name3': 'inc_capacity3'}
This way you can add values into a dictionary using variables.

Remove duplicate data from an array in python

I have this array of data
data = [20001202.05, 20001202.05, 20001202.50, 20001215.75, 20021215.75]
I remove the duplicate data with list(set(data)), which gives me
data = [20001202.05, 20001202.50, 20001215.75, 20021215.75]
But I would like to remove the duplicate data, based on the numbers before the "period"; for instance, if there is 20001202.05 and 20001202.50, I want to keep one of them in my array.
As you don't care about the order of the items you keep, you could do:
>>> {int(d):d for d in data}.values()
[20001202.5, 20021215.75, 20001215.75]
If you would like to keep the lowest item, I can't think of a one-liner.
Here is a basic example for anybody who would like to add a condition on the key or value to keep.
seen = set()
result = []
for item in sorted(data):
key = int(item) # or whatever condition
if key not in seen:
result.append(item)
seen.add(key)
Generically, with python 3.7+, because dictionaries maintain order, you can do this, even when order matters:
data = {d:None for d in data}.keys()
However for OP's original problem, OP wants to de-dup based on the integer value, not the raw number, so see the top voted answer. But generically, this will work to remove true duplicates.
data1 = [20001202.05, 20001202.05, 20001202.50, 20001215.75, 20021215.75]
for i in data1:
if i not in ls:
ls.append(i)
print ls

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Pandas Dataframe to Dictionary with Multiple Keys

I am currently working with a dataframe consisting of a column of 13 letter strings ('13mer') paired with ID codes ('Accession') as such:
However, I would like to create a dictionary in which the Accession codes are the keys with values being the 13mers associated with the accession so that it looks as follows:
{'JO2176': ['IGY....', 'QLG...', 'ESS...', ...],
'CYO21709': ['IGY...', 'TVL...',.............],
...}
Which I've accomplished using this code:
Accession_13mers = {}
for group in grouped:
Accession_13mers[group[0]] = []
for item in group[1].iteritems():
Accession_13mers[group[0]].append(item[1])
However, now I would like to go back through and iterate through the keys for each Accession code and run a function I've defined as find_match_position(reference_sequence, 13mer) which finds the 13mer in in a reference sequence and returns its position. I would then like to append the position as a value for the 13mer which will be the key.
If anyone has any ideas for how I can expedite this process that would be extremely helpful.
Thanks,
Justin
I would suggest creating a new dictionary, whose values are another dictionary. Essentially a nested dictionary.
position_nmers = {}
for key in H1_Access_13mers:
position_nmers[key] = {} # replicate key, val in new dictionary, as a dictionary
for value in H1_Access_13mers[key]:
position_nmers[key][value] = # do something
To introspect the dictionary and make sure it's okay:
print position_nmers
You can iterate over the groupby more cleanly by unpacking:
d = {}
for key, s in df.groupby('Accession')['13mer']:
d[key] = list(s)
This also makes it much clearer where you should put your function!
... However, I think that it might be better suited to an enumerate:
d2 = {}
for pos, val in enumerate(df['13mer']):
d2[val] = pos

Urlencode dictionary using Python - naming key and value in the url

I am attempting to generate a URL link in the following format using urllib and urlencode.
<img src=page.psp?KEY=%28SpecA%2CSpecB%29&VALUE=1&KEY=%28SpecA%2C%28SpecB%2CSpecC%29%29&VALUE=2>
I'm trying to use data from my dictionary to input into the urllib.urlencode() function however, I need to get it into a format where the keys and values have a variable name, like below. So the keys from my dictionary will = NODE and values will = VALUE.
wanted = urllib.urlencode( [("KEY",v1),("VALUE",v2)] )
req.write( "<a href=page.psp?%s>" % (s) );
The problem I am having is that I want the URL as above and instead I am getting what is below, rather than KEY=(SpecA,SpecB) NODE=1, KEY=(SpecA,SpecB,SpecC) NODE=2 which is what I want.
KEY=%28SpecA%2CSpecB%29%2C%28%28SpecA%2CSpecB%29%2CSpecC%29&VALUE=1%2C2
So far I have extracted keys and values from the dictionary, extracted into tuples, lists, strings and also tried dict.items() but it hasn't helped much as I still can't get it to go into the format I want. Also I am doing this using Python server pages which is why I keep having to print things as a string due to constant string errors. This is part of what I have so far:
k = (str(dict))
ver1 = dict.keys()
ver2 = dict.values()
new = urllib.urlencode(function)
f = urllib.urlopen("page.psp?%s" % new)
I am wondering what I need to change in terms of extracting values from the dictionary/converting them to different formats in order to get the output I want? Any help would be appreciated and I can add more of my code (as messy as it has become) if need be. Thanks.
This should give you the format you want:
data = {
'(SpecA,SpecB)': 1,
'(SpecA,SpecB,SpecC)': 2,
}
params = []
for k,v in data.iteritems():
params.append(('KEY', k))
params.append(('VALUE', v))
new = urllib.urlencode(params)
Note that the KEY/VALUE pairings may not be the order you want, given that dicts are unordered.

Categories

Resources