use slice in for loop to build a list - python

I would like to build up a list using a for loop and am trying to use a slice notation. My desired output would be a list with the structure:
known_result[i] = (record.query_id, (align.title, align.title,align.title....))
However I am having trouble getting the slice operator to work:
knowns = "output.xml"
i=0
for record in NCBIXML.parse(open(knowns)):
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
i+=1
which results in:
list assignment index out of range.
I am iterating through a series of sequences using BioPython's NCBIXML module but the problem is adding to the list. Does anyone have an idea on how to build up the desired list either by changing the use of the slice or through another method?
thanks zach cp
(crossposted at [Biostar])1

You cannot assign a value to a list at an index that doesn't exist. The way to add an element (at the end of the list, which is the common use case) is to use the .append method of the list.
In your case, the lines
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
Should probably be changed to
element=(record.query_id, tuple(align.title for align in record.alignment))
known_results.append(element)
Warning: The code above is untested, so might contain bugs. But the idea behind it should work.

Use:
for record in NCBIXML.parse(open(knowns)):
known_results[i] = (record.query_id, None)
known_results[i][1] = (align.title for align in record.alignment)
i+=1

If i get you right you want to assign every record.query_id one or more matching align.title. So i guess your query_ids are unique and those unique ids are related to some titles. If so, i would suggest a dictionary instead of a list.
A dictionary consists of a key (e.g. record.quer_id) and value(s) (e.g. a list of align.title)
catalog = {}
for record in NCBIXML.parse(open(knowns)):
catalog[record.query_id] = [align.title for align in record.alignment]
To access this catalog you could either iterate through:
for query_id in catalog:
print catalog[query_id] # returns the title-list for the actual key
or you could access them directly if you know what your looking for.
query_id = XYZ_Whatever
print catalog[query_id]

Related

How to iterate to create variables in a list

Suppose I have the following code:
classifiers_name_all = [('AdaBoostClassifier', AdaBoostClassifier(), 'AdaBoost'),
('BernoulliNB', BernoulliNB(), 'Bernoulli Naive Bayes'),
('DummyClassifier', DummyClassifier(), 'Dummy Classifier')]
clf_values = []
for clf_na in classifiers_name_all:
clf_values.append((locals()['score_'+clf_na[0]+'_mean'], locals()['score_'+clf_na[0]+'_stddev']))
clf_values
The code above doesn't quite work.
I want to get a list which contains the variables:
clf_values = [(score_AdaBoostClassifier_mean, score_AdaBoostClassifier_stddev),
(score_BernoulliNB_mean, score_BernoulliNB_stddev)
(score_DummyClassifier_mean, score_DummyClassifier_stddev)]
How do I do this? Many thanks.
From whatever info you have given so far, I infer that there are no key errors and the resultant list is a list containing nones.
This can only mean, that your code works fine but the variables u are trying to access have 'None' values assigned to them. Check why your values are having None values and once that is fixed, this list will get desired values.

Parsing and arranging text in python

I'm having some trouble figuring out the best implementation
I have data in file in this format:
|serial #|machine_name|machine_owner|
If a machine_owner has multiple machines, I'd like the machines displayed in a comma separated list in the field. so that.
|1234|Fred Flinstone|mach1|
|5678|Barney Rubble|mach2|
|1313|Barney Rubble|mach3|
|3838|Barney Rubble|mach4|
|1212|Betty Rubble|mach5|
Looks like this:
|Fred Flinstone|mach1|
|Barney Rubble|mach2,mach3,mach4|
|Betty Rubble|mach5|
Any hints on how to approach this would be appreciated.
You can use dict as temporary container to group by name and then print it in desired format:
import re
s = """|1234|Fred Flinstone|mach1|
|5678|Barney Rubble|mach2|
|1313|Barney Rubble||mach3|
|3838|Barney Rubble||mach4|
|1212|Betty Rubble|mach5|"""
results = {}
for line in s.splitlines():
_, name, mach = re.split(r"\|+", line.strip("|"))
if name in results:
results[name].append(mach)
else:
results[name] = [mach]
for name, mach in results.items():
print(f"|{name}|{','.join(mach)}|")
You need to store all the machines names in a list. And every time you want to append a machine name, you run a function to make sure that the name is not already in the list, so that it will not put it again in the list.
After storing them in an array called data. Iterate over the names. And use this function:
data[i] .append( [ ] )
To add a list after each machine name stored in the i'th place.
Once your done, iterate over the names and find them in in the file, then append the owner.
All of this can be done in 2 steps.

Regular expressions matching words which contain the pattern but also the pattern plus something else

I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.

Splitting json data in python

I'm trying to manipulate a list of items in python but im getting the error "AttributeError: 'list' object has no attribute 'split'"
I understand that list does not understand .split but i don't know what else to do. Below is a copy paste of the relevant part of my code.
tourl = 'http://data.bitcoinity.org/chart_data'
tovalues = {'timespan':'24h','resolution':'hour','currency':'USD','exchange':'all','mining_pool':'all','compare':'no','data_type':'price_volume','chart_type':'line_bar','smoothing':'linear','chart_types':'ccacdfcdaa'}
todata = urllib.urlencode(tovalues)
toreq = urllib2.Request(tourl, todata)
tores = urllib2.urlopen(toreq)
tores2 = tores.read()
tos = json.loads(tores2)
tola = tos["data"]
for item in tola:
ting = item.get("values")
ting.split(',')[2] <-----ERROR
print(ting)
To understand what i'm trying to do you will also need to see the json data. Ting outputs this:
[
[1379955600000L, 123.107310846774], [1379959200000L, 124.092526428571],
[1379962800000L, 125.539504822835], [1379966400000L, 126.27024617931],
[1379970000000L, 126.723474983766], [1379973600000L, 126.242406356837],
[1379977200000L, 124.788410570987], [1379980800000L, 126.810084904632],
[1379984400000L, 128.270580796748], [1379988000000L, 127.892411269036],
[1379991600000L, 126.140579640523], [1379995200000L, 126.513705084746],
[1379998800000L, 128.695124951923], [1380002400000L, 128.709738051044],
[1380006000000L, 125.987767097378], [1380009600000L, 124.323433535528],
[1380013200000L, 123.359378559603], [1380016800000L, 125.963250678733],
[1380020400000L, 125.074618194444], [1380024000000L, 124.656345088853],
[1380027600000L, 122.411303435449], [1380031200000L, 124.145747100372],
[1380034800000L, 124.359452274881], [1380038400000L, 122.815357211394],
[1380042000000L, 123.057706915888]
]
[
[1379955600000L, 536.4739135], [1379959200000L, 1235.42506637],
[1379962800000L, 763.16329656], [1379966400000L, 804.04579319],
[1379970000000L, 634.84689741], [1379973600000L, 753.52716718],
[1379977200000L, 506.90632968], [1379980800000L, 494.473732950001],
[1379984400000L, 437.02095093], [1379988000000L, 176.25405034],
[1379991600000L, 319.80432715], [1379995200000L, 206.87212398],
[1379998800000L, 638.47226435], [1380002400000L, 438.18036666],
[1380006000000L, 512.68490443], [1380009600000L, 904.603705539997],
[1380013200000L, 491.408088450001], [1380016800000L, 670.275397960001],
[1380020400000L, 767.166941339999], [1380024000000L, 899.976089609997],
[1380027600000L, 1243.64963909], [1380031200000L, 1508.82429811],
[1380034800000L, 1190.18854705], [1380038400000L, 546.504592349999],
[1380042000000L, 206.84883264]
]
And ting[0] outputs this:
[1379955600000L, 123.187067936508]
[1379955600000L, 536.794013499999]
What i'm really trying to do is add up the values from ting[0-24] that comes AFTER the second comma. This made me try to do a split but that does not work
You already have a list; the commas are put there by Python to delimit the values only when printing the list.
Just access element 2 directly:
print ting[2]
This prints:
[1379962800000, 125.539504822835]
Each of the entries in item['values'] (so ting) is a list of two float values, so you can address each of those with index 0 and 1:
>>> print ting[2][0]
1379962800000
>>> print ting[2][1]
125.539504822835
To get a list of all the second values, you could use a list comprehension:
second_vals = [t[1] for t in ting]
When you load the data with json.loads, it is already parsed into a real list that you can slice and index as normal. If you want the data starting with the third element, just use ting[2:]. (If you just want the third element by itself, just use ting[2].)

changing file name and extension using loop

I need to change the name and extension of a series of files. The names are currently 'tmax.##.txt', but I need it to be 'tmax_##.txt'. Then, I want to change the .txt extension to .asc. I've tried the below code and the first loop works as expected to produce 'tmax_01'. The second loop runs, but produces unexpected results, 't'.
list_raw = 'tmax.01.txt', 'tmax.02.txt', 'tmax.03.txt'
for i in list_raw:
list_conv = i.replace('.','_')
for i in list_conv:
list_final = i.replace('_txt','.asc')
Any suggestions?
You are just assigning new values to a variable in each iteration of the loop. What you want to do is create a new list from the modified elements of an existing list, which is best done with a list comprehension:
list_raw = ['tmax.01.txt', 'tmax.02.txt', 'tmax.03.txt']
list_final = [i.replace(".", "_").replace("_txt", ".asc") for i in list_raw]
Note that you can do this, as in my example, in one step - there is no reason to iterate over the list twice, and produce an intermediate list, which is inefficient.
You could also do i.replace(".", "_", 1) to only replace the first ., and avoid having to do the awkward hack with the file extension. However, I would personally use i[:-4].replace(".", "_") + ".asc" - that is, cut off the existing extension with a slice, replace the .s, and then add the new extension.
If the extensions are likely to vary in length, you may want to look into the os.path module, as suggested by sotapme.
Because you're talking of files it may be worth using os.path as it's likely that the next part of your code will be to manipulate these or other files. (just guessing)
os.path.splitext('afile.txt')[0] + '.asc'
Gives
'afile.asc'
In the first loop: -
for i in list_raw:
list_conv = i.replace('.','_')
Your list_conv contains a str object. And it will contain the last element in the list with the appropriate replacement.
Then in your 2nd loop: -
for i in list_conv:
list_final = i.replace('_txt','.asc')
You are just iterating over string sequence, which will give you 1 character at a time. And list_final will contain the last character with the appropriate replacement done.
Since the last character in tmax_03_txt is t, that is why you got t.
If you want to do the replacement on each element of the list, then you can use list comprehension, and chaning of method invocation: -
>>> list_raw = ['tmax.01.txt', 'tmax.02.txt', 'tmax.03.txt']
>>> [elem.replace('.', '_').replace('_txt', '.asc') for elem in list_raw]
16: ['tmax_01.asc', 'tmax_02.asc', 'tmax_03.asc']
Alternately you could use the string method rsplit.
list_raw = ['tmax.01.txt', 'tmax.02.txt', 'tmax.03.txt']
list_final = [filename.rsplit('.',1)[0] + '.ext' for filename in list_raw]
Where ext is the new extension. The 1 in rsplit() indicates that only the rightmost '.' will act as split point.

Categories

Resources