I have a function that annotates some genomic variants with multiple items (detail snot important). For every variant, it stores all the information in a list. All variant lists are added to a list which ultimately looks something like this:
[['chr9', 11849076, 'chr9', 12028629, 'DEL', 0, 179553, 0, 0, '', '',
0, '', 0, 0, 13, 13], ['chr3', 5577129, 'chr3', 5708227, 'DUP', 0,
131098, 0, 0, '', '', 0, '', 0, 0, 13, 13],...]
This big list is returned by the annotator function and then I would like to convert it to a numpy array which goes fine:
annotated_tn = np.array(annotated_tn, dtype="object")
However, the result is not as expected:
array([list(['chr9', 11849076, 'chr9', 12028629, 'DEL', 0, 179553, 0, 0, '', '', 0, '', 0, 0, 13, 13]),
list(['chr3', 5577129, 'chr3', 5708227, 'DUP', 0, 131098, 0, 0, '', '', 0, '', 0, 0, 13, 13]),... ],dtype=object)
For some reason it adds an extra list() to all the lists in the array making them not indexable:
annotated_tn[:,1]
IndexError: too many indices for array
I believe the output should like this:
array([['chr9', 11849076, 'chr9', 12028629, 'DEL', 0, 179553, 0, 0, '', '', 0, '', 0, 0, 13, 13], ['chr3', 5577129, 'chr3', 5708227, 'DUP', 0, 131098, 0, 0, '', '', 0, '', 0, 0, 13, 13],..], dtype=object)
Any idea what is happening here?
My best guess is that there's a row in your data that doesn't have the same number of columns as the other rows.
If they were all the same length, then you're right and your code should work. But as soon as you add a row with a different length you get the exact result you're getting
Since you're only posting 2 rows of your data and both have 17 columns, then I can't say this for sure. But I'm pretty sure this is your problem
Related
I need to run for loop on 2 columns coming from a dataframe and return a dict. But when i use zip I am getting only a part of a string on which the loop is running.
import pandas as pd
def split(owner, cost):
split_bill = {'ads': 0, 'qaweb': 0, 'ovt': 0, 'cs': 0, 'edu': 0, 'xms': 0, 'cc': 0}
for owner_in, cost in zip(owner, cost): --> #need to know what type of loop can work here
split_bill[owner_in] += cost
continue
return split_bill
data = {
"owner": ['ads', 'cs', 'edu'],
"cost": [2.3, 4.30, 45]
}
df = pd.DataFrame(data)
df['metric'] = df.apply(lambda x: split(x.owner, {x.cost}), axis=1)
Exptected output
df['metric'] =
metric
{'ads': 2.3, 'qaweb': 0, 'ovt': 0, 'cs': 0, 'edu': 0, 'xms': 0, 'cc': 0}
{'ads': 2.3, 'qaweb': 0, 'ovt': 0, 'cs': 4.3, 'edu': 0, 'xms': 0, 'cc': 0}
{'ads': 2.3, 'qaweb': 0, 'ovt': 0, 'cs': 0, 'edu': 45, 'xms': 0, 'cc': 0}
in the for loop owner_in is only taking a of ads Which should be taking ads instead of a.
Can you help with what type of loop could work?
zip is to zip some lists into list of tuple. The length of the final list is determined by the shortest list among those list.
In your example, owner is a string ads, cost is a set with one float value. In zip(owner, cost), string is treated as a list with three values. So the length of final list is 1 determined by the shortest set which has only one float value.
I guess you may want to do df.groupby('owner')['cost'].apply(sum).
I build three nested dictionaries to analyze my big data. I try to anylyze values inside them to make a scatter plot, so I am creating a list to append my data to them and then make a scatterplot by matplotlib. My problem is that I get an error while I try to append! TypeError: unhashable type: 'list'. so i confused to change structure of my dictionaries or is there possibility to handle it by this from that i have created.
my dictionaries structure are respectively like:
data_geo1:
'ENSG00000268358': {'Sample_19-leish_023_v2': 0, 'Sample_4-leish_012_v3': 0, 'Sample_25-leish027_v2': 0, 'Sample_6-leish_015_v3': 0, 'Sample_23-leish026_v2': 1, 'Sample_20-leish_023_v3': 0, 'Sample_18-leish_022_v3': 0, 'Sample_10-leish_017_v3': 0, 'Sample_13-leish_019_v2': 0, 'Sample_1-Leish_011_v2': 0, 'Sample_11-leish_018_v2': 0, 'Sample_3-leish_012_v2': 0, 'Sample_2-leish_011_v3': 0, 'Sample_29-leish032_v2': 0, 'Sample_8-leish_016_v3': 0, 'Sample_28-leish028_v3': 0, 'Sample_27-leish028_v2': 1, 'Sample_26-leish027_v3': 0, 'Sample_12-leish_018_v3': 0, 'Sample_5-leish_015_v2': 0, 'Sample_16-leish_021_v3': 0, 'Sample_21-leish_024_v2': 0, 'Sample_9-leish_017_v2': 0, 'Sample_24-leish026_v3': 1, 'Sample_22-leish_024_v3': 0, 'Sample_14-leish_019_v3': 0, 'Sample_30-leish032_v3': 0, 'Sample_7-leish_016_v2': 0, 'Sample_15-leish_021_v2': 0, 'Sample_17-leish_022_v2': 1}
data_ali:
{'ENSG00000268358': {'Sample_19-leish_023_v2': 0, 'Sample_16-leish_021_v3': 2, 'Sample_20': 0, 'Sample_24-leish026_v3': 1, 'Sample_6-leish_015_v3': 0, 'Sample_12-leish_018_v3': 0, 'Sample_22-leish_024_v3': 0, 'Sample_23-leish026_v2': 2, 'Sample_25-leish027_v2': 0, 'Sample_18-leish_022_v3': 1, 'Sample_14': 0, 'Sample_2-leish_011_v3': 0, 'Sample_13-leish_019_v2': 0, 'Sample_1-Leish_011_v2': 0, 'Sample_11-leish_018_v2': 0, 'Sample_20-leish_023_v3': 0, 'Sample_3-leish_012_v2': 0, 'Sample_10-leish_017_v3': 1, 'Sample_7': 0, 'Sample_29-leish032_v2': 1, 'Sample_8-leish_016_v3': 0, 'Sample_6': 0, 'Sample_7-leish_016_v2': 0, 'Sample_9': 0, 'Sample_8': 0, 'Sample_27-leish028_v2': 0, 'Sample_26-leish027_v3': 0, 'Sample_5': 1, 'Sample_4': 0, 'Sample_3': 0, 'Sample_19': 0, 'Sample_1': 0, 'Sample_2': 0, 'Sample_9-leish_017_v2': 0, 'Sample_5-leish_015_v2': 0, 'Sample_4-leish_012_v3': 0, 'Sample_21-leish_024_v2': 0, 'Sample_18': 0, 'Sample_13': 0, 'Sample_12': 0, 'Sample_11': 0, 'Sample_10': 1, 'Sample_17': 0, 'Sample_16': 0, 'Sample_15': 1, 'Sample_14-leish_019_v3': 0, 'Sample_30-leish032_v3': 0, 'Sample_28-leish028_v3': 1, 'Sample_15-leish_021_v2': 0, 'Sample_17-leish_022_v2': 0}
here is all my code structure from beginning, as you see in the end lines i tried to create list and append my values inside a list but i couldn't successful.
import os
import numpy as np
import matplotlib.pyplot as plt
path = "/home/ali/Desktop/data/"
root = "/home/ali/Desktop/SAMPLES/"
data_geo1={}
with open(path+"GSE98212_H_DE_genes_count.txt","rt") as fin: #data for sample 1-30
h = fin.readline()
sample1 = h.split()
sample_names = [s.strip('"') for s in sample1[1:31]]
for l in fin.readlines():
l = l.strip().split()
if l:
gene1= l[0].strip('"')
data_geo1[gene1] = {}
for i, x in enumerate(l[1:31]):
data_geo1[gene1][sample_names[i]] = int(x)
#print(data_geo1)
data_geo2={}
with open (path+"GSE98212_L_DE_genes_count.txt","rt") as fin:
h= fin.readline()
sample2=h.split()
sample_names=sample2[1:21]
for l in fin.readlines():
l = l.strip().split()
if l:
gene2= l[0].strip()
data_geo2[gene2]={}
for i,x in enumerate (l[1:21]):
data_geo2[gene2][sample_names[i]]= int(x)
#print(data_geo2)
data_ali={}
for sample_name in os.listdir(root):
with open(os.path.join(root, sample_name, "counts.txt"), "r") as fin:
for line in fin.readlines():
gene, reads = line.split()
reads = int(reads)
if gene.startswith('ENSG'):
data_ali.setdefault(gene, {})[sample_name] = reads
gene = l[0].strip()
#print(data_ali)
list_samples= data_ali[gene].keys()
#print(list_samples)
for sample in list_samples:
reads_data_ali = []
for gene in data_ali.keys():
reads_data_ali.append(data_ali[gene][sample_name])
i expect the output like :
[[0, 0], [0, 2], [11, 12], [4, 4], [18, 17], [2, 2], [381, 383], [1019, 1020], [198, 194], [66, 65], [2223, 2230], [30, 30], [0, 0], [33, 34], [0, 0], [411, 409], [804, 803], [11829, 7286], [137, 139], [277, 278], [3475, 3482], [5, 5], [2, 1], [70, 70], [48, 48], [234, 232], [121, 120], [928, 925], [220, 159], [165, 165], [702, 700], [1645, 1643], [79, 78], [1064, 1067], [971, 972], [0, 0]]
You can try to avoid the keyerror by checking if the key exist in your dictionary before the .append(...). Try to look at the dictionary .get() method. It's good for prevent this type of error.
As to your description, I suppose your code of making the dictionaries of data_ali and data_geo1got the right outputs, so the problem may be in the last code of making a list.
I find two questions:
1 for gene in data_ali.keys():, in the following loop, reads_data_geo1.append(data_geo1[gene1][sample_names]),here it's [gene1]
2for sample in list_samples:,so maybe you should use reads_data_ali.append(data_ali[gene][sample])
you may revise the name of these variables and see if it works.
I wrote a code to download the synonyms of the words in a list, locations. But since a word can have multiple meanings, I used another list, meaning, to point to the serial number of the meaning I want for that word. Then calculate similarities between the words based on these synonyms found, and then save them in a file.
from nltk.corpus import wordnet as wn
from textblob import Word
from textblob.wordnet import Synset
locations = ['access', 'airport', 'amenity', 'area', 'atm', 'barrier', 'bay', 'bench', 'boundary', 'bridge', 'building', 'bus', 'cafe', 'car', 'coast', 'continue', 'created', 'defibrillator', 'drinking', 'embankment', 'entrance', 'ferry', 'foot', 'fountain', 'fuel', 'gate', 'golf', 'gps', 'grave', 'highway', 'horse', 'hospital', 'house', 'land', 'layer', 'leisure', 'man', 'market', 'marketplace', 'height', 'name', 'natural', 'exit', 'way', 'park', 'parking', 'place', 'worship', 'playground', 'police', 'station', 'post', 'mail', 'power', 'private', 'public', 'railway', 'ref', 'residential', 'restaurant', 'road', 'route', 'school', 'shelter', 'shop', 'source', 'sport', 'toilet', 'tourism', 'unknown', 'vehicle', 'vending', 'machine', 'village', 'wall', 'waste', 'waterway'];
meaning = [0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 5, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 11, 0, 1, 0, 0, 3, 0, 4, 0, 0, 3, 4, 0, 0, 0, 10, 0, 9, 1, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
ncols = len(locations)
nrows = len(locations)
matrix = [[0] * ncols for i in range(nrows)]
for i in range(0,len(locations)):
word1 = Word(locations[i])
SS1 = word1.synsets[meaning[i]]
for j in range(0,len(locations)):
word2 = Word(locations[j])
SS2 = word1.synsets[meaning[j]]
matrix[i][j] = SS1.path_similarity(SS2)
f = open('Similarities.csv', 'w')
print(matrix, file=f)
But the code gives the following error:
SS2 = word1.synsets[meaning[j]]
IndexError: list index out of range
When I printed out the values of i and j, I found that it prints till i=0 and j=36. That means that when j=36, the error arises. The word in the list at index 36 is man, and the value at index 36 of meaning is 11.
So, why is this error occuring and how do I fix it?
EDIT: The mistake was in SS2 = word1.synsets[meaning[j]]. It should have been SS2 = word2.synsets[meaning[j]]. Sorry.
len(word1.synsets) returns 8 and type(word1.synsets) returns list. So it's a list with indexes 0 to 7.
your list 'meaning' contains 11 at index 36. so when your loop reaches word1.synsets[11] you get the index out of range error.
Like Jose said, 7 is the max int you can have in 'meaning'.
I am doing something fairly simple, I'm trying to convert a list of lists to an np.array:
rows = [[1, 'None', 'None', 0, 0, 'None', 0, 0],
[2, 'None', 'None', 0, 0, 'None', 0, 0],
[3, 'None', 'None', 0, 0, 'None', 0, 0],
[4, 'None', 'None', 0, 0, 'None', 0, 0],
[5, 'None', 'None', 0, 0, 'None', 0, 0]]
dtypes = np.dtype(
[
('_ID', np.int),
('SOURCE_LILST', '|S1024'),
('PRI_SOURCE', '|S256'),
('PRI_SOURCE_CNT', np.int32),
('PRI_SOURCE_PER', np.float64),
('SEC_SOURCE', '|S256'),
('SEC_SOURCE_CNT', np.int32),
('SEC_SOURCE_PER', np.float64)
]
)
array = np.array(rows, dtypes) # error raised
Can anyone see what the issue it?
Thank you
Your rows object should be a list of tuples, not a list of lists:
rows = [(1, 'None', 'None', 0, 0, 'None', 0, 0),
(2, 'None', 'None', 0, 0, 'None', 0, 0),
(3, 'None', 'None', 0, 0, 'None', 0, 0),
(4, 'None', 'None', 0, 0, 'None', 0, 0),
(5, 'None', 'None', 0, 0, 'None', 0, 0)]
dtypes = np.dtype(
[
('_ID', np.int),
('SOURCE_LILST', '|S1024'),
('PRI_SOURCE', '|S256'),
('PRI_SOURCE_CNT', np.int32),
('PRI_SOURCE_PER', np.float64),
('SEC_SOURCE', '|S256'),
('SEC_SOURCE_CNT', np.int32),
('SEC_SOURCE_PER', np.float64)
]
)
array = np.array(rows, dtypes) # no error raised
A few extra hints:
Try to simplify your problem as much as possible to figure out where the issue is. In this case, you could have deleted all but one of the columns, and hence dtypes, and still seen the problem. This helps eliminate possible errors like whether or not you specified all those strings correctly, etc.
Google your problem before bothering to create a question.
When asking a question about why you're getting an error, include the error message — or at least some shortened version of it.
More generally, read this. It's not just helpful for asking better questions, it's also a helpful way to figure out problems before you even get around to asking.
This question already has answers here:
How do I create variable variables?
(17 answers)
Closed 6 years ago.
Is there some way to simplify this code? Maybe using def or a for loop or lists or something? Thank you!
c=0
c2=0
c3=0
c4=0
c5=0
c6=0
c7=0
c8=0
c9=0
c10=0
c11=0
c12=0
c13=0
c14=0
c15=0
c16=0
c17=0
c18=0
c19=0
c20=0
c21=0
c22=0
c23=0
c24=0
c25=0
c26=0
I would much rather use a dictionary here:
>>> d = {"c{}".format(val): 0 for val in range(27)}
>>> d
{'c19': 0, 'c18': 0, 'c13': 0, 'c12': 0, 'c11': 0, 'c10': 0, 'c17': 0, 'c16': 0, 'c15': 0, 'c14': 0, 'c9': 0, 'c8': 0, 'c3': 0, 'c2': 0, 'c1': 0, 'c0': 0, 'c7': 0, 'c6': 0, 'c5': 0, 'c4': 0, 'c22': 0, 'c23': 0, 'c20': 0, 'c21': 0, 'c26': 0, 'c24': 0, 'c25': 0}
>>> d.get('c15')
0
>>> d.get('c10000')
None
There are a couple of solutions. What are you using all these variables for? The best solution is probably to put them in a list. So:
c = [0 for _ in range(number_of_variables)]
Then you access them like c[0] c[26] etc