What's going on with my dictionary comprehension here?
I am parsing a BLAST file and want to create objects for each line in the file. Ideally each object will be stored in a dictionary for use later in the program.
Parsing works fine but I end up with a blank transSwiss dictionary.
Here are a few lines of output as an example:
c0_g1_i1|m.1 gi|74665200|sp|Q9HGP0.1|PVG4_SCHPO 100.00 372 0 0 1 372 1 372 0.0 754
c1000_g1_i1|m.799 gi|48474761|sp|O94288.1|NOC3_SCHPO 100.00 747 0 0 5 751 1 747 0.0 1506
c1001_g1_i1|m.800 gi|259016383|sp|O42919.3|RT26A_SCHPO 100.00 268 0 0 1 268 1 268 0.0 557
c1002_g1_i1|m.801 gi|1723464|sp|Q10302.1|YD49_SCHPO 100.00 646 0 0 1 646 1 646 0.0 1310
I'm trying to make each BLAST line a parse_blast object.
class parse_blast(object):
def __init__(self, line):
#Strip end-of-line and split on tabs
self.fields = line.strip("\n").split("\t")
self.transcriptId, self.isoform = self.fields[0].split("|")
self.swissStuff = self.fields[1].split("|")
self.swissProtId = self.swissStuff[3]
self.percentId = self.fields[2]
def filterblast(self):
return float(self.percentId) > 95
blastmap = map(parse_blast, blast_output.readlines())
filtered = filter(parse_blast.filterblast, blastmap)
transSwiss = {blastmap.transcriptId:blastmap for blastmap.transcriptId in filtered}
When you do this:
for blastmap.transcriptId in filtered
you're trying to assign each element of filtered to blastmap.transcriptId in sequence. blastmap is either a list or an instance of the map type, depending on your Python version, so it has no transcriptId attribute, and your code fails with an AttributeError.
Use a variable. A new variable:
transSwiss = {pb.transcriptId: pb for pb in filtered}
Related
Context: I have the following dataframe:
gene_id Control_3Aligned.sortedByCoord.out.gtf Control_4Aligned.sortedByCoord.out.gtf ... NET_101Aligned.sortedByCoord.out.gtf NET_103Aligned.sortedByCoord.out.gtf NET_105Aligned.sortedByCoord.out.gtf
0 ENSG00000213279|Z97192.2 0 0 ... 3 2 7
1 ENSG00000132680|KHDC4 625 382 ... 406 465 262
2 ENSG00000145041|DCAF1 423 104 ... 231 475 254
3 ENSG00000102547|CAB39L 370 112 ... 265 393 389
4 ENSG00000173826|KCNH6 0 0 ... 0 0 0
And I'd like to get a nested dictionary as this example:
{Control_3Aligned.sortedByCoord.out.gtf:
{ENSG00000213279|Z97192.2:0,
ENSG00000132680|KHDC4:625,...},
Control_4Aligned.sortedByCoord.out.gtf:
{ENSG00000213279|Z97192.2:0,
ENSG00000132680|KHDC4:382,...}}
So the general format would be:
{column_name : {row_name:value,...},...}
I was trying something like this:
sample_dict ={}
for column in df.columns[1:]:
for index in range(0,len(df.index)+1):
sample_dict.setdefault(column, {row_name:value for row_name,value in zip(df.iloc[index,0], df.loc[index,column])})
sample_dict[column] += {row_name:value for row_name,value in zip(df.iloc[index,0], df.loc[index,column])}
But I keep getting TypeError: 'numpy.int64' object is not iterable (the problem seems to be in the zip() as zip only takes iterables and I'm not really doing that in this example and most certainly in the way I'm populating the dictionary as well)
Any help is very welcome! Thank you in advance
Managed to do it like this:
sample_dict ={}
gene_list = []
for index in range(0,len(df.index)):
temp_data = df.loc[index,'gene_id']
gene_list.append(temp_data)
for column in df.columns[1:]:
column_list = df.loc[:,column]
gene_dict = {}
for index in range(0,len(df.index)):
if gene_list[index] not in gene_dict:
gene_dict[gene_list[index]]=df.loc[index,column]
sample_dict[column] = gene_dict
sample_dict.items()
dict_pairs = sample_dict.items()
pairs_iterator = iter(dict_pairs)
first_pair = next(pairs_iterator)
first_pair
I have a data set where null value is
df.isnull().sum()
country 0
country_long 0
name 0
gppd_idnr 0
capacity_mw 0
latitude 46
longitude 46
primary_fuel 0
other_fuel1 0
other_fuel2 0
other_fuel3 908
commissioning_year 380
owner 0
source 0
url 0
geolocation_source 0
wepp_id 908
year_of_capacity_data 388
generation_gwh_2013 524
generation_gwh_2014 507
generation_gwh_2015 483
generation_gwh_2016 471
generation_gwh_2017 465
generation_data_source 0
estimated_generation_gwh 908
I tried mean mode max min and std all the methods but all null values is not removing
when I try
df['wepp_id']=df['wepp_id'].replace(np.NAN,df['wepp_id'].mean())
its not working same things happen on median , std and min, max also
Try df['wepp_id']=df['wepp_id'].fillna(df['wepp_id'].mean())
If this does not work, then it means that your column is not of number type. If it is an string type, then you need to do this first:
df['wepp_id'] = df['wepp_id'].astype(float)
Then run the command in the first line.
I'm having a pretty simple issue. I have a dataset (small sample shown below)
22 85 203 174 9 0 362 40 0
21 87 186 165 5 0 379 32 0
30 107 405 306 25 0 756 99 0
6 5 19 6 2 0 160 9 0
21 47 168 148 7 0 352 29 0
28 38 161 114 10 3 375 40 0
27 218 1522 1328 114 0 1026 310 0
21 78 156 135 5 0 300 27 0
The first issue I needed to cover was replacing each space with a comma I did that with the following code
import fileinput
with open('Data_Sorted.txt', 'w') as f:
for line in fileinput.input('DATA.dat'):
line = line.split(None,8)
f.write(','.join(line))
The result was the following
22,85,203,174,9,0,362,40,0
21,87,186,165,5,0,379,32,0
30,107,405,306,25,0,756,99,0
6,5,19,6,2,0,160,9,0
21,47,168,148,7,0,352,29,0
28,38,161,114,10,3,375,40,0
27,218,1522,1328,114,0,1026,310,0
21,78,156,135,5,0,300,27,0
My next step is to grab the values from the last column, check if they are less than 2 and replace it with the string 'nfp'.
I'm able to seperate the last column with the following
for line in open("Data_Sorted.txt"):
columns = line.split(',')
print columns[8]
My issue is implementing the conditional to replace the value with the string and then I'm not sure how to put the modified column back into the original dataset.
There's no need to do this in two loops through the file. Also, you can use -1 to index the last element in the line.
import fileinput
with open('Data_Sorted.txt', 'w') as f:
for line in fileinput.input('DATA.dat'):
# strip newline character and split on whitespace
line = line.strip().split()
# check condition for last element (assuming you're using ints)
if int(line[-1]) < 2:
line[-1] = 'nfp'
# write out the line, but you have to add the newline back in
f.write(','.join(line) + "\n")
Further Reading:
Negative list index?
Understanding Python's slice notation
You need to convert columns[8] to an int and compare if it is less than 2.
for line in open("Data_Sorted.txt"):
columns = line.split(',')
if (int(columns[8]) < 2):
columns[8] = "nfp"
print columns
Following the docs on line_profiler, I am able to profile my code just fine, but when I view the output with python -m line_profiler script_to_profile.py.lprof, I only see 27 lines of code. I expect to see about 250, because that's the length of the function that I added the #profile decorator to. My output looks like this:
Timer unit: 1e-06 s
Total time: 831.023 s
File: /Users/will/Repos/Beyond12/src/student_table.py
Function: aggregate_student_data at line 191
Line # Hits Time Per Hit % Time Line Contents
==============================================================
191 # load files
192 files = os.listdir(datadir)
193
194 tabs = ['Contact', 'Costs_Contributions', 'Course', 'EducationalHistory', 'Events', 'FinancialAidScholarship',
195 1 764 764.0 0.0 'PreCollegiateExam', 'QualitativeInformation', 'Student_Performance', 'Task', 'Term']
196 special_contact_id = {'Contact': 'Id',
197 1 7 7.0 0.0 'Events': 'WhoId',
198 1 6 6.0 0.0 'QualitativeInformation': 'Alumni',
199 1 6 6.0 0.0 'Student_Performance': 'Contact',
200 1 6 6.0 0.0 'PreCollegiateExam': 'Contact ID'}
201 1 6 6.0 0.0
202 1 5 5.0 0.0 # todo: make dictionary of the columns that we'll be using, along with their types?
203
204 df = {}
205 for tab in tabs:
206 1 6 6.0 0.0 # match tab titles to files by matching first 5 non-underscore characters
207 12 115 9.6 0.0 filename = filter(lambda x: tab.replace('_', '')[:5] in x.replace('_', '')[:5], files)[0]
It's cut off in the middle of a for loop.
you might have modified source file after creating profile result file, try a re-run.
because the source code line_profiler printed is read from the file on disk, ref:
https://github.com/rkern/line_profiler/blob/master/line_profiler.py#L190
I got all my data into a HDFStore (yeah!), but how to get it out of it..
I've saved 6 DataFrames as frame_table in my HDFStore. Each of these table looks like the following, but the length varies (date is Julian date).
>>> a = store.select('var1')
>>> a.head()
var1
x_coor y_coor date
928 310 2006257 133
932 400 2006257 236
939 311 2006257 253
941 312 2006257 152
942 283 2006257 68
Then I select from all my tables the values where the date is e.g > 2006256.
>>> b = store.select_as_multiple(['var1','var2','var3','var4','var5','var6'], where=(pd.Term('date','>',date)), selector= 'var1')
>>> b.head()
var1 var2 var3 var4 var5 var6
x_coor y_coor date
928 310 2006257 133 14987 7045 18 240 171
2006273 136 0 7327 30 253 161
2006289 125 0 -239 83 217 168
2006305 95 14604 6786 13 215 57
2006321 84 0 4548 13 133 88
This works, but only for the relatively small .h5 files. So for my normal .h5 files I would like to temporarily store it in a HDFStore using chunksize (since I've to add a new column based on this selection to it as well). I thought like this (using this):
for df in store.select_as_multiple(['var1','var2','var3','var4','var5','var6'], where=(pd.Term('date','>',date)), selector= 'var1', chunksize=15):
tempstore.put('test',pd.DataFrame(df))
But then only one chunk is added to the store. But with:
tempstore.append('test',pd.DataFrame(df))
I get ValueError: Can only append to Tables. What I'm doing wrong?
When you tried to do this with put it kept overwriting the store (with the latest chunk), then you get the error when you append (because you can't append to a storer / non-table).
That is:
put writes a single, non-appendable fixed format (called a storer), which is fast to write, but you cannot append, nor query (only get it in its entirety).
append creates a table format, which is what you want here (and what a frame_table is).
Note: you don't need to do pd.DataFrame(df) as df is already a frame.
So, first do this (delete the store) if its there:
if 'test' in tempstore:
tempstore.remove('test')
Then append each DataFrame:
for df in store.select_as_multiple(.....):
tempstore.append('test', df)