I was about to create a matrix like :
33 12 23 42 11 32 43 22
33 − 1 1 1 0 0 1 1
12 1 − 1 1 0 0 1 1
23 1 1 − 1 1 1 0 0
42 1 1 1 − 1 1 0 0
11 0 0 1 1 − 1 1 1
32 0 0 1 1 1 − 1 1
43 1 1 0 0 1 1 − 1
22 1 1 0 0 1 1 1 −
I want to query by horizontal or vertical titles, so I created the matrix by:
a = np.matrix('99 33 12 23 42 11 32 43 22;33 99 1 1 1 0 0 1 1;12 1 99 1 1 0 0 1 1;23 1 1 99 1 1 1 0 0;42 1 1 1 99 1 1 0 0;11 0 0 1 1 99 1 1 1;32 0 0 1 1 1 99 1 1;43 1 1 0 0 1 1 99 1;22 1 1 0 0 1 1 1 99')
I want to have the certain data if I query a[23][11] = 1
so is there a way we can create a 2D dictionary, so that a[23][11] = 1?
Thanks
You're clearly asking for something outside of numpy.
A defauldict with the default_factory as dict gives a sense of the 2D dictionary you want:
>>> from collections import defaultdict
>>> a = defaultdict(dict)
>>> a[23][11] = 1
>>> a[23]
{11: 1}
>>> a[23][11]
1
Another possibility is to use tuples as the dictionary keys
dict((33,12):1, (23,12):1, ...]
scipy.sparse has a sparse matrix format that stores it's values in such a dictionary. With your values such a matrix would represent a 50x50 matrix with mostly 0 values, and just 1's at these selected coordinates.
Keep in mind that the keys of a dictionary (ordinary at least) are not ordered
What are going to be doing with this data? A dictionary, whether type or nested, is good for one kind of usage, but bad for others. A matrix such as you sample is better for other things, like operations along rows or columns. The dictionary format largely obscures that kind of ordered layout.
Are you looking for a dictionary with pairs as keys?
d = {}
d[33, 12] = 1
d[33, 23] = 1
# etc
Note that in python d[a, b] is just syntactic sugar for d[(a, b)]
If I understand correctly you just want to label your row/columns. To stay within the numpy array framework, a simple solution would be to create a mapping between the labels and the array order. I am also going to assume that it is OK to convert the labels into strings as they can be anything (though integers would also be fine).
l = {str(x) : ind for ind , x in enumerate((33,12,23,42,11,32,43,22))}
a = sp.linalg.circulant([99,1,1,1,0,0,1,1])
a[l['32'],l['23']]
Related
I'm trying to flag some price data as "stale" if the quoted price of the security hasn't changed over lets say 3 trading days. I'm currently trying it with:
firm["dev"] = np.std(firm["Price"],firm["Price"].shift(1),firm["Price"].shift(2))
firm["flag"] == np.where(firm["dev"] = 0, 1, 0)
But I'm getting nowhere with it. This is what my dataframe would look like.
Index
Price
Flag
1
10
0
2
11
0
3
12
0
4
12
0
5
12
1
6
11
0
7
13
0
Any help is appreciated!
If you are okay with other conditions, you can first check if series.diff equals 0 and take cumsum to check if you have a cumsum of 2 (n-1). Also check if the next row is equal to current, when both these conditions suffice, assign a flag of 1 else 0.
n=3
firm['Flag'] = (firm['Price'].diff().eq(0).cumsum().eq(n-1) &
firm['Price'].eq(firm['Price'].shift())).astype(int)
EDIT, to make it a generalized function with consecutive n, use this:
def fun(df,col,n):
c = df[col].diff().eq(0)
return (c|c.shift(-1)).cumsum().ge(n) & df[col].eq(df[col].shift())
firm['flag_2'] = fun(firm,'Price',2).astype(int)
firm['flag_3'] = fun(firm,'Price',3).astype(int)
print(firm)
Price Flag flag_2 flag_3
Index
1 10 0 0 0
2 11 0 0 0
3 12 0 0 0
4 12 0 1 0
5 12 1 1 1
6 11 0 0 0
7 13 0 0 0
I have a dataset df of object components in 3-d space - each ID represents an object which has various components:
ID Comp x y z
A 1 2 2 1
A 2 2 1 -1
A 3 -10 1 -10
A 4 -1 3 -5
B 1 3 0 0
B 2 3 0 -5
...
I would like to loop through each ID, using a clustering technique in sklearn to create clusters of components (Comp) based on each component's (x,y,z) co-ordinates - to achieve something like this:
ID Comp x y z cluster
A 1 2 2 1 1
A 2 2 1 -1 1
A 3 -10 1 -10 2
A 4 -1 3 -5 3
B 1 3 0 0 1
B 2 3 0 -5 1
...
As an example - ID:A,Comp:1 is incluster1, whereasID:A, Comp:4 is in cluster 3. (I plan to then concatenate ID and cluster later).
I'm having no luck with the following groupby + apply:
from sklearn.cluster import AffinityPropagation
ap = AffinityPropagation()
df['cluster']=df.groupby(['ID','Comp']).apply(lambda x: ap.fit_predict(np.array([x.x,x.y,x.z]).T))
I could brute-force it by using a for loop over the ID but my dataset is large (~ 150k ID) and I'm worried about resource and time constraints. Any help would be great!
IIUC, I think you could try something like this:
def ap_fit_pred(x):
ap = AffinityPropagation()
return pd.Series(ap.fit_predict(x.loc[:,['x','y','z']]))
df['cluster'] = df.groupby('ID').apply(ap_fit_pred).reset_index(drop=True)
Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a
i understand that to create dynamic for loops, recursive or itertools module in python is the way to go. Lets say I am doing it in recursive.
What I want is
for var1 in range(var1_lowerlimit, var1_upperlimit, var1_stepsize):
for var2 in range(var2_lowerlimit, var2_upperlimit, var2_stepsize):
:
:
# do_whatever()
repeat for n loops where n is the number of variables
What I have now is I have 2 lists
variable_list = [ var1, var2, var3, ... ]
boundaries_list = [ [var1_lowerlimit, var1_upperlimit, var1_stepsize],
[var2_lowerlimit, var2_upperlimit, var2_stepsize], ...]
def dynamic_for_loop(variable_list , boundaries_list, no_of_loops, list_index = 0):
if no_of_loops <= 0:
# do_whatever()
else:
lower_bound = boundaries_list[list_index][0]
upper_bound = boundaries_list[list_index][1]
step_size = boundaries_list[list_index][2]
for index in range(lower_bound, upper_bound, step_size):
list_index += 1
try:
dynamic_for_loop(variable_list , boundaries_list, no_of_loops - 1, list_index)
except:
list_index = 0
dynamic_for_loop(variable_list , boundaries_list, no_of_loops - 1, list_index)
I did a reset on list_index as it gets out of range, but i couldn't get the result I want. Can someone enlighten me what went wrong?
Use the itertools.product() function to generate the values over a variable number of ranges:
for values in product(*(range(*b) for b in boundaries_list)):
# do things with the values tuple, do_whatever(*values) perhaps
Don't try to set a variable number of variables; just iterate over the values tuple or use indexing as needed.
Using * in a call tells Python to take all elements of an iterable and apply them as separate arguments. So each b in your boundaries_list is applied to range() as separate arguments, as if you called range(b[0], b[1], b[2]).
The same applies to the product() call; each range() object the generator expression produces is passed to product() as a separate argument. This way you can pass a dynamic number of range() objects to that call.
Just for fun, I thought that I would implement this using recursion to perhaps demonstrate the pythonic style. Of course, the most pythonic way would be to use the itertools package as demonstrated by Martijn Pieters.
def dynamic_for_loop(boundaries, *vargs):
if not boundaries:
print(*vargs) # or whatever you want to do with the values
else:
bounds = boundaries[0]
for i in range(*bounds):
dynamic_for_loop(boundaries[1:], *(vargs + (i,)))
Now we can use it like so:
In [2]: boundaries = [[0,5,1], [0,3,1], [0,3,1]]
In [3]: dynamic_for_loop(boundaries)
0 0 0
0 0 1
0 0 2
0 1 0
0 1 1
0 1 2
0 2 0
0 2 1
0 2 2
1 0 0
1 0 1
1 0 2
1 1 0
1 1 1
1 1 2
1 2 0
1 2 1
1 2 2
2 0 0
2 0 1
2 0 2
2 1 0
2 1 1
2 1 2
2 2 0
2 2 1
2 2 2
3 0 0
3 0 1
3 0 2
3 1 0
3 1 1
3 1 2
3 2 0
3 2 1
3 2 2
4 0 0
4 0 1
4 0 2
4 1 0
4 1 1
4 1 2
4 2 0
4 2 1
4 2 2
I'm working on a python program to compute a numerical coding of mutated residues and positions of a set of strings (protein sequences), stored in fasta format file with each protein sequence is separated by comma. I'm trying to find the position and sequences which are mutated.
My fasta file is as follows:
MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN
Example:
The following figure (based on another set of fasta file) will explain the algorithm behind this. In this figure first box represents alignment of input file sequences. The last box represents the output file. How can I do this with my fasta file in Python?
example input file:
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
Here are two ways I have tried to do it:
ls= 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
a=set().union(*pos)
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
print '\t'.join('1' if key in p else '0' for key in alle)
(here I'm getting columns of mutated as well as non-mutated residues, but I want only columns for mutated residues)
from pandas import *
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'
df = DataFrame([list(row) for row in data.split(',')])
df = DataFrame({str(col+1)+val:(df[col]==val).apply(int) for col in df.columns for val in set(df[col])})
print df.select(lambda x: not df[x].all(), axis = 1)
(here it is giving output ,but not in orderly ie, first 2K then 2T then 3A like that.)
How should I be doing this?
The function get_dummies gets you most of the way:
In [11]: s
Out[11]:
0 T
1 T
2 T
3 T
4 K
Name: 1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:
1K 1T
0 0 1
1 0 1
2 0 1
3 0 1
4 1 0
And those columns which have differing values:
In [21]: (df.ix[0] != df).any()
Out[21]:
0 False
1 True
2 True
3 False
4 True
5 False
Putting these together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = [pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I]
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Note: I created the initial DataFrame as follows, however this may be done more efficiently depending on your situation:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))