Add new column with existing column names - python

I'm dealing with a dataframe which looks like:
FID geometry Code w1 w2
0 12776 POLYGON ((-1.350000000000025 53.61540813717482... 12776 0 1
1 13892 POLYGON ((6.749999999999988 52.11964001623148,... 13892 1 0
2 14942 POLYGON ((-3.058896639907732e-14 51.3958198431... 14942 1 1
3 18964 POLYGON ((8.549999999999974 45.26941059233587,... 18964 0 1
4 19863 POLYGON ((-0.4500000000000305 44.6337746953077... 19863 0 1
My objective is to add a column, labeled as 'Max', where I'm going to write which w (w1, w2) has got more frequency.
So far I've only managed add a column in which appears the maximum frequency, instead of the name of the column where it appears.
The desired output would be something like this:
FID geometry Code w1 w2 Max
0 12776 ... 12776 0 1 w2
1 13892 ... 13892 1 0 w1
2 14942 ... 14942 1 1 0
3 18964 ... 18964 0 1 w2
4 19863 ... 19863 0 1 w2
Furthermore, I'd like to fill with zeros whenever the frequencies are the same, if its possible, at the same time.
Any help would be appreciated! :-)

Use np.where to choose 0 when they are equal idxmax(1) when they are not.
df['max'] = np.where(df.w1 == df.w2, 0, df[['w1', 'w2']].idxmax(1))
df
FID geometry Code w1 w2 Max
0 12776 ... 12776 0 1 w2
1 13892 ... 13892 1 0 w1
2 14942 ... 14942 1 1 0
3 18964 ... 18964 0 1 w2
4 19863 ... 19863 0 1 w2

Something like this should work:
(df['w1'] == df['w2']).map({True: 0}).fillna(df[['w1', 'w2']].idxmax(axis=1))
Out[26]:
0 w2
1 w1
2 0
3 w2
4 w2
dtype: object
How it works:
The main part is with idxmax:
df[['w1', 'w2']].idxmax(axis=1)
Out[27]:
0 w2
1 w1
2 w1
3 w2
4 w2
dtype: object
This first selects the relevant columns, and returns the index of the maximum (axis=1 for columns). However, it returns the first index in case of ties.
(df['w1'] == df['w2']).map({True: 0}) fills a series with 0 when w1==w2. Remaining values are NaN. So those are filled with idxmax values.
Note: np.where is definitely the more logical (and probably faster) choice here. I just like to experiment with other alternatives.

Related

np where with two conditions and met first

I am trying to create a target variable based on 2 conditions. I have X values that are binary and X2 values that are also binary. My condition is whenver X changes from 1 to zero, we have one in y only if it is followed by a change from 0 to 1 in X2. If that was followed by a change from 0 to 1 in X then we don't do the change in the first place. I attached a picture from excel.
I also did the following to account for the change in X
df['X-prev']=df['X'].shift(1)
df['Change-X;]=np.where(df['X-prev']+df['X']==1,1,0)
# this is the data frame
X=[1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0]
X2=[0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1]
df=pd.DataFrame()
df['X']=X
df['X2']=X2
however, this is not enough as I need to know which change came first after the X change. I attached a picture of the example.
Thanks a lot for all the contributions.
Keep rows that match your transition (X=1, X+1=0) and (X2=1, X2-1=0) then merge all selected rows to a list where a value of 0 means 'start a cycle' and 1 means 'end a cycle'.
But in this list, you can have consecutive start or end so you need to filter again to get only cycles of (0, 1). After that, reindex this new series by your original dataframe index and back fill with 1.
x1 = df['X'].sub(df['X'].shift(-1)).eq(1)
x2 = df['X2'].sub(df['X2'].shift(1)).eq(1)
sr1 = pd.Series(0, df.index[x1])
sr2 = pd.Series(1, df.index[x2])
sr = pd.concat([sr2, sr1]).sort_index()
df['Y'] = sr[sr.lt(sr.shift(-1)) | sr.gt(sr.shift(1))] \
.reindex(df.index).bfill().fillna(0).astype(int)
>>> df
X X2 Y
0 1 0 0 # start here: (X=1, X+1=0) but never ended before another start
1 1 0 0
2 0 0 0
3 0 0 0
4 1 0 0 # start here: (X=1, X+1=0)
5 0 0 1 # <- fill with 1
6 0 0 1 # <- fill with 1
7 0 0 1 # <- fill with 1
8 0 0 1 # <- fill with 1
9 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
10 0 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 0 0
15 0 0 0
16 0 1 0 # end here: (X2=1, X2-1=0) but never started before
17 0 0 0
18 0 0 0
19 0 0 0
20 1 0 0
21 1 0 0 # start here: (X=1, X+1=0)
22 0 0 1 # <- fill with 1
23 0 0 1 # <- fill with 1
24 0 0 1 # <- fill with 1
25 0 0 1 # <- fill with 1
26 0 0 1 # <- fill with 1
27 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
28 0 1 0
29 0 1 0

Conditional sum of non zero values

I have a daraframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of continious non zero values of Column (Fn)
I want my result dataframe as below:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 1
2 18747.392361 7050.0 0 0
3 18747.395833 8240.0 1 1
4 18747.399306 5158.0 1 2 <<<
5 18747.402778 3926.0 0 0
6 18747.406250 4043.0 0 0
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
You can use groupby() and cumsum():
groups = df.Fn.eq(0).cumsum()
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
Details
First use df.Fn.eq(0).cumsum() to create pseudo-groups of consecutive non-zeros. Each zero will get a new id while consecutive non-zeros will keep the same id:
groups = df.Fn.eq(0).cumsum()
# groups Fn (Fn added just for comparison)
# 0 1 0
# 1 1 1
# 2 2 0
# 3 2 1
# 4 2 1
# 5 3 0
# 6 4 0
# 7 4 1
# 8 4 1
# 9 4 1
Then group df.Fn.ne(0) on these pseudo-groups and cumsum() to generate the within-group sequences:
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
# Datetime Data Fn Sum
# 0 18747.385417 11275.0 0 0
# 1 18747.388889 8872.0 1 1
# 2 18747.392361 7050.0 0 0
# 3 18747.395833 8240.0 1 1
# 4 18747.399306 5158.0 1 2
# 5 18747.402778 3926.0 0 0
# 6 18747.406250 4043.0 0 0
# 7 18747.409722 2752.0 1 1
# 8 18747.420139 3502.0 1 2
# 9 18747.423611 4026.0 1 3
How about using cumsum and reset when value is 0
df['Fn2'] = df['Fn'].replace({0: False, 1: True})
df['Fn2'] = df['Fn2'].cumsum() - df['Fn2'].cumsum().where(df['Fn2'] == False).ffill().astype(int)
df
You can store the fn column in a list and then create a new list and iterate over the stored fn column and check the previous index value if it is greater than zero then add it to current index else do not update it and after this u can make a dataframe for the list and concat column wise to existing dataframe
fn=df[Fn]
sum_list[0]=fn first value
for i in range(1,lenghtofthe column):
if fn[i-1]>0:
sum_list.append(fn[i-1]+fn[i])
else:
sum_list.append(fn[i])
dfsum=pd.Dataframe(sum_list)
df=pd.concat([df,dfsum],axis=1)
Hope this will help you.there may me syntax errors that you can refer google.But the idea is this
try this:
sum_arr = [0]
for val in df['Fn']:
if val > 0:
sum_arr.append(sum_arr[-1] + 1)
else:
sum_arr.append(0)
df['sum'] = sum_arr[1:]
df

Cluster Rows in Data Subgroups

I have a dataset df of object components in 3-d space - each ID represents an object which has various components:
ID Comp x y z
A 1 2 2 1
A 2 2 1 -1
A 3 -10 1 -10
A 4 -1 3 -5
B 1 3 0 0
B 2 3 0 -5
...
I would like to loop through each ID, using a clustering technique in sklearn to create clusters of components (Comp) based on each component's (x,y,z) co-ordinates - to achieve something like this:
ID Comp x y z cluster
A 1 2 2 1 1
A 2 2 1 -1 1
A 3 -10 1 -10 2
A 4 -1 3 -5 3
B 1 3 0 0 1
B 2 3 0 -5 1
...
As an example - ID:A,Comp:1 is incluster1, whereasID:A, Comp:4 is in cluster 3. (I plan to then concatenate ID and cluster later).
I'm having no luck with the following groupby + apply:
from sklearn.cluster import AffinityPropagation
ap = AffinityPropagation()
df['cluster']=df.groupby(['ID','Comp']).apply(lambda x: ap.fit_predict(np.array([x.x,x.y,x.z]).T))
I could brute-force it by using a for loop over the ID but my dataset is large (~ 150k ID) and I'm worried about resource and time constraints. Any help would be great!
IIUC, I think you could try something like this:
def ap_fit_pred(x):
ap = AffinityPropagation()
return pd.Series(ap.fit_predict(x.loc[:,['x','y','z']]))
df['cluster'] = df.groupby('ID').apply(ap_fit_pred).reset_index(drop=True)

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

protein sequence coding

I'm working on a python program to compute a numerical coding of mutated residues and positions of a set of strings (protein sequences), stored in fasta format file with each protein sequence is separated by comma. I'm trying to find the position and sequences which are mutated.
My fasta file is as follows:
MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN
Example:
The following figure (based on another set of fasta file) will explain the algorithm behind this. In this figure first box represents alignment of input file sequences. The last box represents the output file. How can I do this with my fasta file in Python?
example input file:
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
Here are two ways I have tried to do it:
ls= 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
a=set().union(*pos)
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
print '\t'.join('1' if key in p else '0' for key in alle)
(here I'm getting columns of mutated as well as non-mutated residues, but I want only columns for mutated residues)
from pandas import *
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'
df = DataFrame([list(row) for row in data.split(',')])
df = DataFrame({str(col+1)+val:(df[col]==val).apply(int) for col in df.columns for val in set(df[col])})
print df.select(lambda x: not df[x].all(), axis = 1)
(here it is giving output ,but not in orderly ie, first 2K then 2T then 3A like that.)
How should I be doing this?
The function get_dummies gets you most of the way:
In [11]: s
Out[11]:
0 T
1 T
2 T
3 T
4 K
Name: 1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:
1K 1T
0 0 1
1 0 1
2 0 1
3 0 1
4 1 0
And those columns which have differing values:
In [21]: (df.ix[0] != df).any()
Out[21]:
0 False
1 True
2 True
3 False
4 True
5 False
Putting these together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = [pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I]
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Note: I created the initial DataFrame as follows, however this may be done more efficiently depending on your situation:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))

Categories

Resources