I have a dataframe:
A B C v1 v2 v3
q 2 3 4 9 1
8 f 2 7 4 7
I want to calc a new columns, that will have the RMS (sqrt(sum(x^2)) of all the v columns.
So the new df will be:
A B C v1 v2 v3 v_rms
q 2 3 4 9 1 9.9
8 f 2 7 2 4 8.3
since sqrt(4^2 + 9^2 + 1^2) = 9.9, sqrt(7^2 + 2^2 + 4^2) = 8.3
What is the best way to do so?
Use DataFrame.filter for v columns, then DataFrame.pow with sum and for sqrt is used pow with 1/2:
df['v_rms'] = df.filter(like='v').pow(2).sum(axis=1).pow(1/2)
print (df)
A B C v1 v2 v3 v_rms
0 q 2 3 4 9 1 9.899495
1 8 f 2 7 2 4 8.306624
Related
I want to build a data frame with m column and n rows.
Each rows start with 1 and increment by 1 until m.
I've tried to find a solution, but I found only this solution for the columns.
I have also added a figure of a simple case.
Using assign to broadcast the rows in an empty DataFrame:
df = (
pd.DataFrame(index=range(3))
.assign(**{f'c{i}': i+1 for i in range(4)})
)
Output:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
You can use np.tile:
import numpy as np
m = 4
n = 3
out = pd.DataFrame(np.tile(np.arange(1,m), (n,1)), columns=[f'c{num}' for num in range(m-1)])
Output:
c0 c1 c2
0 1 2 3
1 1 2 3
2 1 2 3
Try with this (no additional libraries needed):
df = pd.DataFrame({f'c{n}': [n + 1] * (m - 1) for n in range(m)})
Result with m = 4:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
We just do np.one
m = 3
n = 4
out = pd.DataFrame(np.ones((m,n))*(np.arange(n)+1))
Out[139]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 1.0 2.0 3.0 4.0
2 1.0 2.0 3.0 4.0
I have a dataframe:
A B C V
1 4 7 T
2 6 8 T
3 9 9 F
and I want to create a new column, summing the rows where V is 'T'
So I want
A B C V D
1 4 7 T 12
2 6 8 T 16
3 9 9 F
Is there any way to do this without iteration?
Mask the values before summing:
df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T')
# Or,
df.select_dtypes(np.number).mask(df['V'] != 'T').sum(axis=1, skipna=False)
0 12.0
1 16.0
2 NaN
dtype: float64
df['D'] = df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T')
df
A B C V D
0 1 4 7 T 12.0
1 2 6 8 T 16.0
2 3 9 9 F NaN
If you actually wanted blanks, use
df.select_dtypes(np.number).sum(axis=1).mask(df['V'] != 'T', '')
0 24
1 32
2
dtype: object
Which returns an object column (not recommended).
Alternatively, using np.where:
np.where(df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), np.nan)
# array([12., 16., nan])
df['D'] = np.where(
df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), np.nan)
df
A B C V D
0 1 4 7 T 12.0
1 2 6 8 T 16.0
2 3 9 9 F 0.0
Use Numpy where
import numpy as np
df['D'] = np.where(df['V'] == 'T', df.select_dtypes(np.number).sum(axis=1), None)
df['D'] = df[['A', 'B', 'C']][df['V'] == 'T'].sum(axis=1)
In [51]df:
Out[51]:
A B C V D
0 1 4 7 T 12.000
1 2 6 8 T 16.000
2 3 9 9 F nan
I have huge dataframe,, hundred thousand row and column.
My data like this:
df
MAC T_1 X_1 Y_1 T_2 X_2 Y_2 T_3 X_3 Y_3 T_4 X_4 Y_4 T_5 X_5 Y_5 T_6 X_6 Y_6 T_7 X_7 Y_7
ID1 1 1 1 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5
ID2 6 2 5 6 2 5 7 3 5 7 3 5 8 4 5 9 5 5 10 5 4
ID3 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5 6 2 5
I want to calculate the speed using this equation:
I used code:
df = pd.read_csv("data.csv")
def v_2(i):
return (df.ix[x,(5+3*(i-1))]-df.ix[x,(2+3*(i-1))])**2 + (df.ix[x,(6+3*(i-1))]-df.ix[x,(3+3*(i-1))])**2
def v(i):
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) ==0:
return 0
else:
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) <0:
return 0
else:
return math.sqrt(v_2(i)) / (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))])
for i in range(1,int((len(df.columns)-1)/3)):
v_result = list()
for x in range(len(df.index)):
v_2(i)
v(i)
v_result.append(v(i))
df_result[i]=v_result
my expected result:
MAC V1 V2 V3 V4 V5 V6
ID1 0 1 1 0 1 1
ID2 0 1 0 1 1 1
ID3 1 1 0 1 1 1
but this code takes huge time,
would you mind to give another idea more simple and fast process or using multiprocessing module.
thank you
The calculation can be sped up quite a bit through reshaping the data first, so that efficient pandas methods can be used. If that is not fast enough, you can then go down to the numpy array and apply the functions there.
first reshape the data from the wide format to a long format so that there are only 3 columns, T, X, Y. The column suffixes, i.e. _1, _2, etc are split out into a new index.
df = df.set_index('MAC')
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df = df.stack()
this produces the following data frame:
T X Y
MAC
ID1 1 1 1 1
2 1 1 1
3 2 1 2
4 3 1 3
5 3 1 3
6 4 1 4
7 5 1 5
ID2 1 6 2 5
2 6 2 5
3 7 3 5
4 7 3 5
5 8 4 5
6 9 5 5
7 10 5 4
ID3 1 1 1 1
2 2 1 2
3 3 1 3
4 3 1 3
5 4 1 4
6 5 1 5
7 6 2 5
Next calculate the del_X^2, del_Y^2 & del_t (I hope the usage of prefix del is unambiguous). This is easier done using these two utility functions to avoid repetition.
def f(x):
return x.shift(-1) - x
def f2(x):
return f(x)**2
update: description of functions
The first function calculates F(W,n) = W(n+1) - W(n), for all n, where n is the index of the array W. The second function squares its argument. These functions are composed to calculate the distance squared. See the documentation for pd.Series.shift for more information & examples.
using lower-case column names for the del prefix above and the suffix 2 to mean squared:
df['x2'] = df.groupby(level=0).X.transform(f2)
df['y2'] = df.groupby(level=0).Y.transform(f2)
df['t'] = df.groupby(level=0).Y.transform(f)
df['v'] = np.sqrt(df.x2 + df.y2) / df.t
df.v.unstack(0)
produces the following which is similar to your output, but transposed.
MAC ID1 ID2 ID3
1 NaN NaN 1.0
2 1.0 1.0 1.0
3 1.0 NaN NaN
4 NaN 1.0 1.0
5 1.0 1.0 1.0
6 1.0 1.0 1.0
7 NaN NaN NaN
you can filter out the last row (where the computed columns t, x2 & y2 are null), fill the np.nan in v with with 0, transpose, rename the columns & reset index to get at your desired result.
result = df[pd.notnull(df.t)].v.unstack(0).fillna(0).T
result.columns = ['V'+x for x in result.columns]
result.reset_index()
# outputs:
MAC V1 V2 V3 V4 V5 V6
0 ID1 0.0 1.0 1.0 0.0 1.0 1.0
1 ID2 0.0 1.0 0.0 1.0 1.0 1.0
2 ID3 1.0 1.0 0.0 1.0 1.0 1.0
I suggest you use Apache Spark if you want a real speed.
You can do that by passing your function to Spark as described here in this documentation:
Passing function to Spark
I have a unique requirement , where i need mean of common columns (per row) from two dataframes.
I can not think of a pythonic way of doing this. I know i can loop through two data frames and find common columns and then get mean of rows where key matches.
Assuming I have below Data Frames:
DF1:
Key A B C D E
K1 2 3 4 5 8
K2 2 3 4 5 8
K3 2 3 4 5 8
K4 2 3 4 5 8
DF2:
Key A B C D
K1 4 7 4 7
K2 4 7 4 7
K3 4 7 4 7
K4 4 7 4 7
The result DF should be the mean values of the two DF , each column per row where Key matches.
ResultDF:
Key A B C D
K1 3 5 4 6
K2 3 5 4 6
K3 3 5 4 6
K4 3 5 4 6
I know i should put sample code here , but i can not think of any logic for achieving this till now.
Use DataFrame.add using Key as the indexes:
df1.set_index('Key').add(df2.set_index('Key')).dropna(axis=1) / 2
A B C D
Key
K1 3 5 4 6
K2 3 5 4 6
K3 3 5 4 6
K4 3 5 4 6
Alternative with concat + groupby.
pd.concat([df1, df2], axis=0).dropna(axis=1).groupby('Key').mean()
A B C D
Key
K1 3 5 4 6
K2 3 5 4 6
K3 3 5 4 6
K4 3 5 4 6
Try adding the to data frames together then use the pandas apply function then add a lambda in it then divide x with two:
import pandas as pd
df1 = pd.DataFrame({'A': [2,2]})
df2 = pd.DataFrame({'A': [4,4]})
print((df1+df2).apply(lambda x: x/2))
Output:
A
0 3.0
1 3.0
Note: this is just with a dummy data frame
I have a table of data with a multi-index. The first level of the multi-index is a name corresponding to a given sequence (DNA), the second level of the multi-index corresponds to a specific type of sequence variant wt, m1,m2, m3 in the example below. Not all given wt sequences will have all types of variants(see seqA, and seqC below).
df = pd.DataFrame(data={'A':range(1,9), 'B':range(1,9), 'C': range(1,9)},
index=pd.MultiIndex.from_tuples([('seqA', 'wt'), ('seqA', 'm1'),
('seqA', 'm2'), ('seqB', 'wt'), ('seqB', 'm1'), ('seqB', 'm2'),
('seqB', 'm3'), ('seqC', 'wt') ]))
df.index.rename(['seq_name','type'], inplace=True)
print df
A B C
seq_name type
seqA wt 1 1 1
m1 2 2 2
m2 3 3 3
seqB wt 4 4 4
m1 5 5 5
m2 6 6 6
m3 7 7 7
seqC wt 8 8 8
I want to perform subsequent analyses on the data for only the sequences that have specific type(s) of variants (m1 and m2 in this example). So I want to filter my data frame to require that a given seq_name has all variant types that are specified in a list.
My current solution is pretty clunky, and not very aesthetically pleasing IMO.
var_l = ['wt', 'm1', 'm2']
df1 = df[df.index.get_level_values('type').isin(var_l)] #Filter varaints not of interest
set_l = []
for v in var_l: #Filter for each variant individually, and store seq_names
df2 = df[df.index.get_level_values('type').isin([v])]
set_l.append(set(df2.index.get_level_values('seq_name')))
seq_s = set.intersection(*set_l) # Get seq_names that only have all three variants
df3 = df1[df1.index.get_level_values('seq_name').isin(seq_s)] #Filter based on seq_name
print df3
A B C
seq_name type
seqA wt 1 1 1
m1 2 2 2
m2 3 3 3
seqB wt 4 4 4
m1 5 5 5
m2 6 6 6
I feel like there must be a one-liner that can do this. Something like:
var_l = ['wt', 'm1', 'm2']
filtered_df = filterDataframe(df1, var_l)
print filtered_df
A B C
seq_name type
seqA wt 1 1 1
m1 2 2 2
m2 3 3 3
seqB wt 4 4 4
m1 5 5 5
m2 6 6 6
I've tried searching this site, and have only found answers that let you filter by any item in a list.
You can use query with filter:
var_l = ['wt', 'm1', 'm2']
filtered_df=df.query('type in #var_l').groupby(level=0).filter(lambda x: len(x)==len(var_l))
print (filtered_df)
A B C
seq_name type
seqA wt 1 1 1
m1 2 2 2
m2 3 3 3
seqB wt 4 4 4
m1 5 5 5
m2 6 6 6
Another solution with transform size and then filter by boolean indexing:
filtered_df = df.query('type in #var_l')
filtered_df = filtered_df[filtered_df.groupby(level=0)['A']
.transform('size')
.eq(len(var_l))
.rename(None)]
print (filtered_df)
A B C
seq_name type
seqA wt 1 1 1
m1 2 2 2
m2 3 3 3
seqB wt 4 4 4
m1 5 5 5
m2 6 6 6
It works because:
print (filtered_df.groupby(level=0)['A'].transform('size'))
seq_name type
seqA wt 3
m1 3
m2 3
seqB wt 3
m1 3
m2 3
seqC wt 1
Name: A, dtype: int32
print (filtered_df.groupby(level=0)['A']
.transform('size')
.eq(len(var_l))
.rename(None))
seq_name type
seqA wt True
m1 True
m2 True
seqB wt True
m1 True
m2 True
seqC wt False
dtype: bool
option 1
using query + stack
As #jezrael pointed out, this depends on no NaN existing in rows to be analyzed.
df.query('type in #var_l').unstack().dropna().stack()
A B C
seq_name type
seqA m1 2.0 2.0 2.0
m2 3.0 3.0 3.0
wt 1.0 1.0 1.0
seqB m1 5.0 5.0 5.0
m2 6.0 6.0 6.0
wt 4.0 4.0 4.0
Preserve the dtypes
df.query('type in #var_l').unstack().dropna().stack().astype(df.dtypes)
A B C
seq_name type
seqA m1 2 2 2
m2 3 3 3
wt 1 1 1
seqB m1 5 5 5
m2 6 6 6
wt 4 4 4
option 2
using filter
it checks if the sub-index intersected with the var_l is the same as var_l
def correct_vars(df, v):
x = set(v)
n = df.name
y = set(df.xs(n).index.intersection(v))
return x == y
df.groupby(level=0).filter(correct_vars, v=var_l)
A B C
seq_name type
seqA wt 1 1 1
m1 2 2 2
m2 3 3 3
seqB wt 4 4 4
m1 5 5 5
m2 6 6 6
m3 7 7 7