I have huge dataframe,, hundred thousand row and column.
My data like this:
df
MAC T_1 X_1 Y_1 T_2 X_2 Y_2 T_3 X_3 Y_3 T_4 X_4 Y_4 T_5 X_5 Y_5 T_6 X_6 Y_6 T_7 X_7 Y_7
ID1 1 1 1 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5
ID2 6 2 5 6 2 5 7 3 5 7 3 5 8 4 5 9 5 5 10 5 4
ID3 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5 6 2 5
I want to calculate the speed using this equation:
I used code:
df = pd.read_csv("data.csv")
def v_2(i):
return (df.ix[x,(5+3*(i-1))]-df.ix[x,(2+3*(i-1))])**2 + (df.ix[x,(6+3*(i-1))]-df.ix[x,(3+3*(i-1))])**2
def v(i):
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) ==0:
return 0
else:
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) <0:
return 0
else:
return math.sqrt(v_2(i)) / (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))])
for i in range(1,int((len(df.columns)-1)/3)):
v_result = list()
for x in range(len(df.index)):
v_2(i)
v(i)
v_result.append(v(i))
df_result[i]=v_result
my expected result:
MAC V1 V2 V3 V4 V5 V6
ID1 0 1 1 0 1 1
ID2 0 1 0 1 1 1
ID3 1 1 0 1 1 1
but this code takes huge time,
would you mind to give another idea more simple and fast process or using multiprocessing module.
thank you
The calculation can be sped up quite a bit through reshaping the data first, so that efficient pandas methods can be used. If that is not fast enough, you can then go down to the numpy array and apply the functions there.
first reshape the data from the wide format to a long format so that there are only 3 columns, T, X, Y. The column suffixes, i.e. _1, _2, etc are split out into a new index.
df = df.set_index('MAC')
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df = df.stack()
this produces the following data frame:
T X Y
MAC
ID1 1 1 1 1
2 1 1 1
3 2 1 2
4 3 1 3
5 3 1 3
6 4 1 4
7 5 1 5
ID2 1 6 2 5
2 6 2 5
3 7 3 5
4 7 3 5
5 8 4 5
6 9 5 5
7 10 5 4
ID3 1 1 1 1
2 2 1 2
3 3 1 3
4 3 1 3
5 4 1 4
6 5 1 5
7 6 2 5
Next calculate the del_X^2, del_Y^2 & del_t (I hope the usage of prefix del is unambiguous). This is easier done using these two utility functions to avoid repetition.
def f(x):
return x.shift(-1) - x
def f2(x):
return f(x)**2
update: description of functions
The first function calculates F(W,n) = W(n+1) - W(n), for all n, where n is the index of the array W. The second function squares its argument. These functions are composed to calculate the distance squared. See the documentation for pd.Series.shift for more information & examples.
using lower-case column names for the del prefix above and the suffix 2 to mean squared:
df['x2'] = df.groupby(level=0).X.transform(f2)
df['y2'] = df.groupby(level=0).Y.transform(f2)
df['t'] = df.groupby(level=0).Y.transform(f)
df['v'] = np.sqrt(df.x2 + df.y2) / df.t
df.v.unstack(0)
produces the following which is similar to your output, but transposed.
MAC ID1 ID2 ID3
1 NaN NaN 1.0
2 1.0 1.0 1.0
3 1.0 NaN NaN
4 NaN 1.0 1.0
5 1.0 1.0 1.0
6 1.0 1.0 1.0
7 NaN NaN NaN
you can filter out the last row (where the computed columns t, x2 & y2 are null), fill the np.nan in v with with 0, transpose, rename the columns & reset index to get at your desired result.
result = df[pd.notnull(df.t)].v.unstack(0).fillna(0).T
result.columns = ['V'+x for x in result.columns]
result.reset_index()
# outputs:
MAC V1 V2 V3 V4 V5 V6
0 ID1 0.0 1.0 1.0 0.0 1.0 1.0
1 ID2 0.0 1.0 0.0 1.0 1.0 1.0
2 ID3 1.0 1.0 0.0 1.0 1.0 1.0
I suggest you use Apache Spark if you want a real speed.
You can do that by passing your function to Spark as described here in this documentation:
Passing function to Spark
Related
Let's say I have the following dataset
A B
0 1 1
1 NaN 3
2 5 2
3 7 4
4 NaN 3
5 3 3
I want to fill NaNs with some function, for example mean, of values A of nearest neighbors by B by given threshold. For example, if threshold is 1, then
for object 0 neighbors are object 2 (because 2 - 1 <= 1)
for object 1 neighbors are objects 2, 3, 4, 5
and so on
so the result would be
A B
0 1 1
1 5 3 # 5 = (5 + 7 + 3) / 3
2 5 2
3 7 4
4 5 3 # 5 = (5 + 7 + 3) / 3
5 3 3
If NaNs are faced in computations, they are neglected. How can I do that?
THRE = 1
b = df.B.to_numpy()
isnan_a = df.A.isna()
# outer difference helps in whether neighbor or not in B
neigh_mask = np.abs(np.subtract.outer(b, b)) <= THRE
# but interested in those whose A value is NaN
neigh_mask_for_nans = neigh_mask[isnan_a]
# get neighbors to take mean to via A times the neighbor mask of True/False's
neighs_for_nan_a = df.A.to_numpy()[None] * neigh_mask_for_nans
# False's gave 0 above in mul but we want them NaN to discard in mean
neighs_for_nan_a[~neigh_mask_for_nans] = np.nan
# take the mean ignoring NaNs and fill
df.loc[isnan_a, "A"] = np.nanmean(neighs_for_nan_a, axis=1)
to get
A B
0 1.0 1
1 5.0 3
2 5.0 2
3 7.0 4
4 5.0 3
5 3.0 3
Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.
I want to build a data frame with m column and n rows.
Each rows start with 1 and increment by 1 until m.
I've tried to find a solution, but I found only this solution for the columns.
I have also added a figure of a simple case.
Using assign to broadcast the rows in an empty DataFrame:
df = (
pd.DataFrame(index=range(3))
.assign(**{f'c{i}': i+1 for i in range(4)})
)
Output:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
You can use np.tile:
import numpy as np
m = 4
n = 3
out = pd.DataFrame(np.tile(np.arange(1,m), (n,1)), columns=[f'c{num}' for num in range(m-1)])
Output:
c0 c1 c2
0 1 2 3
1 1 2 3
2 1 2 3
Try with this (no additional libraries needed):
df = pd.DataFrame({f'c{n}': [n + 1] * (m - 1) for n in range(m)})
Result with m = 4:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
We just do np.one
m = 3
n = 4
out = pd.DataFrame(np.ones((m,n))*(np.arange(n)+1))
Out[139]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 1.0 2.0 3.0 4.0
2 1.0 2.0 3.0 4.0
i have this dataframe:
a b c d
4 7 5 12
3 8 2 8
1 9 3 5
9 2 6 4
i want the column 'd' to become the difference between n-value of column a and n+1 value of column 'a'.
I tried this but it doesn't run:
for i in data.index-1:
data.iloc[i]['d']=data.iloc[i]['a']-data.iloc[i+1]['a']
can anyone help me?
Basically what you want is diff.
df = pd.DataFrame.from_dict({"a":[4,3,1,9]})
df["d"] = df["a"].diff(periods=-1)
print(df)
Output
a d
0 4 1.0
1 3 2.0
2 1 -8.0
3 9 NaN
lets try simple way:
df=pd.DataFrame.from_dict({'a':[2,4,8,15]})
diff=[]
for i in range(len(df)-1):
diff.append(df['a'][i+1]-df['a'][i])
diff.append(np.nan)
df['d']=diff
print(df)
a d
0 2 2.0
1 4 4.0
2 8 7.0
3 15 NaN
I have this dataframe.
from pandas import DataFrame
import pandas as pd
df = pd.DataFrame({'name': ['A','D','M','T','B','C','D','E','A','L'],
'id': [1,1,1,2,2,3,3,3,3,5],
'rate': [3.5,4.5,2.0,5.0,4.0,1.5,2.0,2.0,1.0,5.0]})
>> df
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 C 3 1.5
6 D 3 2.0
7 E 3 2.0
8 A 3 1.0
9 L 5 5.0
df = df.groupby('id')['rate'].mean()
what i want is this:
1) find mean of every 'id'.
2) give the number of ids (length) which has mean >= 3.
3) give back all rows of dataframe (where mean of any id >= 3.
Expected output:
Number of ids (length) where mean >= 3: 3
>> dataframe where (mean(id) >=3)
>>df
name id rate
0 A 1 3.0
1 D 1 4.0
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
5 L 5 5.0
Use GroupBy.transform for means by all groups with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('id')['rate'].transform('mean') >=3]
print (df)
name id rate
0 A 1 3.5
1 D 1 4.5
2 M 1 2.0
3 T 2 5.0
4 B 2 4.0
9 L 5 5.0
Detail:
print (df.groupby('id')['rate'].transform('mean'))
0 3.333333
1 3.333333
2 3.333333
3 4.500000
4 4.500000
5 1.625000
6 1.625000
7 1.625000
8 1.625000
9 5.000000
Name: rate, dtype: float64
Alternative solution with DataFrameGroupBy.filter:
df = df.groupby('id').filter(lambda x: x['rate'].mean() >=3)