I've been working on a DataFrame, like the following extract and I want to know when the value changes:
A M C
0 2.0 1 C1
1 2.0 1 C1
2 2.0 2 C1
3 2.0 2 C1
4 2.0 3 C1
5 2.0 3 C1
6 2.0 1 C2
7 2.0 1 C2
8 2.0 2 C2
9 2.0 2 C3
10 2.0 3 C3
11 2.0 3 C3
13 2.1 1 C3
14 2.1 1 C3
15 2.1 2 C3
16 2.1 2 C3
17 2.1 3 C3
18 2.1 3 C3
I know that A or C, changes always when M starts in 1. The question is how can I get the position every time M value starts in 1?
I don't know if your entire data set is built the same way as the one you are showing us but from what I can see you are searching for occurrence of 3 to 1 in the m columns which would result in a difference of -2 :
df[df['M'].diff()==-2].index
Out[101]: Int64Index([6, 13], dtype='int64')
let's say your M column always increases but it can go higher than 3, you could just look for the first occurrence of a negative number such has:
df[df['M'].diff()<0].index
Out[103]: Int64Index([6, 13], dtype='int64')
let's say there is no pattern there you could simply do:
df[(df['M'].diff()!=0) & (df['M']==1)].index
Out[104]: Int64Index([0, 6, 13], dtype='int64')
this is adding index 0 because .diff() will return NaN for the first index of the dataframe which is !=0 and df['M'] ==0
Another way to determine when a new M set starts is to find where M is 1 and the previous M isn't:
In [18]: (df['M'] == 1) & (df["M"].shift() != 1)
Out[18]:
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 False
[.. and so on]
Name: M, dtype: bool
This includes the first element, but often makes sense. Once you have this, you can take its cumulative sum to get a group number associated with each group (because True == 1 and False == 0):
In [19]: df["group_index"] = ((df['M'] == 1) & (df["M"].shift() != 1)).cumsum()
In [20]: df
Out[20]:
A M C group_index
0 2.0 1 C1 1
1 2.0 1 C1 1
2 2.0 2 C1 1
3 2.0 2 C1 1
4 2.0 3 C1 1
5 2.0 3 C1 1
6 2.0 1 C2 2
7 2.0 1 C2 2
[.. and so on]
which is convenient because then you can use groupby to perform operations on the different clusters.
Related
Lets say we want to compute the variable D in the dataframe below based on time values in variable B and C.
Here, second row of D is C2 - B1, the difference is 4 minutes and
third row = C3 - B2= 4 minutes,.. and so on.
There is no reference value for first row of D so its NA.
Issue:
We also want a NA value for the first row when the category value in variable A changes from 1 to 2. In other words, the value -183 must be replaced by NA.
A B C D
1 5:43:00 5:24:00 NA
1 6:19:00 5:47:00 4
1 6:53:00 6:23:00 4
1 7:29:00 6:55:00 2
1 8:03:00 7:31:00 2
1 8:43:00 8:05:00 2
2 6:07:00 5:40:00 -183
2 6:42:00 6:11:00 4
2 7:15:00 6:45:00 3
2 7:53:00 7:17:00 2
2 8:30:00 7:55:00 2
2 9:07:00 8:32:00 2
2 9:41:00 9:09:00 2
2 10:17:00 9:46:00 5
2 10:52:00 10:20:00 3
You can use:
# Compute delta
df['D'] = (pd.to_timedelta(df['C']).sub(pd.to_timedelta(df['B'].shift()))
.dt.total_seconds().div(60))
# Fill nan
df.loc[df['A'].ne(df['A'].shift()), 'D'] = np.nan
Output:
>>> df
A B C D
0 1 5:43:00 5:24:00 NaN
1 1 6:19:00 5:47:00 4.0
2 1 6:53:00 6:23:00 4.0
3 1 7:29:00 6:55:00 2.0
4 1 8:03:00 7:31:00 2.0
5 1 8:43:00 8:05:00 2.0
6 2 6:07:00 5:40:00 NaN
7 2 6:42:00 6:11:00 4.0
8 2 7:15:00 6:45:00 3.0
9 2 7:53:00 7:17:00 2.0
10 2 8:30:00 7:55:00 2.0
11 2 9:07:00 8:32:00 2.0
12 2 9:41:00 9:09:00 2.0
13 2 10:17:00 9:46:00 5.0
14 2 10:52:00 10:20:00 3.0
You can use the difference between datetime columns in pandas.
Having
df['B_dt'] = pd.to_datetime(df['B'])
df['C_dt'] = pd.to_datetime(df['C'])
Makes the following possible
>>> df['D'] = (df.groupby('A')
.apply(lambda s: (s['C_dt'] - s['B_dt'].shift()).dt.seconds / 60)
.reset_index(drop=True))
You can always drop these new columns later.
I want to build a data frame with m column and n rows.
Each rows start with 1 and increment by 1 until m.
I've tried to find a solution, but I found only this solution for the columns.
I have also added a figure of a simple case.
Using assign to broadcast the rows in an empty DataFrame:
df = (
pd.DataFrame(index=range(3))
.assign(**{f'c{i}': i+1 for i in range(4)})
)
Output:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
You can use np.tile:
import numpy as np
m = 4
n = 3
out = pd.DataFrame(np.tile(np.arange(1,m), (n,1)), columns=[f'c{num}' for num in range(m-1)])
Output:
c0 c1 c2
0 1 2 3
1 1 2 3
2 1 2 3
Try with this (no additional libraries needed):
df = pd.DataFrame({f'c{n}': [n + 1] * (m - 1) for n in range(m)})
Result with m = 4:
c0 c1 c2 c3
0 1 2 3 4
1 1 2 3 4
2 1 2 3 4
We just do np.one
m = 3
n = 4
out = pd.DataFrame(np.ones((m,n))*(np.arange(n)+1))
Out[139]:
0 1 2 3
0 1.0 2.0 3.0 4.0
1 1.0 2.0 3.0 4.0
2 1.0 2.0 3.0 4.0
This question already has answers here:
pandas replace multiple values one column
(7 answers)
Closed 3 years ago.
df = pd.DataFrame({'Tissues':['a1','x2','y3','b','c1','v2','w3'], 'M':[1,2,'a',4,'b','a',7]})
df.set_index('Tissues')
The dataframe looks like:
M
Tissues
a1 1
x2 2
y3 a
b 4
c1 b
v2 a
w3 7
How can I replace all as in column M with say a specific value, 2 and all bs to 3?
I tried:
replace_values = {'a':2, 'b':3}
df['M'] = df['M'].map(replace_values)
, but that changed other values not in the keys in replace_values to NaN:
Tissues M
0 a1 NaN
1 x2 NaN
2 y3 2.0
3 b NaN
4 c1 3.0
5 v2 2.0
6 w3 NaN
I see that I can do
df.loc[(df['M'] == 'a')] = 2
but can I do this efficiently for a, b and so on, instead of one by one?
Use df.replace:
df = pd.DataFrame({'Tissues':['a1','x2','y3','b','c1','v2','w3'], 'M':[1,2,'a',4,'b','a',7]})
df.set_index('Tissues')
replace_values = {'a':2, 'b':3}
df['M'] = df['M'].replace(replace_values)
Output:
>>> df
Tissues M
0 a1 1
1 x2 2
2 y3 2
3 b 4
4 c1 3
5 v2 2
6 w3 7
Fix your code by add fillna
df['M'] = df['M'].map(replace_values).fillna(df.M)
df
Tissues M
0 a1 1.0
1 x2 2.0
2 y3 2.0
3 b 4.0
4 c1 3.0
5 v2 2.0
6 w3 7.0
Use df.replace
replace_values = {'a':2, 'b':3}
df = df.replace({"M": replace_values})
Results:
Tissues M
0 a1 1
1 x2 2
2 y3 a
3 b 4
4 c1 b
5 v2 a
6 w3 7
I have huge dataframe,, hundred thousand row and column.
My data like this:
df
MAC T_1 X_1 Y_1 T_2 X_2 Y_2 T_3 X_3 Y_3 T_4 X_4 Y_4 T_5 X_5 Y_5 T_6 X_6 Y_6 T_7 X_7 Y_7
ID1 1 1 1 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5
ID2 6 2 5 6 2 5 7 3 5 7 3 5 8 4 5 9 5 5 10 5 4
ID3 1 1 1 2 1 2 3 1 3 3 1 3 4 1 4 5 1 5 6 2 5
I want to calculate the speed using this equation:
I used code:
df = pd.read_csv("data.csv")
def v_2(i):
return (df.ix[x,(5+3*(i-1))]-df.ix[x,(2+3*(i-1))])**2 + (df.ix[x,(6+3*(i-1))]-df.ix[x,(3+3*(i-1))])**2
def v(i):
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) ==0:
return 0
else:
if (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))]) <0:
return 0
else:
return math.sqrt(v_2(i)) / (df.ix[x,(4+3*(i-1))]-df.ix[x,(1+3*(i-1))])
for i in range(1,int((len(df.columns)-1)/3)):
v_result = list()
for x in range(len(df.index)):
v_2(i)
v(i)
v_result.append(v(i))
df_result[i]=v_result
my expected result:
MAC V1 V2 V3 V4 V5 V6
ID1 0 1 1 0 1 1
ID2 0 1 0 1 1 1
ID3 1 1 0 1 1 1
but this code takes huge time,
would you mind to give another idea more simple and fast process or using multiprocessing module.
thank you
The calculation can be sped up quite a bit through reshaping the data first, so that efficient pandas methods can be used. If that is not fast enough, you can then go down to the numpy array and apply the functions there.
first reshape the data from the wide format to a long format so that there are only 3 columns, T, X, Y. The column suffixes, i.e. _1, _2, etc are split out into a new index.
df = df.set_index('MAC')
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df = df.stack()
this produces the following data frame:
T X Y
MAC
ID1 1 1 1 1
2 1 1 1
3 2 1 2
4 3 1 3
5 3 1 3
6 4 1 4
7 5 1 5
ID2 1 6 2 5
2 6 2 5
3 7 3 5
4 7 3 5
5 8 4 5
6 9 5 5
7 10 5 4
ID3 1 1 1 1
2 2 1 2
3 3 1 3
4 3 1 3
5 4 1 4
6 5 1 5
7 6 2 5
Next calculate the del_X^2, del_Y^2 & del_t (I hope the usage of prefix del is unambiguous). This is easier done using these two utility functions to avoid repetition.
def f(x):
return x.shift(-1) - x
def f2(x):
return f(x)**2
update: description of functions
The first function calculates F(W,n) = W(n+1) - W(n), for all n, where n is the index of the array W. The second function squares its argument. These functions are composed to calculate the distance squared. See the documentation for pd.Series.shift for more information & examples.
using lower-case column names for the del prefix above and the suffix 2 to mean squared:
df['x2'] = df.groupby(level=0).X.transform(f2)
df['y2'] = df.groupby(level=0).Y.transform(f2)
df['t'] = df.groupby(level=0).Y.transform(f)
df['v'] = np.sqrt(df.x2 + df.y2) / df.t
df.v.unstack(0)
produces the following which is similar to your output, but transposed.
MAC ID1 ID2 ID3
1 NaN NaN 1.0
2 1.0 1.0 1.0
3 1.0 NaN NaN
4 NaN 1.0 1.0
5 1.0 1.0 1.0
6 1.0 1.0 1.0
7 NaN NaN NaN
you can filter out the last row (where the computed columns t, x2 & y2 are null), fill the np.nan in v with with 0, transpose, rename the columns & reset index to get at your desired result.
result = df[pd.notnull(df.t)].v.unstack(0).fillna(0).T
result.columns = ['V'+x for x in result.columns]
result.reset_index()
# outputs:
MAC V1 V2 V3 V4 V5 V6
0 ID1 0.0 1.0 1.0 0.0 1.0 1.0
1 ID2 0.0 1.0 0.0 1.0 1.0 1.0
2 ID3 1.0 1.0 0.0 1.0 1.0 1.0
I suggest you use Apache Spark if you want a real speed.
You can do that by passing your function to Spark as described here in this documentation:
Passing function to Spark
In this case, I have two data frames A and B.
c1 c2 c3 c1 c2 c3
r0 7 6 4 r0 0 0 1
r1 6 2 5 r1 1 1 0
r2 3 5 9 r2 1 0 1
A is the data frame on the left, and B on the right.
Basically my goal is to find the top 2 values in each row of A, and the corresponding row values in B, and then take the sum of the products of these pairs.
So for example in the first row, the top values in A are 7, and 6, which correspond to 0, 0 in the first row of B. I then want to return 7 * 0 + 6 * 0 = 0. I'd like to do this over every row and return something like:
d1 0
d2 6
d3 9
I'm currently using an implementation with using numpy argsort to find the index of the top n values in each row of A, and then using a map and a self-defined function to go over rows and find the product-sum.
This method has ended up being really slow for me, so I was wondering if there are any faster alternatives. Thank you.
Use rank to get top 2 values and use that as mask for B.
In [1311]: (A*B.where(A.rank(axis=1) >= 2)).sum(axis=1)
Out[1311]:
r0 0.0
r1 6.0
r2 9.0
dtype: float64
Details
In [1314]: A.rank(axis=1)
Out[1314]:
c1 c2 c3
r0 3.0 2.0 1.0
r1 3.0 1.0 2.0
r2 1.0 2.0 3.0
In [1315]: A.rank(axis=1) >=2
Out[1315]:
c1 c2 c3
r0 True True False
r1 True False True
r2 False True True
In [1317]: B.where(A.rank(axis=1) >= 2)
Out[1317]:
c1 c2 c3
r0 0.0 0.0 NaN
r1 1.0 NaN 0.0
r2 NaN 0.0 1.0
In [1318]: (A*B.where(A.rank(axis=1) >= 2))
Out[1318]:
c1 c2 c3
r0 0.0 0.0 NaN
r1 6.0 NaN 0.0
r2 NaN 0.0 9.0