I have a wide dataset:
id x0 x1 x2 x3 x4 x5 ... x10000 Type
1 40 31.05 25.5 25.5 25.5 25 ... 33 1
2 35 35.75 36.5 26.5 36.5 36.5 ... 29 0
3 35 35.70 36.5 36.5 36.5 36.5 ... 29 1
4 40 31.50 23.5 24.5 26.5 25 ... 33 1
...
900 40 31.05 25.5 25.5 25.5 25 ... 23 0
with each row being a time series. I would like to standardise in place all values except for the last column, with each row/time series as an independent distribution. I am thinking about appending 2 columns mean and std(standard deviation) to the rightmost of the dataframe, and standardise using apply. But it sounds cumbersome and might make mistakes in the process. How can I do this and is there an easier way? Thanks
Method 1:
We can use sklearn.preprocessing.scale! Set axis = 1 to scale data on each row!
This kind of data cleaning can be done nicely using sklearn.preprocessing.Here is an official doc
Code:
# Generate data
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
data = pd.DataFrame({'A':np.random.randint(5,15,100),'B':np.random.randint(1,10,100),
'C':np.random.randint(0,10,100),'type':np.random.randint(0,2,100)})
data.head()
# filter columns and then standardlize inplace
data.loc[:,~data.columns.isin(['type'])] = scale(data.loc[:,~data.columns.isin(['type'])], axis = 1)
data.head()
Output:
A B C type
0 12 8 2 0
1 5 2 9 1
2 14 5 2 1
3 5 7 6 0
4 8 1 4 0
A B C type
0 1.135550 0.162221 -1.297771 0
1 -0.116248 -1.162476 1.278724 1
2 1.372813 -0.392232 -0.980581 1
3 -1.224745 1.224745 0.000000 0
4 1.278724 -1.162476 -0.116248 0
Method 2:
Just use lambda function if your dataset is not huge.
Code:
# Generate data
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
data = pd.DataFrame({'A':np.random.randint(5,15,100),'B':np.random.randint(1,10,100),
'C':np.random.randint(0,10,100),'type':np.random.randint(0,2,100)})
data.head()
# filter columns and than standardlize inplace
data.loc[:,~data.columns.isin(['type'])] = data.loc[:,~data.columns.isin(['type'])].\
apply(lambda x: (x - np.mean(x))/np.std(x), axis = 1)
data.head()
Output:
A B C type
0 12 8 2 0
1 5 2 9 1
2 14 5 2 1
3 5 7 6 0
4 8 1 4 0
A B C type
0 1.135550 0.162221 -1.297771 0
1 -0.116248 -1.162476 1.278724 1
2 1.372813 -0.392232 -0.980581 1
3 -1.224745 1.224745 0.000000 0
4 1.278724 -1.162476 -0.116248 0
Speed compare:
Method 1 is faster then method 2.
Method 1: 2.03 ms ± 205 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)
%%timeit -r 100 -n 100
data.loc[:,~data.columns.isin(['type'])] = scale(data.loc[:,~data.columns.isin(['type'])], axis = 1)
Method 2: 3.06 ms ± 153 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)
%%timeit -r 100 -n 100
data.loc[:,~data.columns.isin(['type'])].apply(lambda x: (x - np.mean(x))/np.std(x), axis = 0)
You could compute mean and std manually:
stats = df.iloc[:,1:-1].agg(['mean','std'], axis=1) # axis=1 apply on rows
df.iloc[:, 1:-1] = (df.iloc[:, 1:-1]
.sub(stats['mean'], axis='rows') # axis='rows' apply on rows
.div(stats['std'],axis='rows')
)
output:
id x0 x1 x2 x3 x4 x5 x10000 Type
0 1 1.87515 0.297204 -0.681302 -0.681302 -0.681302 -0.769456 0.641003 1
1 2 0.31841 0.499129 0.679848 -1.72974 0.679848 0.679848 -1.12734 0
2 3 -0.0363456 0.218074 0.508839 0.508839 0.508839 0.508839 -2.21708 1
3 4 1.81012 0.392987 -0.940787 -0.774066 -0.440622 -0.690705 0.64307 1
Related
I want to sum the different values for each column. i think that i should use a special aggregation using apply() but i don't know the correct code
A B C D E F G
1 2 3 4 5 6 7
1 3 3 4 8 7 7
2 2 3 5 8 1 1
2 1 3 5 7 5 1
#i want to have this result
for each value in column A
A B C D E F G
1 5 3 4 13 13 7
2 3 3 5 15 6 1
You can vectorize this by dropping duplicates per index positions. You can then re-create the origin matrix conveniently using a sparse matrix.
You could accomplish the same thing create a zero array and adding, but this way you avoid the large memory requirement if your A column is very sparse.
from scipy import sparse
def non_dupe_sums_2D(ids, values):
v = np.unique(ids)
x, y = values.shape
r = np.arange(y)
m = np.repeat(a, y)
n = np.tile(r, x)
u = np.unique(np.column_stack((m, n, values.ravel())), axis=0)
return sparse.csr_matrix((u[:, 2], (u[:, 0], u[:, 1])))[v].A
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
non_dupe_sums_2D(a, b)
array([[ 5, 3, 4, 13, 13, 7],
[ 3, 3, 5, 15, 6, 1]], dtype=int64)
Performance
df = pd.DataFrame(np.random.randint(1, 100, (100, 100)))
a = df.iloc[:, 0].to_numpy()
b = df.iloc[:, 1:].to_numpy()
%timeit pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby(0) ], axis=1)
1.09 s ± 9.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique)
1.05 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit non_dupe_sums_2D(a, b)
7.95 ms ± 30.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Validation
>>> np.array_equal(non_dupe_sums_2D(a, b), df.iloc[:, 1:].groupby(df.iloc[:, 0]).apply(sum_unique).values)
True
I'd do something like:
def sum_unique(x):
return x.apply(lambda x: x.unique().sum())
df.groupby('A')[df.columns ^ {'A'}].apply(sum_unique).reset_index()
which gives me:
A B C D E F G
0 1 5 3 4 13 13 7
1 2 3 3 5 15 6 1
which seems to be what you're expecting
Not so ideal, but here's one way with apply:
pd.concat([g.apply(lambda x: x.unique().sum()) for v,g in df.groupby('A') ], axis=1)
Output:
0 1
A 1 2
B 5 3
C 3 3
D 4 5
E 13 15
F 13 6
G 7 1
You can certainly transpose the dataframe to obtain the expected output.
I have a time series dataframe where there is 1 or 0 in it (true/false). I wrote a function that loops through all rows with values 1 in them. Given user defined integer parameter called n_hold, I will set values 1 to n rows forward from the initial row.
For example, in the dataframe below I will be loop to row 2016-08-05. If n_hold = 2, then I will set both 2016-08-08 and 2016-08-09 to 1 too.:
2016-08-03 0
2016-08-04 0
2016-08-05 1
2016-08-08 0
2016-08-09 0
2016-08-10 0
The resulting df will then is
2016-08-03 0
2016-08-04 0
2016-08-05 1
2016-08-08 1
2016-08-09 1
2016-08-10 0
The problem I have is this is being run 10s of thousands of times and my current solution where I am looping over rows where there are ones and subsetting is way too slow. I was wondering if there are any solutions to the above problem that is really fast.
Here is my (slow) solution, x is the initial signal dataframe:
n_hold = 2
entry_sig_diff = x.diff()
entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
final_signal = x * 0
for i in range(0, len(entry_sig_dt)):
row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])
if (row_idx + n_hold) >= len(x):
break
final_signal[row_idx:(row_idx + n_hold + 1)] = 1
Completely changed answer, because working differently with consecutive 1 values:
Explanation:
Solution remove each consecutive 1 first by where with chained boolean mask by comparing with ne (not equal !=) with shift to NaNs, forward filling them by ffill with limit parameter and last replace 0 back:
n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
Timings and comparing outputs:
np.random.seed(123)
x = pd.Series(np.random.choice([0,1], p=(.8,.2), size=1000))
x1 = x.copy()
#print (x)
def orig(x):
n_hold = 2
entry_sig_diff = x.diff()
entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
final_signal = x * 0
for i in range(0, len(entry_sig_dt)):
row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])
if (row_idx + n_hold) >= len(x):
break
final_signal[row_idx:(row_idx + n_hold + 1)] = 1
return final_signal
#print (orig(x))
n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
#print (s)
df = pd.concat([x,orig(x1), s], axis=1, keys=('input', 'orig', 'new'))
print (df.head(20))
input orig new
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 1 1 1
7 0 1 1
8 0 1 1
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 0 0
19 0 0 0
#check outputs
#print (s.values == orig(x).values)
Timings:
%timeit (orig(x))
24.8 ms ± 653 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
1.36 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have a data frame as:
Time InvInstance
5 5
8 4
9 3
19 2
20 1
3 3
8 2
13 1
Time variable is sorted and InvInstance variable denotes the number of rows to the end of a Time block. I want to create another column showing whether a crossover condition is met within the Time column. I can do it with a for loop like that:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
df["10mMark"] = 0
for i in range(1,len(df)):
r = int(df.InvInstance.iloc[i])
rprev = int(df.InvInstance.iloc[i-1])
m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
And the desired output is:
Time InvInstance 10mMark
5 5 0
8 4 0
9 3 0
19 2 1
20 1 0
3 3 0
8 2 1
13 1 0
To be more specific; there are 2 sorted time blocks in the Time column, and going row by row we know the distance (in terms of rows) to the end of each block by the value of InvInstance. The question is whether the time difference between a row and the end of the block is less than 10 minutes and it was greater than 10 in the previous row. Is it possible to do this without loops such as shift() etc, so that it runs much faster?
I don't see/know how to use internal vectorized Pandas/Numpy methods for shifting Series/Array using a non-scalar / vector step, but we can use Numba here:
from numba import jit
#jit
def dyn_shift(s, step):
assert len(s) == len(step), "[s] and [step] should have the same length"
assert isinstance(s, np.ndarray), "[s] should have [numpy.ndarray] dtype"
assert isinstance(step, np.ndarray), "[step] should have [numpy.ndarray] dtype"
N = len(s)
res = np.empty(N, dtype=s.dtype)
for i in range(N):
res[i] = s[i+step[i]-1]
return res
mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
df['10mMark'] = np.where(mask1 & mask2,1,0)
result:
In [6]: df
Out[6]:
Time InvInstance 10mMark
0 5 5 0
1 8 4 0
2 9 3 0
3 19 2 1
4 20 1 0
5 3 3 0
6 8 2 1
7 13 1 0
Timing for 8.000 rows DF:
In [13]: df = pd.concat([df] * 10**3, ignore_index=True)
In [14]: df.shape
Out[14]: (8000, 3)
In [15]: %%timeit
...: df["10mMark"] = 0
...: for i in range(1,len(df)):
...: r = int(df.InvInstance.iloc[i])
...: rprev = int(df.InvInstance.iloc[i-1])
...: m = df['Time'].iloc[i+r-1] - df['Time'].iloc[i]
...: mprev = df['Time'].iloc[i-1+rprev-1] - df['Time'].iloc[i-1]
...: df["10mMark"].iloc[i] = np.where((m < 10) & (mprev >= 10),1,0)
...:
3.06 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [16]: %%timeit
...: mask1 = dyn_shift(df.Time.values, df.InvInstance.values) - df.Time < 10
...: mask2 = (dyn_shift(df.Time.values, df.InvInstance.values) - df.Time).shift() >= 10
...: df['10mMark'] = np.where(mask1 & mask2,1,0)
...:
1.02 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
speed-up factor:
In [17]: 3.06 * 1000 / 1.02
Out[17]: 3000.0
Actually, your m is the time delta between the time of a row and the time at the end of the 'block' and the mprev is the same thing but with the time at the previous row (so it's actually shift of m). My idea is to create a column containing the time at the end of the block, by first identifying each block, then merge with the last time when using groupby on block . Then calculate the difference for creating a column 'm' and use the np.where and shift to finally fill the column 10mMark.
# a column with incremental value for each block end
df['block'] = df.InvInstance[df.InvInstance ==1].cumsum()
#to back fill the number to get all block with same value of block
df['block'] = df['block'].bfill() #to back fill the number
# now merge to create a column time_last with the time at the end of the block
df = df.merge(df.groupby('block', as_index=False)['Time'].last(), on = 'block', suffixes=('','_last'), how='left')
# create column m with just a difference
df['m'] = df['Time_last'] - df['Time']
# now you can use np.where and shift on this column to create the 10mMark column
df['10mMark'] = np.where((df['m'] < 10) & (df['m'].shift() >= 10),1,0)
#just drop the useless column
df = df.drop(['block', 'Time_last','m'],1)
your final result before dropping, to see what as been created, looks like
Time InvInstance block Time_last m 10mMark
0 5 5 1.0 20 15 0
1 8 4 1.0 20 12 0
2 9 3 1.0 20 11 0
3 19 2 1.0 20 1 1
4 20 1 1.0 20 0 0
5 3 3 2.0 13 10 0
6 8 2 2.0 13 5 1
7 13 1 2.0 13 0 0
in which the column 10mMark has the expected result
It is not as efficient as with the solution of #MaxU with Numba, but with a df of 8000 rows as he used, I get speed up factor of about 350.
I have a dataframe (ev), and I would like to read it and whenever the value of the 'trig' column is 64, I need to update the value of the critical column that is 4 rows above, and change it to 999. I tried the code below but it does not change anything, though seems it should work.
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] == 999
Try this, you were close: Learn about single "=" vs double "=="
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] = 999
I think you can mask by mask with fillna False, because after shift you get NaN:
import pandas as pd
ev = pd.DataFrame({'trig':[1,2,3,2,4,6,8,9,64,6,7,8,6,64],
'critical':[4,5,6,3,5,7,8,9,0,7,6,4,3,5]})
print (ev)
critical trig
0 4 1
1 5 2
2 6 3
3 3 2
4 5 4
5 7 6
6 8 8
7 9 9
8 0 64
9 7 6
10 6 7
11 4 8
12 3 6
13 5 64
mask = (ev.trig == 64).shift(-4).fillna(False)
print (mask)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 True
10 False
11 False
12 False
13 False
Name: trig, dtype: bool
ev['critical'] = ev.critical.mask(mask, 999)
print (ev)
critical trig
0 4 1
1 5 2
2 6 3
3 3 2
4 999 4
5 7 6
6 8 8
7 9 9
8 0 64
9 999 6
10 6 7
11 4 8
12 3 6
13 5 64
EDIT:
Timings:
I think better is avoiding iteration in pandas, because in large dataframe it is very slow:
len(df)=1400:
In [66]: %timeit (jez(ev))
1000 loops, best of 3: 1.29 ms per loop
In [67]: %timeit (mer(ev1))
10 loops, best of 3: 49.9 ms per loop
len(df)=14k:
In [59]: %timeit (jez(ev))
100 loops, best of 3: 2.49 ms per loop
In [60]: %timeit (mer(ev1))
1 loop, best of 3: 501 ms per loop
len(df)=140k:
In [63]: %timeit (jez(ev))
100 loops, best of 3: 15.8 ms per loop
In [64]: %timeit (mer(ev1))
1 loop, best of 3: 6.32 s per loop
Code for timings:
import pandas as pd
ev = pd.DataFrame({'trig':[1,2,3,2,4,6,8,9,64,6,7,8,6,64],
'critical':[4,5,6,3,5,7,8,9,0,7,6,4,3,5]})
print (ev)
ev = pd.concat([ev]*100).reset_index(drop=True)
#ev = pd.concat([ev]*1000).reset_index(drop=True)
#ev = pd.concat([ev]*10000).reset_index(drop=True)
ev1 = ev.copy()
def jez(df):
ev['critical'] = ev.critical.mask((ev.trig == 64).shift(-4).fillna(False), 999)
return (ev)
def mer(df):
for i in range(0,len(ev)):
if ev['trig'][i] == 64:
ev['critical'][i-4] = 999
return (ev)
print (jez(ev))
print (mer(ev1))
I have a pandas DataFrame df like this
mat time
0 101 20
1 102 7
2 103 15
I need to divide the rows so the column of time doesn't have any values higher than t=10 to have something like this
mat time
0 101 10
2 101 10
3 102 7
4 103 10
5 103 5
the index doesn't matter
If I'd use groupby('mat')['time'].sum() on this df I would have the original df, but I need like an inverse of the groupby func.
Is there any way to get the ungrouped DataFrame with the condition of time <= t?
I'm trying to use a loop here but it's kind of 'unPythonic', any ideas?
Use an apply function that loops until all are less than 10.
def split_max_time(df):
new_df = df.copy()
while new_df.iloc[-1, -1] > 10:
temp = new_df.iloc[-1, -1]
new_df.iloc[-1, -1] = 10
new_df = pd.concat([new_df, new_df])
new_df.iloc[-1, -1] = temp - 10
return new_df
print df.groupby('mat', group_keys=False).apply(split_max_time)
mat time
0 101 10
0 101 10
1 102 7
2 103 10
2 103 5
You could .groupby('mat') and .apply() a combination of integer division and modulo operation using the cutoff (10) to decompose each time value into the desired components:
cutoff = 10
def decompose(time):
components = [cutoff for _ in range(int(time / cutoff))] + [time.iloc[0] % cutoff]
return pd.Series([c for c in components if c > 0])
df.groupby('mat').time.apply(decompose).reset_index(-1, drop=True)
to get:
mat
101 10
101 10
102 7
103 10
103 5
In case you care about performance:
%timeit df.groupby('mat', group_keys=False).apply(split_max_time)
100 loops, best of 3: 4.21 ms per loop
%timeit df.groupby('mat').time.apply(decompose).reset_index(-1, drop=True)
1000 loops, best of 3: 1.83 ms per loop