I have to handle a huge amount of data. Every row starts with 1 or 0. I need a dataframe where every rows start with 1, so I have to step left all rows values till the first value is 1.
For example:
0 1 0 0 1 0 0
1 0 0 0 0 1 1
0 0 0 1 0 0 1
0 0 0 0 0 1 1
The result has to be this:
1 0 0 1 0 0 0
1 0 0 0 0 1 1
1 0 0 1 0 0 0
1 1 0 0 0 0 0
I don't want to use for, while, etc., because I need some faster methods with pandas or numpy.
Do you have idea for this problem?
You may using with cummax to mask all position need to shift as NaN and sorted
df[df.cummax(1).ne(0)].apply(lambda x : sorted(x,key=pd.isnull),1).fillna(0).astype(int)
Out[310]:
1 2 3 4 5 6 7
0 1 0 0 1 0 0 0
1 1 0 0 0 0 1 1
2 1 0 0 1 0 0 0
3 1 1 0 0 0 0 0
Or we using the function justify write by Divakar(much faster than the apply sorted)
pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).astype(int)
Out[314]:
0 1 2 3 4 5 6
0 1 0 0 1 0 0 0
1 1 0 0 0 0 1 1
2 1 0 0 1 0 0 0
3 1 1 0 0 0 0 0
You can make use of numpy.ogrid here:
a = df.values
s = a.argmax(1) * - 1
m, n = a.shape
r, c = np.ogrid[:m, :n]
s[s < 0] += n
c = c - s[:, None]
a[r, c]
array([[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 1, 1],
[1, 0, 0, 1, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0]], dtype=int64)
Timings
In [35]: df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))
In [36]: %timeit pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).a
...: stype(int)
116 ms ± 640 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [37]: %%timeit
...: a = df.values
...: s = a.argmax(1) * - 1
...: m, n = a.shape
...: r, c = np.ogrid[:m, :n]
...: s[s < 0] += n
...: c = c - s[:, None]
...: pd.DataFrame(a[r, c])
...:
...:
11.3 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For performance, you can use numba. An elementary loop, but effective given JIT-compilation and use of more basic objects at C-level:
from numba import njit
#njit
def shifter(A):
res = np.zeros(A.shape)
for i in range(res.shape[0]):
start, end = 0, 0
for j in range(res.shape[1]):
if A[i, j] != 0:
start = j
break
res[i, :res.shape[1]-start] = A[i, start:]
return res
Performance benchmarking
def jpp(df):
return pd.DataFrame(shifter(df.values).astype(int))
def user348(df):
a = df.values
s = a.argmax(1) * - 1
m, n = a.shape
r, c = np.ogrid[:m, :n]
s[s < 0] += n
c = c - s[:, None]
return pd.DataFrame(a[r, c])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))
assert np.array_equal(jpp(df).values, user348(df).values)
%timeit jpp(df) # 9.2 ms per loop
%timeit user348(df) # 18.5 ms per loop
Here is a stride_tricks solution, which is fast because it enables slice-wise copying.
def pp(x):
n, m = x.shape
am = x.argmax(-1)
mam = am.max()
xx = np.empty((n, m + mam), x.dtype)
xx[:, :m] = x
xx[:, m:] = 0
xx = np.lib.stride_tricks.as_strided(xx, (n, mam+1, m), (*xx.strides, xx.strides[-1]))
return xx[np.arange(x.shape[0]), am]
It pads the input with the required number of zeros and then creates a sliding window view using as_strided. This is addressed using fancy indexing, but necause the last dimension is not indexed copying of lines is optimized and fast.
How fast? For large enough inputs on par with numba:
x = np.random.randint(0, 2, (10000, 10))
from timeit import timeit
shifter(x) # that should compile it, right?
print(timeit(lambda:shifter(x).astype(x.dtype), number=1000))
print(timeit(lambda:pp(x), number=1000))
Sample output:
0.8630472810036736
0.7336142909916816
Related
I'm trying to simulate financial trades using a vectorized approach in python. Part of this includes removing duplicate signals.
To elaborate, I've developed a buy_signal column and a sell_signal column. These columns contain booleans in the form of 1s and 0s.
Looking at the signals from the top-down, I don't want to trigger a second buy_signal before a sell_signal triggers, AKA if a 'position' is open. Same thing with sell signals, I do not want duplicate sell signals if a 'position' is closed. If a sell_signal and buy_signal are 1, set them both to 0.
What is the best way to remove these irrelevant signals?
Here's an example:
import pandas as pd
df = pd.DataFrame(
{
"buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
"sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
}
)
print(df)
buy_signal sell_signal
0 1 0
1 1 0
2 1 1
3 1 1
4 0 1
5 0 0
6 1 0
7 1 0
8 1 1
9 0 0
Here's the result I want:
buy_signal sell_signal
0 1 0
1 0 0
2 0 1
3 0 0
4 0 0
5 0 0
6 1 0
7 0 0
8 0 1
9 0 0
As I said earlier (in a comment about a response since then deleted), one must consider the interaction between buy and sell signals, and cannot simply operate on each independently.
The key idea is to consider a quantity q (or "position") that is the amount currently held, and that the OP says would like bounded to [0, 1]. That quantity is cumsum(buy - sell) after cleaning.
Therefore, the problem reduces to "cumulative sum with limits", which unfortunately cannot be done in a vectorized way with numpy or pandas, but that we can code quite efficiently using numba. The code below processes 1 million rows in 37 ms.
import numpy as np
from numba import njit
#njit
def cumsum_clip(a, xmin=-np.inf, xmax=np.inf):
res = np.empty_like(a)
c = 0
for i in range(len(a)):
c = min(max(c + a[i], xmin), xmax)
res[i] = c
return res
def clean_buy_sell(df, xmin=0, xmax=1):
# model the quantity held: cumulative sum of buy-sell clipped in
# [xmin, xmax]
# note that, when buy and sell are equal, there is no change
q = cumsum_clip(
(df['buy_signal'] - df['sell_signal']).values,
xmin=xmin, xmax=xmax)
# derive actual transactions: positive for buy, negative for sell, 0 for hold
trans = np.diff(np.r_[0, q])
df = df.assign(
buy_signal=np.clip(trans, 0, None),
sell_signal=np.clip(-trans, 0, None),
)
return df
Now:
df = pd.DataFrame(
{
"buy_signal": [1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
"sell_signal": [0, 0, 1, 1, 1, 0, 0, 0, 1, 0],
}
)
new_df = clean_buy_sell(df)
>>> new_df
buy_signal sell_signal
0 1 0
1 0 0
2 0 0
3 0 0
4 0 1
5 0 0
6 1 0
7 0 0
8 0 0
9 0 0
Speed and correctness
n = 1_000_000
np.random.seed(0) # repeatable example
df = pd.DataFrame(np.random.choice([0, 1], (n, 2)),
columns=['buy_signal', 'sell_signal'])
%timeit clean_buy_sell(df)
37.3 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Correctness tests:
z = clean_buy_sell(df)
q = (z['buy_signal'] - z['sell_signal']).cumsum()
# q is quantity held through time; must be in {0, 1}
assert q.isin({0, 1}).all()
# we should not have introduced any new buy signal:
# check that any buy == 1 in z was also 1 in df
assert not (z['buy_signal'] & ~df['buy_signal']).any()
# same for sell signal:
assert not (z['sell_signal'] & ~df['sell_signal']).any()
# finally, buy and sell should never be 1 on the same row:
assert not (z['buy_signal'] & z['sell_signal']).any()
Bonus: other limits, fractional buys and sells
For fun, we can consider the more general case where buy and sell values are fractional (or any float value), and the limits are not [0, 1]. There is nothing to change to the current version of clean_buy_sell, which is general enough to handle these conditions.
np.random.seed(0)
df = pd.DataFrame(
np.random.uniform(0, 1, (100, 2)),
columns=['buy_signal', 'sell_signal'],
)
# set limits to -1, 2: we can sell short (borrow) up to 1 unit
# and own up to 2 units.
z = clean_buy_sell(df, -1, 2)
(z['buy_signal'] - z['sell_signal']).cumsum().plot()
I have a problem I've been trying to think through. Say I have a numpy array that looks like this (in the actual implementation, len(array) will be around 4500):
array = np.repeat([0, 1, 2], 2)
array >> [0, 0, 1, 1, 2, 2]
From this, I'm trying to generate a new (shuffled) array where the proportion of values that randomly agree with array is a particular proportion p. So let's say p = .5. Then, an example new array would be something like
array = [0, 0, 1, 1, 2, 2]
new_array = [0, 1, 2, 1, 0, 2]
where you can see that exactly 50% of the values in new_array agree with the values in array. The requirements of the output array are:
np.count_nonzero(array - new_array) / len(array) = p, and
set(np.unique(array)) == set(np.unique(new_array)).
By "agree" I mean array[i] == new_array[i] for agreeing indices i. All values in new_array should be the same as array, just shuffled.
I'm sure there's an elegant way of doing this -- can anybody think of something?
Thanks!
You can try something like
import random
p = 0.5
arr = np.array([0, 0, 1, 1, 2, 2])
# number of similar elements required
num_sim_element = round(len(arr)*p)
# creating indeices of similar element
hp = {}
for i,e in enumerate(arr):
if(e in hp):
hp[e].append(i)
else:
hp[e] = [i]
#print(hp)
out_map = []
k = list(hp.keys())
v = list(hp.values())
index = 0
while(len(out_map) != num_sim_element):
if(len(v[index]) > 0):
k_ = k[index]
random.shuffle(v[index])
v_ = v[index].pop()
out_map.append((k_,v_))
index += 1
index %= len(k)
#print(out_map)
out_unique = set([i[0] for i in out_map])
out_indices = [i[-1] for i in out_map]
out_arr = arr.copy()
#for i in out_map:
# out_arr[i[-1]] = i[0]
for i in set(range(len(arr))).difference(out_indices):
out_arr[i] = random.choice(list(out_unique.difference([out_arr[i]])))
print(arr)
print(out_arr)
assert 1 - (np.count_nonzero(arr - out_arr) / len(arr)) == p
assert set(np.unique(arr)) == set(np.unique(out_arr))
[0 0 1 1 2 2]
[1 0 1 0 0 2]
Here's a version that might be a little easier to follow:
import math, random
# generate array of random values
a = np.random.rand(4500)
# make a utility list of every position in that array, and shuffle it
indices = [i for i in range(0, len(a))]
random.shuffle(indices)
# set the proportion you want to keep the same
proportion = 0.5
# make two lists of indices, the ones that stay the same and the ones that get shuffled
anchors = indices[0:math.floor(len(a)*proportion)]
not_anchors = indices[math.floor(len(a)*proportion):]
# get values of non-anchor indices, and shuffle them
not_anchor_values = [a[i] for i in not_anchors]
random.shuffle(not_anchor_values)
# loop original array, if an anchor position, keep original value
# if not an anchor, draw value from shuffle non-anchor value list and increment the count
final_list = []
count = 0
for e,i in enumerate(a):
if e in not_anchors:
final_list.append(i)
else:
final_list.append(not_anchor_values[count])
count +=1
# test proportion of matches and non-matches in output
match = []
not_match = []
for e,i in enumerate(a):
if i == final_list[e]:
match.append(True)
else:
not_match.append(True)
len(match)/(len(match)+len(not_match))
Comments in the code explain the approach.
(EDITED to include a different and more accurate approach)
One should note that not all values of the shuffled fraction p (number of shuffled elements divided by the total number of elements) is accessible.
The possible value of p depend on the size of the input and on the number of repeated elements.
That said, I can suggest two possible approaches:
split your input into pinned and unpinned indices of the correct size and then shuffle the unpinned indices.
import numpy as np
def partial_shuffle(arr, p=1.0):
n = arr.size
k = round(n * p)
shuffling = np.arange(n)
shuffled = np.random.choice(n, k, replace=False)
shuffling[shuffled] = np.sort(shuffled)
return arr[shuffling]
The main advantage of approach (1) is that it can be easily implemented in a vectorized form using np.random.choice() and advanced indexing.
On the other hand, this works well as long as you are willing to accept that some shuffling may return you some elements unshuffled because of repeating values or simply because the shuffling indexes are accidentally coinciding with the unshuffled ones.
This causes the requested value of p to be typically larger than the actual value observed.
If one needs a relatively more accurate value of p, one could just try performing a search on the p parameter giving the desired value on the output, or go by trial-and-error.
implement a variation of the Fisher-Yates shuffle where you: (a) reject swappings of positions whose value is identical and (b) pick only random positions to swap that were not already visited.
def partial_shuffle_eff(arr, p=1.0, inplace=False, tries=2.0):
if not inplace:
arr = arr.copy()
n = arr.size
k = round(n * p)
tries = round(n * tries)
seen = set()
i = l = t = 0
while i < n and l < k:
seen.add(i)
j = np.random.randint(i, n)
while j in seen and t < tries:
j = np.random.randint(i, n)
t += 1
if arr[i] != arr[j]:
arr[i], arr[j] = arr[j], arr[i]
l += 2
seen.add(j)
while i in seen:
i += 1
return arr
While this approach gets to a more accurate value of p, it is still limited by the fact that the target number of swaps must be even.
Also, for inputs with lots of uniques the second while (while j in seen ...) is potentially an infinite loop so a cap on the number of tries should be set.
Finally, you would need to go with explicit looping, resulting in a much slower execution speed, unless you can use Numba's JIT compilation, which would speed up your execution significantly.
import numba as nb
partial_shuffle_eff_nb = nb.njit(partial_shuffle_eff)
partial_shuffle_eff_nb.__name__ = 'partial_shuffle_eff_nb'
To test the accuracy of the partial shuffling we may use the (percent) Hamming distance:
def hamming_distance(a, b):
assert(a.shape == b.shape)
return np.count_nonzero(a == b)
def percent_hamming_distance(a, b):
return hamming_distance(a, b) / len(a)
def shuffling_fraction(a, b):
return 1 - percent_hamming_distance(a, b)
And we may observe a behavior similar to this:
funcs = (
partial_shuffle,
partial_shuffle_eff,
partial_shuffle_eff_nb
)
n = 12
m = 3
arrs = (
np.zeros(n, dtype=int),
np.arange(n),
np.repeat(np.arange(m), n // m),
np.repeat(np.arange(3), 2),
np.repeat(np.arange(3), 3),
)
np.random.seed(0)
for arr in arrs:
print(" " * 24, arr)
for func in funcs:
shuffled = func(arr, 0.5)
print(f"{func.__name__:>24s}", shuffled, shuffling_fraction(arr, shuffled))
# [0 0 0 0 0 0 0 0 0 0 0 0]
# partial_shuffle [0 0 0 0 0 0 0 0 0 0 0 0] 0.0
# partial_shuffle_eff [0 0 0 0 0 0 0 0 0 0 0 0] 0.0
# partial_shuffle_eff_nb [0 0 0 0 0 0 0 0 0 0 0 0] 0.0
# [ 0 1 2 3 4 5 6 7 8 9 10 11]
# partial_shuffle [ 0 8 2 3 6 5 7 4 9 1 10 11] 0.5
# partial_shuffle_eff [ 3 8 11 0 4 5 6 7 1 9 10 2] 0.5
# partial_shuffle_eff_nb [ 9 10 11 3 4 5 6 7 8 0 1 2] 0.5
# [0 0 0 0 1 1 1 1 2 2 2 2]
# partial_shuffle [0 0 2 0 1 2 1 1 2 2 1 0] 0.33333333333333337
# partial_shuffle_eff [1 1 1 0 0 1 0 0 2 2 2 2] 0.5
# partial_shuffle_eff_nb [1 2 1 0 1 0 0 1 0 2 2 2] 0.5
# [0 0 1 1 2 2]
# partial_shuffle [0 0 1 1 2 2] 0.0
# partial_shuffle_eff [1 1 0 0 2 2] 0.6666666666666667
# partial_shuffle_eff_nb [1 2 0 1 0 2] 0.6666666666666667
# [0 0 0 1 1 1 2 2 2]
# partial_shuffle [0 0 1 1 0 1 2 2 2] 0.2222222222222222
# partial_shuffle_eff [0 1 2 1 0 1 2 2 0] 0.4444444444444444
# partial_shuffle_eff_nb [0 0 1 0 2 1 2 1 2] 0.4444444444444444
or, for an input closer to your use-case:
n = 4500
m = 3
arr = np.repeat(np.arange(m), n // m)
np.random.seed(0)
for func in funcs:
shuffled = func(arr, 0.5)
print(f"{func.__name__:>24s}", shuffling_fraction(arr, shuffled))
# partial_shuffle 0.33777777777777773
# partial_shuffle_eff 0.5
# partial_shuffle_eff_nb 0.5
Finally some small benchmarking:
n = 4500
m = 3
arr = np.repeat(np.arange(m), n // m)
np.random.seed(0)
for func in funcs:
print(f"{func.__name__:>24s}", end=" ")
%timeit func(arr, 0.5)
# partial_shuffle 213 µs ± 6.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# partial_shuffle_eff 10.9 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# partial_shuffle_eff_nb 172 µs ± 1.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I have a time series dataframe where there is 1 or 0 in it (true/false). I wrote a function that loops through all rows with values 1 in them. Given user defined integer parameter called n_hold, I will set values 1 to n rows forward from the initial row.
For example, in the dataframe below I will be loop to row 2016-08-05. If n_hold = 2, then I will set both 2016-08-08 and 2016-08-09 to 1 too.:
2016-08-03 0
2016-08-04 0
2016-08-05 1
2016-08-08 0
2016-08-09 0
2016-08-10 0
The resulting df will then is
2016-08-03 0
2016-08-04 0
2016-08-05 1
2016-08-08 1
2016-08-09 1
2016-08-10 0
The problem I have is this is being run 10s of thousands of times and my current solution where I am looping over rows where there are ones and subsetting is way too slow. I was wondering if there are any solutions to the above problem that is really fast.
Here is my (slow) solution, x is the initial signal dataframe:
n_hold = 2
entry_sig_diff = x.diff()
entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
final_signal = x * 0
for i in range(0, len(entry_sig_dt)):
row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])
if (row_idx + n_hold) >= len(x):
break
final_signal[row_idx:(row_idx + n_hold + 1)] = 1
Completely changed answer, because working differently with consecutive 1 values:
Explanation:
Solution remove each consecutive 1 first by where with chained boolean mask by comparing with ne (not equal !=) with shift to NaNs, forward filling them by ffill with limit parameter and last replace 0 back:
n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
Timings and comparing outputs:
np.random.seed(123)
x = pd.Series(np.random.choice([0,1], p=(.8,.2), size=1000))
x1 = x.copy()
#print (x)
def orig(x):
n_hold = 2
entry_sig_diff = x.diff()
entry_sig_dt = entry_sig_diff[entry_sig_diff == 1].index
final_signal = x * 0
for i in range(0, len(entry_sig_dt)):
row_idx = entry_sig_diff.index.get_loc(entry_sig_dt[i])
if (row_idx + n_hold) >= len(x):
break
final_signal[row_idx:(row_idx + n_hold + 1)] = 1
return final_signal
#print (orig(x))
n_hold = 2
s = x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
#print (s)
df = pd.concat([x,orig(x1), s], axis=1, keys=('input', 'orig', 'new'))
print (df.head(20))
input orig new
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 1 1 1
7 0 1 1
8 0 1 1
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
18 0 0 0
19 0 0 0
#check outputs
#print (s.values == orig(x).values)
Timings:
%timeit (orig(x))
24.8 ms ± 653 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit x.where(x.ne(x.shift()) & (x == 1)).ffill(limit=n_hold).fillna(0, downcast='int')
1.36 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I have a dataset with binary values. I want to find out frequent value in each row. This dataset have couple of millions records. What would be the most efficient way to do it? Following is the sample of the dataset.
import pandas as pd
data = pd.read_csv('myData.csv', sep = ',')
data.head()
bit1 bit2 bit2 bit4 bit5 frequent freq_count
0 0 0 1 1 0 3
1 1 1 0 0 1 3
1 0 1 1 1 1 4
I want to create frequent as well as freq_count columns like the sample above. These are not part of original dataset and will be created after looking at all rows.
Here's one approach -
def freq_stat(df):
a = df.values
zero_c = (a==0).sum(1)
one_c = a.shape[1] - zero_c
df['frequent'] = (zero_c<=one_c).astype(int)
df['freq_count'] = np.maximum(zero_c, one_c)
return df
Sample run -
In [305]: df
Out[305]:
bit1 bit2 bit2.1 bit4 bit5
0 0 0 0 1 1
1 1 1 1 0 0
2 1 0 1 1 1
In [308]: freq_stat(df)
Out[308]:
bit1 bit2 bit2.1 bit4 bit5 frequent freq_count
0 0 0 0 1 1 0 3
1 1 1 1 0 0 1 3
2 1 0 1 1 1 1 4
Benchmarking
Let's test out this one against the fastest approach from #jezrael's soln :
from scipy import stats
def mod(df): # #jezrael's best soln
a = df.values.T
b = stats.mode(a)
df['a'] = b[0][0]
df['b'] = b[1][0]
return df
Also, let's use the same setup from the other post and get the timings -
In [323]: np.random.seed(100)
...: N = 10000
...: #[10000 rows x 20 columns]
...: df = pd.DataFrame(np.random.randint(2, size=(N,20)))
...:
# #jezrael's soln
In [324]: %timeit mod(df)
100 loops, best of 3: 5.92 ms per loop
# Proposed in this post
In [325]: %timeit freq_stat(df)
1000 loops, best of 3: 496 µs per loop
You can use scipy.stats.mode:
from scipy import stats
a = df.values.T
b = stats.mode(a)
print(b)
ModeResult(mode=array([[0, 1, 1]], dtype=int64), count=array([[3, 3, 4]]))
df['frequent'] = b[0][0]
df['freq_count'] = b[1][0]
print (df)
bit1 bit2 bit2.1 bit4 bit5 frequent freq_count
0 0 0 0 1 1 0 3
1 1 1 1 0 0 1 3
2 1 0 1 1 1 1 4
Use Counter.most_common:
from collections import Counter
def f(x):
a, b = Counter(x).most_common(1)[0]
return pd.Series([a, b])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Another solution:
def f(x):
counts = np.bincount(x)
a = np.argmax(counts)
b = np.max(counts)
return pd.Series([a,b])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Alternative:
from collections import defaultdict
def f(x):
d = defaultdict(int)
for i in x:
d[i] += 1
return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Timings:
np.random.seed(100)
N = 10000
#[10000 rows x 20 columns]
df = pd.DataFrame(np.random.randint(2, size=(N,20)))
In [140]: %timeit df.apply(f1, axis=1)
1 loop, best of 3: 1.78 s per loop
In [141]: %timeit df.apply(f2, axis=1)
1 loop, best of 3: 1.66 s per loop
In [142]: %timeit df.apply(f3, axis=1)
1 loop, best of 3: 1.7 s per loop
In [143]: %timeit mod(df)
100 loops, best of 3: 8.37 ms per loop
In [144]: %timeit mod1(df)
100 loops, best of 3: 8.88 ms per loop
from collections import Counter
from collections import defaultdict
from scipy import stats
def f1(x):
a, b = Counter(x).most_common(1)[0]
return pd.Series([a, b])
def f2(x):
counts = np.bincount(x)
a = np.argmax(counts)
b = np.max(counts)
return pd.Series([a,b])
def f3(x):
d = defaultdict(int)
for i in x:
d[i] += 1
return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
def mod(df):
a = df.values.T
b = stats.mode(a)
df['a'] = b[0][0]
df['b'] = b[1][0]
return df
def mod1(df):
a = df.values
b = stats.mode(a, axis=1)
df['a'] = b[0][:, 0]
df['b'] = b[1][:, 0]
return df
Background
I got a data frame with integers. These integers represents a series of features that are either present or not present for that row.
I want these features to be named columns in my data frame.
Problem
My current solution explodes in memory and is crazy slow. How do I improve the memory efficiency of this?
import pandas as pd
df = pd.DataFrame({'some_int':range(5)})
df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).apply(pd.Series).rename(columns=dict(zip(range(4), ["f1", "f2", "f3", "f4"])))
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
It seems to be the .apply(pd.Series) that is slowing this down. Everything else is quite fast until I add this.
I cannot skip it because a simple list will not make a dataframe.
you can use numpy.binary_repr method:
In [336]: df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=4)))) \
.add_prefix('f')
Out[336]:
f0 f1 f2 f3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
or
In [346]: pd.DataFrame([list(np.binary_repr(x, width=4)) for x in df.some_int.values],
...: columns=np.arange(1,5)) \
...: .add_prefix('f')
...:
Out[346]:
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Here's a vectorized NumPy approach -
def num2bin(nums, width):
return ((nums[:,None] & (1 << np.arange(width-1,-1,-1)))!=0).astype(int)
Sample run -
In [70]: df
Out[70]:
some_int
0 1
1 5
2 3
3 8
4 4
In [71]: pd.DataFrame( num2bin(df.some_int.values, 4), \
columns = [["f1", "f2", "f3", "f4"]])
Out[71]:
f1 f2 f3 f4
0 0 0 0 1
1 0 1 0 1
2 0 0 1 1
3 1 0 0 0
4 0 1 0 0
Explanation
1) Inputs :
In [98]: nums = np.array([1,5,3,8,4])
In [99]: width = 4
2) Get the 2 powered range numbers :
In [100]: (1 << np.arange(width-1,-1,-1))
Out[100]: array([8, 4, 2, 1])
3) Convert nums to a 2D array version as we later on want to do element-wise bit-ANDing between it and the 2-powered numbers in a vectorized mannner following the rules of broadcasting :
In [101]: nums[:,None]
Out[101]:
array([[1],
[5],
[3],
[8],
[4]])
In [102]: nums[:,None] & (1 << np.arange(width-1,-1,-1))
Out[102]:
array([[0, 0, 0, 1],
[0, 4, 0, 1],
[0, 0, 2, 1],
[8, 0, 0, 0],
[0, 4, 0, 0]])
To understand the bit-ANDIng, let's consider the number 5 from nums and its bit-ANDing for it against all 2-powered numbers [8,4,2,1] :
In [103]: 5 & 8 # 0101 & 1000
Out[103]: 0
In [104]: 5 & 4 # 0101 & 0100
Out[104]: 4
In [105]: 5 & 2 # 0101 & 0010
Out[105]: 0
In [106]: 5 & 1 # 0101 & 0001
Out[106]: 1
Thus, we see that there are no intersection against [8,2], whereas for others we have non-zeros.
4) In the final stage, look for matches (non-zeros) and simply convert those to 1s and rest to 0s by comparing against 0 resulting in a boolean array and then converting to int dtype :
In [107]: matches = nums[:,None] & (1 << np.arange(width-1,-1,-1))
In [108]: matches!=0
Out[108]:
array([[False, False, False, True],
[False, True, False, True],
[False, False, True, True],
[ True, False, False, False],
[False, True, False, False]], dtype=bool)
In [109]: (matches!=0).astype(int)
Out[109]:
array([[0, 0, 0, 1],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
Runtime test
In [58]: df = pd.DataFrame({'some_int':range(100000)})
# #jezrael's soln-1
In [59]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).values.tolist())
1 loops, best of 3: 198 ms per loop
# #jezrael's soln-2
In [60]: %timeit pd.DataFrame([list('{:20b}'.format(x)) for x in df['some_int'].values])
10 loops, best of 3: 154 ms per loop
# #jezrael's soln-3
In [61]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:20b}'.format(x))).values.tolist())
10 loops, best of 3: 132 ms per loop
# #MaxU's soln-1
In [62]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loops, best of 3: 193 ms per loop
# #MaxU's soln-2
In [64]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loops, best of 3: 11.8 s per loop
# Proposed in this post
In [65]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 5.64 ms per loop
I think you need:
a = pd.DataFrame(df['some_int'].astype(int)
.apply(bin)
.str[2:]
.str.zfill(4)
.apply(list).values.tolist(), columns=["f1","f2","f3","f4"])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Another solution, thanks Jon Clements and ayhan:
a = pd.DataFrame(df['some_int'].apply(lambda x: list('{:04b}'.format(x))).values.tolist(),
columns=['f1', 'f2', 'f3', 'f4'])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
A bit changed:
a = pd.DataFrame([list('{:04b}'.format(x)) for x in df['some_int'].values],
columns=['f1', 'f2', 'f3', 'f4'])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Timings:
df = pd.DataFrame({'some_int':range(100000)})
In [80]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(20).apply(list).values.tolist())
1 loop, best of 3: 231 ms per loop
In [81]: %timeit pd.DataFrame([list('{:020b}'.format(x)) for x in df['some_int'].values])
1 loop, best of 3: 232 ms per loop
In [82]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:020b}'.format(x))).values.tolist())
1 loop, best of 3: 222 ms per loop
In [83]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loop, best of 3: 343 ms per loop
In [84]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loop, best of 3: 16.4 s per loop
In [87]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 11.4 ms per loop