I have a dataset with binary values. I want to find out frequent value in each row. This dataset have couple of millions records. What would be the most efficient way to do it? Following is the sample of the dataset.
import pandas as pd
data = pd.read_csv('myData.csv', sep = ',')
data.head()
bit1 bit2 bit2 bit4 bit5 frequent freq_count
0 0 0 1 1 0 3
1 1 1 0 0 1 3
1 0 1 1 1 1 4
I want to create frequent as well as freq_count columns like the sample above. These are not part of original dataset and will be created after looking at all rows.
Here's one approach -
def freq_stat(df):
a = df.values
zero_c = (a==0).sum(1)
one_c = a.shape[1] - zero_c
df['frequent'] = (zero_c<=one_c).astype(int)
df['freq_count'] = np.maximum(zero_c, one_c)
return df
Sample run -
In [305]: df
Out[305]:
bit1 bit2 bit2.1 bit4 bit5
0 0 0 0 1 1
1 1 1 1 0 0
2 1 0 1 1 1
In [308]: freq_stat(df)
Out[308]:
bit1 bit2 bit2.1 bit4 bit5 frequent freq_count
0 0 0 0 1 1 0 3
1 1 1 1 0 0 1 3
2 1 0 1 1 1 1 4
Benchmarking
Let's test out this one against the fastest approach from #jezrael's soln :
from scipy import stats
def mod(df): # #jezrael's best soln
a = df.values.T
b = stats.mode(a)
df['a'] = b[0][0]
df['b'] = b[1][0]
return df
Also, let's use the same setup from the other post and get the timings -
In [323]: np.random.seed(100)
...: N = 10000
...: #[10000 rows x 20 columns]
...: df = pd.DataFrame(np.random.randint(2, size=(N,20)))
...:
# #jezrael's soln
In [324]: %timeit mod(df)
100 loops, best of 3: 5.92 ms per loop
# Proposed in this post
In [325]: %timeit freq_stat(df)
1000 loops, best of 3: 496 µs per loop
You can use scipy.stats.mode:
from scipy import stats
a = df.values.T
b = stats.mode(a)
print(b)
ModeResult(mode=array([[0, 1, 1]], dtype=int64), count=array([[3, 3, 4]]))
df['frequent'] = b[0][0]
df['freq_count'] = b[1][0]
print (df)
bit1 bit2 bit2.1 bit4 bit5 frequent freq_count
0 0 0 0 1 1 0 3
1 1 1 1 0 0 1 3
2 1 0 1 1 1 1 4
Use Counter.most_common:
from collections import Counter
def f(x):
a, b = Counter(x).most_common(1)[0]
return pd.Series([a, b])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Another solution:
def f(x):
counts = np.bincount(x)
a = np.argmax(counts)
b = np.max(counts)
return pd.Series([a,b])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Alternative:
from collections import defaultdict
def f(x):
d = defaultdict(int)
for i in x:
d[i] += 1
return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Timings:
np.random.seed(100)
N = 10000
#[10000 rows x 20 columns]
df = pd.DataFrame(np.random.randint(2, size=(N,20)))
In [140]: %timeit df.apply(f1, axis=1)
1 loop, best of 3: 1.78 s per loop
In [141]: %timeit df.apply(f2, axis=1)
1 loop, best of 3: 1.66 s per loop
In [142]: %timeit df.apply(f3, axis=1)
1 loop, best of 3: 1.7 s per loop
In [143]: %timeit mod(df)
100 loops, best of 3: 8.37 ms per loop
In [144]: %timeit mod1(df)
100 loops, best of 3: 8.88 ms per loop
from collections import Counter
from collections import defaultdict
from scipy import stats
def f1(x):
a, b = Counter(x).most_common(1)[0]
return pd.Series([a, b])
def f2(x):
counts = np.bincount(x)
a = np.argmax(counts)
b = np.max(counts)
return pd.Series([a,b])
def f3(x):
d = defaultdict(int)
for i in x:
d[i] += 1
return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
def mod(df):
a = df.values.T
b = stats.mode(a)
df['a'] = b[0][0]
df['b'] = b[1][0]
return df
def mod1(df):
a = df.values
b = stats.mode(a, axis=1)
df['a'] = b[0][:, 0]
df['b'] = b[1][:, 0]
return df
Related
I have to handle a huge amount of data. Every row starts with 1 or 0. I need a dataframe where every rows start with 1, so I have to step left all rows values till the first value is 1.
For example:
0 1 0 0 1 0 0
1 0 0 0 0 1 1
0 0 0 1 0 0 1
0 0 0 0 0 1 1
The result has to be this:
1 0 0 1 0 0 0
1 0 0 0 0 1 1
1 0 0 1 0 0 0
1 1 0 0 0 0 0
I don't want to use for, while, etc., because I need some faster methods with pandas or numpy.
Do you have idea for this problem?
You may using with cummax to mask all position need to shift as NaN and sorted
df[df.cummax(1).ne(0)].apply(lambda x : sorted(x,key=pd.isnull),1).fillna(0).astype(int)
Out[310]:
1 2 3 4 5 6 7
0 1 0 0 1 0 0 0
1 1 0 0 0 0 1 1
2 1 0 0 1 0 0 0
3 1 1 0 0 0 0 0
Or we using the function justify write by Divakar(much faster than the apply sorted)
pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).astype(int)
Out[314]:
0 1 2 3 4 5 6
0 1 0 0 1 0 0 0
1 1 0 0 0 0 1 1
2 1 0 0 1 0 0 0
3 1 1 0 0 0 0 0
You can make use of numpy.ogrid here:
a = df.values
s = a.argmax(1) * - 1
m, n = a.shape
r, c = np.ogrid[:m, :n]
s[s < 0] += n
c = c - s[:, None]
a[r, c]
array([[1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 1, 1],
[1, 0, 0, 1, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0]], dtype=int64)
Timings
In [35]: df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))
In [36]: %timeit pd.DataFrame(justify(df[df.cummax(1).ne(0)].values, invalid_val=np.nan, axis=1, side='left')).fillna(0).a
...: stype(int)
116 ms ± 640 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [37]: %%timeit
...: a = df.values
...: s = a.argmax(1) * - 1
...: m, n = a.shape
...: r, c = np.ogrid[:m, :n]
...: s[s < 0] += n
...: c = c - s[:, None]
...: pd.DataFrame(a[r, c])
...:
...:
11.3 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For performance, you can use numba. An elementary loop, but effective given JIT-compilation and use of more basic objects at C-level:
from numba import njit
#njit
def shifter(A):
res = np.zeros(A.shape)
for i in range(res.shape[0]):
start, end = 0, 0
for j in range(res.shape[1]):
if A[i, j] != 0:
start = j
break
res[i, :res.shape[1]-start] = A[i, start:]
return res
Performance benchmarking
def jpp(df):
return pd.DataFrame(shifter(df.values).astype(int))
def user348(df):
a = df.values
s = a.argmax(1) * - 1
m, n = a.shape
r, c = np.ogrid[:m, :n]
s[s < 0] += n
c = c - s[:, None]
return pd.DataFrame(a[r, c])
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (1000, 1000)))
assert np.array_equal(jpp(df).values, user348(df).values)
%timeit jpp(df) # 9.2 ms per loop
%timeit user348(df) # 18.5 ms per loop
Here is a stride_tricks solution, which is fast because it enables slice-wise copying.
def pp(x):
n, m = x.shape
am = x.argmax(-1)
mam = am.max()
xx = np.empty((n, m + mam), x.dtype)
xx[:, :m] = x
xx[:, m:] = 0
xx = np.lib.stride_tricks.as_strided(xx, (n, mam+1, m), (*xx.strides, xx.strides[-1]))
return xx[np.arange(x.shape[0]), am]
It pads the input with the required number of zeros and then creates a sliding window view using as_strided. This is addressed using fancy indexing, but necause the last dimension is not indexed copying of lines is optimized and fast.
How fast? For large enough inputs on par with numba:
x = np.random.randint(0, 2, (10000, 10))
from timeit import timeit
shifter(x) # that should compile it, right?
print(timeit(lambda:shifter(x).astype(x.dtype), number=1000))
print(timeit(lambda:pp(x), number=1000))
Sample output:
0.8630472810036736
0.7336142909916816
This question already has answers here:
Calculate new value based on decreasing value
(4 answers)
Closed 5 years ago.
Given the following table
vals
0 20
1 3
2 2
3 10
4 20
I'm trying to find a clean solution in pandas to subtract away a value, say 30 for example, to end with the following result.
vals
0 0
1 0
2 0
3 5
4 20
I was wondering if pandas had a solution to performing this that didn't require looping through all the rows in a dataframe, something that takes advantage of pandas's bulk operations.
identify where cumsum is greater than or equal to 30
mask the rows where it isn't
reassign the one row to be the cumsum less 30
c = df.vals.cumsum()
m = c.ge(30)
i = m.idxmax()
n = df.vals.where(m, 0)
n.loc[i] = c.loc[i] - 30
df.assign(vals=n)
vals
0 0
1 0
2 0
3 5
4 20
Same thing, but numpyfied
v = df.vals.values
c = v.cumsum()
m = c >= 30
i = m.argmax()
n = np.where(m, v, 0)
n[i] = c[i] - 30
df.assign(vals=n)
vals
0 0
1 0
2 0
3 5
4 20
Timing
%%timeit
v = df.vals.values
c = v.cumsum()
m = c >= 30
i = m.argmax()
n = np.where(m, v, 0)
n[i] = c[i] - 30
df.assign(vals=n)
10000 loops, best of 3: 168 µs per loop
%%timeit
c = df.vals.cumsum()
m = c.ge(30)
i = m.idxmax()
n = df.vals.where(m, 0)
n.loc[i] = c.loc[i] - 30
df.assign(vals=n)
1000 loops, best of 3: 853 µs per loop
Here's one using NumPy with four lines of code -
v = df.vals.values
a = v.cumsum()-30
idx = (a>0).argmax()+1
v[:idx] = a.clip(min=0)[:idx]
Sample run -
In [274]: df # Original df
Out[274]:
vals
0 20
1 3
2 2
3 10
4 20
In [275]: df.iloc[3,0] = 7 # Bringing in some variety
In [276]: df
Out[276]:
vals
0 20
1 3
2 2
3 7
4 20
In [277]: v = df.vals.values
...: a = v.cumsum()-30
...: idx = (a>0).argmax()+1
...: v[:idx] = a.clip(min=0)[:idx]
...:
In [278]: df
Out[278]:
vals
0 0
1 0
2 0
3 2
4 20
#A one-liner solution
df['vals'] = df.assign(res = 30-df.vals.cumsum()).apply(lambda x: 0 if x.res>0 else x.vals if abs(x.res)>x.vals else x.vals-abs(x.res), axis=1)
df
Out[96]:
vals
0 0
1 0
2 0
3 5
4 20
Background
I got a data frame with integers. These integers represents a series of features that are either present or not present for that row.
I want these features to be named columns in my data frame.
Problem
My current solution explodes in memory and is crazy slow. How do I improve the memory efficiency of this?
import pandas as pd
df = pd.DataFrame({'some_int':range(5)})
df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).apply(pd.Series).rename(columns=dict(zip(range(4), ["f1", "f2", "f3", "f4"])))
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
It seems to be the .apply(pd.Series) that is slowing this down. Everything else is quite fast until I add this.
I cannot skip it because a simple list will not make a dataframe.
you can use numpy.binary_repr method:
In [336]: df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=4)))) \
.add_prefix('f')
Out[336]:
f0 f1 f2 f3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
or
In [346]: pd.DataFrame([list(np.binary_repr(x, width=4)) for x in df.some_int.values],
...: columns=np.arange(1,5)) \
...: .add_prefix('f')
...:
Out[346]:
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Here's a vectorized NumPy approach -
def num2bin(nums, width):
return ((nums[:,None] & (1 << np.arange(width-1,-1,-1)))!=0).astype(int)
Sample run -
In [70]: df
Out[70]:
some_int
0 1
1 5
2 3
3 8
4 4
In [71]: pd.DataFrame( num2bin(df.some_int.values, 4), \
columns = [["f1", "f2", "f3", "f4"]])
Out[71]:
f1 f2 f3 f4
0 0 0 0 1
1 0 1 0 1
2 0 0 1 1
3 1 0 0 0
4 0 1 0 0
Explanation
1) Inputs :
In [98]: nums = np.array([1,5,3,8,4])
In [99]: width = 4
2) Get the 2 powered range numbers :
In [100]: (1 << np.arange(width-1,-1,-1))
Out[100]: array([8, 4, 2, 1])
3) Convert nums to a 2D array version as we later on want to do element-wise bit-ANDing between it and the 2-powered numbers in a vectorized mannner following the rules of broadcasting :
In [101]: nums[:,None]
Out[101]:
array([[1],
[5],
[3],
[8],
[4]])
In [102]: nums[:,None] & (1 << np.arange(width-1,-1,-1))
Out[102]:
array([[0, 0, 0, 1],
[0, 4, 0, 1],
[0, 0, 2, 1],
[8, 0, 0, 0],
[0, 4, 0, 0]])
To understand the bit-ANDIng, let's consider the number 5 from nums and its bit-ANDing for it against all 2-powered numbers [8,4,2,1] :
In [103]: 5 & 8 # 0101 & 1000
Out[103]: 0
In [104]: 5 & 4 # 0101 & 0100
Out[104]: 4
In [105]: 5 & 2 # 0101 & 0010
Out[105]: 0
In [106]: 5 & 1 # 0101 & 0001
Out[106]: 1
Thus, we see that there are no intersection against [8,2], whereas for others we have non-zeros.
4) In the final stage, look for matches (non-zeros) and simply convert those to 1s and rest to 0s by comparing against 0 resulting in a boolean array and then converting to int dtype :
In [107]: matches = nums[:,None] & (1 << np.arange(width-1,-1,-1))
In [108]: matches!=0
Out[108]:
array([[False, False, False, True],
[False, True, False, True],
[False, False, True, True],
[ True, False, False, False],
[False, True, False, False]], dtype=bool)
In [109]: (matches!=0).astype(int)
Out[109]:
array([[0, 0, 0, 1],
[0, 1, 0, 1],
[0, 0, 1, 1],
[1, 0, 0, 0],
[0, 1, 0, 0]])
Runtime test
In [58]: df = pd.DataFrame({'some_int':range(100000)})
# #jezrael's soln-1
In [59]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(4).apply(list).values.tolist())
1 loops, best of 3: 198 ms per loop
# #jezrael's soln-2
In [60]: %timeit pd.DataFrame([list('{:20b}'.format(x)) for x in df['some_int'].values])
10 loops, best of 3: 154 ms per loop
# #jezrael's soln-3
In [61]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:20b}'.format(x))).values.tolist())
10 loops, best of 3: 132 ms per loop
# #MaxU's soln-1
In [62]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loops, best of 3: 193 ms per loop
# #MaxU's soln-2
In [64]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loops, best of 3: 11.8 s per loop
# Proposed in this post
In [65]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 5.64 ms per loop
I think you need:
a = pd.DataFrame(df['some_int'].astype(int)
.apply(bin)
.str[2:]
.str.zfill(4)
.apply(list).values.tolist(), columns=["f1","f2","f3","f4"])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Another solution, thanks Jon Clements and ayhan:
a = pd.DataFrame(df['some_int'].apply(lambda x: list('{:04b}'.format(x))).values.tolist(),
columns=['f1', 'f2', 'f3', 'f4'])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
A bit changed:
a = pd.DataFrame([list('{:04b}'.format(x)) for x in df['some_int'].values],
columns=['f1', 'f2', 'f3', 'f4'])
print (a)
f1 f2 f3 f4
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
Timings:
df = pd.DataFrame({'some_int':range(100000)})
In [80]: %timeit pd.DataFrame(df['some_int'].astype(int).apply(bin).str[2:].str.zfill(20).apply(list).values.tolist())
1 loop, best of 3: 231 ms per loop
In [81]: %timeit pd.DataFrame([list('{:020b}'.format(x)) for x in df['some_int'].values])
1 loop, best of 3: 232 ms per loop
In [82]: %timeit pd.DataFrame(df['some_int'].apply(lambda x: list('{:020b}'.format(x))).values.tolist())
1 loop, best of 3: 222 ms per loop
In [83]: %timeit pd.DataFrame([list(np.binary_repr(x, width=20)) for x in df.some_int.values])
1 loop, best of 3: 343 ms per loop
In [84]: %timeit df.some_int.apply(lambda x: pd.Series(list(np.binary_repr(x, width=20))))
1 loop, best of 3: 16.4 s per loop
In [87]: %timeit pd.DataFrame( num2bin(df.some_int.values, 20))
100 loops, best of 3: 11.4 ms per loop
I have a pandas df with a time series in column1, and a boolean condition in column2. This describes continuous time intervals that meet a specific condition. Note that the time intervals are of unequal length.
Timestamp Boolean_condition
1 1
2 1
3 0
4 1
5 1
6 1
7 0
8 0
9 1
10 0
How to count the total number of time intervals within the whole series that meet this condition?
The desired output should look like this:
Timestamp Boolean_condition Event_number
1 1 1
2 1 1
3 0 NaN
4 1 2
5 1 2
6 1 2
7 0 NaN
8 0 NaN
9 1 3
10 0 NaN
You can create Series with cumsum of two masks and then create NaN by function Series.mask:
mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
print ((mask2 & mask0).cumsum().add(1))
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 4
Name: Boolean_condition, dtype: int32
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
print (df)
Timestamp Boolean_condition Event_number
0 1 1 1.0
1 2 1 1.0
2 3 0 NaN
3 4 1 2.0
4 5 1 2.0
5 6 1 2.0
6 7 0 NaN
7 8 0 NaN
8 9 1 3.0
9 10 0 NaN
Timings:
#[100000 rows x 2 columns
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()
def nick(df):
isone = df.Boolean_condition[df.Boolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
return df
def jez(df):
mask0 = df.Boolean_condition.eq(0)
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
return (df)
def jez1(df):
mask0 = ~df.Boolean_condition
mask2 = df.Boolean_condition.ne(df.Boolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().add(1).mask(mask0)
return (df)
In [68]: %timeit (jez1(df))
100 loops, best of 3: 6.45 ms per loop
In [69]: %timeit (nick(df1))
100 loops, best of 3: 12 ms per loop
In [70]: %timeit (jez(df2))
100 loops, best of 3: 5.34 ms per loop
You could try the following:
1) Get all values of True instance (here, 1) which comprises of isone
2) Take it's corresponding set of indices and convert this to a series representation so that the new series has both it's index and values as the earlier computed indices. Perform the difference between successive rows and check if they are equal to 1. This becomes our boolean mask.
3) Compare isone with the obtained boolean mask and whenever they do not become equal, we take their cumulative sum (also known as adjacency check between elements). These help us in grouping purposes.
4) Using loc for the indices of isone, we assign the codes computed after changing the grp array to Categorical format to a new column created, Event_number.
isone = df.Bolean_condition[df.Bolean_condition.eq(1)]
idx = isone.index
grp = (isone != idx.to_series().diff().eq(1)).cumsum()
df.loc[idx, 'Event_number'] = pd.Categorical(grp).codes + 1
Faster approach:
Using only numpy:
1) Get it's array representation.
2) Compute the non-zero, here (1's) indices.
3) Insert NaN at the beginning of this array which would act as a starting point for us to perform difference taking successive rows into consideration.
4) Initialize a new array filled with Nan's of the same shape as that of the original array.
5) Whenever the difference between successive rows is not equal to 1, we take their cumulative sum, else they fall in the same group. These values get imputed at the indices where there were 1's before.
6) Assign these back to the new column.
def nick(df):
b = df.Bolean_condition.values
slc = np.flatnonzero(b)
slc_pl_1 = np.append(np.nan, slc)
nan_arr = np.full(b.size, fill_value=np.nan)
nan_arr[slc] = np.cumsum(slc_pl_1[1:] - slc_pl_1[:-1] != 1)
df['Event_number'] = nan_arr
return df
Timings:
For a DF of 10,000 rows:
np.random.seed(42)
df1 = pd.DataFrame(dict(
Timestamp=np.arange(10000),
Bolean_condition=np.random.choice(np.array([0,1]), 10000, p=[0.4, 0.6]))
)
df1.shape
# (10000, 2)
def jez(df):
mask0 = df.Bolean_condition.eq(0)
mask2 = df.Bolean_condition.ne(df.Bolean_condition.shift(1))
df['Event_number'] = (mask2 & mask0).cumsum().mask(mask0)
return (df)
nick(df1).equals(jez(df1))
# True
%%timeit
nick(df1)
1000 loops, best of 3: 362 µs per loop
%%timeit
jez(df1)
100 loops, best of 3: 1.56 ms per loop
For a DF containing 1 million rows:
np.random.seed(42)
df1 = pd.DataFrame(dict(
Timestamp=np.arange(1000000),
Bolean_condition=np.random.choice(np.array([0,1]), 1000000, p=[0.4, 0.6]))
)
df1.shape
# (1000000, 2)
nick(df1).equals(jez(df1))
# True
%%timeit
nick(df1)
10 loops, best of 3: 34.9 ms per loop
%%timeit
jez(df1)
10 loops, best of 3: 50.1 ms per loop
This should work but might be a bit slow for a very long df.
df = pd.concat([df,pd.Series([0]*len(df), name = '2')], axis = 1)
if df.iloc[0,1] == 1:
counter = 1
df.iloc[0, 2] = counter
else:
counter = 0
df.iloc[0,2] = 0
previous = df.iloc[0,1]
for y,x in df.iloc[1:,].iterrows():
print(y)
if x[1] == 1 and previous == 1:
previous = x[1]
df.iloc[y, 2] = counter
if x[1] == 0:
previous = x[1]
df.iloc[y,2] = 0
if x[1] == 1 and previous == 0:
counter += 1
previous = x[1]
df.iloc[y,2] = counter
A custom function does the trick. here is a solution in Matlab code:
Boolean_condition = [1 1 0 1 1 1 0 0 1 0];
Event_number = [NA NA NA NA NA NA NA NA NA NA];
loop_event_number = 1;
for timestamp=1:10
if Boolean_condition(timestamp)==1
Event_number(timestamp) = loop_event_number;
last_event_number = loop_event_number;
else
loop_event_number = last_event_number +1;
end
end
% Event_number = 1 1 NA 2 2 2 NA NA 3 NA
I have a word list like following.
wordlist = ['p1','p2','p3','p4','p5','p6','p7']
And the dataframe is like following.
df = pd.DataFrame({'id' : [1,2,3,4],
'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3"]})
output:
id path
1 p1,p2,p3,p4
2 p1,p2,p1
3 p1,p5,p5,p7
4 p1,p2,p3,p3
I want to count the path data to get following output. Is it possible to get this kind of transformation?
id p1 p2 p3 p4 p5 p6 p7
1 1 1 1 1 0 0 0
2 2 1 0 0 0 0 0
3 1 0 0 0 2 0 1
4 1 1 2 0 0 0 0
I think this would be efficient
# create Series with dictionaries
>>> from collections import Counter
>>> c = df["path"].str.split(',').apply(Counter)
>>> c
0 {u'p2': 1, u'p3': 1, u'p1': 1, u'p4': 1}
1 {u'p2': 1, u'p1': 2}
2 {u'p1': 1, u'p7': 1, u'p5': 2}
3 {u'p2': 1, u'p3': 2, u'p1': 1}
# create DataFrame
>>> pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
update
Another way to do this:
>>> dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
>>> pd.DataFrame(dfN, columns=wordlist).fillna(0)
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
update 2
Some rough tests for performance:
>>> dfL = pd.concat([df]*100)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
0.7363274283027295
>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
0.5305424618886718
# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
1.765344003293876
>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
2.33328927599905
update 3
after reading this topic I've found that Counter is really slow. You can optimize it a bit by using defaultdict:
>>> def create_dict(x):
... d = defaultdict(int)
... for c in x:
... d[c] += 1
... return d
>>> c = df["path"].str.split(",").apply(create_dict)
>>> pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
and some tests:
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
0.45942801555111146
# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
1.5798653213942089
You can use the vectorized string method str.count() (see docs and reference), and that for each element in wordlist feed that to a new dataframe:
In [4]: pd.DataFrame({name : df["path"].str.count(name) for name in wordlist})
Out[4]:
p1 p2 p3 p4 p5 p6 p7
id
1 1 1 1 1 0 0 0
2 2 1 0 0 0 0 0
3 1 0 0 0 2 0 1
4 1 1 2 0 0 0 0
UPDATE: some answers to the comments. Indeed this will not work if the strings can be substrings of each other (but the OP should clarify it then). If that is the case, this would work (and is also faster):
splitted = df["path"].str.split(",")
pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})
And some tests to back up my claim of being faster :-)
Off course, I don't know what the realistic use case is, but I made the dataframe a bit larger (just repeated it 1000 times, the differences are bigger then):
In [37]: %%timeit
....: splitted = df["path"].str.split(",")
....: pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name i
n wordlist})
....:
100 loops, best of 3: 17.9 ms per loop
In [38]: %%timeit
....: pd.DataFrame({name:df["path"].str.count(name) for name in wordlist})
....:
10 loops, best of 3: 23.6 ms per loop
In [39]: %%timeit
....: c = df["path"].str.split(',').apply(Counter)
....: pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
....:
10 loops, best of 3: 42.3 ms per loop
In [40]: %%timeit
....: dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
....: pd.DataFrame(dfN, columns=wordlist).fillna(0)
....:
1 loops, best of 3: 715 ms per loop
I also did the test with more elements in wordlist, and conclusion is: if you have a larger dataframe with relative smaller number of elements in wordlist my approach is faster, if you have a large wordlist the approach with Counter from #RomanPekar can be faster (but only the last one).
something similar to this:
df1 = pd.DataFrame([[path.count(p) for p in wordlist] for path in df['path']],columns=['p1','p2','p3','p4','p5','p6','p7'])