having a simple pandas data frame with 2 columns e.g. id and value where value is either 0 or 1 I would like to randomly replace 10% of all value==1 with 0.
How can I achieve this behaviour with pandas?
pandas answer
use query to get filtered df with only value == 1
use sample(frac=.1) to take 10% of those
use the index of the result to assign zero
df.loc[
df.query('value == 1').sample(frac=.1).index,
'value'
] = 0
alternative numpy answer
get boolean array of where df['value'] is 1
assign random array of 10% zeros and 90% ones
v = df.value.values == 1
df.loc[v, 'value'] = np.random.choice((0, 1), v.sum(), p=(.1, .9))
Here's a NumPy approach with np.random.choice -
a = df.value.values # get a view into value col
idx = np.flatnonzero(a) # get the nonzero indices
# Finally select unique 10% from those indices and set 0s there
a[np.random.choice(idx,size=int(0.1*len(idx)),replace=0)] = 0
Sample run -
In [237]: df = pd.DataFrame(np.random.randint(0,2,(100,2)),columns=['id','value'])
In [238]: (df.value==1).sum() # Original Count of 1s in df.value column
Out[238]: 53
In [239]: a = df.value.values
In [240]: idx = np.flatnonzero(a)
In [241]: a[np.random.choice(idx,size=int(0.1*len(idx)),replace=0)] = 0
In [242]: (df.value==1).sum() # New count of 1s in df.value column
Out[242]: 48
Alternatively, a bit more pandas approach -
idx = np.flatnonzero(df['value'])
df.ix[np.random.choice(idx,size=int(0.1*len(idx)),replace=0),'value'] = 0
Runtime test
All approaches posted thus far -
def f1(df): ##piRSquared's soln1
df.loc[df.query('value == 1').sample(frac=.1).index,'value'] = 0
def f2(df): ##piRSquared's soln2
v = df.value.values == 1
df.loc[v, 'value'] = np.random.choice((0, 1), v.sum(), p=(.1, .9))
def f3(df): ##Roman Pekar's soln
idx = df.index[df.value==1]
df.loc[np.random.choice(idx, size=idx.size/10, replace=False)].value = 0
def f4(df): ##Mine soln1
a = df.value.values
idx = np.flatnonzero(a)
a[np.random.choice(idx,size=int(0.1*len(idx)),replace=0)] = 0
def f5(df): ##Mine soln2
idx = np.flatnonzero(df['value'])
df.ix[np.random.choice(idx,size=int(0.1*len(idx)),replace=0),'value'] = 0
Timings -
In [2]: # Setup inputs
...: df = pd.DataFrame(np.random.randint(0,2,(10000,2)),columns=['id','value'])
...: df1 = df.copy()
...: df2 = df.copy()
...: df3 = df.copy()
...: df4 = df.copy()
...: df5 = df.copy()
...:
In [3]: # Timings
...: %timeit f1(df1)
...: %timeit f2(df2)
...: %timeit f3(df3)
...: %timeit f4(df4)
...: %timeit f5(df5)
...:
100 loops, best of 3: 3.96 ms per loop
1000 loops, best of 3: 844 µs per loop
1000 loops, best of 3: 1.62 ms per loop
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 663 µs per loop
you can probably use numpy.random.choice:
>>> idx = df.index[df.value==1]
>>> df.loc[np.random.choice(idx, size=idx.size/10, replace=False)].value = 0
Related
Suppose I have pandas dataframe, where first column is threshold:
threshold,value1,value2,value3,...,valueN
5,12,3,4,...,20
4,1,7,8,...,3
7,5,2,8,...,10
And for each row I want set elements in columns value1..valueN to zero if it less then threshold:
threshold,value1,value2,value3,...,valueN
5,12,0,0,...,20
4,0,7,8,...,0
7,0,0,8,...,10
How can I do this without explicit for loops?
You can try in this way:
df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: np.where(x > df.threshold, x, 0), axis=0)
Use DataFrame.lt for compare with mask:
df = df.mask(df.lt(df['threshold'], axis=0), 0)
Orset_index and reset_index:
df = df.set_index('threshold')
df = df.mask(df.lt(df.index, axis=0), 0).reset_index()
For improve performance numpy solution:
arr = df.values
df = pd.DataFrame(np.where(arr < arr[:, 0][:, None], 0, arr), columns=df.columns)
print (df)
threshold value1 value2 value3 valueN
0 5 12 0 0 20
1 4 0 7 8 0
2 7 0 0 8 10
Timings:
In [294]: %timeit set_reset_sol(df)
1 loop, best of 3: 376 ms per loop
In [295]: %timeit numpy_sol(df)
10 loops, best of 3: 59.9 ms per loop
In [296]: %timeit df.mask(df.lt(df['threshold'], axis=0), 0)
1 loop, best of 3: 380 ms per loop
In [297]: %timeit df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: np.where(x > df.threshold, x, 0), axis=0)
1 loop, best of 3: 449 ms per loop
np.random.seed(234)
N = 100000
#[100000 rows x 100 columns]
df = pd.DataFrame(np.random.randint(100, size=(N, 100)))
df.columns = ['threshold'] + df.columns[1:].tolist()
print (df)
def set_reset_sol(df):
df = df.set_index('threshold')
return df.mask(df.lt(df.index, axis=0), 0).reset_index()
def numpy_sol(df):
arr = df.values
return pd.DataFrame(np.where(arr < arr[:, 0][:, None], 0, arr), columns=df.columns)
The title is confusing.
So, say I have a dataframe with one column, id, which occurs multiple times throughout my dataframe. Then I have another column, lets call it cumulativeOccurrences.
How do I select all unique occurrences of id such that the other column fulfills a certain condition, say cumulativeOccurrences > 20 for each and every instance of that id?
The beginning of the code is probably something like this:
dataframe.groupby('id')
But I can't figure out the rest.
Here is a sample small dataset that should return zero values:
id cumulativeOccurrences
5494178 136
5494178 71
5494178 18
5494178 83
5494178 57
5494178 181
5494178 13
5494178 10
5494178 90
5494178 4484
Okay, here is the result I got after more muddling around:
res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
ids = res[res.cumulativeOccurrences['<lambda>']==True].index
This gives me a list of ids which fulfill the condition. There probably is a better way than the list comprehension lambda function for the agg function, though. Any ideas?
First filter and then use DataFrameGroupBy.all:
res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
ids = res.index[res]
print (ids)
Int64Index([5494172], dtype='int64', name='id')
EDIT1:
First timings are for non sorted id and second for sorted.
np.random.seed(123)
N = 10000000
df = pd.DataFrame({'id': np.random.randint(1000, size=N),
'cumulativeOccurrences':np.random.randint(19,5000,size=N)},
columns=['id','cumulativeOccurrences'])
print (df.head())
In [125]: %%timeit
...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
...: ids = res.index[res]
...:
1 loop, best of 3: 1.22 s per loop
In [126]: %%timeit
...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
...:
1 loop, best of 3: 3.69 s per loop
In [127]: %timeit
In [128]: %%timeit
...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x]))
...: ids = res.index[res]
...:
1 loop, best of 3: 3.63 s per loop
np.random.seed(123)
N = 10000000
df = pd.DataFrame({'id': np.random.randint(1000, size=N),
'cumulativeOccurrences':np.random.randint(19,5000,size=N)},
columns=['id','cumulativeOccurrences']).sort_values('id').reset_index(drop=True)
print (df.head())
In [130]: %%timeit
...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
...: ids = res.index[res]
...:
1 loop, best of 3: 795 ms per loop
In [131]: %%timeit
...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
...:
1 loop, best of 3: 3.23 s per loop
In [132]: %%timeit
...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x]))
...: ids = res.index[res]
...:
1 loop, best of 3: 3.15 s per loop
Conclusion - Sorting id and unique index can improve performance. Also data was tested in 0.20.3 version under python 3.
Suppose I have two dataframes X & Y, I want to get the data in X that with the same value in Y.For example,
X is like this:
user_id sku_id time model_id type cate brand
0 27630 37957 2016-02-01 07:43:14 NaN 6 8 489
1 65377 165713 2016-02-01 11:09:34 NaN 6 5 124
2 10396 65823 2016-02-01 08:20:59 NaN 6 6 78
……
and Y is like this:
user_id sku_id
8489 58104 79636
9043 99179 113675
9330 101391 39778
9468 65786 73834
……
the (user_id,sku_id) is not unique in X &Y. I want to select all the data which (user_id,sku_id) is in Y from X. It is not just isin(), because the user_id and the sku_id should be meet the requirement in the same time.
and I also want to find a more efficient way than merge().
I think you need inner join in merge:
df = pd.merge(X, Y)
Or:
X.set_index(['user_id', 'sku_id'], inplace=True)
df = Y.join(X, how='inner', on=['user_id', 'sku_id'])
Another solution is isin with boolean indexing, but it works only if unique user_id:
X = X.set_index('user_id')
df = X[X['sku_id'].isin(Y.set_index('user_id')['sku_id'])].reset_index()
Generally, the best and fastest is use merge in pandas:
In [143]: %%timeit
...: (Y1.join(X1.set_index(['user_id', 'sku_id']),how='inner',on=['user_id','sku_id']))
...:
1 loop, best of 3: 583 ms per loop
In [144]: %%timeit
...: (pd.merge(X2,Y2))
...:
1 loop, best of 3: 487 ms per loop
In [145]: %%timeit
...: x = pd.MultiIndex.from_arrays([X['user_id'], X['sku_id']])
...: y = pd.MultiIndex.from_arrays([Y['user_id'], Y['sku_id']])
...: inter = x.intersection(y)
...: a = X.set_index(['user_id', 'sku_id']).loc[inter].reset_index()
...:
1 loop, best of 3: 1.26 s per loop
#another solution
In [146]: %%timeit
...: X[(X['user_id'].astype(str) +"_" +X['sku_id'].astype(str)).isin((Y['user_id'].astype(str)+"_"+Y['sku_id'].astype(str)))]
...:
1 loop, best of 3: 6.34 s per loop
If all values are strings (X = X.astype(str), Y = Y.astype(str)):
In [147]: %%timeit
...: X[(X['user_id'] +"_" +X['sku_id']).isin((Y['user_id']+"_"+Y['sku_id']))]
...:
1 loop, best of 3: 953 ms per loop
Code for timings:
np.random.seed(123)
N = 1000000
X = pd.DataFrame({'user_id':np.random.randint(10000, size=N),
'sku_id': np.random.randint(10000, size=N),
'brand': np.random.randint(10000, size=N)})
X = X.drop_duplicates(subset=['user_id', 'sku_id'])
print (X)
X1,X2 = X.copy(), X.copy()
Y = pd.DataFrame({'user_id':np.random.randint(10000, size=N),
'sku_id': np.random.randint(10000, size=N)})
print (Y)
Y = Y.drop_duplicates(subset=['user_id', 'sku_id'])
Y1,Y2 = Y.copy(), Y.copy()
Not sure how fast this solution will be but you can try to cnocatenate the two columns and then use isin()
X['key'] = X['user_id']+"_"+X['sku_id']
Y['key'] = Y['user_id']+"_"+Y['sku_id']
Now just use key to filter X from Y. Please let me know if this improves performance.
I have a csv file with transaction data of the following form
import pandas as pd
df = pd.DataFrame({'OrderID':[1,1,1,1,2,2], 'ItemID':[1,2,3,4,1,2]})
print(df)
ItemID OrderID
0 1 1
1 2 1
2 3 1
3 4 1
4 1 2
5 2 2
I want to obtain a list that contains for every OrderID the sets of items.
This can be obtained with
df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist()
[{1, 2, 3, 4}, {1, 2}]
However, on a csv file with 9 million rows this takes some time. Thus, I'm wondering if there is a faster way?
I'm interested in any solution with pandas or that operates directly on a .csv-file
First of all I want to thank you guys, for your awesome input!
I took a sample of 50000 OrderIds (and the corresponding items) from my real data and applied several of the methods from to the data set.
And here are the results
Note that I used the updated version of the pir programm.
So the winner is divakar, even if we only consider the list of sets output.
On my whole data set, his faster set approach has a duration of 5.05 seconds and his faster list based approach a duration of only 2.32s.
That is a huge gain from the initial 115 seconds!
Thanks again!
new method
defaultdict
from collections import defaultdict
def pir(df):
d = defaultdict(set)
for n, g in df.groupby('OrderID').ItemID:
d[n].update(g.values.tolist())
return list(d.values())
sample
df = pd.DataFrame(dict(OrderID=np.random.randint(0, 1000, 10000000),
ItemID=np.random.randint(0, 1000, 10000000)))
old method
uo, io = np.unique(df.OrderID.values, return_inverse=True)
ui, ii = np.unique(df.ItemID.values, return_inverse=True)
def gu(i):
return set(ui[ii[io == i]].tolist())
[gu(i) for i in range(len(uo))]
[{1, 2, 3, 4}, {1, 2}]
old timing
code:
def pir(df):
uo, io = np.unique(df.OrderID.values, return_inverse=True)
ui, ii = np.unique(df.ItemID.values, return_inverse=True)
def gu(i):
return set(ui[ii[io == i]].tolist())
return [gu(i) for i in range(len(uo))]
def jez(df):
arr = df.groupby('OrderID')['ItemID'].unique().values
return [set(v) for v in arr]
def div(df):
a = df.values
sidx = a[:,1].argsort(kind='mergesort')
cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
out = np.split(a[sidx,0], cut_idx)
return list(map(set,out))
def quik(df):
return df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist()
with sample data
with more data
df = pd.DataFrame(dict(OrderID=np.random.randint(0, 10, 10000),
ItemID=np.random.randint(0, 10, 10000)))
even more data
df = pd.DataFrame(dict(OrderID=np.random.randint(0, 10, 10000000),
ItemID=np.random.randint(0, 10, 10000000)))
Approach #1 : Using array splitting and set -
def divakar_v1(df):
a = df.values
sidx = a[:,1].argsort() # Use .argsort(kind='mergesort') to keep order
cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
out = np.split(a[sidx,0], cut_idx)
return list(map(set,out))
Approach #2 : Using iterative array slicing and set -
def divakar_v2(df):
data = df.values
a = data[data[:,1].argsort()] # Use .argsort(kind='mergesort') to keep order
stop = np.append(np.nonzero(a[1:,1] > a[:-1,1])[0]+1,a.size)
start = np.append(0, stop[:-1])
out_set = [set(a[start[i]:stop[i],0]) for i in range(len(start))]
return out_set
Given that per 'OrderID', we would have unique/distinct elements in 'ItemID', we would have two more approaches skipping the use of set() and thus giving us a list of lists as output. These are listed next.
Approach #3 : Using array splitting and list of lists as o/p -
def divakar_v3(df):
a = df.values
sidx = a[:,1].argsort() # Use .argsort(kind='mergesort') to keep order
cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
out = np.split(a[sidx,0], cut_idx)
return list(map(list,out))
Approach #4 : Using iterative array slicing and list of lists as o/p -
def divakar_v4(df):
data = df.values
a = data[data[:,1].argsort()] # Use .argsort(kind='mergesort') to keep order
stop = np.append(np.nonzero(a[1:,1] > a[:-1,1])[0]+1,a.size)
start = np.append(0, stop[:-1])
a0 = a[:,0].tolist()
return [a0[start[i]:stop[i]] for i in range(len(start))]
Runtime test -
In [145]: np.random.seed(123)
...: N = 100000
...: df = pd.DataFrame(np.random.randint(30,size=(N,2)))
...: df.columns = ['ItemID','OrderID']
...:
In [146]: %timeit divakar_v1(df)
...: %timeit divakar_v2(df)
...: %timeit divakar_v3(df)
...: %timeit divakar_v4(df)
...:
10 loops, best of 3: 21.1 ms per loop
10 loops, best of 3: 21.7 ms per loop
100 loops, best of 3: 16.7 ms per loop
100 loops, best of 3: 12.3 ms per loop
You can try SeriesGroupBy.unique, then convert to numpy array and last to set by list comprehension:
arr = df.groupby('OrderID')['ItemID'].unique().values
print (arr)
[array([1, 2, 3, 4], dtype=int64) array([1, 2], dtype=int64)]
print ([set(v) for v in arr])
[{1, 2, 3, 4}, {1, 2}]
EDIT Faster is use unique in apply:
print (df.groupby('OrderID').apply(lambda x: set(x['ItemID'].unique())).tolist())
Timings:
np.random.seed(123)
N = 1000000
df = pd.DataFrame(np.random.randint(30,size=(N,2)))
df.columns = ['OrderID','ItemID']
def pir(df):
uo, io = np.unique(df.OrderID.values, return_inverse=True)
ui, ii = np.unique(df.ItemID.values, return_inverse=True)
def gu(i):
return set(ui[ii[io == i]].tolist())
return [gu(i) for i in range(len(uo))]
def divakar(df):
a = df.values
sidx = a[:,1].argsort(kind='mergesort')
cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
out = np.split(a[sidx,0], cut_idx)
return list(map(set,out))
In [120]: %timeit (df.groupby('OrderID')
.apply(lambda x: set(x['ItemID'].unique())).tolist())
10 loops, best of 3: 92.7 ms per loop
In [121]: %timeit (df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist())
10 loops, best of 3: 168 ms per loop
In [122]: %timeit ([set(v) for v in df.groupby('OrderID')['ItemID'].unique().values])
10 loops, best of 3: 125 ms per loop
In [123]: %timeit (list(map(set,df.groupby('OrderID')['ItemID'].unique().values)))
10 loops, best of 3: 125 ms per loop
In [124]: %timeit (pir(df))
1 loop, best of 3: 276 ms per loop
In [125]: %timeit (divakar(df))
1 loop, best of 3: 190 ms per loop
I take 3 values of a column (third) and put these values into a row on 3 new columns. And merge the new and old columns into a new matrix A
Input timeseries in col nr3 values in col nr 1 and 2
[x x 1]
[x x 2]
[x x 3]
output : matrix A
[x x 1 0 0 0]
[x x 2 0 0 0]
[x x 3 1 2 3]
[x x 4 2 3 4]
So for brevity, first the code generates the matrix 6 rows / 3 col. The last column I want to use to fill 3 extra columns and merge it into a new matrix A. This matrix A was prefilled with 2 rows to offset the starting position.
I have implemented this idea in the code below and it takes a really long time to process large data sets.
How to improve the speed of this conversion
import numpy as np
matrix = np.arange(18).reshape((6, 3))
nr=3
A = np.zeros((nr-1,nr))
for x in range( matrix.shape[0]-nr+1):
newrow = (np.transpose( matrix[x:x+nr,2:3] ))
A = np.vstack([A , newrow])
total= np.column_stack((matrix,A))
print (total)
Here's an approach using broadcasting to get those sliding windowed elements and then just some stacking to get A -
col2 = matrix[:,2]
nrows = col2.size-nr+1
out = np.zeros((nr-1+nrows,nr))
col2_2D = np.take(col2,np.arange(nrows)[:,None] + np.arange(nr))
out[nr-1:] = col2_2D
Here's an efficient alternative using NumPy strides to get col2_2D -
n = col2.strides[0]
col2_2D = np.lib.stride_tricks.as_strided(col2, shape=(nrows,nr), strides=(n,n))
It would be even better to initialize an output array of zeros of the size as total and then assign values into it with col2_2D and finally with input array matrix.
Runtime test
Approaches as functions -
def org_app1(matrix,nr):
A = np.zeros((nr-1,nr))
for x in range( matrix.shape[0]-nr+1):
newrow = (np.transpose( matrix[x:x+nr,2:3] ))
A = np.vstack([A , newrow])
return A
def vect_app1(matrix,nr):
col2 = matrix[:,2]
nrows = col2.size-nr+1
out = np.zeros((nr-1+nrows,nr))
col2_2D = np.take(col2,np.arange(nrows)[:,None] + np.arange(nr))
out[nr-1:] = col2_2D
return out
def vect_app2(matrix,nr):
col2 = matrix[:,2]
nrows = col2.size-nr+1
out = np.zeros((nr-1+nrows,nr))
n = col2.strides[0]
col2_2D = np.lib.stride_tricks.as_strided(col2, \
shape=(nrows,nr), strides=(n,n))
out[nr-1:] = col2_2D
return out
Timings and verification -
In [18]: # Setup input array and params
...: matrix = np.arange(1800).reshape((60, 30))
...: nr=3
...:
In [19]: np.allclose(org_app1(matrix,nr),vect_app1(matrix,nr))
Out[19]: True
In [20]: np.allclose(org_app1(matrix,nr),vect_app2(matrix,nr))
Out[20]: True
In [21]: %timeit org_app1(matrix,nr)
1000 loops, best of 3: 646 µs per loop
In [22]: %timeit vect_app1(matrix,nr)
10000 loops, best of 3: 20.6 µs per loop
In [23]: %timeit vect_app2(matrix,nr)
10000 loops, best of 3: 21.5 µs per loop
In [28]: # Setup input array and params
...: matrix = np.arange(7200).reshape((120, 60))
...: nr=30
...:
In [29]: %timeit org_app1(matrix,nr)
1000 loops, best of 3: 1.19 ms per loop
In [30]: %timeit vect_app1(matrix,nr)
10000 loops, best of 3: 45 µs per loop
In [31]: %timeit vect_app2(matrix,nr)
10000 loops, best of 3: 27.2 µs per loop