Replace specific values in multiindex dataframe - python

I have a multindex dataframe with 3 index levels and 2 numerical columns.
A 1 2017-04-01 14.0 87.346878
2017-06-01 4.0 87.347504
2 2014-08-01 1.0 123.110001
2015-01-01 4.0 209.612503
B 3 2014-07-01 1.0 68.540001
2014-12-01 1.0 64.370003
4 2015-01-01 3.0 75.000000
I want to replace the values in first row of 3rd index level wherever a new second level index begins.
For ex: every first row
(A,1,2017-04-01)->0.0 0.0
(A,2,2014-08-01)->0.0 0.0
(B,3,2014-07-01)->0.0 0.0
(B,4,2015-01-01)->0.0 0.0
The dataframe is too big and doing it datframe by dataframe like df.xs('A,1')...df.xs(A,2) gets time consuming. Is there some way where i can get a mask and replace with new values in these positions ?

Use DataFrame.reset_index on level=2, then use DataFrame.groupby on level=[0, 1] and aggregate level_2 using first, then using pd.MultiIndex.from_arrays create a multilevel index, finally use this multilevel index to change the values in dataframe:
idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0
Result:
# print(df)
col1 col2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000

We can extract a series of the second-level index with:
df.index.get_level_values(1)
# output: Int64Index([1, 1, 2, 2, 3, 3, 4], dtype='int64')
And check where it changes with:
idx = df.index.get_level_values(1)
np.where(idx != np.roll(idx, 1))[0]
# output: array([0, 2, 4, 6])
So we can simply use the returned value of the second statement with iloc to get the first row of every second-level index and modify their values like this:
idx = df.index.get_level_values(1)
df.iloc[np.where(idx != np.roll(idx, 1))[0]] = 0
output:
value1 value2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000

You can use the grouper indices in a simple iloc:
df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0
Example:
df = pd.DataFrame({'col1': [14., 4., 1., 4., 1., 1., 3.],
'col2': [ 87.346878, 87.347504, 123.110001, 209.612503, 68.540001, 64.370003, 75.]},
index = pd.MultiIndex.from_tuples(([('A', 1, '2017-04-01'), ('A', 1, '2017-06-01'),
('A', 2, '2014-08-01'), ('A', 2, '2015-01-01'),
('B', 3, '2014-07-01'), ('B', 3, '2014-12-01'),
('B', 4, '2015-01-01')])))
Result:
col1 col2
A 1 2017-04-01 0.0 0.000000
2017-06-01 4.0 87.347504
2 2014-08-01 0.0 0.000000
2015-01-01 4.0 209.612503
B 3 2014-07-01 0.0 0.000000
2014-12-01 1.0 64.370003
4 2015-01-01 0.0 0.000000
Timings:
%%timeit
idx = df.reset_index(level=2).groupby(level=[0, 1])['level_2'].first()
idx = pd.MultiIndex.from_arrays(idx.reset_index().to_numpy().T)
df.loc[idx, :] = 0
#6.7 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.iloc[[a[0] for a in df.groupby(level=[0, 1]).indices.values()]] = 0
#897 µs ± 6.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So this is about 7 times faster than the accepted answer

I think you can use something like this:
import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
df = pd.DataFrame([['A', 'B'], ['bar', 'two'],
['foo', 'one'], ['foo', 'two']],
columns=['first', 'second'])
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df
You can create a list of unique values from your index. Then get the index position, to replace on your column the row value coincidence with the row value.
lst = ['bar','foo', 'qux']
ls = []
for i in lst:
base = df.index.get_loc(i)
a = base.indices(len(df))
a = a[0]
ls.append(a)
for ii in ls:
#print(ii)
df[0][ii] = 0
df
Fortunately, this can help you.
Cheers!

Related

Python/Pandas: use one column's value to be the suffix of the column name from which I want a value

I have a pandas dataframe. From multiple columns therein, I need to select the value from only one into a single new column, according to the ID (bar in this example) of that row.
I need the fastest way to do this.
Dataframe for application is like this:
foo bar ID_A ID_B ID_C ID_D ID_E ...
1 B 1.5 2.3 4.1 0.5 6.6 ...
2 E 3 4 5 6 7 ...
3 A 9 6 3 8 1 ...
4 C 13 5 88 9 0 ...
5 B 6 4 6 9 4 ...
...
An example of a way to do it (my fastest at present) is thus - however, it is too slow for my purposes.
df.loc[df.bar=='A', 'baz'] = df.ID_A
df.loc[df.bar=='B', 'baz'] = df.ID_B
df.loc[df.bar=='C', 'baz'] = df.ID_C
df.loc[df.bar=='D', 'baz'] = df.ID_D
df.loc[df.bar=='E', 'baz'] = df.ID_E
df.loc[df.bar=='F', 'baz'] = df.ID_F
df.loc[df.bar=='G', 'baz'] = df.ID_G
Result will be like this (after dropping used columns):
foo baz
1 2.3
2 7
3 9
4 88
5 4
...
I have tried with .apply() and it was very slow.
I tried with np.where() which was still much slower than the example shown above (which was 1000% faster than np.where()).
Would appreciate recommendations!
Many thanks
EDIT: after the first few answers, I think I need to add this:
"whilst I would appreciate runtime estimate relative to the example, I know it's a small example so may be tricky.
My actual data has 280000 rows and an extra 50 columns (which I need to keep along with foo and baz). I have to reduce 13 columns to the single column per the example.
The speed is the only reason for asking, & no mention of speed thus far in first few responses. Thanks again!"
You can use a variant of the indexing lookup:
idx, cols = pd.factorize('ID_'+df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
foo baz
0 1 2.3
1 2 7.0
2 3 9.0
3 4 88.0
4 5 4.0
testing speed
Setting up a test dataset (280k rows, 54 ID columns):
from string import ascii_uppercase, ascii_lowercase
letters = list(ascii_lowercase+ascii_uppercase)
N = 280_000
np.random.seed(0)
df = (pd.DataFrame({'foo': np.arange(1, N+1),
'bar': np.random.choice(letters, size=N)})
.join(pd.DataFrame(np.random.random(size=(N, len(letters))),
columns=[f'ID_{l}' for l in letters]
))
)
speed testing:
%%timeit
idx, cols = pd.factorize('ID_'+df['bar'])
out = pd.DataFrame({'foo': df['foo'],
'baz': df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]})
output:
54.4 ms ± 3.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Can try this. It should generalize to arbitrary number of columns.
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 'B', 1.5, 2.3, 4.1, 0.5, 6.6],
[2, 'E', 3, 4, 5, 6, 7],
[3, 'A', 9, 6, 3, 8, 1],
[4, 'C', 13, 5, 88, 9, 0],
[5, 'B', 6, 4, 6, 9, 4]])
df.columns = ['foo', 'bar', 'ID_A', 'ID_B', 'ID_C', 'ID_D', 'ID_E']
for val in np.unique(df['bar'].values):
df.loc[df.bar==val, 'baz'] = df[f'ID_{val}']
To show an alternative approach, you can perform a combination of melting your data and reindexing. In this case I used wide_to_long (instead of melt/stack) because of the patterned nature of your column names:
out = (
pd.wide_to_long(
df, stubnames=['ID'], i=['foo', 'bar'], j='', sep='_', suffix=r'\w+'
)
.loc[lambda d:
d.index.get_level_values('bar') == d.index.get_level_values(level=-1),
'ID'
]
.droplevel(-1)
.rename('baz')
.reset_index()
)
print(out)
foo bar baz
0 1 B 2.3
1 2 E 7.0
2 3 A 9.0
3 4 C 88.0
4 5 B 4.0
An alternative syntax to the above leverages .melt & .query to shorten the syntax.
out = (
df.melt(id_vars=['foo', 'bar'], var_name='id', value_name='baz')
.assign(id=lambda d: d['id'].str.get(-1))
.query('bar == id')
)
print(out)
foo bar id baz
2 3 A A 9.0
5 1 B B 2.3
9 5 B B 4.0
13 4 C C 88.0
21 2 E E 7.0

Update column values in a group based on one row in that group

I have a dataframe from source data that resembles the following:
In[1]: df = pd.DataFrame({'test_group': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'test_type': [np.nan,'memory', np.nan, np.nan, 'visual', np.nan, np.nan,
'auditory', np.nan]}
Out[1]:
test_group test_type
0 1 NaN
1 1 memory
2 1 NaN
3 2 NaN
4 2 visual
5 2 NaN
6 3 NaN
7 3 auditory
8 3 NaN
test_group represents the grouping of the rows, which represent a test. I need to replace the NaNs in column test_type in each test_group with the value of the row that is not a NaN, e.g. memory, visual, etc.
I've tried a variety of approaches including isolating the "real" value in test_type such as
In [4]: df.groupby('test_group')['test_type'].unique()
Out[4]:
test_group
1 [nan, memory]
2 [nan, visual]
3 [nan, auditory]
Easy enough, I can index into each row and pluck out the value I want. This seems to head in the right direction:
In [6]: df.groupby('test_group')['test_type'].unique().apply(lambda x: x[1])
Out[6]:
test_group
1 memory
2 visual
3 auditory
I tried this among many other things but it doesn't quite work (note: apply and transform give the same result):
In [15]: grp = df.groupby('test_group')
In [16]: df['test_type'] = grp['test_type'].unique().transform(lambda x: x[1])
In [17]: df
Out[17]:
test_group test_type
0 1 NaN
1 1 memory
2 1 visual
3 2 auditory
4 2 NaN
5 2 NaN
6 3 NaN
7 3 NaN
8 3 NaN
I'm sure if I looped it I'd be done with things, but loops are too slow as the data set is millions of records per file.
You can use GroupBy.size to get the size of each group. Then boolean index using Series.isna. Now, use Index.repeat with df.reindex
repeats = df.groupby('test_group').size()
out = df[~df['test_type'].isna()]
out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
timeit analysis:
Benchmarking dataframe:
df = pd.DataFrame({'test_group': [1]*10_001 + [2]*10_001 + [3]*10_001,
'test_type' : [np.nan]*10_000 + ['memory'] +
[np.nan]*10_000 + ['visual'] +
[np.nan]*10_000 + ['auditory']})
df.shape
# (30003, 2)
Results:
# Ch3steR's answer
In [54]: %%timeit
...: repeats = df.groupby('test_group').size()
...: out = df[~df['test_type'].isna()]
...: out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
...:
...:
2.56 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# timgeb's answer
In [55]: %%timeit
...: df['test_type'] = df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
...:
...:
10.1 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Almost ~4X faster. I believe it's because boolean indexing is very fast. And reindex + repeat is lightwieght compared to dual fillna.
Under the assumption that there's a unique non-nan value per group, the following should satisfy your request.
>>> df['test_type'] = df.groupby('test_group')['test_type'].ffill().bfill()
>>> df
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
edit:
The original answer used
df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
but it looks like according to schwim's timings ffill/bfill is significantly faster (for some reason).

Create a sliding window of data index positions

I am trying to write a function that returns the index positions of a sliding window over a Pandas DataFrame as a list of (train, test) tuples.
Example:
df.head(10)
col_a col_b
0 20.1 6.0
1 19.1 7.1
2 19.1 8.9
3 16.5 11.0
4 16.0 11.1
5 17.4 8.7
6 19.3 9.7
7 22.8 12.6
8 21.4 11.9
9 23.0 12.8
def split_function(df, train_length, test_length):
some_logic_to_split_dataframe
split_indices = [(train_idx, test_idx) for index_tuples in split_dataframe_logic]
return split_indices
Desired outcome:
train_length = 2
test_length = 1
split_indices = split_function(df, train_length, test_length)
split_indices
output:
[((0,1), (2)), ((1,2),(3)),...,((7,8), (9)) etc]
The function loop/generator expression would need to terminate when the test_index == last observation too.
All help very much appreciated
I would suggest using the rolling method offered by pandas.
split_indices = []
def split(x):
split_indices.append((x.index[:train_length], x.index[-test_length:]))
return np.nan
df['col1'].rolling(train_length + test_length).apply(split)
This code will create the following split_indices
>>> split_indices
[(Int64Index([0, 1], dtype='int64'), Int64Index([2], dtype='int64')),
(Int64Index([1, 2], dtype='int64'), Int64Index([3], dtype='int64')),
(Int64Index([2, 3], dtype='int64'), Int64Index([4], dtype='int64')),
(Int64Index([3, 4], dtype='int64'), Int64Index([5], dtype='int64')),
(Int64Index([4, 5], dtype='int64'), Int64Index([6], dtype='int64')),
(Int64Index([5, 6], dtype='int64'), Int64Index([7], dtype='int64')),
(Int64Index([6, 7], dtype='int64'), Int64Index([8], dtype='int64')),
(Int64Index([7, 8], dtype='int64'), Int64Index([9], dtype='int64'))]
After you can easily get the data of your dataframe of a given index
>>> df.loc[split_indices[3][0]]
col1 col2
3 16.5 11.0
4 16.0 11.1

Fastest way to calculate difference in all columns

I have a dataframe of all float columns. For example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list('ABCD'))
# A B C D
# 0 0.0 1.0 2.0 3.0
# 1 4.0 5.0 6.0 7.0
# 2 8.0 9.0 10.0 11.0
I would like to calculate column-wise differences for all combinations of columns (e.g., A-B, A-C, B-C, etc.).
E.g., the desired output would be something like:
A_B A_C A_D B_C B_D C_D
-1.0 -2.0 -3.0 -1.0 -2.0 -1.0
-1.0 -2.0 -3.0 -1.0 -2.0 -1.0
-1.0 -2.0 -3.0 -1.0 -2.0 -1.0
Since the number of columns may be large, I'd like to do the calculations as efficiently/quickly as possible. I assume I'll get a big speed bump by converting the dataframe to a numpy array first so I'll do that, but I'm wondering if there are any other strategies that might result in large performance gains. Maybe some matrix algebra or multidimensional data format trick that results in not having to loop through all unique combinations. Any suggestions are welcome. This project is in Python 3.
Listed in this post are two NumPy approaches for performance - One would be fully vectorized approach and another with one loop.
Approach #1
def numpy_triu1(df):
a = df.values
r,c = np.triu_indices(a.shape[1],1)
cols = df.columns
nm = [cols[i]+"_"+cols[j] for i,j in zip(r,c)]
return pd.DataFrame(a[:,r] - a[:,c], columns=nm)
Sample run -
In [72]: df
Out[72]:
A B C D
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
In [78]: numpy_triu(df)
Out[78]:
A_B A_C A_D B_C B_D C_D
0 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
1 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
2 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
Approach #2
If we are okay with array as output or dataframe without specialized column names, here's another -
def pairwise_col_diffs(a): # a would df.values
n = a.shape[1]
N = n*(n-1)//2
idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
start, stop = idx[:-1], idx[1:]
out = np.empty((a.shape[0],N),dtype=a.dtype)
for j,i in enumerate(range(n-1)):
out[:, start[j]:stop[j]] = a[:,i,None] - a[:,i+1:]
return out
Runtime test
Since OP has mentioned that multi-dim array output would work for them as well, here are the array based approaches from other author(s) -
# #Allen's soln
def Allen(arr):
n = arr.shape[1]
idx = np.asarray(list(itertools.combinations(range(n),2))).T
return arr[:,idx[0]]-arr[:,idx[1]]
# #DYZ's soln
def DYZ(arr):
result = np.concatenate([(arr.T - arr.T[x])[x+1:] \
for x in range(arr.shape[1])]).T
return result
pandas based solution from #Gerges Dib's post wasn't included as it came out very slow as compared to others.
Timings -
We will use three dataset sizes - 100, 500 and 1000 :
In [118]: df = pd.DataFrame(np.random.randint(0,9,(3,100)))
...: a = df.values
...:
In [119]: %timeit DYZ(a)
...: %timeit Allen(a)
...: %timeit pairwise_col_diffs(a)
...:
1000 loops, best of 3: 258 µs per loop
1000 loops, best of 3: 1.48 ms per loop
1000 loops, best of 3: 284 µs per loop
In [121]: df = pd.DataFrame(np.random.randint(0,9,(3,500)))
...: a = df.values
...:
In [122]: %timeit DYZ(a)
...: %timeit Allen(a)
...: %timeit pairwise_col_diffs(a)
...:
100 loops, best of 3: 2.56 ms per loop
10 loops, best of 3: 39.9 ms per loop
1000 loops, best of 3: 1.82 ms per loop
In [123]: df = pd.DataFrame(np.random.randint(0,9,(3,1000)))
...: a = df.values
...:
In [124]: %timeit DYZ(a)
...: %timeit Allen(a)
...: %timeit pairwise_col_diffs(a)
...:
100 loops, best of 3: 8.61 ms per loop
10 loops, best of 3: 167 ms per loop
100 loops, best of 3: 5.09 ms per loop
I think you can do it with NumPy. Let arr=df.values. First, let's find all two-column combinations:
from itertools import combinations
column_combos = combinations(range(arr.shape[1]), 2)
Now, subtract columns pairwise and convert a list of arrays back to a 2D array:
result = np.array([(arr[:,x[1]] - arr[:,x[0]]) for x in column_combos]).T
#array([[1., 2., 3., 1., 2., 1.],
# [1., 2., 3., 1., 2., 1.],
# [1., 2., 3., 1., 2., 1.]])
Another solution is somewhat (~15%) faster because it subtracts whole 2D arrays rather than columns, and has fewer Python-side iterations:
result = np.concatenate([(arr.T - arr.T[x])[x+1:] for x in range(arr.shape[1])]).T
#array([[ 1., 2., 3., 1., 2., 1.],
# [ 1., 2., 3., 1., 2., 1.],
# [ 1., 2., 3., 1., 2., 1.]])
You can convert the result back to a DataFrame if you want:
columns = list(map(lambda x: x[1]+x[0], combinations(df.columns, 2)))
#['BA', 'CA', 'DA', 'CB', 'DB', 'DC']
pd.DataFrame(result, columns=columns)
# BA CA DA CB DB DC
#0 1.0 2.0 3.0 1.0 2.0 1.0
#1 1.0 2.0 3.0 1.0 2.0 1.0
#2 1.0 2.0 3.0 1.0 2.0 1.0
import itertools
df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list('ABCD'))
df_cols = df.columns.tolist()
#build a index array of all the pairs need to do the subtraction
idx = np.asarray(list(itertools.combinations(range(len(df_cols)),2))).T
#build a new DF using the pairwise difference and column names
df_new = pd.DataFrame(data=df.values[:,idx[0]]-df.values[:,idx[1]],
columns=[''.join(e) for e in (itertools.combinations(df_cols,2))])
df_new
Out[43]:
AB AC AD BC BD CD
0 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
1 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
2 -1.0 -2.0 -3.0 -1.0 -2.0 -1.0
I am not sure how fast can this be compared to other possible methods, but here it is:
df = pd.DataFrame(np.arange(12.0).reshape(3,4), columns=list('ABCD'))
# get the columns as list
cols = list(df.columns)
# define output dataframe
out = pd.DataFrame()
# loop over possible periods
for period in range(1, df.shape[1]):
names = [l1 + l2 for l1, l2, in zip(cols, cols[period:])]
out[names] = df.diff(periods=period, axis=1).dropna(axis=1, how='all')
print(out)
# column name shows which two columns are subtracted
AB BC CD AC BD AD
0 1.0 1.0 1.0 2.0 2.0 3.0
1 1.0 1.0 1.0 2.0 2.0 3.0
2 1.0 1.0 1.0 2.0 2.0 3.0

Renaming values in a column from lists within a dataframe

I have a data frame which looks like this,
df=pd.DataFrame({'col1':[1,2,3,4,5,6], 'col2':list('AASOSP')})
df
and I have two lists,
lis1=['A']
Lis2=['S','O']
I need to replace the value in col2 based on the lis1 and lis2.
So I used np.where to do so.
like this,
df['col2'] = np.where(df.col2.isin(lis1),'PC',df.col2.isin(lis2),'Ln','others')
But its throwing me follwoing error,
TypeError: function takes at most 3 arguments (5 given)
Any suggestion is very appreciated.!!
At the end I am aiming to have the values replace in col2 of my data frame as,
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Use double numpy.where:
lis1=['A']
lis2=['S','O']
df['col2'] = np.where(df.col2.isin(lis1),'PC',
np.where(df.col2.isin(lis2),'Ln','others'))
print (df)
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Timings:
#[60000 rows x 2 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
In [257]: %timeitnp.where(df.col2.isin(lis1),'PC',np.where(df.col2.isin(lis2),'Ln','others'))
100 loops, best of 3: 8.15 ms per loop
In [258]: %timeit in1d_based(df, lis1, lis2)
100 loops, best of 3: 4.98 ms per loop
Here's one approach -
a = df.col2.values
df.col2 = np.take(['others','PC','Ln'], np.in1d(a,lis1) + 2*np.in1d(a,lis2))
Sample step-by-step run -
# Input dataframe
In [206]: df
Out[206]:
col1 col2
0 1 A
1 2 A
2 3 S
3 4 O
4 5 S
5 6 P
# Extract out col2 values
In [207]: a = df.col2.values
# Form an indexing array based on where we have matches in lis1 or lis2 or neither
In [208]: idx = np.in1d(a,lis1) + 2*np.in1d(a,lis2)
In [209]: idx
Out[209]: array([1, 1, 2, 2, 2, 0])
# Index into a list of new strings with those indices
In [210]: newvals = np.take(['others','PC','Ln'], idx)
In [211]: newvals
Out[211]:
array(['PC', 'PC', 'Ln', 'Ln', 'Ln', 'others'],
dtype='|S6')
# Finally assign those into col2
In [212]: df.col2 = newvals
In [213]: df
Out[213]:
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Runtime test -
In [251]: df=pd.DataFrame({'col1':[1,2,3,4,5,6], 'col2':list('AASOSP')})
In [252]: df = pd.concat([df]*10000).reset_index(drop=True)
In [253]: lis1
Out[253]: ['A']
In [254]: lis2
Out[254]: ['S', 'O']
In [255]: def in1d_based(df, lis1, lis2):
...: a = df.col2.values
...: return np.take(['others','PC','Ln'], np.in1d(a,lis1) + 2*np.in1d(a,lis2))
...:
# #jezrael's soln
In [256]: %timeit np.where(df.col2.isin(lis1),'PC', np.where(df.col2.isin(lis2),'Ln','others'))
100 loops, best of 3: 3.78 ms per loop
In [257]: %timeit in1d_based(df, lis1, lis2)
1000 loops, best of 3: 1.89 ms per loop

Categories

Resources