pandas dataframe index match - python

I'm wondering if there is a more efficient way to do an "index & match" type function that is popular in excel. For example - given two pandas DataFrames, update the df_1 with information found in df_2:
import pandas as pd
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
'num_b':[2, 4, 1, 2, 3]})
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
'name':['a', 'b', 'c', 'd', 'e']})
I'm working with data sets that have ~80,000 rows in both df_1 and df_2 and my goal is to create two new columns in df_1, "name_a" and "name_b".
Below is the most efficient method that I could come up with. There has to be a better way!
name_a = []
name_b = []
for i in range(len(df_1)):
name_a.append(df_2.name.iloc[df_2[
df_2.num == df_1.num_a.iloc[i]].index[0]])
name_b.append(df_2.name.iloc[df_2[
df_2.num == df_1.num_b.iloc[i]].index[0]])
df_1['name_a'] = name_a
df_1['name_b'] = name_b
Resulting in:
>>> df_1.head()
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c

High Level
Create a dictionary to use in a replace
replace, rename columns, and join
m = dict(zip(
df_2.num.values.tolist(),
df_2.name.values.tolist()
))
df_1.join(
df_1.replace(m).rename(
columns=lambda x: x.replace('num', 'name')
)
)
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 5 c
Breakdown
replace with a dictionary should be pretty quick. There are bunch of ways to build a dictionary form df_2. As a matter of fact we could have used a pd.Series. I chose to build with dict and zip because I find that it's faster.
Building m
Option 1
m = df_2.set_index('num').name
Option 2
m = df_2.set_index('num').name.to_dict()
Option 3
m = dict(zip(df_2.num, df_2.name))
Option 4 (My Choice)
m = dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))
m build times
1000 loops, best of 3: 325 µs per loop
1000 loops, best of 3: 376 µs per loop
10000 loops, best of 3: 32.9 µs per loop
100000 loops, best of 3: 10.4 µs per loop
%timeit df_2.set_index('num').name
%timeit df_2.set_index('num').name.to_dict()
%timeit dict(zip(df_2.num, df_2.name))
%timeit dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))
Replacing num
Again, we have choices, here are a few and their times.
%timeit df_1.replace(m)
%timeit df_1.applymap(lambda x: m.get(x, x))
%timeit df_1.stack().map(lambda x: m.get(x, x)).unstack()
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 959 µs per loop
1000 loops, best of 3: 925 µs per loop
I choose...
df_1.replace(m)
num_a num_b
0 a b
1 b d
2 c a
3 d b
4 5 c
Rename columns
df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name'))
name_a name_b <-- note the column name change
0 a b
1 b d
2 c a
3 d b
4 5 c
Join
df_1.join(df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name')))
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 5 c

I think there's a more straightforward solution than those already offered. Since you mentioned Excel, this is a basic vlookup. You can simulate this in pandas by using Series.map.
name_map = dict(df_2.set_index('num').name)
df_1['name_a'] = df_1.num_a.map(name_map)
df_1['name_b'] = df_1.num_b.map(name_map)
df_1
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
All we do is convert df_2 to a dict with 'num' as the keys. The map function looks up each value from a df_1 column in the dict and returns the corresponding letter. No complicated indexing required.

Just try a conditional statement:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
'num_b':[2, 4, 1, 2, 3]})
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
'name':['a', 'b', 'c', 'd', 'e']})
df_1["name_a"] = df_2["num_b"]
df_1["name_b"] = np.array(df_1["name_a"][df_1["num_b"]-1])
print(df_1)
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c

Related

python Pandas replace the word in the string

Given a dataframe like:
A B C
1 a yes
2 b yes
3 a no
I would like to change the dataframe to:
A B C
1 a yes
2 b no
3 a no
which means that if column B has the value 'b', I want to change the column C to 'no'. Which can be represented by df[df['B']=='b']['C'].str.replace('yes','no'). But use this will not change dataframe df itself. Even I tried df[df['B']=='b']['C'] = df[df['B']=='b']['C'].str.replace('yes','no') it didn't work. I am wondering how to solve this problem.
Solutions with set values by mask:
df.loc[df.B == 'b', 'C'] = 'no'
print (df)
A B C
0 1 a yes
1 2 b no
2 3 a no
df['C'] = df['C'].mask(df.B == 'b','no')
print (df)
A B C
0 1 a yes
1 2 b no
2 3 a no
Solutions with replace only yes string:
df.loc[df.B == 'b', 'C'] = df['C'].replace('yes', 'no')
print (df)
A B C
0 1 a yes
1 2 b no
2 3 a no
df['C'] = df['C'].mask(df.B == 'b', df['C'].replace('yes', 'no'))
print (df)
A B C
0 1 a yes
1 2 b no
2 3 a no
Difference better seen in changed df:
print (df)
A B C
0 1 a yes
1 2 b yes
2 3 b another
3 4 a no
df['C_set'] = df['C'].mask(df.B == 'b','no')
df['C_replace'] = df['C'].mask(df.B == 'b', df['C'].replace('yes', 'no'))
print (df)
A B C C_set C_replace
0 1 a yes yes yes
1 2 b yes no no
2 3 b another no another
3 4 a no no no
EDIT:
In your solution is necessary only add loc:
df.loc[df['B']=='b', 'C'] = df.loc[df['B']=='b', 'C'].str.replace('yes','no')
print (df)
A B C
0 1 a yes
1 2 b no
2 3 b another
3 4 a no
EDIT1:
I was really curious what method is fastest:
#[40000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
print (df)
In [37]: %timeit df.loc[df['B']=='b', 'C'] = df['C'].str.replace('yes','no')
10 loops, best of 3: 79.5 ms per loop
In [38]: %timeit df.loc[df['B']=='b', 'C'] = df.loc[df['B']=='b','C'].str.replace('yes','no')
10 loops, best of 3: 48.4 ms per loop
In [39]: %timeit df.loc[df['B']=='b', 'C'] = df.loc[df['B']=='b', 'C'].replace('yes','no')
100 loops, best of 3: 14.1 ms per loop
In [40]: %timeit df['C'] = df['C'].mask(df.B == 'b', df['C'].replace('yes', 'no'))
100 loops, best of 3: 10.1 ms per loop
# piRSquared solution with replace
In [53]: %timeit df.C = np.where(df.B.values == 'b', df.C.replace('yes', 'no'), df.C.values)
100 loops, best of 3: 4.74 ms per loop
EDIT1:
Better is change condition - add df.C == 'yes' or df.C.values == 'yes' if need fastest solution:
df.loc[(df.B == 'b') & (df.C == 'yes'), 'C'] = 'no'
df.C = np.where((df.B.values == 'b') & (df.C.values == 'yes'), 'no', df.C.values)
np.where
df.C = np.where(df.B == 'b', 'no', df.C)
Or
df.C = np.where(df.B.values == 'b', 'no', df.C.values)
pd.Series.mask
df.C = df.C.mask(df.B == 'b', 'no')
All change df in place and yield
A B C
0 1 a yes
1 2 b no
2 3 a no
timing

How to convert row string values to features

I have a dataframe which looks like the below one. It has the ID column and the products history of each customer.
ID 1 2 3 4
1 A B C D
2 E C B D
3 F B C D
Instead of listing products for each customer, I would like to convert the products to features(columns) so that the data frame will look like this.
ID A B C D E F
1 1 1 1 1 0 0
2 0 0 0 1 1 0
3 0 1 1 1 0 1
I tried using get_dummies function, however, this will render different columns as 1-A, 1-E, 1-F, 2-B, 2-C, ....etc which is not what I need.
Any advice in getting this done.
This would yield a dataframe you want.
df = pd.get_dummies(df.set_index('ID').T.unstack()).groupby(level=0).sum().astype(int)
print (df)
Output:
A B C D E F
ID
1 1 1 1 1 0 0
2 0 1 1 1 1 0
3 0 1 1 1 0 1
You can use get_dummies and aggregate with max:
print (pd.get_dummies(df.set_index('ID'), prefix_sep='', prefix='')
.groupby(axis=1, level=0).max())
Or:
print (pd.get_dummies(df.set_index('ID').stack())
.groupby(level=0).max().astype(int))
You can use custom function, but it is slow:
df = df.set_index('ID').apply(lambda x: pd.Series(dict(zip(x, [1]*len(df.columns)))), axis=1)
.fillna(0)
.astype(int)
print (df)
A B C D E F
ID
1 1 1 1 1 0 0
2 0 1 1 1 1 0
3 0 1 1 1 0 1
I was interesting about timings:
np.random.seed(123)
N = 10000
L = list('ABCDEFGHIJKLMNOPQRST')
#[10000 rows x 20 columns]
df = pd.DataFrame(np.random.choice(L, size=(N,5)))
df = df.rename_axis('ID').reset_index()
print (df.head())
#Alex Fung solution
In [160]: %timeit (pd.get_dummies(df.set_index('ID').T.unstack()).groupby(level=0).sum().astype(int))
10 loops, best of 3: 27.9 ms per loop
In [161]: %timeit (pd.get_dummies(df.set_index('ID').stack()).groupby(level=0).max().astype(int))
10 loops, best of 3: 26.3 ms per loop
In [162]: %timeit (pd.get_dummies(df.set_index('ID'), prefix_sep='', prefix='').groupby(axis=1, level=0).max())
10 loops, best of 3: 26.4 ms per loop
In [163]: %timeit (df.set_index('ID').apply(lambda x: pd.Series(dict(zip(x, [1]*len(df.columns)))), axis=1).fillna(0).astype(int))
1 loop, best of 3: 3.95 s per loop

adding column in pandas dataframe containing the same value

I have a pandas dataframe A of size (1500,5) and a dictionary D containing:
D
Out[121]:
{'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
for each key in the dictionary I would like to create a new column in the dataframe A with the values in the dictionary (same value for all the rows of each column)
at the end
A should be of size (1500,8)
Is there a "python" way to do this? thanks!
You can use concat with DataFrame constructor:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame({'A':[1,2],
'B':[4,5],
'C':[7,8]})
print (df)
A B C
0 1 4 7
1 2 5 8
print (pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1))
A B C newcol1 newcol2 newcol3
0 1 4 7 a 2 1
1 2 5 8 a 2 1
Timings:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame(np.random.rand(10000000, 5), columns=list('abcde'))
In [37]: %timeit pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1)
The slowest run took 18.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 875 ms per loop
In [38]: %timeit df.assign(**D)
1 loop, best of 3: 1.22 s per loop
setup
A = pd.DataFrame(np.random.rand(10, 5), columns=list('abcde'))
d = {
'newcol1': 'a',
'newcol2': 2,
'newcol3': 1
}
solution
Use assign
A.assign(**d)
a b c d e newcol1 newcol2 newcol3
0 0.709249 0.275538 0.135320 0.939448 0.549480 a 2 1
1 0.396744 0.513155 0.063207 0.198566 0.487991 a 2 1
2 0.230201 0.787672 0.520359 0.165768 0.616619 a 2 1
3 0.300799 0.554233 0.838353 0.637597 0.031772 a 2 1
4 0.003613 0.387557 0.913648 0.997261 0.862380 a 2 1
5 0.504135 0.847019 0.645900 0.312022 0.715668 a 2 1
6 0.857009 0.313477 0.030833 0.952409 0.875613 a 2 1
7 0.488076 0.732990 0.648718 0.389069 0.301857 a 2 1
8 0.187888 0.177057 0.813054 0.700724 0.653442 a 2 1
9 0.003675 0.082438 0.706903 0.386046 0.973804 a 2 1

Join unique values into new data frame (python, pandas)

I have two dataFrames, from where I extract the unique values of a column into a and b
a = df1.col1.unique()
b = df2.col2.unique()
now a and b are something like this
['a','b','c','d'] #a
[1,2,3] #b
they are now type numpy.ndarray
I want to join them to have a DataFrame like this
col1 col2
0 a 1
1 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
. . .
Is there a way to do it not using a loop?
with numpy tools :
pd.DataFrame({'col1':np.repeat(a,b.size),'col2':np.tile(b,a.size)})
UPDATE:
B. M.'s solution utilizing numpy is much faster - i would recommend to use his approach:
In [88]: %timeit pd.DataFrame({'col1':np.repeat(aa,bb.size),'col2':np.tile(bb,aa.size)})
10 loops, best of 3: 25.4 ms per loop
In [89]: %timeit pd.DataFrame(list(product(aa,bb)), columns=['col1', 'col2'])
1 loop, best of 3: 1.28 s per loop
In [90]: aa.size
Out[90]: 1000
In [91]: bb.size
Out[91]: 1000
try itertools.product:
In [56]: a
Out[56]:
array(['a', 'b', 'c', 'd'],
dtype='<U1')
In [57]: b
Out[57]: array([1, 2, 3])
In [63]: pd.DataFrame(list(product(a,b)), columns=['col1', 'col2'])
Out[63]:
col1 col2
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 c 3
9 d 1
10 d 2
11 d 3
You can't do this task without using at least one for loop. The best you can do is hide the for loop or make use of implicit yield calls to make a memory-efficient generator.
itertools exports efficient functions for this task that use yield implicitly to return generators:
from itertools import product
products = product(['a','b','c','d'], [1,2,3])
col1_items, col2_items = zip(*products)
result = pandas.DataFrame({'col1':col1_items, 'col2': col2_items})
itertools.product creates a Cartesian product of two iterables. The zip(*products) simply unpacks the resulting list of tuples into two separate tuples, as seen here.
You can do this with pandas merge and it will be faster than itertools or a loop:
df_a = pd.DataFrame({'a': a, 'key': 1})
df_b = pd.DataFrame({'b': b, 'key': 1})
result = pd.merge(df_a, df_b, how='outer')
result:
a key b
0 a 1 1
1 a 1 2
2 a 1 3
3 b 1 1
4 b 1 2
5 b 1 3
6 c 1 1
7 c 1 2
8 c 1 3
9 d 1 1
10 d 1 2
11 d 1 3
then if need be you can always do
del result['key']

python pandas wildcard? Replace all values in df with a constant

I have a df and want to make a new_df of the same size but with all 1s. Something to the spirit of: new_df=df.replace("*","1"). I think this is faster than creating a new df from scratch, because i would need to get the dimensions, fill it with 1s, and copy all the headers over. Unless I'm wrong about that.
df_new = pd.DataFrame(np.ones(df.shape), columns=df.columns)
import numpy as np
import pandas as pd
d = [
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3],
[4,4,4,4,4],
[5,5,5,5,5],
]
cols = ["A","B","C","D","E"]
%timeit df1 = pd.DataFrame(np.ones(df.shape), columns=df.columns)
10000 loops, best of 3: 94.6 µs per loop
%timeit df2 = df.copy(); df2.loc[:, :] = 1
1000 loops, best of 3: 245 µs per loop
%timeit df3 = df * 0 + 1
1000 loops, best of 3: 200 µs per loop
It's actually pretty easy.
import pandas as pd
d = [
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3],
[4,4,4,4,4],
[5,5,5,5,5],
]
cols = ["A","B","C","D","E"]
df = pd.DataFrame(d, columns=cols)
print df
print "------------------------"
df.loc[:,:] = 1
print df
Result:
A B C D E
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
------------------------
A B C D E
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 1
3 1 1 1 1 1
4 1 1 1 1 1
Obviously, df.loc[:,:] means you target all rows across all columns. Just use df2 = df.copy() or something if you want a new dataframe.

Categories

Resources