I have a df and want to make a new_df of the same size but with all 1s. Something to the spirit of: new_df=df.replace("*","1"). I think this is faster than creating a new df from scratch, because i would need to get the dimensions, fill it with 1s, and copy all the headers over. Unless I'm wrong about that.
df_new = pd.DataFrame(np.ones(df.shape), columns=df.columns)
import numpy as np
import pandas as pd
d = [
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3],
[4,4,4,4,4],
[5,5,5,5,5],
]
cols = ["A","B","C","D","E"]
%timeit df1 = pd.DataFrame(np.ones(df.shape), columns=df.columns)
10000 loops, best of 3: 94.6 µs per loop
%timeit df2 = df.copy(); df2.loc[:, :] = 1
1000 loops, best of 3: 245 µs per loop
%timeit df3 = df * 0 + 1
1000 loops, best of 3: 200 µs per loop
It's actually pretty easy.
import pandas as pd
d = [
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3],
[4,4,4,4,4],
[5,5,5,5,5],
]
cols = ["A","B","C","D","E"]
df = pd.DataFrame(d, columns=cols)
print df
print "------------------------"
df.loc[:,:] = 1
print df
Result:
A B C D E
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
------------------------
A B C D E
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 1
3 1 1 1 1 1
4 1 1 1 1 1
Obviously, df.loc[:,:] means you target all rows across all columns. Just use df2 = df.copy() or something if you want a new dataframe.
Related
I am trying to access the index of a row in a function applied across an entire DataFrame in Pandas. I have something like this:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df
a b c
0 1 2 3
1 4 5 6
and I'll define a function that access elements with a given row
def rowFunc(row):
return row['a'] + row['b'] * row['c']
I can apply it like so:
df['d'] = df.apply(rowFunc, axis=1)
>>> df
a b c d
0 1 2 3 7
1 4 5 6 34
Awesome! Now what if I want to incorporate the index into my function?
The index of any given row in this DataFrame before adding d would be Index([u'a', u'b', u'c', u'd'], dtype='object'), but I want the 0 and 1. So I can't just access row.index.
I know I could create a temporary column in the table where I store the index, but I'm wondering if it is stored in the row object somewhere.
To access the index in this case you access the name attribute:
In [182]:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
return row['a'] + row['b'] * row['c']
def rowIndex(row):
return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
Note that if this is really what you are trying to do that the following works and is much faster:
In [198]:
df['d'] = df['a'] + df['b'] * df['c']
df
Out[198]:
a b c d
0 1 2 3 7
1 4 5 6 34
In [199]:
%timeit df['a'] + df['b'] * df['c']
%timeit df.apply(rowIndex, axis=1)
10000 loops, best of 3: 163 µs per loop
1000 loops, best of 3: 286 µs per loop
EDIT
Looking at this question 3+ years later, you could just do:
In[15]:
df['d'],df['rowIndex'] = df['a'] + df['b'] * df['c'], df.index
df
Out[15]:
a b c d rowIndex
0 1 2 3 7 0
1 4 5 6 34 1
but assuming it isn't as trivial as this, whatever your rowFunc is really doing, you should look to use the vectorised functions, and then use them against the df index:
In[16]:
df['newCol'] = df['a'] + df['b'] + df['c'] + df.index
df
Out[16]:
a b c d rowIndex newCol
0 1 2 3 7 0 6
1 4 5 6 34 1 16
Either:
1. with row.name inside the apply(..., axis=1) call:
df = pandas.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'], index=['x','y'])
a b c
x 1 2 3
y 4 5 6
df.apply(lambda row: row.name, axis=1)
x x
y y
2. with iterrows() (slower)
DataFrame.iterrows() allows you to iterate over rows, and access their index:
for idx, row in df.iterrows():
...
To answer the original question: yes, you can access the index value of a row in apply(). It is available under the key name and requires that you specify axis=1 (because the lambda processes the columns of a row and not the rows of a column).
Working example (pandas 0.23.4):
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
>>> df.set_index('a', inplace=True)
>>> df
b c
a
1 2 3
4 5 6
>>> df['index_x10'] = df.apply(lambda row: 10*row.name, axis=1)
>>> df
b c index_x10
a
1 2 3 10
4 5 6 40
I'm wondering if there is a more efficient way to do an "index & match" type function that is popular in excel. For example - given two pandas DataFrames, update the df_1 with information found in df_2:
import pandas as pd
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
'num_b':[2, 4, 1, 2, 3]})
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
'name':['a', 'b', 'c', 'd', 'e']})
I'm working with data sets that have ~80,000 rows in both df_1 and df_2 and my goal is to create two new columns in df_1, "name_a" and "name_b".
Below is the most efficient method that I could come up with. There has to be a better way!
name_a = []
name_b = []
for i in range(len(df_1)):
name_a.append(df_2.name.iloc[df_2[
df_2.num == df_1.num_a.iloc[i]].index[0]])
name_b.append(df_2.name.iloc[df_2[
df_2.num == df_1.num_b.iloc[i]].index[0]])
df_1['name_a'] = name_a
df_1['name_b'] = name_b
Resulting in:
>>> df_1.head()
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
High Level
Create a dictionary to use in a replace
replace, rename columns, and join
m = dict(zip(
df_2.num.values.tolist(),
df_2.name.values.tolist()
))
df_1.join(
df_1.replace(m).rename(
columns=lambda x: x.replace('num', 'name')
)
)
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 5 c
Breakdown
replace with a dictionary should be pretty quick. There are bunch of ways to build a dictionary form df_2. As a matter of fact we could have used a pd.Series. I chose to build with dict and zip because I find that it's faster.
Building m
Option 1
m = df_2.set_index('num').name
Option 2
m = df_2.set_index('num').name.to_dict()
Option 3
m = dict(zip(df_2.num, df_2.name))
Option 4 (My Choice)
m = dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))
m build times
1000 loops, best of 3: 325 µs per loop
1000 loops, best of 3: 376 µs per loop
10000 loops, best of 3: 32.9 µs per loop
100000 loops, best of 3: 10.4 µs per loop
%timeit df_2.set_index('num').name
%timeit df_2.set_index('num').name.to_dict()
%timeit dict(zip(df_2.num, df_2.name))
%timeit dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))
Replacing num
Again, we have choices, here are a few and their times.
%timeit df_1.replace(m)
%timeit df_1.applymap(lambda x: m.get(x, x))
%timeit df_1.stack().map(lambda x: m.get(x, x)).unstack()
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 959 µs per loop
1000 loops, best of 3: 925 µs per loop
I choose...
df_1.replace(m)
num_a num_b
0 a b
1 b d
2 c a
3 d b
4 5 c
Rename columns
df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name'))
name_a name_b <-- note the column name change
0 a b
1 b d
2 c a
3 d b
4 5 c
Join
df_1.join(df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name')))
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 5 c
I think there's a more straightforward solution than those already offered. Since you mentioned Excel, this is a basic vlookup. You can simulate this in pandas by using Series.map.
name_map = dict(df_2.set_index('num').name)
df_1['name_a'] = df_1.num_a.map(name_map)
df_1['name_b'] = df_1.num_b.map(name_map)
df_1
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
All we do is convert df_2 to a dict with 'num' as the keys. The map function looks up each value from a df_1 column in the dict and returns the corresponding letter. No complicated indexing required.
Just try a conditional statement:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
'num_b':[2, 4, 1, 2, 3]})
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
'name':['a', 'b', 'c', 'd', 'e']})
df_1["name_a"] = df_2["num_b"]
df_1["name_b"] = np.array(df_1["name_a"][df_1["num_b"]-1])
print(df_1)
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
I have a dataframe which looks like the below one. It has the ID column and the products history of each customer.
ID 1 2 3 4
1 A B C D
2 E C B D
3 F B C D
Instead of listing products for each customer, I would like to convert the products to features(columns) so that the data frame will look like this.
ID A B C D E F
1 1 1 1 1 0 0
2 0 0 0 1 1 0
3 0 1 1 1 0 1
I tried using get_dummies function, however, this will render different columns as 1-A, 1-E, 1-F, 2-B, 2-C, ....etc which is not what I need.
Any advice in getting this done.
This would yield a dataframe you want.
df = pd.get_dummies(df.set_index('ID').T.unstack()).groupby(level=0).sum().astype(int)
print (df)
Output:
A B C D E F
ID
1 1 1 1 1 0 0
2 0 1 1 1 1 0
3 0 1 1 1 0 1
You can use get_dummies and aggregate with max:
print (pd.get_dummies(df.set_index('ID'), prefix_sep='', prefix='')
.groupby(axis=1, level=0).max())
Or:
print (pd.get_dummies(df.set_index('ID').stack())
.groupby(level=0).max().astype(int))
You can use custom function, but it is slow:
df = df.set_index('ID').apply(lambda x: pd.Series(dict(zip(x, [1]*len(df.columns)))), axis=1)
.fillna(0)
.astype(int)
print (df)
A B C D E F
ID
1 1 1 1 1 0 0
2 0 1 1 1 1 0
3 0 1 1 1 0 1
I was interesting about timings:
np.random.seed(123)
N = 10000
L = list('ABCDEFGHIJKLMNOPQRST')
#[10000 rows x 20 columns]
df = pd.DataFrame(np.random.choice(L, size=(N,5)))
df = df.rename_axis('ID').reset_index()
print (df.head())
#Alex Fung solution
In [160]: %timeit (pd.get_dummies(df.set_index('ID').T.unstack()).groupby(level=0).sum().astype(int))
10 loops, best of 3: 27.9 ms per loop
In [161]: %timeit (pd.get_dummies(df.set_index('ID').stack()).groupby(level=0).max().astype(int))
10 loops, best of 3: 26.3 ms per loop
In [162]: %timeit (pd.get_dummies(df.set_index('ID'), prefix_sep='', prefix='').groupby(axis=1, level=0).max())
10 loops, best of 3: 26.4 ms per loop
In [163]: %timeit (df.set_index('ID').apply(lambda x: pd.Series(dict(zip(x, [1]*len(df.columns)))), axis=1).fillna(0).astype(int))
1 loop, best of 3: 3.95 s per loop
I have a pandas dataframe A of size (1500,5) and a dictionary D containing:
D
Out[121]:
{'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
for each key in the dictionary I would like to create a new column in the dataframe A with the values in the dictionary (same value for all the rows of each column)
at the end
A should be of size (1500,8)
Is there a "python" way to do this? thanks!
You can use concat with DataFrame constructor:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame({'A':[1,2],
'B':[4,5],
'C':[7,8]})
print (df)
A B C
0 1 4 7
1 2 5 8
print (pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1))
A B C newcol1 newcol2 newcol3
0 1 4 7 a 2 1
1 2 5 8 a 2 1
Timings:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame(np.random.rand(10000000, 5), columns=list('abcde'))
In [37]: %timeit pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1)
The slowest run took 18.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 875 ms per loop
In [38]: %timeit df.assign(**D)
1 loop, best of 3: 1.22 s per loop
setup
A = pd.DataFrame(np.random.rand(10, 5), columns=list('abcde'))
d = {
'newcol1': 'a',
'newcol2': 2,
'newcol3': 1
}
solution
Use assign
A.assign(**d)
a b c d e newcol1 newcol2 newcol3
0 0.709249 0.275538 0.135320 0.939448 0.549480 a 2 1
1 0.396744 0.513155 0.063207 0.198566 0.487991 a 2 1
2 0.230201 0.787672 0.520359 0.165768 0.616619 a 2 1
3 0.300799 0.554233 0.838353 0.637597 0.031772 a 2 1
4 0.003613 0.387557 0.913648 0.997261 0.862380 a 2 1
5 0.504135 0.847019 0.645900 0.312022 0.715668 a 2 1
6 0.857009 0.313477 0.030833 0.952409 0.875613 a 2 1
7 0.488076 0.732990 0.648718 0.389069 0.301857 a 2 1
8 0.187888 0.177057 0.813054 0.700724 0.653442 a 2 1
9 0.003675 0.082438 0.706903 0.386046 0.973804 a 2 1
Sometimes, I would manipulate some columns of the dataframe and re-change it.
For example, one dataframe df has 6 columns like this:
A, B1, B2, B3, C, D
And I want to change the values in the columns (B1,B2,B3) transform into (B1*A, B2*A, B3*A).
Aside the loop subroutine which is slow, the df.filter(like = 'B') will accelerate a lot.
df.filter(like = "B").mul(df.A, axis = 0) can produce the right answer. But I can't change the B-like columns in df using:
df.filter(like = "B") =df.filter(like = "B").mul(df.A. axis = 0)`
How to achieve it? I know using pd.concat to creat a new dataframe can get it done. But when the number of columns are huge, this method may be loss of efficiency. What I want to do is to assign new value to the columns already exist.
Any advices would be appreciate!
Use str.contains with boolean indexing:
cols = df.columns[df.columns.str.contains('B')]
df[cols] = df[cols].mul(df.A, axis = 0)
Sample:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'C':[5,3,6],
'D':[7,4,3]})
print (df)
A B1 B2 B3 C D
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
cols = df.columns[df.columns.str.contains('B')]
print (cols)
Index(['B1', 'B2', 'B3'], dtype='object')
df[cols] = df[cols].mul(df.A, axis = 0)
print (df)
A B1 B2 B3 C D
0 1 4 7 1 5 7
1 2 10 16 6 3 4
2 3 18 27 15 6 3
Timings:
len(df)=3:
In [17]: %timeit (a(df))
1000 loops, best of 3: 1.36 ms per loop
In [18]: %timeit (b(df1))
100 loops, best of 3: 2.39 ms per loop
len(df)=30k:
In [14]: %timeit (a(df))
100 loops, best of 3: 2.89 ms per loop
In [15]: %timeit (b(df1))
100 loops, best of 3: 4.71 ms per loop
Code:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3],
'B1':[4,5,6],
'B2':[7,8,9],
'B3':[1,3,5],
'C':[5,3,6],
'D':[7,4,3]})
print (df)
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()
def a(df):
cols = df.columns[df.columns.str.contains('B')]
df[cols] = df[cols].mul(df.A, axis = 0)
return (df)
def b(df):
df.loc[:, df.filter(regex=r'^B').columns] = df.loc[:, df.filter(regex=r'^B').columns].mul(df.A, axis=0)
return (df)
print (a(df))
print (b(df1))
you have almost done it:
In [136]: df.loc[:, df.filter(regex=r'^B').columns] = df.loc[:, df.filter(regex=r'^B').columns].mul(df.A, axis=0)
In [137]: df
Out[137]:
A B1 B2 B3 B4 F
0 1 4 7 1 5 7
1 2 10 16 6 6 4
2 3 18 27 15 18 3