I have a pandas dataframe A of size (1500,5) and a dictionary D containing:
D
Out[121]:
{'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
for each key in the dictionary I would like to create a new column in the dataframe A with the values in the dictionary (same value for all the rows of each column)
at the end
A should be of size (1500,8)
Is there a "python" way to do this? thanks!
You can use concat with DataFrame constructor:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame({'A':[1,2],
'B':[4,5],
'C':[7,8]})
print (df)
A B C
0 1 4 7
1 2 5 8
print (pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1))
A B C newcol1 newcol2 newcol3
0 1 4 7 a 2 1
1 2 5 8 a 2 1
Timings:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame(np.random.rand(10000000, 5), columns=list('abcde'))
In [37]: %timeit pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1)
The slowest run took 18.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 875 ms per loop
In [38]: %timeit df.assign(**D)
1 loop, best of 3: 1.22 s per loop
setup
A = pd.DataFrame(np.random.rand(10, 5), columns=list('abcde'))
d = {
'newcol1': 'a',
'newcol2': 2,
'newcol3': 1
}
solution
Use assign
A.assign(**d)
a b c d e newcol1 newcol2 newcol3
0 0.709249 0.275538 0.135320 0.939448 0.549480 a 2 1
1 0.396744 0.513155 0.063207 0.198566 0.487991 a 2 1
2 0.230201 0.787672 0.520359 0.165768 0.616619 a 2 1
3 0.300799 0.554233 0.838353 0.637597 0.031772 a 2 1
4 0.003613 0.387557 0.913648 0.997261 0.862380 a 2 1
5 0.504135 0.847019 0.645900 0.312022 0.715668 a 2 1
6 0.857009 0.313477 0.030833 0.952409 0.875613 a 2 1
7 0.488076 0.732990 0.648718 0.389069 0.301857 a 2 1
8 0.187888 0.177057 0.813054 0.700724 0.653442 a 2 1
9 0.003675 0.082438 0.706903 0.386046 0.973804 a 2 1
Related
I am trying to create a new column 'ratioA' in a dataframe df whereby the values are related to a column A:
For a given row, df['ratioA'] is equal to the ratio between df['A'] in that row and the next row.
I iterated over the index column as reference, but not sure why the values are appearing as NaN - Technically only the last row should appear as NaN.
import numpy as np
import pandas as pd
series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})
df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
df = df.reset_index()
for i in df['index']:
df['ratioA'] = df['A'][df['index']==i]/df['A'][df['index']==i+1]
print (df)
The output is:
index A B ratioA
0 0 1 2 NaN
1 1 3 4 NaN
2 2 5 6 NaN
3 3 7 8 NaN
The desired output should be:
index A B ratioA
0 0 1 2 0.33
1 1 3 4 0.60
2 2 5 6 0.71
3 3 7 8 NaN
You can use vectorized solution - divide by div shifted column A:
print (df['A'].shift(-1))
0 3.0
1 5.0
2 7.0
3 NaN
Name: A, dtype: float64
df['ratioA'] = df['A'].div(df['A'].shift(-1))
print (df)
index A B ratioA
0 0 1 2 0.333333
1 1 3 4 0.600000
2 2 5 6 0.714286
3 3 7 8 NaN
In pandas loops are very slow, so the best is avoid them (Jeff (pandas developer) explain it better.):
for i, row in df.iterrows():
if i != df.index[-1]:
df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A']
print (df)
index A B ratioA
0 0 1 2 0.333333
1 1 3 4 0.600000
2 2 5 6 0.714286
3 3 7 8 NaN
Timings:
series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})
df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
#[4000 rows x 3 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
df = df.reset_index()
In [49]: %timeit df['ratioA1'] = df['A'].div(df['A'].shift(-1))
1000 loops, best of 3: 431 µs per loop
In [50]: %%timeit
...: for i, row in df.iterrows():
...: if i != df.index[-1]:
...: df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A']
...:
1 loop, best of 3: 2.15 s per loop
I'm wondering if there is a more efficient way to do an "index & match" type function that is popular in excel. For example - given two pandas DataFrames, update the df_1 with information found in df_2:
import pandas as pd
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
'num_b':[2, 4, 1, 2, 3]})
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
'name':['a', 'b', 'c', 'd', 'e']})
I'm working with data sets that have ~80,000 rows in both df_1 and df_2 and my goal is to create two new columns in df_1, "name_a" and "name_b".
Below is the most efficient method that I could come up with. There has to be a better way!
name_a = []
name_b = []
for i in range(len(df_1)):
name_a.append(df_2.name.iloc[df_2[
df_2.num == df_1.num_a.iloc[i]].index[0]])
name_b.append(df_2.name.iloc[df_2[
df_2.num == df_1.num_b.iloc[i]].index[0]])
df_1['name_a'] = name_a
df_1['name_b'] = name_b
Resulting in:
>>> df_1.head()
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
High Level
Create a dictionary to use in a replace
replace, rename columns, and join
m = dict(zip(
df_2.num.values.tolist(),
df_2.name.values.tolist()
))
df_1.join(
df_1.replace(m).rename(
columns=lambda x: x.replace('num', 'name')
)
)
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 5 c
Breakdown
replace with a dictionary should be pretty quick. There are bunch of ways to build a dictionary form df_2. As a matter of fact we could have used a pd.Series. I chose to build with dict and zip because I find that it's faster.
Building m
Option 1
m = df_2.set_index('num').name
Option 2
m = df_2.set_index('num').name.to_dict()
Option 3
m = dict(zip(df_2.num, df_2.name))
Option 4 (My Choice)
m = dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))
m build times
1000 loops, best of 3: 325 µs per loop
1000 loops, best of 3: 376 µs per loop
10000 loops, best of 3: 32.9 µs per loop
100000 loops, best of 3: 10.4 µs per loop
%timeit df_2.set_index('num').name
%timeit df_2.set_index('num').name.to_dict()
%timeit dict(zip(df_2.num, df_2.name))
%timeit dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))
Replacing num
Again, we have choices, here are a few and their times.
%timeit df_1.replace(m)
%timeit df_1.applymap(lambda x: m.get(x, x))
%timeit df_1.stack().map(lambda x: m.get(x, x)).unstack()
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 959 µs per loop
1000 loops, best of 3: 925 µs per loop
I choose...
df_1.replace(m)
num_a num_b
0 a b
1 b d
2 c a
3 d b
4 5 c
Rename columns
df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name'))
name_a name_b <-- note the column name change
0 a b
1 b d
2 c a
3 d b
4 5 c
Join
df_1.join(df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name')))
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 5 c
I think there's a more straightforward solution than those already offered. Since you mentioned Excel, this is a basic vlookup. You can simulate this in pandas by using Series.map.
name_map = dict(df_2.set_index('num').name)
df_1['name_a'] = df_1.num_a.map(name_map)
df_1['name_b'] = df_1.num_b.map(name_map)
df_1
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
All we do is convert df_2 to a dict with 'num' as the keys. The map function looks up each value from a df_1 column in the dict and returns the corresponding letter. No complicated indexing required.
Just try a conditional statement:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
'num_b':[2, 4, 1, 2, 3]})
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
'name':['a', 'b', 'c', 'd', 'e']})
df_1["name_a"] = df_2["num_b"]
df_1["name_b"] = np.array(df_1["name_a"][df_1["num_b"]-1])
print(df_1)
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
What's the most efficient way to select the second to last of each duplicated set in a pandas dataframe?
For instance I basically want to do this operation:
df = df.drop_duplicates(['Person','Question'],take_last=True)
But this:
df = df.drop_duplicates(['Person','Question'],take_second_last=True)
Abstracted question: how to choose which duplicate to keep if duplicate is neither the max nor the min?
With groupby.apply:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
'B': np.arange(10), 'C': np.arange(10)})
df
Out:
A B C
0 1 0 0
1 1 1 1
2 1 2 2
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 7
8 3 8 8
9 4 9 9
(df.groupby('A', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
.reset_index(level=0, drop=True))
Out:
A B C
2 1 2 2
5 2 5 5
7 3 7 7
9 4 9 9
With a different DataFrame, subset two columns:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)})
df
Out:
A B C
0 1 1 0
1 1 1 1
2 1 2 2
3 1 1 3
4 2 2 4
5 2 2 5
6 2 2 6
7 3 3 7
8 3 3 8
9 4 4 9
(df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
.reset_index(level=0, drop=True))
Out:
A B C
1 1 1 1
2 1 2 2
5 2 2 5
7 3 3 7
9 4 4 9
You could groupby/tail(2) to take the last 2 items, then groupby/head(1) to take the first item from the tail:
df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
If there is only one item in the group, tail(2) returns just the one item.
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
assert expected.sort_index().equals(result)
The builtin groupby methods (such as tail and head) are often much faster
than groupby/apply with custom Python functions. This is especially true if there are a lot of groups:
In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
1000 loops, best of 3: 1.7 ms per loop
In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
100 loops, best of 3: 17.9 ms per loop
Alternatively, ayhan suggests a nice improvement:
alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
assert expected.sort_index().equals(alt)
In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
1000 loops, best of 3: 1.43 ms per loop
I have two dataFrames, from where I extract the unique values of a column into a and b
a = df1.col1.unique()
b = df2.col2.unique()
now a and b are something like this
['a','b','c','d'] #a
[1,2,3] #b
they are now type numpy.ndarray
I want to join them to have a DataFrame like this
col1 col2
0 a 1
1 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
. . .
Is there a way to do it not using a loop?
with numpy tools :
pd.DataFrame({'col1':np.repeat(a,b.size),'col2':np.tile(b,a.size)})
UPDATE:
B. M.'s solution utilizing numpy is much faster - i would recommend to use his approach:
In [88]: %timeit pd.DataFrame({'col1':np.repeat(aa,bb.size),'col2':np.tile(bb,aa.size)})
10 loops, best of 3: 25.4 ms per loop
In [89]: %timeit pd.DataFrame(list(product(aa,bb)), columns=['col1', 'col2'])
1 loop, best of 3: 1.28 s per loop
In [90]: aa.size
Out[90]: 1000
In [91]: bb.size
Out[91]: 1000
try itertools.product:
In [56]: a
Out[56]:
array(['a', 'b', 'c', 'd'],
dtype='<U1')
In [57]: b
Out[57]: array([1, 2, 3])
In [63]: pd.DataFrame(list(product(a,b)), columns=['col1', 'col2'])
Out[63]:
col1 col2
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 c 3
9 d 1
10 d 2
11 d 3
You can't do this task without using at least one for loop. The best you can do is hide the for loop or make use of implicit yield calls to make a memory-efficient generator.
itertools exports efficient functions for this task that use yield implicitly to return generators:
from itertools import product
products = product(['a','b','c','d'], [1,2,3])
col1_items, col2_items = zip(*products)
result = pandas.DataFrame({'col1':col1_items, 'col2': col2_items})
itertools.product creates a Cartesian product of two iterables. The zip(*products) simply unpacks the resulting list of tuples into two separate tuples, as seen here.
You can do this with pandas merge and it will be faster than itertools or a loop:
df_a = pd.DataFrame({'a': a, 'key': 1})
df_b = pd.DataFrame({'b': b, 'key': 1})
result = pd.merge(df_a, df_b, how='outer')
result:
a key b
0 a 1 1
1 a 1 2
2 a 1 3
3 b 1 1
4 b 1 2
5 b 1 3
6 c 1 1
7 c 1 2
8 c 1 3
9 d 1 1
10 d 1 2
11 d 1 3
then if need be you can always do
del result['key']
I have a df and want to make a new_df of the same size but with all 1s. Something to the spirit of: new_df=df.replace("*","1"). I think this is faster than creating a new df from scratch, because i would need to get the dimensions, fill it with 1s, and copy all the headers over. Unless I'm wrong about that.
df_new = pd.DataFrame(np.ones(df.shape), columns=df.columns)
import numpy as np
import pandas as pd
d = [
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3],
[4,4,4,4,4],
[5,5,5,5,5],
]
cols = ["A","B","C","D","E"]
%timeit df1 = pd.DataFrame(np.ones(df.shape), columns=df.columns)
10000 loops, best of 3: 94.6 µs per loop
%timeit df2 = df.copy(); df2.loc[:, :] = 1
1000 loops, best of 3: 245 µs per loop
%timeit df3 = df * 0 + 1
1000 loops, best of 3: 200 µs per loop
It's actually pretty easy.
import pandas as pd
d = [
[1,1,1,1,1],
[2,2,2,2,2],
[3,3,3,3,3],
[4,4,4,4,4],
[5,5,5,5,5],
]
cols = ["A","B","C","D","E"]
df = pd.DataFrame(d, columns=cols)
print df
print "------------------------"
df.loc[:,:] = 1
print df
Result:
A B C D E
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
------------------------
A B C D E
0 1 1 1 1 1
1 1 1 1 1 1
2 1 1 1 1 1
3 1 1 1 1 1
4 1 1 1 1 1
Obviously, df.loc[:,:] means you target all rows across all columns. Just use df2 = df.copy() or something if you want a new dataframe.