Join unique values into new data frame (python, pandas) - python

I have two dataFrames, from where I extract the unique values of a column into a and b
a = df1.col1.unique()
b = df2.col2.unique()
now a and b are something like this
['a','b','c','d'] #a
[1,2,3] #b
they are now type numpy.ndarray
I want to join them to have a DataFrame like this
col1 col2
0 a 1
1 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
. . .
Is there a way to do it not using a loop?

with numpy tools :
pd.DataFrame({'col1':np.repeat(a,b.size),'col2':np.tile(b,a.size)})

UPDATE:
B. M.'s solution utilizing numpy is much faster - i would recommend to use his approach:
In [88]: %timeit pd.DataFrame({'col1':np.repeat(aa,bb.size),'col2':np.tile(bb,aa.size)})
10 loops, best of 3: 25.4 ms per loop
In [89]: %timeit pd.DataFrame(list(product(aa,bb)), columns=['col1', 'col2'])
1 loop, best of 3: 1.28 s per loop
In [90]: aa.size
Out[90]: 1000
In [91]: bb.size
Out[91]: 1000
try itertools.product:
In [56]: a
Out[56]:
array(['a', 'b', 'c', 'd'],
dtype='<U1')
In [57]: b
Out[57]: array([1, 2, 3])
In [63]: pd.DataFrame(list(product(a,b)), columns=['col1', 'col2'])
Out[63]:
col1 col2
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 c 3
9 d 1
10 d 2
11 d 3

You can't do this task without using at least one for loop. The best you can do is hide the for loop or make use of implicit yield calls to make a memory-efficient generator.
itertools exports efficient functions for this task that use yield implicitly to return generators:
from itertools import product
products = product(['a','b','c','d'], [1,2,3])
col1_items, col2_items = zip(*products)
result = pandas.DataFrame({'col1':col1_items, 'col2': col2_items})
itertools.product creates a Cartesian product of two iterables. The zip(*products) simply unpacks the resulting list of tuples into two separate tuples, as seen here.

You can do this with pandas merge and it will be faster than itertools or a loop:
df_a = pd.DataFrame({'a': a, 'key': 1})
df_b = pd.DataFrame({'b': b, 'key': 1})
result = pd.merge(df_a, df_b, how='outer')
result:
a key b
0 a 1 1
1 a 1 2
2 a 1 3
3 b 1 1
4 b 1 2
5 b 1 3
6 c 1 1
7 c 1 2
8 c 1 3
9 d 1 1
10 d 1 2
11 d 1 3
then if need be you can always do
del result['key']

Related

Sum up multiple columns into one columns [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

pandas dataframe index match

I'm wondering if there is a more efficient way to do an "index & match" type function that is popular in excel. For example - given two pandas DataFrames, update the df_1 with information found in df_2:
import pandas as pd
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
'num_b':[2, 4, 1, 2, 3]})
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
'name':['a', 'b', 'c', 'd', 'e']})
I'm working with data sets that have ~80,000 rows in both df_1 and df_2 and my goal is to create two new columns in df_1, "name_a" and "name_b".
Below is the most efficient method that I could come up with. There has to be a better way!
name_a = []
name_b = []
for i in range(len(df_1)):
name_a.append(df_2.name.iloc[df_2[
df_2.num == df_1.num_a.iloc[i]].index[0]])
name_b.append(df_2.name.iloc[df_2[
df_2.num == df_1.num_b.iloc[i]].index[0]])
df_1['name_a'] = name_a
df_1['name_b'] = name_b
Resulting in:
>>> df_1.head()
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
High Level
Create a dictionary to use in a replace
replace, rename columns, and join
m = dict(zip(
df_2.num.values.tolist(),
df_2.name.values.tolist()
))
df_1.join(
df_1.replace(m).rename(
columns=lambda x: x.replace('num', 'name')
)
)
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 5 c
Breakdown
replace with a dictionary should be pretty quick. There are bunch of ways to build a dictionary form df_2. As a matter of fact we could have used a pd.Series. I chose to build with dict and zip because I find that it's faster.
Building m
Option 1
m = df_2.set_index('num').name
Option 2
m = df_2.set_index('num').name.to_dict()
Option 3
m = dict(zip(df_2.num, df_2.name))
Option 4 (My Choice)
m = dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))
m build times
1000 loops, best of 3: 325 µs per loop
1000 loops, best of 3: 376 µs per loop
10000 loops, best of 3: 32.9 µs per loop
100000 loops, best of 3: 10.4 µs per loop
%timeit df_2.set_index('num').name
%timeit df_2.set_index('num').name.to_dict()
%timeit dict(zip(df_2.num, df_2.name))
%timeit dict(zip(df_2.num.values.tolist(), df_2.name.values.tolist()))
Replacing num
Again, we have choices, here are a few and their times.
%timeit df_1.replace(m)
%timeit df_1.applymap(lambda x: m.get(x, x))
%timeit df_1.stack().map(lambda x: m.get(x, x)).unstack()
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 959 µs per loop
1000 loops, best of 3: 925 µs per loop
I choose...
df_1.replace(m)
num_a num_b
0 a b
1 b d
2 c a
3 d b
4 5 c
Rename columns
df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name'))
name_a name_b <-- note the column name change
0 a b
1 b d
2 c a
3 d b
4 5 c
Join
df_1.join(df_1.replace(m).rename(columns=lambda x: x.replace('num', 'name')))
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 5 c
I think there's a more straightforward solution than those already offered. Since you mentioned Excel, this is a basic vlookup. You can simulate this in pandas by using Series.map.
name_map = dict(df_2.set_index('num').name)
df_1['name_a'] = df_1.num_a.map(name_map)
df_1['name_b'] = df_1.num_b.map(name_map)
df_1
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c
All we do is convert df_2 to a dict with 'num' as the keys. The map function looks up each value from a df_1 column in the dict and returns the corresponding letter. No complicated indexing required.
Just try a conditional statement:
import pandas as pd
import numpy as np
df_1 = pd.DataFrame({'num_a':[1, 2, 3, 4, 5],
'num_b':[2, 4, 1, 2, 3]})
df_2 = pd.DataFrame({'num':[1, 2, 3, 4, 5],
'name':['a', 'b', 'c', 'd', 'e']})
df_1["name_a"] = df_2["num_b"]
df_1["name_b"] = np.array(df_1["name_a"][df_1["num_b"]-1])
print(df_1)
num_a num_b name_a name_b
0 1 2 a b
1 2 4 b d
2 3 1 c a
3 4 2 d b
4 5 3 e c

adding column in pandas dataframe containing the same value

I have a pandas dataframe A of size (1500,5) and a dictionary D containing:
D
Out[121]:
{'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
for each key in the dictionary I would like to create a new column in the dataframe A with the values in the dictionary (same value for all the rows of each column)
at the end
A should be of size (1500,8)
Is there a "python" way to do this? thanks!
You can use concat with DataFrame constructor:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame({'A':[1,2],
'B':[4,5],
'C':[7,8]})
print (df)
A B C
0 1 4 7
1 2 5 8
print (pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1))
A B C newcol1 newcol2 newcol3
0 1 4 7 a 2 1
1 2 5 8 a 2 1
Timings:
D = {'newcol1': 'a',
'newcol2': 2,
'newcol3': 1}
df = pd.DataFrame(np.random.rand(10000000, 5), columns=list('abcde'))
In [37]: %timeit pd.concat([df, pd.DataFrame(D, index=df.index)], axis=1)
The slowest run took 18.06 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 875 ms per loop
In [38]: %timeit df.assign(**D)
1 loop, best of 3: 1.22 s per loop
setup
A = pd.DataFrame(np.random.rand(10, 5), columns=list('abcde'))
d = {
'newcol1': 'a',
'newcol2': 2,
'newcol3': 1
}
solution
Use assign
A.assign(**d)
a b c d e newcol1 newcol2 newcol3
0 0.709249 0.275538 0.135320 0.939448 0.549480 a 2 1
1 0.396744 0.513155 0.063207 0.198566 0.487991 a 2 1
2 0.230201 0.787672 0.520359 0.165768 0.616619 a 2 1
3 0.300799 0.554233 0.838353 0.637597 0.031772 a 2 1
4 0.003613 0.387557 0.913648 0.997261 0.862380 a 2 1
5 0.504135 0.847019 0.645900 0.312022 0.715668 a 2 1
6 0.857009 0.313477 0.030833 0.952409 0.875613 a 2 1
7 0.488076 0.732990 0.648718 0.389069 0.301857 a 2 1
8 0.187888 0.177057 0.813054 0.700724 0.653442 a 2 1
9 0.003675 0.082438 0.706903 0.386046 0.973804 a 2 1

in Pandas, how to create a variable that is n for the nth observation within a group?

consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...
I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.

Pandas: sum DataFrame rows for given columns

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Categories

Resources