Dataframe columns from Dataframe rows in Pandas - python

I have two Pandas DataFrames. I would like to add the rows of the other dataframe as columns in the other. I've tried reading through the Merge, join, and concatenate - documentation, but can't get my head around how to do this in Pandas.
Here's how I've managed to do it with converting to numpy arrays, but surely there is a smart way to do this in Pandas.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.normal(size=8).reshape(4,2),index=[1,2,3,4],columns=['a','b'])
df2 = pd.DataFrame(np.random.normal(size=8).reshape(2,4),index=['c','d'],columns=[5,6,7,8])
ar = np.concatenate((df1.values,df2.values.T),axis=1)
df = pd.DataFrame(ar,columns=['a','b','c','d'],index=[1,2,3,4])

If df1.index has no duplicate values, then you could use df1.join:
In [283]: df1 = pd.DataFrame(np.random.normal(size=8).reshape(4,2),index=[1,2,3,4],columns=['a','b'])
In [284]: df2 = pd.DataFrame(np.random.normal(size=8).reshape(2,4),index=['c','d'],columns=[5,6,7,8])
In [285]: df1.join(df2.T.set_index(df1.index))
Out[285]:
a b c d
1 -1.196281 0.222283 1.247750 -0.121309
2 1.188098 0.384871 -1.324419 -1.610255
3 -0.928642 -0.618491 0.171215 -1.545479
4 -0.832756 -0.491364 0.100428 -0.525689
If df1 has duplicate entries in its index, then df1.join(...) may return more rows than desired. For example, if df1 has non-unique index [1,2,1,4] then:
In [4]: df1 = pd.DataFrame(np.random.normal(size=8).reshape(4,2),index=[1,2,1,4],columns=['a','b'])
In [5]: df2 = pd.DataFrame(np.random.normal(size=8).reshape(2,4),index=['c','d'],columns=[5,6,7,8])
In [8]: df1.join(df2.T.set_index(df1.index))
Out[8]:
a b c d
1 -1.087152 -0.828800 -1.129768 -0.579428
1 -1.087152 -0.828800 0.320756 0.297736
1 0.198297 0.277456 -1.129768 -0.579428
1 0.198297 0.277456 0.320756 0.297736
2 1.529188 1.023568 -0.670853 -0.466754
4 -0.393748 0.976632 0.455129 1.230298
The 2 rows with index 1 in df1 are being joined to the 2 rows with index 1 in df2 resulting in 4 rows with index 1 -- probably not what you want.
So, if df1.index does contain duplicate values, use pd.concat to guarantee a simple juxtaposition of the two shapes:
In [7]: pd.concat([df1, df2.T.set_index(df1.index)], axis=1)
Out[7]:
a b c d
1 -1.087152 -0.828800 -1.129768 -0.579428
2 1.529188 1.023568 -0.670853 -0.466754
1 0.198297 0.277456 0.320756 0.297736
4 -0.393748 0.976632 0.455129 1.230298
One reason you might want to use df1.join, however, is that
if you know df1.index has no duplicate values, then using it
is faster than using pd.concat:
In [13]: df1 = pd.DataFrame(np.random.normal(size=8000).reshape(-1,2), columns=['a','b'])
In [14]: df2 = pd.DataFrame(np.random.normal(size=8000).reshape(2,-1),index=['c','d'])
In [15]: %timeit df1.join(df2.T.set_index(df1.index))
1000 loops, best of 3: 600 µs per loop
In [16]: %timeit pd.concat([df1, df2.T.set_index(df1.index)], axis=1)
1000 loops, best of 3: 1.18 ms per loop

Related

Pandaic reasoning behind a way to conditionally update new value from other values in same row in DataFrame

What is the pandaic reasoning behind a way to update a new value in a DataFrame based on other values from the same row?
Given
df = pd.DataFrame([[1,2],[3,4]], columns=list('ab'))
a b
0 1 2
1 3 4
I want
a b c
0 1 2 NaN
1 3 4 3.0
Where the values in column 'c' are set from 'a' if 'b' >= 4.
(1) I tried:
df['c']=df[df['b']>=4]['a']
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
(2) I also tried How can I conditionally update multiple columns in a panda dataframe which sets values from other row values:
df.loc[df['b'] >= 4, 'c'] = df['a']
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
(3) jp also showed a another way:
df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
Which of the above is the most pandic? How does loc work?
Answers to the following did not work:
Update row values where certain condition is met in pandas: sets values from a literal
How to conditionally update DataFrame column in Pandas: sets values from a literal
Other possible way may be to use apply:
df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
print(df)
Result:
a b c
0 1 2 NaN
1 3 4 3.0
Comparing the timings, np.where seems to perform best here among different methods:
%timeit df.loc[df['b'] >= 4, 'c'] = df['a']
1000 loops, best of 3: 1.54 ms per loop
%timeit df['c']=df[df['b']>=4]['a']
1000 loops, best of 3: 869 µs per loop
%timeit df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
1000 loops, best of 3: 440 µs per loop
%timeit df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
1000 loops, best of 3: 359 µs per loop
This will not work because df['c'] is not defined and, if it was, the left is a dataframe while the right is a series:
df[df['b'] >= 4] = df['c']
You cannot assign a series to a dataframe and your assignment is in the wrong direction, so this will never work. However, as you found, the following works:
df.loc[df['b'] >= 4, 'c'] = df['a']
This is because the left and right of this assignment are both series. As an alternative, you can use numpy.where, which you may find more explicit:
df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)

Get column names of specific columns

I get a dataframe from an interface whith cryptically named columns, of which I know some substrings which are mutually exclusive over all columns.
An simplified example looks like this:
df = pandas.DataFrame({'d10432first34sf':[1,2,3],'d10432second34sf':[4,5,6]})
df
d10432first34sf d10432second34sf
0 1 4
1 2 5
2 3 6
Since I know the column substrings, I can access individual columns in the following way:
df.filter(like='first')
d10432first34sf
0 1
1 2
2 3
df.filter(like='second')
d10432second34sf
0 4
1 5
2 6
But now, I also need to get the exact column name of each column, which are unknown to me. How can I achieve that?
Add .columns:
cols = df.filter(like='first').columns
print (cols)
Index(['d10432first34sf'], dtype='object')
Or better boolean indexing with contains:
cols = df.columns[df.columns.str.contains('first')]
print (cols)
Index(['d10432first34sf'], dtype='object')
Timings are not same:
df = pd.DataFrame({'d10432first34sf':[1,2,3],'d10432second34sf':[4,5,6]})
df = pd.concat([df]*10000, axis=1).reset_index(drop=True)
df = pd.concat([df]*1000).reset_index(drop=True)
df.columns = df.columns + pd.Series(range(10000 * 2)).astype('str')
print (df.shape)
(3000, 20000)
In [267]: %timeit df.filter(like='first').columns
10 loops, best of 3: 117 ms per loop
In [268]: %timeit df.columns[df.columns.str.contains('first')]
100 loops, best of 3: 11.9 ms per loop

How to expand/flatten pandas dataframe efficiently

I have a dataset that on one of its columns, each element is a list.
I would like to flatten it, such that every list element would have a row of it's own.
I managed to solve it with iterrows, dict and append(see below) but it is too slow with my true DF that is large.
Is there a way to make things faster?
I can consider replacing the column with list per element in another format (maybe hierarchical df? ) if that would make more sense.
EDIT: I have many columns, and some might change in the future. The only thing i know for sure is that I have the fields column. That's why I used dict in my solution
A minimal example, creating a df to play with:
import StringIO
df = pd.read_csv(StringIO.StringIO("""
id|name|fields
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]
"""), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
print df
resulting df:
id name fields
0 1 abc [qq, ww, rr]
1 2 efg [zz, xx, rr]
my (slow) solution:
new_df = pd.DataFrame(index=[], columns=df.columns)
for _, i in df.iterrows():
flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
new_df = new_df.append(flattened_d )
Resulting with
id name fields
0 1.0 abc qq
1 1.0 abc ww
2 1.0 abc rr
0 2.0 efg zz
1 2.0 efg xx
2 2.0 efg rr
You can use numpy for better performance:
Both solutions use mainly numpy.repeat.
from itertools import chain
vals = df.fields.str.len()
df1 = pd.DataFrame({
"id": np.repeat(df.id.values,vals),
"name": np.repeat(df.name.values, vals),
"fields": list(chain.from_iterable(df.fields))})
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
Another solution:
df[['id','name']].values converts columns to numpy array and duplicate them by numpy.repeat, then stack values in lists by numpy.hstack and add it by numpy.column_stack.
df1 = pd.DataFrame(np.column_stack((df[['id','name']].values.
repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
columns=df.columns)
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
More general solution is filter out column fields and then add it to DataFrame constructor, because always last column:
cols = df.columns[df.columns != 'fields'].tolist()
print (cols)
['id', 'name']
df1 = pd.DataFrame(np.column_stack((df[cols].values.
repeat(list(map(len,df.fields)),axis=0),np.hstack(df.fields))),
columns=cols + ['fields'])
print (df1)
id name fields
0 1 abc qq
1 1 abc ww
2 1 abc rr
3 2 efg zz
4 2 efg xx
5 2 efg rr
If your CSV is many thousands of lines long, then using_string_methods (below)
may be faster than using_iterrows or using_repeat:
With
csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)
In [210]: %timeit using_string_methods(csv)
10 loops, best of 3: 100 ms per loop
In [211]: %timeit using_itertuples(csv)
10 loops, best of 3: 119 ms per loop
In [212]: %timeit using_repeat(csv)
10 loops, best of 3: 126 ms per loop
In [213]: %timeit using_iterrows(csv)
1 loop, best of 3: 1min 7s per loop
So for a 10000-line CSV, using_string_methods is over 600x faster than using_iterrows, and marginally faster than using_repeat.
import pandas as pd
try: from cStringIO import StringIO # for Python2
except ImportError: from io import StringIO # for Python3
def using_string_methods(csv):
df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
other_columns = df.columns.difference(['fields']).tolist()
fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
.str.split(r',', expand=True))
df = pd.concat([df.drop('fields', axis=1), fields], axis=1)
result = (pd.melt(df, id_vars=other_columns, value_name='field')
.drop('variable', axis=1))
result = result.dropna(subset=['field'])
return result
def using_iterrows(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
new_df = pd.DataFrame(index=[], columns=df.columns)
for _, i in df.iterrows():
flattened_d = [dict(i.to_dict(), fields=c) for c in i.fields]
new_df = new_df.append(flattened_d )
return new_df
def using_repeat(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
cols = df.columns[df.columns != 'fields'].tolist()
df1 = pd.DataFrame(np.column_stack(
(df[cols].values.repeat(list(map(len,df.fields)),axis=0),
np.hstack(df.fields))), columns=cols + ['fields'])
return df1
def using_itertuples(csv):
df = pd.read_csv(StringIO(csv), sep='|')
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
other_columns = df.columns.difference(['fields']).tolist()
data = []
for tup in df.itertuples():
data.extend([[getattr(tup, col) for col in other_columns]+[field]
for field in tup.fields])
return pd.DataFrame(data, columns=other_columns+['field'])
csv = 'id|name|fields'+("""
1|abc|[qq,ww,rr]
2|efg|[zz,xx,rr]"""*10000)
Generally, fast NumPy/Pandas operations are possible only when the data is in a
native NumPy dtype (such as int64 or float64, or strings.) Once you place
lists (a non-native NumPy dtype) in a DataFrame the jig is up -- you are forced
to use Python-speed loops to process the lists.
So to improve performance, you need to avoid placing lists in a DataFrame.
using_string_methods loads the fields data as strings:
df = pd.read_csv(StringIO(csv), sep='|', dtype=None)
and avoid using the apply method (which is generally as slow as a plain Python loop):
df.fields = df.fields.apply(lambda s: s[1:-1].split(','))
Instead, it uses faster vectorized string methods to break the strings up into
separate columns:
fields = (df['fields'].str.extract(r'\[(.*)\]', expand=False)
.str.split(r',', expand=True))
Once you have the fields in separate columns, you can use pd.melt to reshape
the DataFrame into the desired format.
pd.melt(df, id_vars=['id', 'name'], value_name='field')
By the way, you might be interested to see that with a slight modification using_iterrows can be just as fast as using_repeat. I show the changes in using_itertuples.
df.itertuples tends to be slightly faster than df.iterrows, but the difference is minor. The majority of the speed gain is achieved by avoiding calling df.append in a for-loop since that leads to quadratic copying.
You can break the lists in the fields column into multiple columns by applying pandas.Series to fields and then merging to id and name like so:
cols = df.columns[df.columns != 'fields'].tolist() # adapted from #jezrael
df = df[cols].join(df.fields.apply(pandas.Series))
Then you can melt the resulting new columns using set_index and stack, and then reseting the index:
df = df.set_index(cols).stack().reset_index()
Finally, drop the redundant column generated by reset_index and rename the generated column to "field":
df = df.drop(df.columns[-2], axis=1).rename(columns={0: 'field'})

How to check whether a pandas DataFrame is empty?

How to check whether a pandas DataFrame is empty? In my case I want to print some message in terminal if the DataFrame is empty.
You can use the attribute df.empty to check whether it's empty or not:
if df.empty:
print('DataFrame is empty!')
Source: Pandas Documentation
I use the len function. It's much faster than empty. len(df.index) is even faster.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10000, 4), columns=list('ABCD'))
def empty(df):
return df.empty
def lenz(df):
return len(df) == 0
def lenzi(df):
return len(df.index) == 0
'''
%timeit empty(df)
%timeit lenz(df)
%timeit lenzi(df)
10000 loops, best of 3: 13.9 µs per loop
100000 loops, best of 3: 2.34 µs per loop
1000000 loops, best of 3: 695 ns per loop
len on index seems to be faster
'''
To see if a dataframe is empty, I argue that one should test for the length of a dataframe's columns index:
if len(df.columns) == 0: 1
Reason:
According to the Pandas Reference API, there is a distinction between:
an empty dataframe with 0 rows and 0 columns
an empty dataframe with rows containing NaN hence at least 1 column
Arguably, they are not the same. The other answers are imprecise in that df.empty, len(df), or len(df.index) make no distinction and return index is 0 and empty is True in both cases.
Examples
Example 1: An empty dataframe with 0 rows and 0 columns
In [1]: import pandas as pd
df1 = pd.DataFrame()
df1
Out[1]: Empty DataFrame
Columns: []
Index: []
In [2]: len(df1.index) # or len(df1)
Out[2]: 0
In [3]: df1.empty
Out[3]: True
Example 2: A dataframe which is emptied to 0 rows but still retains n columns
In [4]: df2 = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
df2
Out[4]: AA BB
0 1 11
1 2 22
2 3 33
In [5]: df2 = df2[df2['AA'] == 5]
df2
Out[5]: Empty DataFrame
Columns: [AA, BB]
Index: []
In [6]: len(df2.index) # or len(df2)
Out[6]: 0
In [7]: df2.empty
Out[7]: True
Now, building on the previous examples, in which the index is 0 and empty is True. When reading the length of the columns index for the first loaded dataframe df1, it returns 0 columns to prove that it is indeed empty.
In [8]: len(df1.columns)
Out[8]: 0
In [9]: len(df2.columns)
Out[9]: 2
Critically, while the second dataframe df2 contains no data, it is not completely empty because it returns the amount of empty columns that persist.
Why it matters
Let's add a new column to these dataframes to understand the implications:
# As expected, the empty column displays 1 series
In [10]: df1['CC'] = [111, 222, 333]
df1
Out[10]: CC
0 111
1 222
2 333
In [11]: len(df1.columns)
Out[11]: 1
# Note the persisting series with rows containing `NaN` values in df2
In [12]: df2['CC'] = [111, 222, 333]
df2
Out[12]: AA BB CC
0 NaN NaN 111
1 NaN NaN 222
2 NaN NaN 333
In [13]: len(df2.columns)
Out[13]: 3
It is evident that the original columns in df2 have re-surfaced. Therefore, it is prudent to instead read the length of the columns index with len(pandas.core.frame.DataFrame.columns) to see if a dataframe is empty.
Practical solution
# New dataframe df
In [1]: df = pd.DataFrame({'AA' : [1, 2, 3], 'BB' : [11, 22, 33]})
df
Out[1]: AA BB
0 1 11
1 2 22
2 3 33
# This data manipulation approach results in an empty df
# because of a subset of values that are not available (`NaN`)
In [2]: df = df[df['AA'] == 5]
df
Out[2]: Empty DataFrame
Columns: [AA, BB]
Index: []
# NOTE: the df is empty, BUT the columns are persistent
In [3]: len(df.columns)
Out[3]: 2
# And accordingly, the other answers on this page
In [4]: len(df.index) # or len(df)
Out[4]: 0
In [5]: df.empty
Out[5]: True
# SOLUTION: conditionally check for empty columns
In [6]: if len(df.columns) != 0: # <--- here
# Do something, e.g.
# drop any columns containing rows with `NaN`
# to make the df really empty
df = df.dropna(how='all', axis=1)
df
Out[6]: Empty DataFrame
Columns: []
Index: []
# Testing shows it is indeed empty now
In [7]: len(df.columns)
Out[7]: 0
Adding a new data series works as expected without the re-surfacing of empty columns (factually, without any series that were containing rows with only NaN):
In [8]: df['CC'] = [111, 222, 333]
df
Out[8]: CC
0 111
1 222
2 333
In [9]: len(df.columns)
Out[9]: 1
I prefer going the long route. These are the checks I follow to avoid using a try-except clause -
check if variable is not None
then check if its a dataframe and
make sure its not empty
Here, DATA is the suspect variable -
DATA is not None and isinstance(DATA, pd.DataFrame) and not DATA.empty
If a DataFrame has got Nan and Non Null values and you want to find whether the DataFrame
is empty or not then try this code.
when this situation can happen?
This situation happens when a single function is used to plot more than one DataFrame
which are passed as parameter.In such a situation the function try to plot the data even
when a DataFrame is empty and thus plot an empty figure!.
It will make sense if simply display 'DataFrame has no data' message.
why?
if a DataFrame is empty(i.e. contain no data at all.Mind you DataFrame with Nan values
is considered non empty) then it is desirable not to plot but put out a message :
Suppose we have two DataFrames df1 and df2.
The function myfunc takes any DataFrame(df1 and df2 in this case) and print a message
if a DataFrame is empty(instead of plotting):
df1 df2
col1 col2 col1 col2
Nan 2 Nan Nan
2 Nan Nan Nan
and the function:
def myfunc(df):
if (df.count().sum())>0: ##count the total number of non Nan values.Equal to 0 if DataFrame is empty
print('not empty')
df.plot(kind='barh')
else:
display a message instead of plotting if it is empty
print('empty')

How to apply a function to two columns of Pandas dataframe

Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function :
f = lambda x, y : my_function_expression.
Now I want to apply the f to df's two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' , somewhat like :
df['col_3'] = df[['col_1','col_2']].apply(f)
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'
How to do ?
** Add detail sample as below ***
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
ID col_1 col_2 col_3
0 1 0 1 ['a', 'b']
1 2 2 4 ['c', 'd', 'e']
2 3 3 5 ['d', 'e', 'f']
There is a clean, one-line way of doing this in Pandas:
df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)
This allows f to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.
Example with data (based on original question):
import pandas as pd
df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)
Output of print(df):
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:
df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)
Here's an example using apply on the dataframe, which I am calling with axis = 1.
Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.
In [49]: df
Out[49]:
0 1
0 1.000000 0.000000
1 -0.494375 0.570994
2 1.000000 0.000000
3 1.876360 -0.229738
4 1.000000 0.000000
In [50]: def f(x):
....: return x[0] + x[1]
....:
In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]:
0 1.000000
1 0.076619
2 1.000000
3 1.646622
4 1.000000
Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.
A simple solution is:
df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)
A interesting question! my answer as below:
import pandas as pd
def sublst(row):
return lst[row['J1']:row['J2']]
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(sublst,axis=1)
print df
Output:
ID J1 J2
0 1 0 1
1 2 2 4
2 3 3 5
ID J1 J2 J3
0 1 0 1 [a]
1 2 2 4 [c, d]
2 3 3 5 [d, e]
I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.
One more brief version:
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df
The method you are looking for is Series.combine.
However, it seems some care has to be taken around datatypes.
In your example, you would (as I did when testing the answer) naively call
df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)
However, this throws the error:
ValueError: setting an array element with a sequence.
My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:
df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:
df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
columns=['a', 'b', 'c'])
df
a b c
0 4 0 0
1 2 0 1
2 2 2 2
3 1 2 2
4 3 0 0
There are three possible outcomes with returning a list from apply
1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.
df.apply(lambda x: list(range(2)), axis=1) # returns a Series
0 [0, 1]
1 [0, 1]
2 [0, 1]
3 [0, 1]
4 [0, 1]
dtype: object
2) When the length of the returned list is equal to the number of
columns then a DataFrame is returned and each column gets the
corresponding value in the list.
df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
a b c
0 0 1 2
1 0 1 2
2 0 1 2
3 0 1 2
4 0 1 2
3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.
i = 0
def f(x):
global i
if i == 0:
i += 1
return list(range(3))
return list(range(4))
df.apply(f, axis=1)
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)
Answering the problem without apply
Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.
Create larger dataframe
df1 = df.sample(100000, replace=True).reset_index(drop=True)
Timings
# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# zip - similar to #Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Thomas answer
%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'm sure this isn't as fast as the solutions using Pandas or Numpy operations, but if you don't want to rewrite your function you can use map. Using the original example data -
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list
We could pass as many arguments as we wanted into the function this way. The output is what we wanted
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Here is a faster solution:
def func_1(a,b):
return a + b
df["C"] = func_1(df["A"].to_numpy(),df["B"].to_numpy())
This is 380 times faster than df.apply(f, axis=1) from #Aman and 310 times faster than df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1) from #ajrwhite.
I add some benchmarks too:
Results:
FUNCTIONS TIMINGS GAIN
apply lambda 0.7 x 1
apply 0.56 x 1.25
map 0.3 x 2.3
np.vectorize 0.01 x 70
f3 on Series 0.0026 x 270
f3 on np arrays 0.0018 x 380
f3 numba 0.0018 x 380
In short:
Using apply is slow. We can speed up things very simply, just by using a function that will operate directly on Pandas Series (or better on numpy arrays). And because we will operate on Pandas Series or numpy arrays, we will be able to vectorize the operations. The function will return a Pandas Series or numpy array that we will assign as a new column.
And here is the benchmark code:
import timeit
timeit_setup = """
import pandas as pd
import numpy as np
import numba
np.random.seed(0)
# Create a DataFrame of 10000 rows with 2 columns "A" and "B"
# containing integers between 0 and 100
df = pd.DataFrame(np.random.randint(0,10,size=(10000, 2)), columns=["A", "B"])
def f1(a,b):
# Here a and b are the values of column A and B for a specific row: integers
return a + b
def f2(x):
# Here, x is pandas Series, and corresponds to a specific row of the DataFrame
# 0 and 1 are the indexes of columns A and B
return x[0] + x[1]
def f3(a,b):
# Same as f1 but we will pass parameters that will allow vectorization
# Here, A and B will be Pandas Series or numpy arrays
# with df["C"] = f3(df["A"],df["B"]): Pandas Series
# with df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy()): numpy arrays
return a + b
#numba.njit('int64[:](int64[:], int64[:])')
def f3_numba_vectorize(a,b):
# Here a and b are 2 numpy arrays with dtype int64
# This function must return a numpy array whith dtype int64
return a + b
"""
test_functions = [
'df["C"] = df.apply(lambda row: f1(row["A"], row["B"]), axis=1)',
'df["C"] = df.apply(f2, axis=1)',
'df["C"] = list(map(f3,df["A"],df["B"]))',
'df["C"] = np.vectorize(f3) (df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3(df["A"],df["B"])',
'df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3_numba_vectorize(df["A"].to_numpy(),df["B"].to_numpy())'
]
for test_function in test_functions:
print(min(timeit.repeat(setup=timeit_setup, stmt=test_function, repeat=7, number=10)))
Output:
0.7
0.56
0.3
0.01
0.0026
0.0018
0.0018
Final note: things could be optimzed with Cython and other numba tricks too.
The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[['col1','col2']] returns a single dataframe with two columns, not two separate columns.
You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.
You need this function signature because the syntax is .apply(f)
So f needs to take the single thing = dataframe and not two things which is what your current f expects.
Since you haven't provided the body of f I can't help in anymore detail - but this should provide the way out without fundamentally changing your code or using some other methods rather than apply
Another option is df.itertuples() (generally faster and recommended over df.iterrows() by docs and user testing):
import pandas as pd
df = pd.DataFrame([range(4) for _ in range(4)], columns=list("abcd"))
df
a b c d
0 0 1 2 3
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
df["e"] = [sum(row) for row in df[["b", "d"]].itertuples(index=False)]
df
a b c d e
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
Since itertuples returns an Iterable of namedtuples, you can access tuple elements both as attributes by column name (aka dot notation) and by index:
b, d = row
b = row.b
d = row[1]
My example to your questions:
def get_sublist(row, col1, col2):
return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')
It can be done in two simple ways:
Let's say, we want sum of col1 and col2 in output column named col_sum
Method 1
f = lambda x : x.col1 + x.col2
df['col_sum'] = df.apply(f, axis=1)
Method 2
def f(x):
x['col_sum'] = x.col_1 + col_2
return x
df = df.apply(f, axis=1)
Method 2 should be used when some complex function has to applied to the dataframe. Method 2 can also be used when output in multiple columns is required.
I suppose you don't want to change get_sublist function, and just want to use DataFrame's apply method to do the job. To get the result you want, I've wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
def get_sublist_list(cols):
return [get_sublist(cols[0],cols[1])]
def unlist(list_of_lists):
return list_of_lists[0]
df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)
df
If you don't use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it'll raise ValueError: could not broadcast input array from shape (3) into shape (2), as #Ted Petrou had mentioned.
If you have a huge data-set, then you can use an easy but faster(execution time) way of doing this using swifter:
import pandas as pd
import swifter
def fnc(m,x,c):
return m*x+c
df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

Categories

Resources