Suppose I have a df which has columns of 'ID', 'col_1', 'col_2'. And I define a function :
f = lambda x, y : my_function_expression.
Now I want to apply the f to df's two columns 'col_1', 'col_2' to element-wise calculate a new column 'col_3' , somewhat like :
df['col_3'] = df[['col_1','col_2']].apply(f)
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'
How to do ?
** Add detail sample as below ***
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
ID col_1 col_2 col_3
0 1 0 1 ['a', 'b']
1 2 2 4 ['c', 'd', 'e']
2 3 3 5 ['d', 'e', 'f']
There is a clean, one-line way of doing this in Pandas:
df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)
This allows f to be a user-defined function with multiple input values, and uses (safe) column names rather than (unsafe) numeric indices to access the columns.
Example with data (based on original question):
import pandas as pd
df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)
Output of print(df):
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
If your column names contain spaces or share a name with an existing dataframe attribute, you can index with square brackets:
df['col_3'] = df.apply(lambda x: f(x['col 1'], x['col 2']), axis=1)
Here's an example using apply on the dataframe, which I am calling with axis = 1.
Note the difference is that instead of trying to pass two values to the function f, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.
In [49]: df
Out[49]:
0 1
0 1.000000 0.000000
1 -0.494375 0.570994
2 1.000000 0.000000
3 1.876360 -0.229738
4 1.000000 0.000000
In [50]: def f(x):
....: return x[0] + x[1]
....:
In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]:
0 1.000000
1 0.076619
2 1.000000
3 1.646622
4 1.000000
Depending on your use case, it is sometimes helpful to create a pandas group object, and then use apply on the group.
A simple solution is:
df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)
A interesting question! my answer as below:
import pandas as pd
def sublst(row):
return lst[row['J1']:row['J2']]
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(sublst,axis=1)
print df
Output:
ID J1 J2
0 1 0 1
1 2 2 4
2 3 3 5
ID J1 J2 J3
0 1 0 1 [a]
1 2 2 4 [c, d]
2 3 3 5 [d, e]
I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.
One more brief version:
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df
The method you are looking for is Series.combine.
However, it seems some care has to be taken around datatypes.
In your example, you would (as I did when testing the answer) naively call
df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)
However, this throws the error:
ValueError: setting an array element with a sequence.
My best guess is that it seems to expect the result to be of the same type as the series calling the method (df.col_1 here). However, the following works:
df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Returning a list from apply is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:
df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
columns=['a', 'b', 'c'])
df
a b c
0 4 0 0
1 2 0 1
2 2 2 2
3 1 2 2
4 3 0 0
There are three possible outcomes with returning a list from apply
1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.
df.apply(lambda x: list(range(2)), axis=1) # returns a Series
0 [0, 1]
1 [0, 1]
2 [0, 1]
3 [0, 1]
4 [0, 1]
dtype: object
2) When the length of the returned list is equal to the number of
columns then a DataFrame is returned and each column gets the
corresponding value in the list.
df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
a b c
0 0 1 2
1 0 1 2
2 0 1 2
3 0 1 2
4 0 1 2
3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.
i = 0
def f(x):
global i
if i == 0:
i += 1
return list(range(3))
return list(range(4))
df.apply(f, axis=1)
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)
Answering the problem without apply
Using apply with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.
Create larger dataframe
df1 = df.sample(100000, replace=True).reset_index(drop=True)
Timings
# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# zip - similar to #Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#Thomas answer
%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I'm sure this isn't as fast as the solutions using Pandas or Numpy operations, but if you don't want to rewrite your function you can use map. Using the original example data -
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list
We could pass as many arguments as we wanted into the function this way. The output is what we wanted
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Here is a faster solution:
def func_1(a,b):
return a + b
df["C"] = func_1(df["A"].to_numpy(),df["B"].to_numpy())
This is 380 times faster than df.apply(f, axis=1) from #Aman and 310 times faster than df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1) from #ajrwhite.
I add some benchmarks too:
Results:
FUNCTIONS TIMINGS GAIN
apply lambda 0.7 x 1
apply 0.56 x 1.25
map 0.3 x 2.3
np.vectorize 0.01 x 70
f3 on Series 0.0026 x 270
f3 on np arrays 0.0018 x 380
f3 numba 0.0018 x 380
In short:
Using apply is slow. We can speed up things very simply, just by using a function that will operate directly on Pandas Series (or better on numpy arrays). And because we will operate on Pandas Series or numpy arrays, we will be able to vectorize the operations. The function will return a Pandas Series or numpy array that we will assign as a new column.
And here is the benchmark code:
import timeit
timeit_setup = """
import pandas as pd
import numpy as np
import numba
np.random.seed(0)
# Create a DataFrame of 10000 rows with 2 columns "A" and "B"
# containing integers between 0 and 100
df = pd.DataFrame(np.random.randint(0,10,size=(10000, 2)), columns=["A", "B"])
def f1(a,b):
# Here a and b are the values of column A and B for a specific row: integers
return a + b
def f2(x):
# Here, x is pandas Series, and corresponds to a specific row of the DataFrame
# 0 and 1 are the indexes of columns A and B
return x[0] + x[1]
def f3(a,b):
# Same as f1 but we will pass parameters that will allow vectorization
# Here, A and B will be Pandas Series or numpy arrays
# with df["C"] = f3(df["A"],df["B"]): Pandas Series
# with df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy()): numpy arrays
return a + b
#numba.njit('int64[:](int64[:], int64[:])')
def f3_numba_vectorize(a,b):
# Here a and b are 2 numpy arrays with dtype int64
# This function must return a numpy array whith dtype int64
return a + b
"""
test_functions = [
'df["C"] = df.apply(lambda row: f1(row["A"], row["B"]), axis=1)',
'df["C"] = df.apply(f2, axis=1)',
'df["C"] = list(map(f3,df["A"],df["B"]))',
'df["C"] = np.vectorize(f3) (df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3(df["A"],df["B"])',
'df["C"] = f3(df["A"].to_numpy(),df["B"].to_numpy())',
'df["C"] = f3_numba_vectorize(df["A"].to_numpy(),df["B"].to_numpy())'
]
for test_function in test_functions:
print(min(timeit.repeat(setup=timeit_setup, stmt=test_function, repeat=7, number=10)))
Output:
0.7
0.56
0.3
0.01
0.0026
0.0018
0.0018
Final note: things could be optimzed with Cython and other numba tricks too.
The way you have written f it needs two inputs. If you look at the error message it says you are not providing two inputs to f, just one. The error message is correct.
The mismatch is because df[['col1','col2']] returns a single dataframe with two columns, not two separate columns.
You need to change your f so that it takes a single input, keep the above data frame as input, then break it up into x,y inside the function body. Then do whatever you need and return a single value.
You need this function signature because the syntax is .apply(f)
So f needs to take the single thing = dataframe and not two things which is what your current f expects.
Since you haven't provided the body of f I can't help in anymore detail - but this should provide the way out without fundamentally changing your code or using some other methods rather than apply
Another option is df.itertuples() (generally faster and recommended over df.iterrows() by docs and user testing):
import pandas as pd
df = pd.DataFrame([range(4) for _ in range(4)], columns=list("abcd"))
df
a b c d
0 0 1 2 3
1 0 1 2 3
2 0 1 2 3
3 0 1 2 3
df["e"] = [sum(row) for row in df[["b", "d"]].itertuples(index=False)]
df
a b c d e
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
Since itertuples returns an Iterable of namedtuples, you can access tuple elements both as attributes by column name (aka dot notation) and by index:
b, d = row
b = row.b
d = row[1]
My example to your questions:
def get_sublist(row, col1, col2):
return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')
It can be done in two simple ways:
Let's say, we want sum of col1 and col2 in output column named col_sum
Method 1
f = lambda x : x.col1 + x.col2
df['col_sum'] = df.apply(f, axis=1)
Method 2
def f(x):
x['col_sum'] = x.col_1 + col_2
return x
df = df.apply(f, axis=1)
Method 2 should be used when some complex function has to applied to the dataframe. Method 2 can also be used when output in multiple columns is required.
I suppose you don't want to change get_sublist function, and just want to use DataFrame's apply method to do the job. To get the result you want, I've wrote two help functions: get_sublist_list and unlist. As the function name suggest, first get the list of sublist, second extract that sublist from that list. Finally, We need to call apply function to apply those two functions to the df[['col_1','col_2']] DataFrame subsequently.
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
def get_sublist_list(cols):
return [get_sublist(cols[0],cols[1])]
def unlist(list_of_lists):
return list_of_lists[0]
df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)
df
If you don't use [] to enclose the get_sublist function, then the get_sublist_list function will return a plain list, it'll raise ValueError: could not broadcast input array from shape (3) into shape (2), as #Ted Petrou had mentioned.
If you have a huge data-set, then you can use an easy but faster(execution time) way of doing this using swifter:
import pandas as pd
import swifter
def fnc(m,x,c):
return m*x+c
df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)
Related
I have a dataframe as follows:
df=pandas.DataFrame()
df['A'] = numpy.random.random(10)
df['B'] = numpy.random.random(10)
df['C'] = numpy.random.random(10)
df['Col_name'] = numpy.random.choice(['A','B','C'],size=10)
I want to obtain an output that uses 'Col_name' and the respective index of the dataframe row to lookup the value in the dataframe.
I can get the desired output this with .apply() follows:
df['output'] = df.apply(lambda x: x[ x['Col_name'] ], axis=1)
.apply() is slow over a large dataframe with it iterating row by row. Is there an obvious solution in pandas that is faster/vectorised?
You can also pick each column name (or give list of possible names) and then apply it as mask to filter your dataframe then pick values from desired column and assign them to all rows matching the mask. Then repeat this for another coulmn.
for column_name in df: #or: for column_name in ['A', 'B', 'C']
df.loc[df['Col_name']==column_name, 'output'] = df[column_name]
Rows that will not match any mask will have NaN values.
PS. Accodring to my test with 10000000 random rows - method with .apply() takes 2min 24s to finish while my method takes only 4,3s.
Use melt to flatten your dataframe and keep rows where Col_name equals to variable column:
df['output'] = df.melt('Col_name', ignore_index=False).query('Col_name == variable')['value']
print(df)
# Output
A B C Col_name output
0 0.202197 0.430735 0.093551 B 0.430735
1 0.344753 0.979453 0.999160 C 0.999160
2 0.500904 0.778715 0.074786 A 0.500904
3 0.050951 0.317732 0.363027 B 0.317732
4 0.722624 0.026065 0.424639 C 0.424639
5 0.578185 0.626698 0.376692 C 0.376692
6 0.540849 0.805722 0.528886 A 0.540849
7 0.918618 0.869893 0.825991 C 0.825991
8 0.688967 0.203809 0.734467 B 0.203809
9 0.811571 0.010081 0.372657 B 0.010081
Transformation after melt:
>>> df.melt('Col_name', ignore_index=False)
Col_name variable value
0 B A 0.202197
1 C A 0.344753
2 A A 0.500904 # keep
3 B A 0.050951
4 C A 0.722624
5 C A 0.578185
6 A A 0.540849 # keep
7 C A 0.918618
8 B A 0.688967
9 B A 0.811571
0 B B 0.430735 # keep
1 C B 0.979453
2 A B 0.778715
3 B B 0.317732 # keep
4 C B 0.026065
5 C B 0.626698
6 A B 0.805722
7 C B 0.869893
8 B B 0.203809 # keep
9 B B 0.010081 # keep
0 B C 0.093551
1 C C 0.999160 # keep
2 A C 0.074786
3 B C 0.363027
4 C C 0.424639 # keep
5 C C 0.376692 # keep
6 A C 0.528886
7 C C 0.825991 # keep
8 B C 0.734467
9 B C 0.372657
Update
Alternative with set_index and stack for #Rabinzel:
df['output'] = (
df.set_index('Col_name', append=True).stack()
.loc[lambda x: x.index.get_level_values(1) == x.index.get_level_values(2)]
.droplevel([1, 2])
)
print(df)
# Output
A B C Col_name output
0 0.209953 0.332294 0.812476 C 0.812476
1 0.284225 0.566939 0.087084 A 0.284225
2 0.815874 0.185154 0.155454 A 0.815874
3 0.017548 0.733474 0.766972 A 0.017548
4 0.494323 0.433719 0.979399 C 0.979399
5 0.875071 0.789891 0.319870 B 0.789891
6 0.475554 0.229837 0.338032 B 0.229837
7 0.123904 0.397463 0.288614 C 0.288614
8 0.288249 0.631578 0.393521 A 0.288249
9 0.107245 0.006969 0.367748 C 0.367748
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['A'] = np.random.random(10)
df['B'] = np.random.random(10)
df['C'] = np.random.random(10)
df['Col_name'] = np.random.choice(['A','B','C'],size=10)
df["output"] = np.nan
Even though you do not like going row per row, I still routinely use loops to go through each row just to know where it breaks when it breaks. Here are two loops just to satisfy myself. The column is created ahead with na values becausethe loops needs it to be.
# each rows by index
for i in range(len(df)):
df['output'][i] = df[df['Col_name'][i]][i]
# each rows but by column name
for col in list(df["Col_name"]):
df.loc[:,'output'] = df.loc[:,col]
Here are some "non-loop" ways to do so.
df["output"] = df.lookup(df.index, df.Col_name)
df['output'] = np.where(np.isnan(df['output']), df[df['Col_name']], np.nan)
I have a pandas DataFrame whose values I want to conditionally change into strings without looping over every value.
Example input:
In [1]: df = pd.DataFrame(data = [[1,2], [4,5]], columns = ['a', 'b'])
Out[2]:
a b
0 1 2
1 4 5
This is my best attempt which doesn't work properly
df['a'] = np.where(df['a'] < 3, f'string-{df["a"]}', df['a'])
In [1]: df
Out[2]:
a b
0 string0 1\n1 4\nName: a, dtype: int64 2
1 4 5
Desired output:
Out[2]:
A B
0 string-1 2
1 4 5
I am using np.where() since looping is not feasible due to the size of the actual DataFrame. The actual f-string I am using is also more complex and has two variables that include column names, but the problem is the same.
Are there other ways to conditionally change pandas values into f-strings without looping over each value?
You can use .map() together with f-string, as follows:
df['a'] = df['a'].map(lambda x: f'string-{x}' if x < 3 else x)
Alternatively, you can also use .loc together with string concatenation, as follows:
df.loc[df['a'] < 3, 'a'] = 'string-' + df['a'].astype(str)
#OR
df['a']=np.where(df['a'] < 3, 'string-'+df['a'].astype(str), df['a'])
Result:
print(df)
a b
0 string-1 2
1 4 5
I have n variables. Suppose n equals 3 in this case. I want to apply one function to all of the combinations(or permutations, depending on how you want to solve this) of variables and store the result in the same row and column in dataframe.
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
df = pd.DataFrame({x:np.nan for x in indexes}, index=indexes)
If I apply sum(the function can be anything), then the result that I want to get is like this:
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I can only think of iterating all the variables, apply the function one by one, and use the index of the iterators to set the value in the dataframe. Is there any better solution?
You can use apply and return a pd.Series for that effect. In such cases, pandas uses the series indices as columns in the resulting dataframe.
s = pd.Series({"a": 1, "b": 2, "c": 3})
s.apply(lambda x: x+s)
Just note that the operation you do is between an element and a series.
I believe you need broadcast sum of array created from variables if performance is important:
a = 1
b = 2
c = 3
indexes = ['a', 'b', 'c']
arr = np.array([a,b,c])
df = pd.DataFrame(arr + arr[:, None], index=indexes, columns=indexes)
print (df)
a b c
a 2 3 4
b 3 4 5
c 4 5 6
I'm simply trying to access named pandas columns by an integer.
You can select a row by location using df.ix[3].
But how to select a column by integer?
My dataframe:
df=pandas.DataFrame({'a':np.random.rand(5), 'b':np.random.rand(5)})
Two approaches that come to mind:
>>> df
A B C D
0 0.424634 1.716633 0.282734 2.086944
1 -1.325816 2.056277 2.583704 -0.776403
2 1.457809 -0.407279 -1.560583 -1.316246
3 -0.757134 -1.321025 1.325853 -2.513373
4 1.366180 -1.265185 -2.184617 0.881514
>>> df.iloc[:, 2]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
>>> df[df.columns[2]]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
Edit: The original answer suggested the use of df.ix[:,2] but this function is now deprecated. Users should switch to df.iloc[:,2].
You can also use df.icol(n) to access a column by integer.
Update: icol is deprecated and the same functionality can be achieved by:
df.iloc[:, n] # to access the column at the nth position
You could use label based using .loc or index based using .iloc method to do column-slicing including column ranges:
In [50]: import pandas as pd
In [51]: import numpy as np
In [52]: df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))
In [53]: df
Out[53]:
a b c d
0 0.806811 0.187630 0.978159 0.317261
1 0.738792 0.862661 0.580592 0.010177
2 0.224633 0.342579 0.214512 0.375147
3 0.875262 0.151867 0.071244 0.893735
In [54]: df.loc[:, ["a", "b", "d"]] ### Selective columns based slicing
Out[54]:
a b d
0 0.806811 0.187630 0.317261
1 0.738792 0.862661 0.010177
2 0.224633 0.342579 0.375147
3 0.875262 0.151867 0.893735
In [55]: df.loc[:, "a":"c"] ### Selective label based column ranges slicing
Out[55]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
In [56]: df.iloc[:, 0:3] ### Selective index based column ranges slicing
Out[56]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
You can access multiple columns by passing a list of column indices to dataFrame.ix.
For example:
>>> df = pandas.DataFrame({
'a': np.random.rand(5),
'b': np.random.rand(5),
'c': np.random.rand(5),
'd': np.random.rand(5)
})
>>> df
a b c d
0 0.705718 0.414073 0.007040 0.889579
1 0.198005 0.520747 0.827818 0.366271
2 0.974552 0.667484 0.056246 0.524306
3 0.512126 0.775926 0.837896 0.955200
4 0.793203 0.686405 0.401596 0.544421
>>> df.ix[:,[1,3]]
b d
0 0.414073 0.889579
1 0.520747 0.366271
2 0.667484 0.524306
3 0.775926 0.955200
4 0.686405 0.544421
The method .transpose() converts columns to rows and rows to column, hence you could even write
df.transpose().ix[3]
Most of the people have answered how to take columns starting from an index. But there might be some scenarios where you need to pick columns from in-between or specific index, where you can use the below solution.
Say that you have columns A,B and C. If you need to select only column A and C you can use the below code.
df = df.iloc[:, [0,2]]
where 0,2 specifies that you need to select only 1st and 3rd column.
You can use the method take. For example, to select first and last columns:
df.take([0, -1], axis=1)
I have a df with many columns. I would like to group by id and transform a subset of those columns leaving the rest untouched. What is the optimal way to do this? In particular, I have a df with a bunch of id's and I would like to z-score columns a and b within each id. Column c should remain untouched. In my actual problem I have many more columns.
The best I can think of is passing a dict of {col_name: function_name} to transform. For some reason this raises a TypeError.
MWE:
import pandas as pd
import numpy as np
np.random.seed(123) #reproducible ex
df = pd.DataFrame(data = {"a": np.arange(10), "b": np.arange(10)[::-1], "c": np.random.choice(a = np.arange(10), size = 10)}, index = pd.Index(data = np.random.choice(a = [1,2,3], size = 10), name = "id"))
#create a dict for all columns other than "c" and the function to do the transform
fmap = {k: lambda x: (x - x.mean()) / x.std() for k in df.columns if k != "c"}
df.groupby("id").transform(fmap) #yields error that "dict" is unhashable
Turns out this is a known bug: https://github.com/pandas-dev/pandas/issues/17309
One possible solution is filter columns names first by difference, because dict cannot working with transfrom yet:
cols = df.columns.difference(['c'])
print (cols)
Index(['a', 'b'], dtype='object')
fmap = lambda x: (x - x.mean()) / x.std()
df[cols] = df.groupby("id")[cols].transform(fmap)
print (df)
a b c
id
3 -1.000000 1.000000 2
2 -1.091089 1.091089 2
1 -1.134975 1.134975 6
3 0.000000 0.000000 1
1 -0.529655 0.529655 3
2 0.218218 -0.218218 9
3 1.000000 -1.000000 6
2 0.872872 -0.872872 1
1 0.680985 -0.680985 0
1 0.983645 -0.983645 1