Replacing a column with a function of itself in Pandas? - python

I'm currently lost deep inside the pandas documentation. My problem is this:
I have a simple dataframe
col1 col2
1 A
4 B
5 X
My aim is to apply something like:
df['col1'] = df['col1'].apply(square)
where square is a cleanly defined function.
But this operation throws an error warning (and produces incorrect results)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I can't make sense of this nor the documentation it points to. My workflow is linear (in case this makes a wider range of solutions viable).
Pandas 0.17.1 and Python 2.7
All help much appreciated.

it works properly for me (pandas 0.18.1):
In [31]: def square(x):
....: return x ** 2
....:
In [33]: df
Out[33]:
col1 col2
0 1 A
1 4 B
2 5 X
In [35]: df.col1 = df.col1.apply(square)
In [36]: df
Out[36]:
col1 col2
0 1 A
1 16 B
2 25 X
PS it also might depend on the implementation of your function...

You can use .loc command to get rid of the warning
df.loc[:'col1'] = df['col1'].apply(square)

Related

Filter pandas index by function

I want to filter a pandas dataframe by a function along the index. I can't seem to find a built-in way of performing this action.
So essentially, I have a function that through some arbitrarily complicated means determines whether a particular index should be included, I'll call it filter_func for this example. I wish to apply exactly what the below code does, but to the index:
new_index = filter(filter_func, df.index)
And only include the values that the filter_func allows. The index could also be any type.
This is a pretty important factor of data manipulation, so I imagine there's a built-in way of doing this action.
ETA:
I found that indexing the dataframe by a list of booleans will do what I want, but still requires double the space of the index in order to apply the filter. So my question still remains if there's a built-in way of doing this that does not require twice the space.
Here's an example:
import pandas as pd
df = pd.DataFrame({"value":[12,34,2,23,6,23,7,2,35,657,1,324]})
def filter_func(ind, n=0):
if n > 200: return False
if ind % 79 == 0: return True
return filter_func(ind+ind-1, n+1)
new_index = filter(filter_func, df)
And I want to do this:
mask = []
for i in df.index:
mask.append(filter_func(i))
df = df[mask]
But in a way that doesn't take twice the space of the index to do so
You can use map instead of filter and then do a boolean indexing:
df.loc[map(filter_func,df.index)]
value
0 12
4 6
7 2
8 35
Have you tried using df.apply?
>>> df = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'c'])
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df[df.apply(lambda x: x['c']%2 == 0, axis = 1)]
a b c
0 0 1 2
2 6 7 8
You can customize the lambda function in any way you want, let me know if this isn't what you're looking for.
If you want to avoid referencing df explicitly inside the filtering condition, you can use the following:
import pandas as pd
df = pd.DataFrame({"value":[12,34,2,23,6,23,7,2,35,657,1,324]}, dtype=object)
df.apply(lambda x: x if filter_func(x.name) else None, axis=1, result_type='broadcast').dropna()

Using looping through data frames and objects [duplicate]

I have a scenario where a user wants to apply several filters to a Pandas DataFrame or Series object. Essentially, I want to efficiently chain a bunch of filtering (comparison operations) together that are specified at run-time by the user.
The filters should be additive (aka each one applied should narrow results).
I'm currently using reindex() (as below) but this creates a new object each time and copies the underlying data (if I understand the documentation correctly). I want to avoid this unnecessary copying as it will be really inefficient when filtering a big Series or DataFrame.
I'm thinking that using apply(), map(), or something similar might be better. I'm pretty new to Pandas though so still trying to wrap my head around everything.
Also, I would like to expand this so that the dictionary passed in can include the columns to operate on and filter an entire DataFrame based on the input dictionary. However, I'm assuming whatever works for a Series can be easily expanded to a DataFrame.
TL;DR
I want to take a dictionary of the following form and apply each operation to a given Series object and return a 'filtered' Series object.
relops = {'>=': [1], '<=': [1]}
Long Example
I'll start with an example of what I have currently and just filtering a single Series object. Below is the function I'm currently using:
def apply_relops(series, relops):
"""
Pass dictionary of relational operators to perform on given series object
"""
for op, vals in relops.iteritems():
op_func = ops[op]
for val in vals:
filtered = op_func(series, val)
series = series.reindex(series[filtered])
return series
The user provides a dictionary with the operations they want to perform:
>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
col1 col2
0 0 10
1 1 11
2 2 12
>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1 1
2 2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1 1
Name: col1
Again, the 'problem' with my above approach is that I think there is a lot of possibly unnecessary copying of the data for the in-between steps.
Pandas (and numpy) allow for boolean indexing, which will be much more efficient:
In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]:
1 1
2 2
Name: col1
In [12]: df[df['col1'] >= 1]
Out[12]:
col1 col2
1 1 11
2 2 12
In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]:
col1 col2
1 1 11
If you want to write helper functions for this, consider something along these lines:
In [14]: def b(x, col, op, n):
return op(x[col],n)
In [15]: def f(x, *b):
return x[(np.logical_and(*b))]
In [16]: b1 = b(df, 'col1', ge, 1)
In [17]: b2 = b(df, 'col1', le, 1)
In [18]: f(df, b1, b2)
Out[18]:
col1 col2
1 1 11
Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):
In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
col1 col2
1 1 11
Chaining conditions creates long lines, which are discouraged by PEP8.
Using the .query method forces to use strings, which is powerful but unpythonic and not very dynamic.
Once each of the filters is in place, one approach could be:
import numpy as np
import functools
def conjunction(*conditions):
return functools.reduce(np.logical_and, conditions)
c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4
data_filtered = data[conjunction(c_1,c_2,c_3)]
np.logical operates on and is fast, but does not take more than two arguments, which is handled by functools.reduce.
Note that this still has some redundancies:
Shortcutting does not happen on a global level
Each of the individual conditions runs on the whole initial data
Still, I expect this to be efficient enough for many applications and it is very readable. You can also make a disjunction (wherein only one of the conditions needs to be true) by using np.logical_or instead:
import numpy as np
import functools
def disjunction(*conditions):
return functools.reduce(np.logical_or, conditions)
c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4
data_filtered = data[disjunction(c_1,c_2,c_3)]
Simplest of All Solutions:
Use:
filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]
Another Example, To filter the dataframe for values belonging to Feb-2018, use the below code
filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]
Since pandas 0.22 update, comparison options are available like:
gt (greater than)
lt (less than)
eq (equals to)
ne (not equals to)
ge (greater than or equals to)
and many more. These functions return boolean array. Let's see how we can use them:
# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})
# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']
1 1
2 2
3 3
4 4
5 5
# where co11 values is between 0 and 2
df.loc[df['col1'].between(0,2)]
col1 col2
0 0 10
1 1 11
2 2 12
# where col1 > 1
df.loc[df['col1'].gt(1)]
col1 col2
2 2 12
3 3 13
4 4 14
5 5 15
Why not do this?
def filt_spec(df, col, val, op):
import operator
ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec
Demo:
df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')
Result:
a b
1 2 4
2 3 3
3 4 2
4 5 1
You can see that column 'a' has been filtered where a >=2.
This is slightly faster (typing time, not performance) than operator chaining. You could of course put the import at the top of the file.
e can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front.
For example
list = [1, 0]
df[df.col1.isin(list)]
If you want to check any/all of multiple columns for a value, you can do:
df[(df[['HomeTeam', 'AwayTeam']] == 'Fulham').any(axis=1)]

Reorder DataFrame rows inplace

Is it possible to reorder pandas.DataFrame rows (based on an index) inplace?
I am looking for something like this:
df.reorder(idx, axis=0, inplace=True)
Where idx is not equal to but is the same type of df.index, it contains the same elements, but in another order. The output is df reordered before new idx.
I have not found anything in documentation and I fail to use reindex_axis. Which made me hoping it was possible because:
A new object is produced unless the new index is equivalent to the
current one and copy=False
I might have misunderstood what equivalent index means in this context.
Try using the reindex function (note that this is not inplace):
>>import pandas as pd
>>df = pd.DataFrame({'col1':[1,2,3],'col2':['test','hi','hello']})
>>df
col1 col2
0 1 test
1 2 hi
2 3 hello
>>df = df.reindex([2,0,1])
>>df
col1 col2
2 3 hello
0 1 test
1 2 hi

Changing values in a dataframe column based off a different column (python)

Col1 Col2
0 APT UB0
1 AK0 UUP
2 IL2 PB2
3 OIU U5B
4 K29 AAA
My data frame looks similar to the above data. I'm trying to change the values in Col1 if the corresponding values in Col2 have the letter "B" in it. If the value in Col2 has "B", then I want to add "-B" to the end of the value in Col1.
Ultimately I want Col1 to look like this:
Col1
0 APT-B
1 AK0
2 IL2-B
.. ...
I have an idea of how to approach it... but I'm somewhat confused because I know my code is incorrect. In addition there are NaN values in my actual code for Col1... which will definitely give an error when I'm trying to do val += "-B" since it's not possible to add a string and a float.
for value in dataframe['Col2']:
if "Z" in value:
for val in dataframe['Col1']:
val += "-B"
Does anyone know how to fix/solve this?
Rather than using a loop, lets use pandas directly:
import pandas as pd
df = pd.DataFrame({'Col1': ['APT', 'AK0', 'IL2', 'OIU', 'K29'], 'Col2': ['UB0', 'UUP', 'PB2', 'U5B', 'AAA']})
df.loc[df.Col2.str.contains('B'), 'Col1'] += '-B'
print(df)
Output:
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA
You have too many "for" loops in your code. You just need to iterate over the rows once, and for any row satisfying your condition you make the change.
for idx, row in df.iterrows():
if 'B' in row['Col2']:
df.loc[idx, 'Col1'] = str(df.loc[idx, 'Col1']) + '-B'
edit: I used str to convert the previous value in Col1 to a string before appending, since you said you sometimes have non-string values there. If this doesn't work for you, please post your test data and results.
You can use a lambda expression. If 'B' is in Col2, then '-B' get appended to Col1. The end result is assigned back to Col1.
df['Col1'] = df.apply(lambda x: x.Col1 + ('-B' if 'B' in x.Col2 else ''), axis=1)
>>> df
Col1 Col2
0 APT-B UB0
1 AK0 UUP
2 IL2-B PB2
3 OIU-B U5B
4 K29 AAA

Efficient way to apply multiple filters to pandas DataFrame or Series

I have a scenario where a user wants to apply several filters to a Pandas DataFrame or Series object. Essentially, I want to efficiently chain a bunch of filtering (comparison operations) together that are specified at run-time by the user.
The filters should be additive (aka each one applied should narrow results).
I'm currently using reindex() (as below) but this creates a new object each time and copies the underlying data (if I understand the documentation correctly). I want to avoid this unnecessary copying as it will be really inefficient when filtering a big Series or DataFrame.
I'm thinking that using apply(), map(), or something similar might be better. I'm pretty new to Pandas though so still trying to wrap my head around everything.
Also, I would like to expand this so that the dictionary passed in can include the columns to operate on and filter an entire DataFrame based on the input dictionary. However, I'm assuming whatever works for a Series can be easily expanded to a DataFrame.
TL;DR
I want to take a dictionary of the following form and apply each operation to a given Series object and return a 'filtered' Series object.
relops = {'>=': [1], '<=': [1]}
Long Example
I'll start with an example of what I have currently and just filtering a single Series object. Below is the function I'm currently using:
def apply_relops(series, relops):
"""
Pass dictionary of relational operators to perform on given series object
"""
for op, vals in relops.iteritems():
op_func = ops[op]
for val in vals:
filtered = op_func(series, val)
series = series.reindex(series[filtered])
return series
The user provides a dictionary with the operations they want to perform:
>>> df = pandas.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]})
>>> print df
>>> print df
col1 col2
0 0 10
1 1 11
2 2 12
>>> from operator import le, ge
>>> ops ={'>=': ge, '<=': le}
>>> apply_relops(df['col1'], {'>=': [1]})
col1
1 1
2 2
Name: col1
>>> apply_relops(df['col1'], relops = {'>=': [1], '<=': [1]})
col1
1 1
Name: col1
Again, the 'problem' with my above approach is that I think there is a lot of possibly unnecessary copying of the data for the in-between steps.
Pandas (and numpy) allow for boolean indexing, which will be much more efficient:
In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]:
1 1
2 2
Name: col1
In [12]: df[df['col1'] >= 1]
Out[12]:
col1 col2
1 1 11
2 2 12
In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]:
col1 col2
1 1 11
If you want to write helper functions for this, consider something along these lines:
In [14]: def b(x, col, op, n):
return op(x[col],n)
In [15]: def f(x, *b):
return x[(np.logical_and(*b))]
In [16]: b1 = b(df, 'col1', ge, 1)
In [17]: b2 = b(df, 'col1', le, 1)
In [18]: f(df, b1, b2)
Out[18]:
col1 col2
1 1 11
Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):
In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
col1 col2
1 1 11
Chaining conditions creates long lines, which are discouraged by PEP8.
Using the .query method forces to use strings, which is powerful but unpythonic and not very dynamic.
Once each of the filters is in place, one approach could be:
import numpy as np
import functools
def conjunction(*conditions):
return functools.reduce(np.logical_and, conditions)
c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4
data_filtered = data[conjunction(c_1,c_2,c_3)]
np.logical operates on and is fast, but does not take more than two arguments, which is handled by functools.reduce.
Note that this still has some redundancies:
Shortcutting does not happen on a global level
Each of the individual conditions runs on the whole initial data
Still, I expect this to be efficient enough for many applications and it is very readable. You can also make a disjunction (wherein only one of the conditions needs to be true) by using np.logical_or instead:
import numpy as np
import functools
def disjunction(*conditions):
return functools.reduce(np.logical_or, conditions)
c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4
data_filtered = data[disjunction(c_1,c_2,c_3)]
Simplest of All Solutions:
Use:
filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]
Another Example, To filter the dataframe for values belonging to Feb-2018, use the below code
filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]
Since pandas 0.22 update, comparison options are available like:
gt (greater than)
lt (less than)
eq (equals to)
ne (not equals to)
ge (greater than or equals to)
and many more. These functions return boolean array. Let's see how we can use them:
# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})
# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']
1 1
2 2
3 3
4 4
5 5
# where co11 values is between 0 and 2
df.loc[df['col1'].between(0,2)]
col1 col2
0 0 10
1 1 11
2 2 12
# where col1 > 1
df.loc[df['col1'].gt(1)]
col1 col2
2 2 12
3 3 13
4 4 14
5 5 15
Why not do this?
def filt_spec(df, col, val, op):
import operator
ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec
Demo:
df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')
Result:
a b
1 2 4
2 3 3
3 4 2
4 5 1
You can see that column 'a' has been filtered where a >=2.
This is slightly faster (typing time, not performance) than operator chaining. You could of course put the import at the top of the file.
e can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front.
For example
list = [1, 0]
df[df.col1.isin(list)]
If you want to check any/all of multiple columns for a value, you can do:
df[(df[['HomeTeam', 'AwayTeam']] == 'Fulham').any(axis=1)]

Categories

Resources