This question already has answers here:
How to apply a function to two columns of Pandas dataframe
(15 answers)
Closed 4 years ago.
I need to make a column in my pandas dataframe that relies on other items in that same row. For example, here's my dataframe.
df = pd.DataFrame(
[['a',],['a',1],['a',1],['a',2],['b',2],['b',2],['c',3]],
columns=['letter','number']
)
letters numbers
0 a 1
1 a 1
2 a 1
3 a 2
4 b 2
5 b 2
6 c 3
I need a third column, that is 1 if 'a' and 2 are present in the row, and 0 otherwise. So it would be [`0,0,0,1,0,0,0]`
How can I use Pandas `apply` or `map` to do this? Iterating over the rows is my first thought, but this seems like a clumsy way of doing it.
You can use apply with axis=1. Suppose you wanted to call your new column c:
df['c'] = df.apply(
lambda row: (row['letter'] == 'a') and (row['number'] == 2),
axis=1
).astype(int)
print(df)
# letter number c
#0 a NaN 0
#1 a 1.0 0
#2 a 1.0 0
#3 a 2.0 1
#4 b 2.0 0
#5 b 2.0 0
#6 c 3.0 0
But apply is slow and should be avoided if possible. In this case, it would be much better to boolean logic operations, which are vectorized.
df['c'] = ((df['letter'] == "a") & (df['number'] == 2)).astype(int)
This has the same result as using apply above.
You can try to use pd.Series.where()/np.where(). If you only are interested in the int represantation of the boolean values, you can pick the other solution. If you want more freedom for the if/else value you can use np.where()
import pandas as pd
import numpy as np
# create example
values = ['a', 'b', 'c']
df = pd.DataFrame()
df['letter'] = np.random.choice(values, size=10)
df['number'] = np.random.randint(1,3, size=10)
# condition
df['result'] = np.where((df['letter'] == 'a') & (df['number'] == 2), 1, 0)
Related
I just wanted to ask the community and see if there is a more efficient to do this.
I have several rows in a data frame and I am using .loc to filter values in row A for I can perform calculations on row B.
I can easily do something like...
filter_1 = df.loc['Condition'] = 1
And then perform the mathematical calculation on row B that I need.
But there are many conditions I must go through so I was wondering if I could possibly make a list of the conditions and then iterate them through the .loc function in less lines of code?
Would something like this work where I create a list, then iterate the conditions through a loop?
Thank you!
This example gets most of what I want. I just need it to show 6.4 and 7.0 in this example. How can I manipulate the iteration for it shows the results for the unique values in row 'a'?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
col = ['a', 'b']
list_1 = []
for i, j in zip(a,b):
list_1.append([i,j])
df1 = pd.DataFrame(list_1, columns= col)
for i in a:
aa = df1[df1['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using set
set_a = set(a)
for i in set_a:
aa = df[df['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using pandas mean function
Is this what you are looking for?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
df = pd.DataFrame({'a':a,'b':b})
print (df)
print(df.groupby('a').mean())
The results from this are:
Original Dataframe df:
a b
0 1 5
1 2 1
2 1 3
3 2 5
4 1 7
5 2 20
6 1 9
7 2 5
8 1 8
9 2 4
The mean value of df['a'] is:
b
a
1 6.4
2 7.0
Here you go:
df = df[(df['A'] > 1) & (df['A'] < 10)]
I want to filter a pandas dataframe by a function along the index. I can't seem to find a built-in way of performing this action.
So essentially, I have a function that through some arbitrarily complicated means determines whether a particular index should be included, I'll call it filter_func for this example. I wish to apply exactly what the below code does, but to the index:
new_index = filter(filter_func, df.index)
And only include the values that the filter_func allows. The index could also be any type.
This is a pretty important factor of data manipulation, so I imagine there's a built-in way of doing this action.
ETA:
I found that indexing the dataframe by a list of booleans will do what I want, but still requires double the space of the index in order to apply the filter. So my question still remains if there's a built-in way of doing this that does not require twice the space.
Here's an example:
import pandas as pd
df = pd.DataFrame({"value":[12,34,2,23,6,23,7,2,35,657,1,324]})
def filter_func(ind, n=0):
if n > 200: return False
if ind % 79 == 0: return True
return filter_func(ind+ind-1, n+1)
new_index = filter(filter_func, df)
And I want to do this:
mask = []
for i in df.index:
mask.append(filter_func(i))
df = df[mask]
But in a way that doesn't take twice the space of the index to do so
You can use map instead of filter and then do a boolean indexing:
df.loc[map(filter_func,df.index)]
value
0 12
4 6
7 2
8 35
Have you tried using df.apply?
>>> df = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'c'])
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df[df.apply(lambda x: x['c']%2 == 0, axis = 1)]
a b c
0 0 1 2
2 6 7 8
You can customize the lambda function in any way you want, let me know if this isn't what you're looking for.
If you want to avoid referencing df explicitly inside the filtering condition, you can use the following:
import pandas as pd
df = pd.DataFrame({"value":[12,34,2,23,6,23,7,2,35,657,1,324]}, dtype=object)
df.apply(lambda x: x if filter_func(x.name) else None, axis=1, result_type='broadcast').dropna()
Imagine that I have a Dataframe and the columns are [A,B,C]. There are some different values for each of these columns. And I want to produce one more column D which can be received with the following function:
def produce_column(i):
# Extract current row by index
raw = df.loc[i]
# Extract previous 3 values for the same sub-df which are before i
df_same = df[
(df['A'] == raw.A)
& (df['B'] == raw.B)
].loc[:i].tail(3)
# Check that we have enough values
if df_same.shape[0] != 3:
return False
# Doesn't matter which function is in use, I just need to apply it on the column / columns
diffs = df_same['C'].map(lambda x: x <= 10 and x > 0)
return all(diffs)
df['D'] = df.index.map(lambda x: produce_column(x))
So on each step, I need to get the Dataframe, which have the same set of properties as a row and perform some operations on columns of this Dataframe. I have a few hundred thousands of rows, so this code takes a lot of time to be executed. I think that a good idea is to vectorize the operation, but I don't know how to do that. Maybe there's another way to perform this?
Thanks in advance!
UPD Here's an example
df = pd.DataFrame([(1,2,3), (4,5,6), (7,8,9)], columns=['A','B','C'])
A B C
0 1 2 3
1 4 5 6
2 7 8 9
df['D'] = df.index.map(lambda x: produce_column(x))
A B C D
0 1 2 3 True
1 4 5 6 True
2 7 8 9 False
I want to sum up all values that I select based on some function of column and row.
Another way of putting it is that I want to use a function of the row index and column index to determine if a value should be included in a sum along an axis.
Is there an easy way of doing this?
Columns can be selected using the syntax dataframe[<list of columns>]. The index (row) can be used for filtering using the dataframe.index method.
import pandas as pd
df = pd.DataFrame({'a': [0.1, 0.2], 'b': [0.2, 0.1]})
odd_a = df['a'][df.index % 2 == 1]
even_b = df['b'][df.index % 2 == 0]
# odd_a:
# 1 0.2
# Name: a, dtype: float64
# even_b:
# 0 0.2
# Name: b, dtype: float64
If df is your dataframe :
In [477]: df
Out[477]:
A s2 B
0 1 5 5
1 2 3 5
2 4 5 5
You can access the odd rows like this :
In [478]: df.loc[1::2]
Out[478]:
A s2 B
1 2 3 5
and the even ones like this:
In [479]: df.loc[::2]
Out[479]:
A s2 B
0 1 5 5
2 4 5 5
To answer your question, getting even rows and column B would be :
In [480]: df.loc[::2,'B']
Out[480]:
0 5
2 5
Name: B, dtype: int64
and odd rows and column A can be done as:
In [481]: df.loc[1::2,'A']
Out[481]:
1 2
Name: A, dtype: int64
I think this should be fairly general if not the cleanest implementation. This should allow applying separate functions for rows and columns depending on conditions (that I defined here in dictionaries).
import numpy as np
import pandas as pd
ran = np.random.randint(0,10,size=(5,5))
df = pd.DataFrame(ran,columns = ["a","b","c","d","e"])
# A dictionary to define what function is passed
d_col = {"high":["a","c","e"], "low":["b","d"]}
d_row = {"high":[1,2,3], "low":[0,4]}
# Generate list of Pandas boolean Series
i_col = [df[i].apply(lambda x: x>5) if i in d_col["high"] else df[i].apply(lambda x: x<5) for i in df.columns]
# Pass the series as a matrix
df = df[pd.concat(i_col,axis=1)]
# Now do this again for rows
i_row = [df.T[i].apply(lambda x: x>5) if i in d_row["high"] else df.T[i].apply(lambda x: x<5) for i in df.T.columns]
# Return back the DataFrame in original shape
df = df.T[pd.concat(i_row,axis=1)].T
# Perform the final operation such as sum on the returned DataFrame
print(df.sum().sum())
This question already has answers here:
Selecting columns from pandas MultiIndex
(13 answers)
Closed 4 years ago.
I have the following pd.DataFrame:
Name 0 1 ...
Col A B A B ...
0 0.409511 -0.537108 -0.355529 0.212134 ...
1 -0.332276 -1.087013 0.083684 0.529002 ...
2 1.138159 -0.327212 0.570834 2.337718 ...
It has MultiIndex columns with names=['Name', 'Col'] and hierarchical levels. The Name label goes from 0 to n, and for each label, there are two A and B columns.
I would like to subselect all the A (or B) columns of this DataFrame.
There is a get_level_values method that you can use in conjunction with boolean indexing to get the the intended result.
In [13]:
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
print df
1 2
A B A B
0 0.543980 0.628078 0.756941 0.698824
1 0.633005 0.089604 0.198510 0.783556
2 0.662391 0.541182 0.544060 0.059381
3 0.841242 0.634603 0.815334 0.848120
In [14]:
print df.iloc[:, df.columns.get_level_values(1)=='A']
1 2
A A
0 0.543980 0.756941
1 0.633005 0.198510
2 0.662391 0.544060
3 0.841242 0.815334
Method 1:
df.xs('A', level='Col', axis=1)
for more refer to http://pandas.pydata.org/pandas-docs/stable/advanced.html#cross-section
Method 2:
df.loc[:, (slice(None), 'A')]
Caveat: this method requires the labels to be sorted. for more refer to http://pandas.pydata.org/pandas-docs/stable/advanced.html#the-need-for-sortedness-with-multiindex
EDIT*
Best way now is to use indexSlice for multi-index selections
idx = pd.IndexSlice
A = df.loc[:,idx[:,'A']]
B = df.loc[:,idx[:,'B']]