Get indices of columns satisfying multiple conditions in new column with pandas - python

With the following dataframe as an example :
df = pd.DataFrame({'Sample':['X', 'Y', 'Z'], 'Base':[2, 10, 3], 'A':[0,5,100], 'C':[0,10,7]})
I would like to add a new column called df["indices"] with the indices of columns df["A"] and/or df["C"] provided they satisfy 2 conditions:
Must be greater than 5
df["A"]/df["Base"] or df["C"]/df["Base"] must be greater than or equal to 1
The resulting dataframe would be:
df = pd.DataFrame({'Sample':['X', 'Y', 'Z'], 'Base':[2, 20, 3], 'A':[0,6,100], 'C':[0,10,7], 'indices': ['','C','A,C']})
I can get True or False values for my first condition with df[['A','C']] > 5 but I cannot get it to work with my condition 2 which is based on another column in my dataframe. Getting the indices where I get True in a new column is yet another story. I imagine something with apply and get_loc or index but I cannot get it to work no matter how I try.

Let's create a boolean mask satisfying the two given conditions, then use DataFrame.dot on this mask to get the indices:
m = df[['A', 'C']].gt(5) & df[['A', 'C']].div(df['Base'], axis=0).ge(1)
df['indices'] = m.dot(m.columns + ',').str.rstrip(',')
Sample Base A C indices
0 X 2 0 0
1 Y 10 5 10 C
2 Z 3 100 7 A,C

You can use df.loc to assign values back to the column when any number of conditions are met. A simple approach would be to have 3 of these, each with your desired conditions. You could also probably chain together np.where to achieve the same thing if you wanted.
import pandas as pd
df = pd.DataFrame({'Sample':['X', 'Y', 'Z'],
'Base':[2, 10, 3],
'A':[0,5,100],
'C':[0,10,7]})
df.loc[(df['A'] / df['Base'] >=1) & (df['C'] / df['Base'] >=1), 'indicies'] = 'A,C'
df.loc[(df['A'] / df['Base'] >=1) & (df['C'] / df['Base'] <1), 'indicies'] = 'A'
df.loc[(df['A'] / df['Base'] <1) & (df['C'] / df['Base'] >=1), 'indicies'] = 'C'
Output
Sample Base A C indicies
0 X 2 0 0 NaN
1 Y 10 5 10 C
2 Z 3 100 7 A,C

Related

Assign multiple columns different values based on conditions in Panda dataframe

I have dataframe where new columns need to be added based on existing column values conditions and I am looking for an efficient way of doing.
For Ex:
df = pd.DataFrame({'a':[1,2,3],
'b':['x','y','x'],
's':['proda','prodb','prodc'],
'r':['oz1','0z2','oz3']})
I need to create 2 new columns ['c','d'] based on following conditions
If df['b'] == 'x':
df['c'] = df['s']
df['d'] = df['r']
elif df[b'] == 'y':
#assign different values to c, d columns
We can use numpy where and apply conditions on new column like
df['c] = ny.where(condition, value)
df['d'] = ny.where(condition, value)
But I am looking if there is a way to do this in a single statement or without using for loop or multiple numpy or panda apply.
The exact output is unclear, but you can use numpy.where with 2D data.
For example:
cols = ['c', 'd']
df[cols] = np.where(df['b'].eq('x').to_numpy()[:,None],
df[['s', 'r']], np.nan)
output:
a b s r c d
0 1 x proda oz1 proda oz1
1 2 y prodb 0z2 NaN NaN
2 3 x prodc oz3 prodc oz3
If you want multiple conditions, use np.select:
cols = ['c', 'd']
df[cols] = np.select([df['b'].eq('x').to_numpy()[:,None],
df['b'].eq('y').to_numpy()[:,None]
],
[df[['s', 'r']],
df[['r', 'a']]
], np.nan)
it is however easier here to use a loop for the conditions if you have many:
cols = ['c', 'd']
df[cols] = np.select([df['b'].eq(c).to_numpy()[:,None] for c in ['x', 'y']],
[df[repl] for repl in (['s', 'r'], ['r', 'a'])],
np.nan)
output:
a b s r c d
0 1 x proda oz1 proda oz1
1 2 y prodb 0z2 0z2 2
2 3 x prodc oz3 prodc oz3

Pandas dataframe selecting with index and condition on a column

I am trying for a while to solve this problem:
I have a daraframe like this:
import pandas as pd
df=pd.DataFrame(np.array([['A', 2, 3], ['B', 5, 6], ['C', 8, 9]]),columns=['a', 'b', 'c'])
j=[0,2]
But then when i try to select just a part of it filtering by a list of index and a condition on a column I get error...
df[df.loc[j]['a']=='A']
There is somenting wrong, but i don't get what is the problem here. Can you help me?
This is the error message:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
There is filtered DataFrame compared by original, so indices are different, so error is raised.
You need compare filtered DataFrame:
df1 = df.loc[j]
print (df1)
a b c
0 A 2 3
2 C 8 9
out = df1[df1['a']=='A']
print(out)
a b c
0 A 2 3
Your solution is possible use with convert ndices of filtered mask by original indices by Series.reindex:
out = df[(df.loc[j, 'a']=='A').reindex(df.index, fill_value=False)]
print(out)
a b c
0 A 2 3
Or nicer solution:
out = df[(df['a'] == 'A') & (df.index.isin(j))]
print(out)
a b c
0 A 2 3
A boolean array and the dataframe should be the same length. here your df length is 3 but the boolean array df.loc[j]['a']=='A' length is 2
You should do:
>>> df.loc[j][df.loc[j]['a']=='A']
a b c
0 A 2 3

How to conditionally change pandas DataFrame values into f-strings?

I have a pandas DataFrame whose values I want to conditionally change into strings without looping over every value.
Example input:
In [1]: df = pd.DataFrame(data = [[1,2], [4,5]], columns = ['a', 'b'])
Out[2]:
a b
0 1 2
1 4 5
This is my best attempt which doesn't work properly
df['a'] = np.where(df['a'] < 3, f'string-{df["a"]}', df['a'])
In [1]: df
Out[2]:
a b
0 string0 1\n1 4\nName: a, dtype: int64 2
1 4 5
Desired output:
Out[2]:
A B
0 string-1 2
1 4 5
I am using np.where() since looping is not feasible due to the size of the actual DataFrame. The actual f-string I am using is also more complex and has two variables that include column names, but the problem is the same.
Are there other ways to conditionally change pandas values into f-strings without looping over each value?
You can use .map() together with f-string, as follows:
df['a'] = df['a'].map(lambda x: f'string-{x}' if x < 3 else x)
Alternatively, you can also use .loc together with string concatenation, as follows:
df.loc[df['a'] < 3, 'a'] = 'string-' + df['a'].astype(str)
#OR
df['a']=np.where(df['a'] < 3, 'string-'+df['a'].astype(str), df['a'])
Result:
print(df)
a b
0 string-1 2
1 4 5

Getting all other columns based on a value Pandas Dataframe

Lets say I have the following df:
> Name A B C D
John Nan 1 2 Nan
Mike 2 Nan Nan Nan
Fred Nan 5 6 7
Ana 3 Nan 3 2
Fran 2 Nan 1 1
What I want to do is sorting some columns so, I what everyone who has only column A filled (in this case, Mike):
> df_1 = df[(df['A'] > 0)&(~(df['A'] == 0))]
or I want only two columns filled (in this case, none):
df_1 = df[(df['A','B'] > 0)&(~(df['A','B'] == 0))]
I am really strugling with this.
tks
isnull + all
Your syntax is incorrect. You can use pd.DataFrame.isnull:
mask1 = df['A'] > 0
mask2 = df[['B', 'C', 'D']].isnull().all(1)
df_1 = df_1[mask1 & mask2]
Similarly, for your second query:
mask1 = (df[['A', 'B']] > 0).all(1)
mask2 = df[['C', 'D']].isnull().all(1)
df_1 = df_1[mask1 & mask2]
This assumes you wish to filter explicitly for values greater than 0 in mask1. If any non-null number suffices, you can use pd.DataFrame.notnull.
Don't be afraid to split your masks across multiple lines in this way. It will make your code clearer and easier to manage.
pipe + isnull + all
More generically, you can write a function to calculate and apply your Boolean series mask:
def masker(df, cols_required):
""" Supply list cols_required. These must be > 0; others null. """
mask1 = (df[cols_required] > 0).all(1)
mask2 = df[df.columns.difference(cols_required)].isnull().all(1)
return df[mask1 & mask2]
df = df.pipe(masker, cols_required=['A', 'B'])

Element-wise average and standard deviation across multiple dataframes

Data:
Multiple dataframes of the same format (same columns, an equal number of rows, and no points missing).
How do I create a "summary" dataframe that contains an element-wise mean for every element? How about a dataframe that contains an element-wise standard deviation?
A B C
0 -1.624722 -1.160731 0.016726
1 -1.565694 0.989333 1.040820
2 -0.484945 0.718596 -0.180779
3 0.388798 -0.997036 1.211787
4 -0.249211 1.604280 -1.100980
5 0.062425 0.925813 -1.810696
6 0.793244 -1.860442 -1.196797
A B C
0 1.016386 1.766780 0.648333
1 -1.101329 -1.021171 0.830281
2 -1.133889 -2.793579 0.839298
3 1.134425 0.611480 -1.482724
4 -0.066601 -2.123353 1.136564
5 -0.167580 -0.991550 0.660508
6 0.528789 -0.483008 1.472787
You can create a panel of your DataFrames and then compute the mean and SD along the items axis:
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
p = pd.Panel({n: df for n, df in enumerate([df1, df2, df3])})
>>> p.mean(axis=0)
A B C
0 -0.024284 -0.622337 0.581292
1 0.186271 0.596634 -0.498755
2 0.084591 -0.760567 -0.334429
3 -0.833688 0.403628 0.013497
4 0.402502 -0.017670 -0.369559
5 0.733305 -1.311827 0.463770
6 -0.941334 0.843020 -1.366963
7 0.134700 0.626846 0.994085
8 -0.783517 0.703030 -1.187082
9 -0.954325 0.514671 -0.370741
>>> p.std(axis=0)
A B C
0 0.196526 1.870115 0.503855
1 0.719534 0.264991 1.232129
2 0.315741 0.773699 1.328869
3 1.169213 1.488852 1.149105
4 1.416236 1.157386 0.414532
5 0.554604 1.022169 1.324711
6 0.178940 1.107710 0.885941
7 1.270448 1.023748 1.102772
8 0.957550 0.355523 1.284814
9 0.582288 0.997909 1.566383
One simple solution here is to simply concatenate the existing dataframes into a single dataframe while adding an ID variable to track the original source:
dfa = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='a')
dfb = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='b')
df = pd.concat([df1,df2])
a b id
0 -0.542652 1.609213 a
1 -0.192136 0.458564 a
0 -0.231949 -0.000573 b
1 0.245715 -0.083786 b
So now you have two 2x2 dataframes combined into a single 4x2 dataframe. The 'id' columns identifies the source dataframe so you haven't lost any generality, and can select on 'id' to do the same thing you would to any single dataframe. E.g. df[ df['id'] == 'a' ].
But now you can also use groupby to do any pandas method such as mean() or std() on an element by element basis:
df.groupby('id').mean()
a b
index
0 0.198164 -0.811475
1 0.639529 0.812810
The following solution worked for me.
average_data_frame = (dataframe1 + dataframe2 ) / 2
Or, if you have more than two dataframes, say n, then
average_data_frame = dataframe1
for i in range(1,n):
average_data_frame = average_data_frame + i_th_dataframe
average_data_frame = average_data_frame / n
Once you have the average, you can go for the standard deviation. If you are looking for a "true Pythonic" approach, you should follow other answers. But if you are looking for a working and quick solution, this is it.

Categories

Resources