pandas assign value based on mean - python

Let's say I have a dataframe column. I want to create a new column where the value for a given observation is 1 if the corresponding value in the old column is above average. But the value should be 0 if the value in the other column is average or below.
What's the fastest way of doing this?

Say you have the following DataFrame:
df = pd.DataFrame({'A': [1, 4, 6, 2, 8, 3, 7, 1, 5]})
df['A'].mean()
Out: 4.111111111111111
Comparison against the mean will get you a boolean vector. You can cast that to integer:
df['B'] = (df['A'] > df['A'].mean()).astype(int)
or use np.where:
df['B'] = np.where(df['A'] > df['A'].mean(), 1, 0)
df
Out:
A B
0 1 0
1 4 0
2 6 1
3 2 0
4 8 1
5 3 0
6 7 1
7 1 0
8 5 1

Related

Pandas compare columns and drop rows based on values in another column

Is there a way to drop values in one column based on comparison with another column? Assuming the columns are of equal length
For example, iterate through each row and drop values in col1 greater than values in col2? Something like this:
df['col1'].drop.where(df['col1']>=df['col2']
Pandas compare columns and drop rows based on values in another column
import pandas as pd
d = {
'1': [1, 2, 3, 4, 5],
'2': [2, 4, 1, 6, 3]
}
df = pd.DataFrame(d)
print(df)
dfd = df.drop(df[(df['1'] >= df['2'])].index)
print('update')
print(dfd)
Output
1 2
0 1 2
1 2 4
2 3 1
3 4 6
4 5 3
update
1 2
0 1 2
1 2 4
3 4 6

Compare two rows on a loop for on Pandas

I have the following dataframe where I want to determinate if the column A is greater than column B and if column C is greater of column B. In case it is smaller, I want to change that value for 0.
d = {'A': [6, 8, 10, 1, 3], 'B': [4, 9, 12, 0, 2], 'C': [3, 14, 11, 4, 9] }
df = pd.DataFrame(data=d)
df
I have tried this with the np.where and it is working:
df[B] = np.where(df[A] > df[B], 0, df[B])
df[C] = np.where(df[B] > df[C], 0, df[C])
However, I have a huge amount of columns and I want to know if there is any way to do this without writing each comparation separately. For example, a loop for.
Thanks
Solution with different ouput, because is compared original columns with DataFrame.diff and set less like 0 values to 0 by DataFrame.mask:
df1 = df.mask(df.diff(axis=1).lt(0), 0)
print (df1)
A B C
0 6 0 0
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
If use list comprehension with zip shifted columns names output is different, because is compared already assigned columns B, C...:
for a, b in zip(df.columns, df.columns[1:]):
df[b] = np.where(df[a] > df[b], 0, df[b])
print (df)
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9
To use a vectorial approach, you cannot simply use a diff as the condition depends on the previous value being replaced or not by 0. Thus two consecutive diff cannot happen.
You can achieve a correct vectorial replacement using a shifted mask:
m1 = df.diff(axis=1).lt(0) # check if < than previous
m2 = ~m1.shift(axis=1, fill_value=False) # and this didn't happen twice
df2 = df.mask(m1&m2, 0)
output:
A B C
0 6 0 3
1 8 9 14
2 10 12 0
3 1 0 4
4 3 0 9

Removing certain Rows from subset of df

I have a pandas dataframe. All the columns right of column#2 may only contain the value 0 or 1. If they contain a value that is NOT 0 or 1, I want to remove that entire row from the dataframe.
So I created a subset of the dataframe to only contain columns right of #2
Then I found the indices of the rows that had values other than 0 or 1 and deleted it from the original dataframe.
See code below please
#reading data file:
data=pd.read_csv('MyData.csv')
#all the columns right of column#2 may only contain the value 0 or 1. So "prod" is a subset of the data df containing these columns:
prod = data.iloc[:,2:]
index_prod = prod[ (prod!= 0) & (prod!= 1)].dropna().index
data = data.drop(index_prod)
However when I run this, the index_prod vector is empty and so does not drop anything at all.
okay so my friend just told me that the data is not numeric and he fixed it by making it numeric. Can anyone please advise how I can find that out? Because all the columns were numeric it seemed like to me. All numbers
You can check dtypes by DataFrame.dtypes.
print (data.dtypes)
Or:
print (data.columns.difference(data.select_dtypes(np.number).columns))
And then convert all values without first 2 to numeric:
data.iloc[:,2:] = data.iloc[:,2:].apply(lambda x: pd.to_numeric(x, errors='coerce'))
Or all columns:
data = data.apply(lambda x: pd.to_numeric(x, errors='coerce'))
And last apply solution:
subset = data.iloc[:,2:]
data1 = data[subset.isin([0,1]).all(axis=1)]
Let's say you have this dataframe:
data = {'A': [1, 2, 3, 4, 5], 'B': [0, 1, 4, 3, 1], 'C': [2, 1, 0, 3, 4]}
df = pd.DataFrame(data)
A B C
0 1 0 2
1 2 1 1
2 3 4 0
3 4 3 3
4 5 1 4
And you want to delete rows based on column B that don't contain 0 or 1, we could accomplish by:
subset = df.iloc[:,1:]
index = subset[ (subset!= 0) & (subset!= 1)].dropna().index
df.drop(index)
A B C
0 1 0 2
1 2 1 1
4 5 1 4
df.reset_index(drop=True)
A B C
0 1 0 2
1 2 1 1
2 5 1 4

Pandas merge duplicate DataFrame columns preserving column names

How can I merge duplicate DataFrame columns and also keep all original column names?
e.g. If I have the DataFrame
df = pd.DataFrame({"col1" : [0, 0, 1, 2, 5, 3, 7],
"col2" : [0, 1, 2, 3, 3, 3, 4],
"col3" : [0, 1, 2, 3, 3, 3, 4]})
I can remove the duplicate columns (yes the transpose is slow for large DataFrames) with
df.T.drop_duplicates().T
but this only preserves one column name per unique column
col1 col2
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
How can I keep the information on which columns were merged? e.g. something like
[col1] [col2, col3]
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
Thanks!
# group columns by their values
grouped_columns = df.groupby(list(df.values), axis=1).apply(lambda g: g.columns.tolist())
# pick one column from each group of the columns
unique_df = df.loc[:, grouped_columns.str[0]]
# make a new column name for each group, don't think the list can work as a column name, you need to join them
unique_df.columns = grouped_columns.apply("-".join)
unique_df
I also used T and tuple to groupby
def f(x):
d = x.iloc[[0]]
d.index = ['-'.join(x.index.tolist())]
return d
df.T.groupby(df.apply(tuple), group_keys=False).apply(f).T

Use a list to conditionally fill a new column based on values in multiple columns

I am trying to populate a new column within a pandas dataframe by using values from several columns. The original columns are either 0 or '1' with exactly a single 1 per series. The new column would correspond to df['A','B','C','D'] by populating new_col = [1, 3, 7, 10] as shown below. (A 1 at A means new_col = 1; if B=1,new_col = 3, etc.)
df
A B C D
1 1 0 0 0
2 0 0 1 0
3 0 0 0 1
4 0 1 0 0
The new df should look like this.
df
A B C D new_col
1 1 0 0 0 1
2 0 0 1 0 7
3 0 0 0 1 10
4 0 1 0 0 3
I've tried to use map, loc, and where but can't seem to formulate an efficient way to get it done. Problem seems very close to this. A couple other posts I've looked at 1 2 3. None of these show how to use multiple columns conditionally to fill a new column based on a list.
I can think of a few ways, mostly involving argmax or idxmax, to get either an ndarray or a Series which we can use to fill the column.
We could drop down to numpy, find the maximum locations (where the 1s are) and use those to index into an array version of new_col:
In [148]: np.take(new_col,np.argmax(df.values,1))
Out[148]: array([ 1, 7, 10, 3])
We could make a Series with new_col as the values and the columns as the index, and index into that with idxmax:
In [116]: pd.Series(new_col, index=df.columns).loc[df.idxmax(1)].values
Out[116]: array([ 1, 7, 10, 3])
We could use get_indexer to turn the column idxmax results into integer offsets we can use with new_col:
In [117]: np.array(new_col)[df.columns.get_indexer(df.idxmax(axis=1))]
Out[117]: array([ 1, 7, 10, 3])
Or (and this seems very wasteful) we could make a new frame with the new columns and use idxmax directly:
In [118]: pd.DataFrame(df.values, columns=new_col).idxmax(1)
Out[118]:
0 1
1 7
2 10
3 3
dtype: int64
It's not the most elegant solution, but for me it beats the if/elif/elif loop:
d = {'A': 1, 'B': 3, 'C': 7, 'D': 10}
def new_col(row):
k = row[row == 1].index.tolist()[0]
return d[k]
df['new_col'] = df.apply(new_col, axis=1)
Output:
A B C D new_col
1 1 0 0 0 1
2 0 0 1 0 7
3 0 0 0 1 10
4 0 1 0 0 3

Categories

Resources