Anonymizing column names - python

I have a dataframe like so
IsCool IsTall IsHappy Target
0 1 0 1
1 1 0 0
0 1 0 0
1 0 1 1
I want to anonymize the column names except for target.
How can I do this?
Expected output:
col1 col2 col3 Target
0 1 0 1
1 1 0 0
0 1 0 0
1 0 1 1
Source dataframe :
import pandas as pd
df = pd.DataFrame({"IsCool": [0, 1, 0, 1],
"IsTall": [1, 1, 1, 0],
"IsHappy": [0, 0, 0, 1],
"Target": [1, 0, 0, 1]})

What about:
cols = {
col: f"col{i + 1}" if col != "Target" else col
for i, col in enumerate(df.columns)
}
out = df.rename(columns=cols)
col1 col2 col3 Target
0 0 1 0 1
1 1 1 0 0
2 0 1 0 0
3 1 0 1 1
You can also do it in place:
cols = [
f"col{i + 1}" if col != "Target" else col
for i, col in enumerate(df.columns)
]
df.columns = cols

You can use:
# get all columns except excluded ones (here "Target")
cols = df.columns.difference(['Target'])
# give a new name
names = 'col' + pd.Series(range(1, len(cols)+1), index=cols).astype(str)
out = df.rename(columns=names)
Output:
col1 col2 col3 Target
0 0 1 0 1
1 1 1 0 0
2 0 1 0 0
3 1 0 1 1

Proposed code :
You can pass a dict to the rename() Pandas function with a dict like this in parameters :
columns={'IsCool': 'col0', 'IsTall': 'col1', 'IsHappy': 'col2'}
This dict is obtained by using of a zip function : dict(zip(keys, values))
import pandas as pd
df = pd.DataFrame({"IsCool": [0, 1, 0, 1],
"IsTall": [1, 1, 1, 0],
"IsHappy": [0, 0, 0, 1],
"Target": [1, 0, 0, 1]})
df = df.rename(columns = dict(zip(df.columns.drop('Target'),
["col%s"%i for i in range(len(df.columns)-1)])))
print(df)
Result :
col0 col1 col2 Target
0 0 1 0 1
1 1 1 0 0
2 0 1 0 0
3 1 0 1 1

Related

Doing group by on one hot encoded columns PANDAS

i have the following dataframe called df. I want for each sector column (sector_) to basically do a group by and get the unique ids for each sector. The sector is denoted as 1 for each row if the id is apart of that sector. How can i do this group by if the columns are one hot encoded?
id winner sector_food sector_learning sector_parenting sector_consumer
1 1 1 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
2 0 1 0 0 0
3 1 0 0 0 1
expected output
sector unique_id
sector_food 2
sector_learning 0
sector_parenting 0
sector_consumer 1
You can do something like this:
out = df.drop(["id", "winner"], 1).multiply(df.id, 0).nunique().subtract(1)
#sector_food 2
#sector_learning 0
#sector_parenting 0
#sector_consumer 1
#dtype: int64
To get your exact expected output you can add:
out = out.rename_axis("sector").to_frame("unique_id")
# unique_id
#sector
#sector_food 2
#sector_learning 0
#sector_parenting 0
#sector_consumer 1
Try this:
df.drop('winner',axis=1).groupby(level=0).sum().gt(0).sum().to_frame('unique_id')
Given
import pandas as pd
ids = [1, 1, 2, 2, 3]
winner = [1, 0, 1, 0, 1]
sector_food = [1, 1, 0, 1, 0]
sector_learning = [0, 0, 0, 0, 0]
sector_parenting = [0, 0, 0, 0, 0]
sector_consumer = [0, 0, 0, 0, 1]
df = pd.DataFrame({
'id': ids,
'winner': winner,
'sector_food': sector_food,
'sector_learning': sector_learning,
'sector_parenting': sector_parenting,
'sector_consumer': sector_consumer
})
print(df)
output
id winner sector_food sector_learning sector_parenting sector_consumer
0 1 1 1 0 0 0
1 1 0 1 0 0 0
2 2 1 0 0 0 0
3 2 0 1 0 0 0
4 3 1 0 0 0 1
You can do
_df = (df
# drop unused cols
.drop('winner', axis=1)
# melt with 'id' as index
.melt(id_vars='id')
# drop all duplicates
.drop_duplicates(['id', 'variable', 'value'])
# sum unique values
.groupby('variable').value.sum()
)
print(_df)
output
variable
sector_consumer 1
sector_food 2
sector_learning 0
sector_parenting 0
Name: value, dtype: int64

Python Pandas Iterate over columns and also update columns based on apply conditions

I am trying to update dataframe columns based on consecutive columns values.
If columns say col1 and col2 has >0 and <0 values, then same columns should get updated as col2=col1 + col2 and col1=0 and also counter +1 (gives how many fixes has been done throughout the column).
Dataframe look like:
id col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 col10
1 0 5 -5 5 -5 0 0 1 4 3 -3
2 0 0 0 0 0 0 0 4 -4 0 0
3 0 0 1 2 3 0 0 0 5 6 0
Required Dataframe after applying logic:
id col0 col1 col2 col3 col4 col6 col7 col8 col9 col10 fix
1 0 0 0 0 0 0 0 1 4 0 0 0 3
2 0 0 0 0 0 0 0 0 0 0 0 0 1
3 0 0 1 2 3 0 0 0 5 6 0 9 0
I tried:
def fix_count(row):
row['fix_cnt'] = 0
for i in range(0, 10):
if ((row['col' + str(i)] > 0) &
(row['col' + str(i + 1)] < 0)):
row['col' + str(i + 1)] = row['col' + str(i)] + row['col' + str(i + 1)]
row['col' + str(i)] = 0
row['fix_cnt'] += 1
return (row['col' + str(i)],
row['col' + str(i + 1)],
row['fix_cnt'])
df.apply(fix_count, axis=1)
It failed IndexError: index 11 is out of bounds for axis 0 with size 11.
I also looked into df.iteritems but I couldn't figure out the way.
DDL to generate DataFrame:
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 3],
'col0': [0, 0, 0],
'col1': [5, 0, 0],
'col2': [-5, 0, 1],
'col3': [5, 0, 2],
'col4': [-5, 0, 3],
'col5' : [0, 0, 0],
'col6': [0, 0, 0],
'col7': [1, 4, 0],
'col8': [4, -4, 5],
'col9': [3, 0, 6],
'col10': [-3, 0, 0]})
Thanks!
Here is an approach without loops:
c = df.gt(0) & df.shift(-1,axis=1).lt(0)
out = (df.mask(c.shift(axis=1).fillna(False),df+df.shift(axis=1))
.mask(c,0).assign(Fix=c.sum(1)))
print(out)
id col0 col1 col2 col3 col4 col6 col7 col8 col9 col10 Fix
0 1 0 0 0 0 0 0 1 4 0 0 3
1 2 0 0 0 0 0 0 0 0 0 0 1
2 3 0 0 1 2 3 0 0 5 6 0 0
Details:
c checks if current column is > 0 and next column is <0.
Add current column to next column in the next column to where c is
True.
Set the current column to 0 if c is True.
Get sum of c for changes
done.
Your code logic is fine. Just a small correction while returning the row from your function works as expected:
def fix_count(row):
row['fix_cnt'] = 0
for i in range(0, 10):
if ((row['col' + str(i)] > 0) &
(row['col' + str(i + 1)] < 0)):
row['col' + str(i + 1)] = row['col' + str(i)] + row['col' + str(i + 1)]
row['col' + str(i)] = 0
row['fix_cnt'] += 1
return (row)
df.apply(fix_count, axis=1)
Try this and let me know if this works!

Pandas - Does row fall below a row with a column value and same id

I am new to Pandas. I have a Pandas data frame like so:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0]})
I want to add a column val2, that indicates whether an row falls below another row having the same id as itself where val1 == 1.
The result would be a data frame like:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0], 'val2': [0, 0, 1, 0, 0, 1, 1]})
My first thought was to use an apply statement, but these only go by row. And from my experience for loops are never the answer. Any help would be greatly appreciated!
Let's try shift + cumsum inside a groupby.
df['val2'] = df.groupby('id').val1.apply(
lambda x: x.shift().cumsum()
).ge(1).astype(int)
Or, in an attempt to avoid the lambda,
df['val2'] = (
df.groupby('id')
.val1.shift()
.groupby(df.id)
.cumsum()
.ge(1)
.astype(int)
)
df
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1
Using groupby + transform. Similar to coldspeed's but using bool conversion for non-zero cumsum values.
df['val2'] = df.groupby('id')['val1'].transform(lambda x: x.cumsum().shift())\
.fillna(0).astype(bool).astype(int)
print(df)
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1

pandas row operation to keep only the right most non zero value per row

How to keep the right most number in each row in a dataframe?
a = [[1, 2, 0], [1, 3, 0], [1, 0, 0]]
df = pd.DataFrame(a, columns=['col1','col2','col3'])
df
col1 col2 col3
row0 1 2 NaN
row1 1 3 0
row2 1 0 0
Then after transformation
col1 col2 col3
row0 0 2 0
row1 0 3 0
row2 1 0 0
Based on the suggestion by divakar I've come up with the following:
import pandas as pd
a = [[1, 2, 0, None],
[1, 3, 0,0],
[1, 0, 0,0],
[1, 0, 0,0],
[1, 0, 0,0],
[0, 0, 0,1]]
df = pd.DataFrame(a, columns=['col1','col2','col3','col4'])
df.fillna(value=0,inplace=True) # Get rid of non numeric items
a
[[1, 2, 0, None],
[1, 3, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]]
# Return index of first occurrence of maximum over requested axis.
# 0 or 'index' for row-wise, 1 or 'columns' for column-wise
df.idxmax(1)
0 col2
1 col2
2 col1
3 col1
4 col1
5 col4
dtype: object
Create a matrix to mask values
numberOfRows = df.shape[0]
df_mask= pd.DataFrame(columns=df.columns,index=np.arange(0, numberOfRows))
df_test.fillna(value=0,inplace=True) # Get rid of non numeric items
# Add mask entries
for row,col in enumerate(df.idxmax(1)):
df_mask.loc[row,col] = 1
df_result=df*df_mask
df_result
col1 col2 col3 col4
0 0 2 0 0.0
1 0 3 0 0.0
2 1 0 0 0.0
3 1 0 0 0.0
4 1 0 0 0.0
5 0 0 0 1.0
Here is a workaround that requires the use of helper functions:
import pandas as pd
#Helper functions
def last_number(lst):
if all(map(lambda x: x == 0, lst)):
return 0
elif lst[-1] != 0:
return len(lst)-1
else:
return last_number(lst[:-1])
def fill_others(lst):
new_lst = [0]*len(lst)
new_lst[last_number(lst)] = lst[last_number(lst)]
return new_lst
#Data
a = [[1, 2, 0], [1, 3, 0], [1, 0, 0]]
df = pd.DataFrame(a, columns=['col1','col2','col3'])
df.fillna(0, inplace = True)
print df
col1 col2 col3
0 1 2 0
1 1 3 0
2 1 0 0
#Application
print df.apply(lambda x: fill_others(x.values.tolist()), axis=1)
col1 col2 col3
0 0 2 0
1 0 3 0
2 1 0 0
As their names suggest, the functions get the last number in a given row and fill the other values with zeros.
I hope this helps.
Working at NumPy level, here's one vectorized approach using broadcasting -
np.where(((a!=0).cumsum(1).argmax(1))[:,None] == np.arange(a.shape[1]),a,0)
Sample run -
In [7]: a # NumPy array
Out[7]:
array([[1, 2, 0],
[1, 3, 0],
[1, 0, 0]])
In [8]: np.where(((a!=0).cumsum(1).argmax(1))[:,None] == np.arange(a.shape[1]),a,0)
Out[8]:
array([[0, 2, 0],
[0, 3, 0],
[1, 0, 0]])
Porting it to pandas, we would have an implementation like so -
idx = (df!=0).values.cumsum(1).argmax(1)
df_out = df*(idx[:,None] == np.arange(df.shape[1]))
Sample run -
In [19]: df
Out[19]:
col1 col2 col3 col4
0 1 2 0 0.0
1 1 3 0 0.0
2 2 2 2 0.0
3 1 0 0 0.0
4 1 0 0 0.0
5 0 0 0 1.0
In [20]: idx = (df!=0).values.cumsum(1).argmax(1)
In [21]: df*(idx[:,None] == np.arange(df.shape[1]))
Out[21]:
col1 col2 col3 col4
0 0 2 0 0.0
1 0 3 0 0.0
2 0 0 2 0.0
3 1 0 0 0.0
4 1 0 0 0.0
5 0 0 0 1.0
You can fill null values "from the left", and then take the values of the resulting last column:
In [49]: df.fillna(axis=0, method='bfill')['col3']
Out[49]:
0 0.0
1 0.0
2 0.0
Name: col3, dtype: float64
Full Example
In [50]: a = [[1, 2, None], [1, 3, 0], [0, 0, 0]]
In [51]: df = pd.DataFrame(a, columns=['col1','col2','col3'])
In [52]: df.fillna(axis=0, method='bfill')['col3']
Out[52]:
0 0.0
1 0.0
2 0.0
Name: col3, dtype: float64

pandas: Use if-else to populate new column

I have a DataFrame like this:
col1 col2
1 0
0 1
0 0
0 0
3 3
2 0
0 4
I'd like to add a column that is a 1 if col2 is > 0 or 0 otherwise. If I was using R I'd do something like
df1[,'col3'] <- ifelse(df1$col2 > 0, 1, 0)
How would I do this in python / pandas?
You could convert the boolean series df.col2 > 0 to an integer series (True becomes 1 and False becomes 0):
df['col3'] = (df.col2 > 0).astype('int')
(To create a new column, you simply need to name it and assign it to a Series, array or list of the same length as your DataFrame.)
This produces col3 as:
col2 col3
0 0 0
1 1 1
2 0 0
3 0 0
4 3 1
5 0 0
6 4 1
Another way to create the column could be to use np.where, which lets you specify a value for either of the true or false values and is perhaps closer to the syntax of the R function ifelse. For example:
>>> np.where(df['col2'] > 0, 4, -1)
array([-1, 4, -1, -1, 4, -1, 4])
I assume that you're using Pandas (because of the 'df' notation). If so, you can assign col3 a boolean flag by using .gt (greater than) to compare col2 against zero. Multiplying the result by one will convert the boolean flags into ones and zeros.
df1 = pd.DataFrame({'col1': [1, 0, 0, 0, 3, 2, 0],
'col2': [0, 1, 0, 0, 3, 0, 4]})
df1['col3'] = df1.col2.gt(0) * 1
>>> df1
Out[70]:
col1 col2 col3
0 1 0 0
1 0 1 1
2 0 0 0
3 0 0 0
4 3 3 1
5 2 0 0
6 0 4 1
You can also use a lambda expression to achieve the same result, but I believe the method above is simpler for your given example.
df1['col3'] = df1['col2'].apply(lambda x: 1 if x > 0 else 0)

Categories

Resources