I have a dataframe like as shown below
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'Label':[1,1,1,0,0]})
My objective is to
a) replace 0s as 1s AND 1s as 0s in Label column
I was trying something like the below
cdf.assign(invert_label=cdf.Label.loc[::-1].reset_index(drop=True)) #not work
cdf['invert_label'] = np.where(cdf['Label']==0, '1', '0')
'
but this doesn't work. It reverses the order
I expect my output to be like as shown below
Id Label
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
You can compare 0, so for 0 get Trues and for not 0 get Falses, then converting to integers for mapping True, False to 1, 0:
print (cdf['Label'].eq(0))
0 False
1 False
2 False
3 True
4 True
Name: Label, dtype: bool
cdf['invert_label'] = cdf['Label'].eq(0).astype(int)
print (cdf)
Id Label invert_label
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
Another idea is use mapping:
cdf['invert_label'] = cdf['Label'].map({1:0, 0:1})
print (cdf)
Id Label invert_label
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
One maybe obvious answer might be to use 1-value:
cdf['Label2'] = 1-cdf['Label']
output:
Id Label Label2
0 1 1 0
1 2 1 0
2 3 1 0
3 4 0 1
4 5 0 1
You could map the not function as well -
import operator
cdf['Label'].map(operator.not_).astype('int')
Another way, and I am adding this as a separate answer as this is probably not "pythonic" enough (in the sense that it is not very explicit) is to use the bitwise xor
cdf['Label'] ^ 1
Related
I want to group my values together so that the max sum of 2 values comes to a certain value ( here 6 ).
For example, I want to put together (1+5), (3+3), (4+1) and the rest by themselves. For this, I need to be able to search for a certain condition combination, and also ignore it if there is no such number. In the "Grouped" column I keep track of if they have already been grouped, if so then leave them, an index can only be grouped once.
I have:
df_1= pd.DataFrame({'Rest_after_division': [1,3,3,4,5,5,1],
'Grouped_with_index': ["-","-","-","-","-","-","-"],
'Grouped': [0,0,0,0,0,0,0]})
Rest_after_division Grouped_with_index Grouped
0 1 - 0
1 2 - 0
2 3 - 0
3 4 - 0
4 5 - 0
5 5 - 0
6 5 - 0
I want:
Rest_after_division Grouped_with_index Grouped
0 1 4 1
1 2 3 1
2 3 - 0
3 4 1 1
4 5 0 1
5 5 - 0
6 5 - 0
I have example 2:
df_1= pd.DataFrame({'Rest_after_division': [1,1,1,4,5,5,5],
'Grouped_with_index': ["-","-","-","-","-","-","-"],
'Grouped': [0,0,0,0,0,0,0]})
Rest_after_division Grouped_with_index Grouped
0 1 - 0
1 1 - 0
2 1 - 0
3 4 - 0
4 5 - 0
5 5 - 0
6 5 - 0
I want example 2:
Rest_after_division Grouped_with_index Grouped
0 1 4 1
1 1 5 1
2 1 6 1
3 4 - 0
4 5 0 1
5 5 1 1
6 5 2 1
I have tried: ( i know I need to loop this eventually, but I can't get the index..)
df_1 = df_1.sort_values('Grouped')
index_group_buddy= df_1[df_1['Rest_after_division']==5].head(1).index[0]
print(index_group_buddy)
This almost works, but not when the condition does not exist, how do I skip this? And I also think it will be problematic when all are grouped...
I have also tried:
#index_group_buddy = df_1.loc[((df_1['Rest_after_division'] == 5) & (df_1['Grouped'] != 1)) ].idxmin(axis=1)
#index_group_buddy =df_1.query("Rest_after_division==5 and Grouped!=1")
index_group_buddy = df_1[(df_1['Rest_after_division']==5) & (df_1['Grouped']!=1)].index[0]
df_1.at[index_group_buddy, 'Grouped'] = 1
df_1.at[index_group_buddy, 'Grouped_with_index '] = index_group_buddy
print(index_group_buddy)
I want to find the first index that has the right conditions.
rework you df_1 to map unique "Rest_after_division" to their index
Map the complement to 6 on those keys
calculate which values should be grouped (not complement with self and first value of group)
insert the values with mask
keys = (df_1['Rest_after_division']
.drop_duplicates()
.reset_index()
.set_index('Rest_after_division')
['index']
)
compl_index = (6-df_1['Rest_after_division']).map(keys)
df_1['Grouped'] = (compl_index.ne(df_1.index)
& df_1.groupby('Rest_after_division').cumcount().eq(0)
).astype(int)
df_1['Grouped_with_index'] = compl_index.where(df_1['Grouped'].eq(1),
df_1['Grouped_with_index'])
output:
Rest_after_division Grouped_with_index Grouped
0 1 4 1
1 2 3 1
2 3 - 0
3 4 1 1
4 5 0 1
5 5 - 0
6 1 - 0
my input:
index frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
Also I have two objects start_frame and end_frame - pandas Series look like this for 'start frame' :
index frame
3 3
and for end frame:
index frame
4 5
My problem is apply function in specific column - user1 and in specific row number, where values I get from start_frame and end_frame.
I expect output like this:
frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 0
4 4 1 0
5 5 1 0
I trying this but it return all column to ones or any other output but not that I want
def my_func(x):
x=x+1
return x
df['user1']=df['user1'].between(df['frame']==3, df['frame']==5, inclusive=False).apply(lambda x: add_one(x))
I trying another code:
df['user1']=df.apply(lambda row: 1 if row['frame'] in (3,5) else 0, axis=1)
But it return only 1 in row 3 and 5, how here in (3,5) insert range?
So I have two question: First and most important how to apply my_func exacly in rows what I need, and other question how to use my object end_frame and start_frame instead manually insert in function.
Thank you
Updated:
arr_rang = range(3,6)
df['user1']=df.apply(lambda row: 1 if row['frame'] in (arr_rang) else 0, axis=1)
Now it's return 1 in frame 3,4,5. That I need. But still I dont understand how use my objects end_frame and start_frame
let's append start_frame and end_frame since they are having common columns then check values using isin() and finally changing value by using boolean masking and loc accessor:
s=start_frame.append(end_frame)
mask=(df['index'].isin(s['index'])) | (df['frame'].isin(s['frame']))
df.loc[mask,'user1']=df.loc[mask,'user1']+1
#you can also use np.where() in place of loc accessor
output of df:
index frame user1 user2
0 0 0 0 0
1 1 1 0 0
2 2 2 0 0
3 3 3 1 0
4 4 4 1 0
5 5 5 1 0
Update:
use:
mask=df['frame'].between(3,5)
df.loc[mask,'user1']=df.loc[mask,'user1']+1
Did you try
def putHello(row):
row["hello"] = "world"
return row
data.iloc[5:7].apply(putHello,axis=1)
The output would look something like this
The documentation for pandas functions
Iloc pandas
Apply pandas
I have a question regarding a complex .loc I'd like to run on my df to mutate a new column. Let's say I have the following df:
x y z
0 1 0 2
1 0 0 0
2 1 1 2
3 0 0 0
4 1 1 2
5 0 0 2
6 1 1 0
7 0 0 2
And I want to use both an & and a | in my .loc. I know very well how to use one or the other, but a certain task I'm trying to complete involves the use of both. Basically, I want to find the rows that meet the following condition:
x or y = 1 and z = 2
and make a new column that consists of a 1 if these conditions are met, and a 0 if they aren't. Like so:
x y z test
0 1 0 2 1
1 0 0 0 0
2 1 1 2 1
3 0 0 0 0
4 1 1 2 1
5 0 0 2 0
6 1 1 0 0
7 0 0 2 0
Again, I know how to run a loc consisting of one or the other, but not both & and |. Before posting, I tried the following code out to no avail:
df['test'] = 0
df.loc[((df['x'] == 1) | (df['y'] == 1)) & (df['z'] == 2),'test'] = 1
I thought I was super clever by including an extra set of () around the | condition, but alas it did not work. However, this code does work just fine when I'm using one operator or the other, just not both. I would really appreciate any help. Thanks!
if have two dataframes, (pandas.DataFrame), each looking as follows. Let's call the first one df_A
code1 code2 code3 code4 code5
0 1 4 2 0 0
1 3 2 1 5 0
2 2 3 0 0 0
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 0 0 1
2 0 1 1 0 0
The objects(rows) are each given up to 5 codes shown by the five columns in the first df.
I instead want a binary representation of which codes each object has. As shown in the second df.
The functions in pandas or scikit-learn for dummy-values take into account which position the code is written in, this in unimportant.
The attempts I have with my own code have not worked due to my inexperience in python and pandas.
This case is different from others I have seen on stack overflow as all the columns represent the same thing.
Thank you!
Edit:
for colname in df_bin.columns:
for row in range(len(df_codes)):
if int(colname) in df_codes.iloc[[row]]:
df_bin[colname][row]=1
This is one of the attempts I made so far.
You can try stack then str.get_dummies
s=df.stack().loc[lambda x : x!=0].astype(str).str.get_dummies().sum(level=0).add_prefix('Has')
Has1 Has2 Has3 Has4 Has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Let's try:
(df.stack().groupby(level=0)
.value_counts()
.unstack(fill_value=0)
[range(1,6)]
.add_prefix('has')
)
Output:
has1 has2 has3 has4 has5
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
Here's another way using pd.crosstab:
df_out = df.reset_index().melt('index')
df_out = pd.crosstab(df_out['index'], df_out['value']).drop(0, axis=1).add_prefix('has')
Output:
value has1 has2 has3 has4 has5
index
0 1 1 0 1 0
1 1 1 1 0 1
2 0 1 1 0 0
I'm trying to replace all numbers greater than 1 with 1 while keeping the original 1s and 0s untouched in the entire dataframe with the minimal effort. Any support is appreciated!!
My dataframe looks something like this but contains way more columns and rows.
Report No Apple Orange Lemon Grape Pear
One 5 0 2 1 1
Two 1 1 0 3 2
Three 0 0 2 1 3
Four 1 1 3 0 0
Five 4 0 0 1 1
Six 1 3 1 2 0
Desired Output:
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
You can try this.
Using boolean mask
df.set_index('Report No',inplace=True)
df[df > 1] = 1
df.reset_index()
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
Or use this if you have some non numeric columns. No need to use set_index and reset_index. This is equivalent to df.select_dtypes('number')
val = df._get_numeric_data()
val[val > 1] = 1
df
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
df.mask
df.set_index('Report No',inplace=True)
df.mask(df>1,1).reset_index()
Report No Apple Orange Lemon Grape Pear
One 1 0 1 1 1
Two 1 1 0 1 1
Three 0 0 1 1 1
Four 1 1 1 0 0
Five 1 0 0 1 1
Six 1 1 1 1 0
np.where
df[df.columns[1:]] = df.iloc[:,1:].where(df.iloc[:,1:] >1 ,1)
np.select
This can be helpful when dealing with multiple conditions. If you want to convert values less than 0 to 0 and values greater than 1 to 1.
df.set_index('Report No', inplace=True)
condlist = [df >= 1, df <= 0] #you can have more conditions and add choices accordingly.
choice = [1, 0] #len(condlist) should be equal to len(choice).
df.loc[:] = np.select(condlist, choice)
Like Jan mentioned use df.clip
Not recommended but you can try this for fun. Using df.astype.
df.set_index('Report No',inplace=True)
df.astype('bool').astype('int')
NOTE: This will only convert falsy values to False and truthy values to True i.e. this will convert 0 to False and anything other than 0 is True even negative numbers.
s = pd.Series([1,-1,0])
s.astype('bool')
0 True
1 True
2 False
dtype: bool
s.astype('bool').astype('int')
0 1
1 1
2 0
dtype: int32
np.sign
When values present are between [0, n] i.e no negative values.
df.loc[:] = np.sign(df)
Use pandas.DataFrame.clip:
new_df = df.clip(0, 1)
EDIT: To exclude the first column by name (this will edit the DataFrame in-place)
mask = df.columns != "Report No"
df.loc[:, mask] = df.loc[:, mask].clip(0, 1)
The fastest and easiest way is to go through all the keys of the datframe and change them using the where function of numpy (library which has to be imported). Then we simply pass as an attribute to that function the condition and the values for when the condition is satisfied or not. In your example it would look like this:
for x in df.keys()[1:]:
df[x] = np.where(df[x] > 1, 1, df[x])
Note that in the loop I have quited the first key because its values are not integer