I have a number of pandas dataframes that each have a column 'speaker', and one of two labels. Typically, this is 0-1, however in some cases it is 1-2, 1-3, or 0-2. I am trying to find a way to iterate through all of my dataframes and standardize them so that they share the same labels (0-1).
The one consistent feature between them is that the first label to appear (i.e. in the first row of the dataframe) should always be mapped to '0', where as the second should always be mapped to '1'.
Here is an example of one of the dataframes I would need to change - being mindful that others will have different labels:
import pandas as pd
data = [1,2,1,2,1,2,1,2,1,2]
df = pd.DataFrame(data, columns = ['speaker'])
I would like to be able to change so that it appears as [0,1,0,1,0,1,0,1,0,1].
Thus far, I have tried inserting the following code within a bigger for loop that iterates through each dataframe. However it is not working at all:
for label in data['speaker']:
if label == data['speaker'][0]:
label = '0'
else:
label = '1'
Hopefully, what the above makes clear is that I am attempting to create a rule akin to: "find all instances in 'Speaker' that match the label in the first index position and change this to '0'. For all other instances change this to '1'."
Method 1
We can use iat + np.where here for conditional creation of your column:
# import numpy as np
first_val = df['speaker'].iat[0] # same as df['speaker'].iloc[0]
df['speaker'] = np.where(df['speaker'].eq(first_val), 0, 1)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Method 2:
We can also make use of booleans, since we can cast them to integers:
first_val = df['speaker'].iat[0]
df['speaker'] = df['speaker'].ne(first_val).astype(int)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Only if your values are actually 1, 2 we can use floor division:
df['speaker'] = df['speaker'] // 2
# same as: df['speaker'] = df['speaker'].floordiv(2)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
You can use a iloc to get the value of the first row and the first column, and then a mask to set the values:
zero_map = df["speaker"].iloc[0]
mask_zero = df["speaker"] == zero_map
df.loc[mask_zero] = 0
df.loc[~mask_zero] = 1
print(df)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Related
I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2
my input:
index frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
Also I have two objects start_frame and end_frame - pandas Series look like this for 'start frame' :
index frame
3 3
and for end frame:
index frame
4 5
My problem is apply function in specific column - user1 and in specific row number, where values I get from start_frame and end_frame.
I expect output like this:
frame user1 user2
0 0 0 0
1 1 0 0
2 2 0 0
3 3 1 0
4 4 1 0
5 5 1 0
I trying this but it return all column to ones or any other output but not that I want
def my_func(x):
x=x+1
return x
df['user1']=df['user1'].between(df['frame']==3, df['frame']==5, inclusive=False).apply(lambda x: add_one(x))
I trying another code:
df['user1']=df.apply(lambda row: 1 if row['frame'] in (3,5) else 0, axis=1)
But it return only 1 in row 3 and 5, how here in (3,5) insert range?
So I have two question: First and most important how to apply my_func exacly in rows what I need, and other question how to use my object end_frame and start_frame instead manually insert in function.
Thank you
Updated:
arr_rang = range(3,6)
df['user1']=df.apply(lambda row: 1 if row['frame'] in (arr_rang) else 0, axis=1)
Now it's return 1 in frame 3,4,5. That I need. But still I dont understand how use my objects end_frame and start_frame
let's append start_frame and end_frame since they are having common columns then check values using isin() and finally changing value by using boolean masking and loc accessor:
s=start_frame.append(end_frame)
mask=(df['index'].isin(s['index'])) | (df['frame'].isin(s['frame']))
df.loc[mask,'user1']=df.loc[mask,'user1']+1
#you can also use np.where() in place of loc accessor
output of df:
index frame user1 user2
0 0 0 0 0
1 1 1 0 0
2 2 2 0 0
3 3 3 1 0
4 4 4 1 0
5 5 5 1 0
Update:
use:
mask=df['frame'].between(3,5)
df.loc[mask,'user1']=df.loc[mask,'user1']+1
Did you try
def putHello(row):
row["hello"] = "world"
return row
data.iloc[5:7].apply(putHello,axis=1)
The output would look something like this
The documentation for pandas functions
Iloc pandas
Apply pandas
Column 'signal' is populated with 0 or 1, and I want column 'reversal to tell me when there is a change in this column (i.e., from 0 to 1 or 1 to 0).
Issue: the code below gives me this information correctly for all the rows except the first one. The reason is that it tries to look at the value before the first one for column 'signal', and, since it does not find any value (of course - it is the first one!), it says that there is a change (as it does when the column's value changes from 0 to 1 or 1 to 0).
How can I fix it? I would like the code to disregard the first discrepancy, basically.
import pandas as pd
import numpy as np
d= {'signal':[0,0,0,1,1,0]}
df_zinc = pd.DataFrame(data=d)
df_zinc['reversal'] = np.where(df_zinc['signal']!=df_zinc['signal'].shift(),1,0)
print(df_zinc)
OUTPUT
signal reversal
0 0 1
1 0 0
2 0 0
3 1 1
4 1 0
5 0 1
If you are looking for the changes, I'd suggest using diff instead:
df_zinc['signal'].diff().fillna(0)!=0
if you prefer it as a int instead of a boolean:
bool_s = df_zinc['signal'].diff().fillna(0)!=0
int_s = bool_s.astype(int)
Testing:
df_zinc['reversal'] = (df_zinc['signal'].diff().fillna(0)!=0).astype(int)
Output
signal reversal
0 0 0
1 0 0
2 0 0
3 1 1
4 1 0
5 0 1
In a pandas dataframe, how can I drop a random subset of rows that obey a condition?
In other words, if I have a Pandas dataframe with a Label column, I'd like to drop 50% (or some other percentage) of rows where Label == 1, but keep all of the rest:
Label A -> Label A
0 1 0 1
0 2 0 2
0 3 0 3
1 10 1 11
1 11 1 12
1 12
1 13
I'd love to know the simplest and most pythonic/panda-ish way of doing this!
Edit: This question provides part of an answer, but it only talks about dropping rows by index, disregarding the row values. I'd still like to know how to drop only from rows that are labeled a certain way.
Use the frac argument
df.sample(frac=.5)
If you define the amount you want to drop in a variable n
n = .5
df.sample(frac=1 - n)
To include the condition, use drop
df.drop(df.query('Label == 1').sample(frac=.5).index)
Label A
0 0 1
1 0 2
2 0 3
4 1 11
6 1 13
Using drop with sample
df.drop(df[df.Label.eq(1)].sample(2).index)
Label A
0 0 1
1 0 2
2 0 3
3 1 10
5 1 12
Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1
Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.
First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Use the .tail method:
df=df.groupby('group_id').tail(1)
You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64
One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.