Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1
Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.
First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Use the .tail method:
df=df.groupby('group_id').tail(1)
You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64
One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.
Related
sample and expected data
The block one is current data and block 2 is the expected data that is, when i encounter 1 i need the next row to be incremented by one and for next country b same should happen
First replace all another values after first 1 to 1, so is possible use GroupBy.cumsum:
df = pd.DataFrame({'c':['a']*3 + ['b']*3+ ['c']*3, 'v':[1,0,0,0,1,0,0,0,1]})
s = df.groupby('c')['v'].cumsum()
df['new'] = s.where(s.eq(0), 1).groupby(df['c']).cumsum()
print (df)
c v new
0 a 1 1
1 a 0 2
2 a 0 3
3 b 0 0
4 b 1 1
5 b 0 2
6 c 0 0
7 c 0 0
8 c 1 1
Another solution is replace all not 1 values to missing values and forward filling 1 per groups, then first missing values are replaced to 0, so cumulative sum also working perfectly:
s = df['v'].where(df['v'].eq(1)).groupby(df['c']).ffill().fillna(0).astype(int)
df['new'] = s.groupby(df['c']).cumsum()
I have a number of pandas dataframes that each have a column 'speaker', and one of two labels. Typically, this is 0-1, however in some cases it is 1-2, 1-3, or 0-2. I am trying to find a way to iterate through all of my dataframes and standardize them so that they share the same labels (0-1).
The one consistent feature between them is that the first label to appear (i.e. in the first row of the dataframe) should always be mapped to '0', where as the second should always be mapped to '1'.
Here is an example of one of the dataframes I would need to change - being mindful that others will have different labels:
import pandas as pd
data = [1,2,1,2,1,2,1,2,1,2]
df = pd.DataFrame(data, columns = ['speaker'])
I would like to be able to change so that it appears as [0,1,0,1,0,1,0,1,0,1].
Thus far, I have tried inserting the following code within a bigger for loop that iterates through each dataframe. However it is not working at all:
for label in data['speaker']:
if label == data['speaker'][0]:
label = '0'
else:
label = '1'
Hopefully, what the above makes clear is that I am attempting to create a rule akin to: "find all instances in 'Speaker' that match the label in the first index position and change this to '0'. For all other instances change this to '1'."
Method 1
We can use iat + np.where here for conditional creation of your column:
# import numpy as np
first_val = df['speaker'].iat[0] # same as df['speaker'].iloc[0]
df['speaker'] = np.where(df['speaker'].eq(first_val), 0, 1)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Method 2:
We can also make use of booleans, since we can cast them to integers:
first_val = df['speaker'].iat[0]
df['speaker'] = df['speaker'].ne(first_val).astype(int)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Only if your values are actually 1, 2 we can use floor division:
df['speaker'] = df['speaker'] // 2
# same as: df['speaker'] = df['speaker'].floordiv(2)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
You can use a iloc to get the value of the first row and the first column, and then a mask to set the values:
zero_map = df["speaker"].iloc[0]
mask_zero = df["speaker"] == zero_map
df.loc[mask_zero] = 0
df.loc[~mask_zero] = 1
print(df)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
I am wondering about code that detect when values in one column BECOME bigger than values in another column. So in the example below in row index 1 B becomes bigger than A and in row index 3 A becomes bigger than B. I would like to get a DataFrame that highlights row 1 and 2 and also which column that became bigger than which.
In [1]: df
Out[1]:
A B
0 3 2
1 5 6
2 3 7
3 8 2
Desired result:
In [1]: df_result
Out[1]:
RES
0 0
1 -1
2 0
3 1
You could check where A is greater than B cast to int8 with view and take the diff:
df.A.gt(df.B).view('i1').diff().fillna(0, downcast = 'i1')
0 0
1 -1
2 0
3 1
dtype: int8
You can use np.sign with a column difference to know sign of the difference and then replace values with column label:
df['who_is_bigger'] = (pd.DataFrame(np.sign(df['col1'] - df['col2']))
.replace({-1: 'col2', 1: 'col1', 0:'same_size'}))
will return in who_is_bigger column the column label of the biggest column and same_size string if the two columns have same value
If the comparison is specifically between two columns, you can take the difference between both, let's say we've got this dataframe:
>>> data = pd.DataFrame.from_dict({'a': [5, 4, 3, 2, 1], 'b': [5, 3, 4, 1, 2]})
>>> data
a b
0 5 5
1 4 3
2 3 4
3 2 1
4 1 2
Then by doing data['a'] - data['b'] we get the deltas between each, just like this:
>>> data['a'] - data['b']
0 0
1 1
2 -1
3 1
4 -1
Where 0 means that both columns in that row possess the same number, a number higher than 0 means that the left column (a) is greater than the right column (b) and a negative number, the opposite.
I have a pandas dataframe like this:
a b c
0 1 1 1
1 1 1 0
2 2 4 1
3 3 5 0
4 3 5 0
where the first 2 columns ('a' and 'b') are IDs while the last one ('c') is a validation (0 = neg, 1 = pos). I do know how to remove duplicates based on the values of the first 2 columns, however in this case I would also like to get rid of inconsistent data i.e. duplicated data validated both as positive and negative. So for example the first 2 rows are duplicated but inconsistent hence I should remove the entire record, while the last 2 rows are both duplicated and consistent so I'd keep one of the records. The expected result sholud be:
a b c
0 2 4 1
1 3 5 0
The real dataframe can have more than two duplicates per group and
as you can see also the index has been changed. Thanks.
First filter rows by GroupBy.transform with SeriesGroupBy.nunique for get only unique values groups with boolean indexing and then DataFrame.drop_duplicates:
df = (df[df.groupby(['a','b'])['c'].transform('nunique').eq(1)]
.drop_duplicates(['a','b'])
.reset_index(drop=True))
print (df)
a b c
0 2 4 1
1 3 5 0
Detail:
print (df.groupby(['a','b'])['c'].transform('nunique'))
0 2
1 2
2 1
3 1
4 1
Name: c, dtype: int64
I have a data-frame which I'm using the pandas.groupby on a specific column and then running aggregate statistics on the produced groups (mean, median, count). I want to treat certain column values as members of the same group produced by the groupby rather than a distinct group per distinct value in the column which was used for the grouping. I was looking how I would accomplish such a thing.
For example:
>> my_df
ID SUB_NUM ELAPSED_TIME
1 1 1.7
2 2 1.4
3 2 2.1
4 4 3.0
5 6 1.8
6 6 1.2
So instead of the typical behavior:
>> my_df.groupby([SUB_NUM]).agg([count])
ID SUB_NUM Count
1 1 1
2 2 2
4 4 1
5 6 2
I want certain values (SUB_NUM in [1, 2]) to be computed as one group so instead something like below is produced:
>> # Some mystery pandas function calls
ID SUB_NUM Count
1 1, 2 3
4 4 1
5 6 2
Any help would be much appreciated, thanks!
For me works:
#for join values convert values to string
df['SUB_NUM'] = df['SUB_NUM'].astype(str)
#create mapping dict by dict comprehension
L = ['1','2']
d = {x: ','.join(L) for x in L}
print (d)
{'2': '1,2', '1': '1,2'}
#replace values by dict
a = df['SUB_NUM'].replace(d)
print (a)
0 1,2
1 1,2
2 1,2
3 4
4 6
5 6
Name: SUB_NUM, dtype: object
#groupby by mapping column and aggregating `first` and `size`
print (df.groupby(a)
.agg({'ID':'first', 'ELAPSED_TIME':'size'})
.rename(columns={'ELAPSED_TIME':'Count'})
.reset_index())
SUB_NUM ID Count
0 1,2 1 3
1 4 4 1
2 6 5 2
What is the difference between size and count in pandas?
You can create another column mapping the SUB_NUM values to actual groups and then group by it.
my_df['SUB_GROUP'] = my_df['SUB_NUM'].apply(lambda x: 1 if x < 3 else x)
my_df.groupby(['SUB_GROUP']).agg([count])