I am wondering about code that detect when values in one column BECOME bigger than values in another column. So in the example below in row index 1 B becomes bigger than A and in row index 3 A becomes bigger than B. I would like to get a DataFrame that highlights row 1 and 2 and also which column that became bigger than which.
In [1]: df
Out[1]:
A B
0 3 2
1 5 6
2 3 7
3 8 2
Desired result:
In [1]: df_result
Out[1]:
RES
0 0
1 -1
2 0
3 1
You could check where A is greater than B cast to int8 with view and take the diff:
df.A.gt(df.B).view('i1').diff().fillna(0, downcast = 'i1')
0 0
1 -1
2 0
3 1
dtype: int8
You can use np.sign with a column difference to know sign of the difference and then replace values with column label:
df['who_is_bigger'] = (pd.DataFrame(np.sign(df['col1'] - df['col2']))
.replace({-1: 'col2', 1: 'col1', 0:'same_size'}))
will return in who_is_bigger column the column label of the biggest column and same_size string if the two columns have same value
If the comparison is specifically between two columns, you can take the difference between both, let's say we've got this dataframe:
>>> data = pd.DataFrame.from_dict({'a': [5, 4, 3, 2, 1], 'b': [5, 3, 4, 1, 2]})
>>> data
a b
0 5 5
1 4 3
2 3 4
3 2 1
4 1 2
Then by doing data['a'] - data['b'] we get the deltas between each, just like this:
>>> data['a'] - data['b']
0 0
1 1
2 -1
3 1
4 -1
Where 0 means that both columns in that row possess the same number, a number higher than 0 means that the left column (a) is greater than the right column (b) and a negative number, the opposite.
Related
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
I ve been around several trials, nothing seems to work so far.
I have tried df.insert(0, "XYZ", 555) which seemed to work until it did not for some reasons i am not certain.
I understand that the issue is that Index is not considered a Series and so, df.iloc[0] does not allow you to insert data directly above the Index column.
I ve also tried manually adding in the list of indices part of the definition of the dataframe a first index with the value "XYZ"..but nothing has work.
Thanks for your help
A B C D are my columns. range(5) is my index. I am trying to obtain this below, for an arbitrary row starting with type, and then a list of strings..thanks
A B C D
type 'string1' 'string2' 'string3' 'string4'
0
1
2
3
4
If you use Timestamps as Index adding a row and a custom single row with its own custom index will throw an error:
ValueError: Cannot add integral value to Timestamp without offset. I am guessing it's due to the difference in the operands, if i substract an Integer from a Timestamp for example.. ? how could i fix this in a generic manner? thanks! –
if you want to insert a row before the first row, you can do it this way:
data:
In [57]: df
Out[57]:
id var
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
adding one row:
In [58]: df.loc[df.index.min() - 1] = ['z', -1]
In [59]: df
Out[59]:
id var
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
-1 z -1
sort index:
In [60]: df = df.sort_index()
In [61]: df
Out[61]:
id var
-1 z -1
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
optionally reset your index :
In [62]: df = df.reset_index(drop=True)
In [63]: df
Out[63]:
id var
0 z -1
1 a 1
2 a 2
3 a 3
4 b 5
5 b 9
Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1
Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.
First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Use the .tail method:
df=df.groupby('group_id').tail(1)
You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64
One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.
Basically, I have a data frame in which b could contain just 1,just 2 or a combination of 1 and 2. In case it has only one of the elements (eg 1) then the missing element (eg 2) should get a value of, say, 0.
For example if df = pd.DataFrame({'value':np.random.randn(3), 'b':[1,1,1]})
the resulting data frame should look like:
value b
-0.160580 1
0.100649 1
1.402768 1
0 2
However, if df = pd.DataFrame({'value':np.random.randn(3), 'b':[2,2,2]})
value b
0 1
-0.390148 2
0.843670 2
-0.199137 2
If df = pd.DataFrame({'value':np.random.randn(3), 'b':[1,2,2]})
value b
-0.912213 1
-1.827496 2
0.995711 2
I though of initiating a data frame:
df_init = pd.DataFrame({'value':[0,0],'b':[1,2]})
and then updating it with whatever values df has and placing them according to whether b is 1 or 2, but don't know how to do this...
You could just append if 2 isn't in the column:
In [11]: df.append(pd.Series({'value': 0, 'b': 2}), ignore_index=True)
Out[11]:
b value
0 1 1.601810
1 1 1.483431
2 1 -0.781733
3 2 0.000000
[4 rows x 2 columns]
To check, use set on the column first (more efficient if you're reusing and have a smaller number of possible values):
In [12]: b_unique = df.b.unique()
In [13]: b_unique
Out[13]: array([1])
That is,
In [14]: if 2 in s: # equivalently use if 2 in df['b'].unique()
df.append(pd.Series({'value': 0, 'b': 2}), ignore_index=True)
In [15]: df
Out[15]:
b value
0 1 1.601810
1 1 1.483431
2 1 -0.781733
3 2 0.000000
[4 rows x 2 columns]
You can do the same check for 1.