I have two dataframes:
df_1 = pd.DataFrame({'a' : [7,8, 2], 'b': [6, 6, 11], 'c': [4, 8, 6]})
df_1
and
df_2 = pd.DataFrame({'d' : [8, 4, 12], 'e': [16, 2, 1], 'f': [9, 3, 4]})
df_2
My goal is something like:
In a way that 'in one shot' I can subtract each column multiple times.
I'm trying for loop but I´m stuck!
You can subtract them as numpy arrays (using .values) and then put the result in a dataframe:
df_3 = pd.DataFrame(df_1.values - df_2.values, columns=list('xyz'))
# x y z
# 0 -1 -10 -5
# 1 4 4 5
# 2 -10 10 2
Or rename df_1.columns and df_2.columns to ['x','y','z'] and you can subtract them directly:
df_1.columns = df_2.columns = list('xyz')
df_3 = df_1 - df_2
# x y z
# 0 -1 -10 -5
# 1 4 4 5
# 2 -10 10 2
Related
zed = pd.DataFrame(data = { 'date': ['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04', '2022-03-05'], 'a': [1, 5, 7, 3, 4], 'b': [3, 4, 9, 12, 5] })
How can the following dataframe be filtered to keep the earliest row (earliest == lowest date) for each of the 3 values 1, 5, 4 appearing in either column a or column b? In this example, the rows with dates '2022-03-01', '2022-03-02' would be kept as they are the lowest dates where each of the 3 values appears?
We have tried zed[zed.isin({'a': [1, 5, 4], 'b': [1, 5, 4]}).any(1)].sort_values(by=['date']) but this returns the incorrect result as it returns 3 rows.
Without reshape your dataframe, you can use:
idx = max([zed[['a', 'b']].eq(i).sum(axis=1).idxmax() for i in [1, 5, 4]])
out = zed.loc[:idx]
Output:
>>> out
date a b
0 2022-03-01 1 3
1 2022-03-02 5 4
You can reshape by DataFrame.stack, so possible filterin gby list with remove duplicates:
s = zed.set_index('date')[['a','b']].stack()
idx = s[s.isin([1, 5, 4])].drop_duplicates().index.remove_unused_levels().levels[0]
print (idx)
Index(['2022-03-01', '2022-03-02'], dtype='object', name='date')
out = zed[zed['date'].isin(idx)]
print (out)
date a b
0 2022-03-01 1 3
1 2022-03-02 5 4
Or filter first index value matching conditions, get unique values and select rows by DataFrame.loc:
L = [1, 5, 4]
idx = pd.unique([y for x in L for y in zed[zed[['a', 'b']].eq(x).any(axis=1)].index[:1]])
df = zed.loc[idx]
print (df)
date a b
0 2022-03-01 1 3
1 2022-03-02 5 4
i want to select the whole row in which the minimal value of 3 selected columns is found, in a dataframe like this:
it is supposed to look like this afterwards:
I tried something like
dfcheckminrow = dfquery[dfquery == dfquery['A':'C'].min().groupby('ID')]
obviously it didn't work out well.
Thanks in advance!
Bkeesey's answer looks like it almost got you to your solution. I added one more step to get the overall minimum for each group.
import pandas as pd
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3],
'C': [1, 2, 3, 4, 5, 6],
})
# set "ID" as the index
df = df.set_index('ID')
# get the min for each column
mindf = df[['A','B']].groupby('ID').transform('min')
# get the min between columns and add it to df
df['min'] = mindf.apply(min, axis=1)
# filter df for when A or B matches the min
df2 = df.loc[(df['A'] == df['min']) | (df['B'] == df['min'])]
print(df2)
In my simplified example, I'm just finding the minimum between columns A and B. Here's the output:
A B C min
ID
1 14 1 2 1
2 100 2 3 2
3 1 100 5 1
One method do filter the initial DataFrame based on a groupby conditional could be to use transform to find the minimum for a "ID" group and then use loc to filter the initial DataFrame where `any(axis=1) (checking rows) is met.
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3]})
# set "ID" as the index
df = df.set_index('ID')
Sample df:
A B
ID
1 30 10
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
Use groupby and transform to find minimum value based on "ID" group.
Then use loc to filter initial df to where any(axis=1) is valid
df.loc[(df == df.groupby('ID').transform('min')).any(axis=1)]
Output:
A B
ID
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
In this example only the first row should be removed as it in both columns is not a minimum for the "ID" group.
I have the following dataframe:
import pandas as pd
data = {0: [-1, -14], 1: [-3, 2], 2: [7, 10], 4: [-10, 15]}
df = pd.DataFrame(data)
I know how to sort an specific row:
df.sort_values(by=0, ascending=False, axis=1)
How is it possible to sort the dataframe by the absolute value of the first row?
In this case I will have something like:
sorted_data = {0: [-10, 15], 1: [7, 10], 2: [-3, 2], 4: [-1, -14]}
sort series by slicing of row 0 and passing its index to indexing the original df
df_sorted = df[df.iloc[0].abs().sort_values(ascending=False).index]
Out[94]:
4 2 1 0
0 -10 7 -3 -1
1 15 10 2 -14
Pandas 1.1 gives a key argument :
df.sort_values(0, axis=1, key=np.abs, ascending=False)
4 2 1 0
0 -10 7 -3 -1
1 15 10 2 -14
Let us try argsort
df = df.iloc[:,(-df.loc[0].abs()).argsort()]
I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?
If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}
I want to select rows from a dataframe based on values in the index combined with values in a specific column:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [0, 20, 30], [40, 20, 30]],
index=[4, 5, 6, 7], columns=['A', 'B', 'C'])
A B C
4 0 2 3
5 0 4 1
6 0 20 30
7 40 20 30
with
df.loc[df['A'] == 0, 'C'] = 99
i can select all rows with column A = 0 and replace the value in column C with 99, but how can i select all rows with column A = 0 and the index < 6 (i want to combine selection on the index with selection on the column)?
You can use multiple conditions in your loc statement:
df.loc[(df.index < 6) & (df.A == 0), 'C'] = 99