I have a data frame like this.
mydf = pd.DataFrame({'a':[1,1,3,3],'b':[np.nan,2,3,6],'c':[1,3,3,9]})
a b c
0 1 NaN 1
1 1 2.0 3
2 3 3.0 3
3 3 6.0 9
I would like to have a resulting dataframe like this.
myResults = pd.concat([mydf.groupby('a').apply(lambda x: (x.b/x.c).max()), mydf.groupby('a').apply(lambda x: (x.b/x.c).min())], axis =1)
myResults.columns = ['max','min']
max min
a
1 0.666667 0.666667
3 1.000000 0.666667
Basically i would like to have max and min of ratio of column b and column c for each group (grouped by column a)
If it possible to achieve this by agg?
I tried mydf.groupby('a').agg([lambda x: (x.b/x.c).max(), lambda x: (x.b/x.c).min()]). It will not work, and seems column name b and c will not be recognized.
Another way i can think of is to add the ratio column first to mydf. i.e. mydf['ratio'] = mydf.b/mydf.c, and then use agg on the updated mydf like mydf.groupby('a')['ratio'],agg[max,min].
Is there a better way to achieve this through agg or other function? In summary, I would like to apply customized function to grouped DataFrame, and the customized function needs to read multiple columns from original DataFrame.
You can use a customized function to acheive this.
You can create any number of new columns using any input columns using the below function.
def f(x):
t = {}
t['max'] = (x['b']/x['c']).max()
t['min'] = (x['b']/x['c']).min()
return pd.Series(t)
mydf.groupby('a').apply(f)
Output:
max min
a
1 0.666667 0.666667
3 1.000000 0.666667
Related
I want to do group_by into aggregation, but for each group I want to use a function based on values from a special column which stores which function needed to be used. Easier to show on example:
id
group
val
func
0
0
0
"avg"
1
0
2
"avg"
2
0
2
"avg"
3
1
0
"med"
4
1
2
"med"
So in that example expected behaviour would be "avg" aggregation for group 0 and "median" for group 1. How can I make agg to choose function based on "func" column values? I know that I can calculate each agg function for each group and then use func as mask for choosing right values, but that isn't that great since I'd do a lot of not needed calculations, there should be a better approach...
P.S. It's guaranteed that func is the same within each group so I don't have to worry about that.
I've written my own solution for my specific case and I'll add that in question, but answer below is fine too.
So, my approach was:
Use dict to transform from table-provided format to proper pandas names as suggested in answer:
func_dict = {"avg": "mean", "med": "median", "min": "min","max": "max", "rnk": "first"}
I wrote a custom function to pass to apply later:
def pick_price(subframe: pd.DataFrame) -> float:
func_name = subframe["agg"].iloc[0]
func_name = func_dict[func_name]
# this picks from first line in subframe a name and get real name from dict
# and next "if" block applies them among subframe
if func_name != "first":
ans = subframe["comp_price"].agg(func_name)
return 1.0 * ans
else:
idx = subframe["rank"].idxmin()
return 1.0 * subframe["comp_price"].loc[idx]
That function takes subframe with group with one same function to apply, and well, apply it.
3. Finally, use that function. First, group by groups where we need to apply different functions, and just apply with apply() method:
grouped = X.groupby("sku")
grouped.apply(pick_price)
I would use a dictionary of group: function:
f = {0: 'mean', 1: 'median'}
df['out'] = df.groupby('group')['val'].transform(lambda s: s.agg(f.get(s.name)))
Output:
id group val out
0 0 0 0 1.333333
1 1 0 2 1.333333
2 2 0 2 1.333333
3 3 1 0 1.000000
4 4 1 2 1.000000
variant using a column as source
NB. it's a bit hacky, I prefer the dictionary. It extract the function name from the first rows of the group. The names must be valid, like mean/meadian, not avg/med.
df['out'] = (df.groupby('group')['val']
.transform(lambda s: s.agg(df.loc[s.index[0], 'func']))
)
Output:
id group val func out
0 0 0 0 mean 1.333333
1 1 0 2 mean 1.333333
2 2 0 2 mean 1.333333
3 3 1 0 median 1.000000
4 4 1 2 median 1.000000
I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)
How can I compare two columns in a dataframe and create a new column based on the difference of those two columns efficiently?
I have a feature in my table that has a lot of missing values and I need to backfill those information by using other tables in the database that contain that same feature. I have used np.select to compare the feature in my original table with the same feature in other table, but I feel like there should be an easy method.
Eg: pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
I expect the new column to contain values [1,2,"different",4,np.nan]. Any help will be appreciated!
pandas.Series.combine_first or pandas.DataFrame.combine_first could be useful here. These operate like a SQL COALESCE and combine the two columns by choosing the first non-null value if one exists.
df = pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
C = df.A.combine_first(df.B)
C looks like:
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
Then, to capture your requirement that two different non-null values should give "different" when combined, just find those indices and update the values.
mask = ~df.A.isna() & ~df.B.isna() & (df.A != df.B)
C[mask] = 'different'
C now looks like:
0 1
1 2
2 different
3 4
4 NaN
Another way is to use pd.DataFrame.iterrows with nunique:
import pandas as pd
df['C'] = [s['A'] if s.nunique()<=1 else 'different' for _, s in df.iterrows()]
Output:
A B C
0 1.0 1.0 1
1 2.0 NaN 2
2 3.0 30.0 different
3 4.0 4.0 4
4 NaN NaN NaN
I have a DataFrame A in Jupiter that looks like the following
Index Var1.A.1 Var1.B.1 Var1.CA.1 Var2.A.1 Var2.B.1 Var2.CA.1
0 1 21 3 3 4 4
1 3 5 4 9 5 1
....
100 9 75 2 4 8 2
I'd like to assess the mean value based on the extension of the name, i.e.
Mean value of .A.1
Mean Value of .B.1
Mean value of .CA.1
For example, to assess the mean value of the variable with extension .A.1, I've tried the following, which doesn't return what I look for
List=['.A.1', '.B.1', '.CA.1']
A[List[List.str.contains('.A.1')]].mean()
However, in this way I get the mean values of the different variables, getting also CA.1, which is not what it look for.
Any advice?
thanks
If want mean per rows by all values after first . use groupby with lambda function and mean:
df = df.groupby(lambda x: x.split('.', 1)[-1], axis=1).mean()
print (df)
A.1 B.1 CA.1
0 2.0 12.5 3.5
1 6.0 5.0 2.5
100 6.5 41.5 2.0
Here is a thrid option:
columns = A.columns
A[[s for s in columns if ".A.1" in s]].stack().reset_index().mean()
dfA.filter(like='.A.1') - gives you the column containing the '.A.1' substring
I have a dataframe with an observation number, and id, and a number
Obs# Id Value
--------------------
1 1 5.643
2 1 7.345
3 2 0.567
4 2 1.456
I want to calculate a new column that is the mean of the previous values of a specific id
I am trying to use something like this but it only acquires the previous value:
df.groupby('Id')['Value'].apply(lambda x: x.shift(1) ...
My question is how do I acquire the range of previous values filtered by the Id so I can calculate the mean ?
So the new column based on this example should be
5.643
6.494
0.567
1.0115
It seems that you want expanding, then mean
df.groupby('Id').Value.expanding().mean()
Id
1.0 1 5.6430
2 6.4940
2.0 3 0.5670
4 1.0115
Name: Value, dtype: float64
You can also do it like:
df = pd.DataFrame({'Obs':[1,2,3,4],'Id':[1,1,2,2],'Value':[5.643,7.345, 0.567,1.456]})
df.groupby('Id')['Value'].apply(lambda x: x.cumsum()/np.arange(1, len(x)+1))
It gives output as :
5.643
6.494
0.567
1.0115