I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 1, 1, 2, 1, 2, 2],
'min_max': ['max_val', 'max_val', 'min_val', 'min_val', 'max_val', 'max_val', 'min_val', 'min_val'],
'value': [1, 20, 20, 10, 12, 3, -10, -5 ]})
id min_max value
0 1 max_val 1
1 2 max_val 20
2 1 min_val 20
3 1 min_val 10
4 2 max_val 12
5 1 max_val 3
6 2 min_val -10
7 2 min_val -5
Each id has several maximal and minimal values associated with it. My desired output looks like this:
max min
id
1 3 10
2 20 -10
It contains the maximal max_val and the minimal min_val for each id.
Currently I implement that as follows:
gdf = df.groupby(by=['id', 'min_max'])['value']
max_max = gdf.max().loc[:, 'max_val']
min_min = gdf.min().loc[:, 'min_val']
final_df = pd.concat([max_max, min_min], axis=1)
final_df.columns = ['max', 'min']
What I don't like is that I have to call .max() and .min() on the grouped dataframe gdf, separately where I throw away 50% of the information (since I am not interested in the maximal min_val and the minimal min_val).
Is there a way to do this in a more straightforward manner by e.g. passing the function that should be applied to a group directly to the groupby call?
EDIT:
df.groupby('id')['value'].agg(['max','min'])
is not sufficient as there can be the case that a group has a min_val that is higher than all max_val for that group or a max_val that is lower than all min_val. Thus, one also has to group based on the column min_max.
Result for
df.groupby('id')['value'].agg(['max','min'])
max min
id
1 20 1
2 20 -10
Result for the code from above:
max min
id
1 3 10
2 20 -10
Here's a slightly tongue-in-cheek solution:
>>> df.groupby(['id', 'min_max'])['value'].apply(lambda g: getattr(g, g.name[1][:3])()).unstack()
min_max max_val min_val
id
1 3 10
2 20 -10
This applies a function that grabs the name of the real function to apply from the group key.
Obviously this wouldn't work so simply if there weren't such a simple relationship between the string "max_val" and the function name "max". It could be generalized by having a dict mapping column values to functions to apply, something like this:
func_map = {'min_val': min, 'max_val': max}
df.groupby(['id', 'min_max'])['value'].apply(lambda g: func_map[g.name[1]](g)).unstack()
Note that this is slightly less efficient than the version above, since it calls the plain Python max/min rather than the optimized pandas versions. But if you want a more generalizable solution, that's what you have to do, because there aren't optimized pandas versions of everything. (This is also more or less why there's no built-in way to do this: for most data, you can't assume a priori that your values can be mapped to meaningful functions, so it doesn't make sense to try to determine the function to apply based on the values themselves.)
One option is to do the customized aggregation with groupby.apply, since it doesn't fit with built in aggregation scenario well:
(df.groupby('id')
.apply(lambda g: pd.Series({'max': g.value[g.min_max == "max_val"].max(),
'min': g.value[g.min_max == "min_val"].min()})))
# max min
#id
# 1 3 10
# 2 20 -10
Solution with pivot_table:
df1 = df.pivot_table(index='id', columns='min_max', values='value', aggfunc=[np.min,np.max])
df1 = df1.loc[:, [('amin','min_val'), ('amax','max_val')]]
df1.columns = df1.columns.droplevel(1)
print (df1)
amin amax
id
1 10 3
2 -10 20
Related
How do i get this 2^ value in another col of a df
i need to calculate 2^ value
is there a easy way to do this
Value
2^Value
0
1
1
2
You can use numpy.power :
import numpy as np
df["2^Value"] = np.power(2, df["Value"])
Or simply, 2 ** df["Value"] as suggested by #B Remmelzwaal.
Output :
print(df)
Value 2^Value
0 0 1
1 1 2
2 3 8
3 4 16
Here is some stats/timing :
Using rpow:
df['2^Value'] = df['Value'].rpow(2)
Output:
Value 2^Value
0 0 1
1 1 2
2 2 4
3 3 8
4 4 16
You can use .apply with a lambda function
df["new_column"] = df["Value"].apply(lambda x: x**2)
In python the power operator is **
You can apply a function to each row in a dataframe by using the df.apply method. See this documentation to learn how the method is used. Here is some untested code to get you started.
# a simple function that takes a number and returns
# 2^n of that number
def calculate_2_n(n):
return 2**n
# use the df.apply method to apply that function to each of the
# cells in the 'Value' column of the DataFrame
df['2_n_value'] = df.apply(lambda row : calculate_2_n(row['Value']), axis = 1)
This code is a modified version of the code from this G4G example
How do I pass the whole dataframe and the index of the row being operated upon when using the apply() method on a dataframe?
Specifically, I have a dataframe correlation_df with the following data:
id
scores
cosine
1
100
0.8
2
75
0.7
3
50
0.4
4
25
0.05
I want to create an extra column where each row value is the correlation of scores and cosine without that row's values included.
My understanding is that I should do this with with a custom function and the apply method, i.e. correlation_df.apply(my_fuct). However, I need to pass in the whole dataframe and the index of the row in question so that I can ignore it in the correlation calculation.
NB. Problem code:
import numpy as np
import pandas as pd
score = np.array([100, 75, 50, 25])
cosine = np.array([.8, 0.7, 0.4, .05])
correlation_df = pd.DataFrame(
{
"score": score,
"cosine": cosine,
}
)
corr = correlation_df.corr().values[0, 1]
[Edit] Roundabout solution that I'm sure can be improved:
def my_fuct(row):
i = int(row["index"])
r = list(range(correlation_df.shape[0]))
r.remove(i)
subset = correlation_df.iloc[r, :].copy()
subset = subset.set_index("index")
return subset.corr().values[0, 1]
correlation_df["diff_correlations"] = = correlation_df.apply(my_fuct, axis=1)
Your problem can be simplified to:
>>> df["diff_correlations"] = df.apply(lambda x: df.drop(x.name).corr().iat[0,1], axis=1)
>>> df
score cosine diff_correlations
0 100 0.80 0.999015
1 75 0.70 0.988522
2 50 0.40 0.977951
3 25 0.05 0.960769
A more sophisticated method would be:
The whole correlation matrix isn't made every time this way.
df.apply(lambda x: (tmp_df := df.drop(x.name)).score.corr(tmp_df.cosine), axis=1)
The index can be accessed in an apply with .name or .index, depending on the axis:
>>> correlation_df.apply(lambda x: x.name, axis=1)
0 0
1 1
2 2
3 3
dtype: int64
>>> correlation_df.apply(lambda x: x.index, axis=0)
score cosine
0 0 0
1 1 1
2 2 2
3 3 3
Using
correlation_df = correlation_df.reset_index()
gives you a new column index, denoting the index of the row, namely what previously was your index. Now when using pd.apply access it via:
correlation_df.apply(lambda r: r["index"])
After you are done you could do:
correlation_df = correlation_df.set_index("index")
to get your previous format back.
I'm trying to find the sum of price data in a data frame.
My current code looks like this, whitch there must be a better way to do it
i=5
a = 0
while i<10
a = a + df.loc[i]["Price"]
i = i + 1
averg = a/5
print(averg)
First note you should avoid chained indexing. It's ambiguous and explicitly discouraged in the docs, and instead you can use pd.DataFrame.at. In addition, you can use the += operator for incrementing a value. So you can rewrite as:
i = 5
a = 0
while i < 10
a += df.at[i, 'Price']
i += 1
avg = a/5
print(avg)
However, note you can use pd.DataFrame.loc to combine row and index labelling and give a pd.Series object. You can then use pd.Series.mean to calculate the average:
avg = df.loc[5:10, 'Price'].mean()
This way you are also taking advantage of vectorised computations as opposed to using a Python-level loop.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=["Price", "Weight", "Size"])
>>> df
Price Weight Size
0 1 2 3
1 4 5 6
2 7 8 9
>>> df.mean()
Price 4
Weight 5
Size 6
dtype: int64
>>> df["Price"].mean()
4
Using loops on dataframes is very inefficient. Try using vector calculations whenever possible. Pandas already has a function mean() for the same.
If the index of that column is "Price" then you could just do the following -
df['Price'].mean()
I have a panda dataframe, it is used for a heatmap. I would like the minimal value of each column to be along the diagonal.
I've sorted the columsn using
data = data.loc[:, data.min().sort_values().index]
This works. Now I just need to sort the values such that the index of the min value in the first column is row 0, then the min value of second column is row 1, and so on.
Example
import seaborn as sns
import pandas as pd
data = [[5,1,9],
[7,8,6],
[5,3,2]]
data = pd.DataFrame(data)
#sns.heatmap(data)
data = data.loc[:, data.min().sort_values().index]
#sns.heatmap(data) # Gives result in step 1
# Step1, Columsn sorted by min value, 1, 2, 5
data = [[1,9,5],
[8,6,7],
[3,2,5]]
data = pd.DataFrame(data)
#sns.heatmap(data)
# How do i perform step two, maintinaing column order.
# Step 2, Rows sorted by min value 1,2,7
data = [[1,9,5],
[3,2,5],
[8,6,7]]
data = pd.DataFrame(data)
sns.heatmap(data)
Is this possible in panda in a clever way?
Setup
data = pd.DataFrame([[5, 1, 9], [7, 8, 6], [5, 3, 2]])
You can accomplish this by using argsort of the diagonal elements of your sorted DataFrame, then indexing the DataFrame using these values.
Step 1
Use your initial sort:
data = data.loc[:, data.min().sort_values().index]
1 2 0
0 1 9 5
1 8 6 7
2 3 2 5
Step 2
Use np.argsort with np.diag:
data.iloc[np.argsort(np.diag(data))]
1 2 0
0 1 9 5
2 3 2 5
1 8 6 7
I'm not quite sure, but you've already done the following to sort column
data = data.loc[:, data.min().sort_values().index]
the same trick could also be applied to sort row
data = data.loc[data.min(axis=1).sort_values().index, :]
To move some values around so that the min value within each column is placed along the diagonal you could try something like this:
for i in range(len(data)):
min_index = data.iloc[:, i].idxmin()
if data.iloc[i,i] != data.iloc[min_index, i]:
data.iloc[i,i], data.iloc[min_index,i] = data.iloc[min_index, i], data.iloc[i,i]
Basically just swap the min with the diagonal.
I wanna implement a calculate method like a simple scenario:
value computed as the sum of daily data during the previous N days (set N = 3 in the following example)
Dataframe df: (df.index is 'date')
date value
20140718 1
20140721 2
20140722 3
20140723 4
20140724 5
20140725 6
20140728 7
......
to do calculating like:
date value new
20140718 1 0
20140721 2 0
20140722 3 0
20140723 4 6 (3+2+1)
20140724 5 9 (4+3+2)
20140725 6 12 (5+4+3)
20140728 7 15 (6+5+4)
......
Now I have done this using for cycle like:
df['value']=[0]*len(df)
for idx in df.index
loc=df.index.get_loc(idx)
if((loc-N)>=0):
tmp=df.ix[df.index[loc-3]:df.index[loc-1]]
sum=tmp['value'].sum()
else:
sum=0
df['new'].ix(idx)=sum
But, when the length of dataframe or the value of N is very long / big, these calculating will be very slow....How I can implement this faster using a function or by other ways?
Besides, if the scenario is more complex? how ? Thanks.
Since you want the sum of the previous three excluding the current one, you can use rolling_apply over the a window of four and sum up all but the last value.
new = rolling_apply(df, 4, lambda x:sum(x[:-1]), min_periods=4)
This is the same as shifting afterwards with a window of three:
new = rolling_apply(df, 3, sum, min_periods=3).shift()
Then
df["new"] = new["value"].fillna(0)