I'm trying to find the sum of price data in a data frame.
My current code looks like this, whitch there must be a better way to do it
i=5
a = 0
while i<10
a = a + df.loc[i]["Price"]
i = i + 1
averg = a/5
print(averg)
First note you should avoid chained indexing. It's ambiguous and explicitly discouraged in the docs, and instead you can use pd.DataFrame.at. In addition, you can use the += operator for incrementing a value. So you can rewrite as:
i = 5
a = 0
while i < 10
a += df.at[i, 'Price']
i += 1
avg = a/5
print(avg)
However, note you can use pd.DataFrame.loc to combine row and index labelling and give a pd.Series object. You can then use pd.Series.mean to calculate the average:
avg = df.loc[5:10, 'Price'].mean()
This way you are also taking advantage of vectorised computations as opposed to using a Python-level loop.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=["Price", "Weight", "Size"])
>>> df
Price Weight Size
0 1 2 3
1 4 5 6
2 7 8 9
>>> df.mean()
Price 4
Weight 5
Size 6
dtype: int64
>>> df["Price"].mean()
4
Using loops on dataframes is very inefficient. Try using vector calculations whenever possible. Pandas already has a function mean() for the same.
If the index of that column is "Price" then you could just do the following -
df['Price'].mean()
Related
How do i get this 2^ value in another col of a df
i need to calculate 2^ value
is there a easy way to do this
Value
2^Value
0
1
1
2
You can use numpy.power :
import numpy as np
df["2^Value"] = np.power(2, df["Value"])
Or simply, 2 ** df["Value"] as suggested by #B Remmelzwaal.
Output :
print(df)
Value 2^Value
0 0 1
1 1 2
2 3 8
3 4 16
Here is some stats/timing :
Using rpow:
df['2^Value'] = df['Value'].rpow(2)
Output:
Value 2^Value
0 0 1
1 1 2
2 2 4
3 3 8
4 4 16
You can use .apply with a lambda function
df["new_column"] = df["Value"].apply(lambda x: x**2)
In python the power operator is **
You can apply a function to each row in a dataframe by using the df.apply method. See this documentation to learn how the method is used. Here is some untested code to get you started.
# a simple function that takes a number and returns
# 2^n of that number
def calculate_2_n(n):
return 2**n
# use the df.apply method to apply that function to each of the
# cells in the 'Value' column of the DataFrame
df['2_n_value'] = df.apply(lambda row : calculate_2_n(row['Value']), axis = 1)
This code is a modified version of the code from this G4G example
I have a dataframe (df) with 2 columns:
Out[2]:
0 1
0 1 2
1 4 5
2 3 6
3 10 12
4 1 2
5 4 5
6 3 6
7 10 12
I would like to use calculate for all the elements of df[0] a function of itself and df[1] column:
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
I get the following error: TypeError:
("'numpy.float64' object is not callable", u'occurred at index 0')
Here is the full code:
from __future__ import division
import pandas as pd
import sys
from scipy import stats
def custom_fct_2(x,y):
res=stats.percentileofscore(y.values,x.iloc[-1])
return res
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
df['perc']=df.rolling(3).apply(custom_fct_2(df[0],df[1]))
Can someone help me on that? ( I am new in Python)
Out[2]:
0 1
...
5 4 5
6 3 6
7 10 12
I want the percentile ranking of [10] in [12,6,5]
I want the percentile ranking of [3] in [6,5,2]
I want the percentile ranking of [4] in [5,2,12]
...
The problem here is that rolling().apply() function cannot give you a segment of 3 rows across all the columns. Instead, it gives you series for the column 0 first, then the column 1.
Maybe there are better solutions, but I would show my one which at least works.
df= pd.DataFrame([[1,2],[4,5],[3,6],[10,12],[1,2],[4,5],[3,6],[10,12]])
def custom_fct_2(s):
score = df[0][s.index.values[1]] # you may use .values[-1] if you want the last element
a = s.values
return stats.percentileofscore(a, score)
I'm using the same data you provided. But I modified your custom_fct_2() function. Here we get the s which is a series of 3 rolling values from the column 1. Fortunately, we have indexes in this series, so we can get the score from the column 0 via the "middle" index of the series. BTW, in Python [-1] means the last element of a collection, but from your explanation, I believe you actually want the middle one.
Then, apply the function.
# remove the shift() function if you want the value align to the last value of the rolling scores
df['prec'] = df[1].rolling(3).apply(custom_fct_2).shift(periods=-1)
The shift function is optional. It depends on your requirements whether your prec need to be aligned with column 0 (the middle score is using) or the rolling scores of column 1. I would assume you need it.
So I have a dataframe that looks something like this:
df1 = pd.DataFrame([[1,2, 3], [5,7,8], [2,5,4]])
0 1 2
0 1 2 3
1 5 7 8
2 2 5 4
I then have a function that adds 5 to a number called add5. I'm trying to create a new column in df1 that adds 5 to all the numbers in column 2 that are greater than 3. I want to use vectorization not apply as this concept is going to be expanded to a dataset with hundreds of thousands of entries and speed will be important. I can do it without the greater than 3 constraint like this:
df1['3'] = add5(df1[2])
But my goal is to do something like this:
df1['3'] = add5(df1[2]) if df1[2] > 3
Hoping someone can point me in the right direction on this. Thanks!
With Pandas, a function applied explicitly to each row typically cannot be vectorised. Even implicit loops such as pd.Series.apply will likely be inefficient. Instead, you should use true vectorised operations, which lean heavily on NumPy in both functionality and syntax.
In this case, you can use numpy.where:
df1[3] = np.where(df1[2] > 3, df1[2] + 5, df1[2])
Alternatively, you can use pd.DataFrame.loc in a couple of steps:
df1[3] = df1[2]
df1.loc[df1[2] > 3, 3] = df1[2] + 5
In each case, the term df1[2] > 3 creates a Boolean series, which is then used to mask another series.
Result:
print(df1)
0 1 2 3
0 1 2 3 3
1 5 7 8 13
2 2 5 4 9
Consider the following data set stored in a pandas DataFrame dfX:
A B
1 2
4 6
7 9
I have a function that is:
def someThingSpecial(x,y)
# z = do something special with x,y
return z
I now want to create a new column in df that bears the computed z value
Looking at other SO examples, I've tried several variants including:
dfX['C'] = dfX.apply(lambda x: someThingSpecial(x=x['A'], y=x['B']), axis=1)
Which returns errors. What is the right way to do this?
This seems to work for me on v0.21. Take a look -
df
A B
0 1 2
1 4 6
2 7 9
def someThingSpecial(x,y):
return x + y
df.apply(lambda x: someThingSpecial(x.A, x.B), 1)
0 3
1 10
2 16
dtype: int64
You might want to try upgrading your pandas version to the latest stable release (0.21 as of now).
Here's another option. You can vectorise your function.
v = np.vectorize(someThingSpecial)
v now accepts arrays, but operates on each pair of elements individually. Note that this just hides the loop, as apply does, but is much cleaner. Now, you can compute C as so -
df['C'] = v(df.A, df.B)
if your function only needs one column's value, then do this instead of coldspeed's answer:
dfX['A'].apply(your_func)
to store it:
dfX['C'] = dfX['A'].apply(your_func)
I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({'id': [1, 2, 1, 1, 2, 1, 2, 2],
'min_max': ['max_val', 'max_val', 'min_val', 'min_val', 'max_val', 'max_val', 'min_val', 'min_val'],
'value': [1, 20, 20, 10, 12, 3, -10, -5 ]})
id min_max value
0 1 max_val 1
1 2 max_val 20
2 1 min_val 20
3 1 min_val 10
4 2 max_val 12
5 1 max_val 3
6 2 min_val -10
7 2 min_val -5
Each id has several maximal and minimal values associated with it. My desired output looks like this:
max min
id
1 3 10
2 20 -10
It contains the maximal max_val and the minimal min_val for each id.
Currently I implement that as follows:
gdf = df.groupby(by=['id', 'min_max'])['value']
max_max = gdf.max().loc[:, 'max_val']
min_min = gdf.min().loc[:, 'min_val']
final_df = pd.concat([max_max, min_min], axis=1)
final_df.columns = ['max', 'min']
What I don't like is that I have to call .max() and .min() on the grouped dataframe gdf, separately where I throw away 50% of the information (since I am not interested in the maximal min_val and the minimal min_val).
Is there a way to do this in a more straightforward manner by e.g. passing the function that should be applied to a group directly to the groupby call?
EDIT:
df.groupby('id')['value'].agg(['max','min'])
is not sufficient as there can be the case that a group has a min_val that is higher than all max_val for that group or a max_val that is lower than all min_val. Thus, one also has to group based on the column min_max.
Result for
df.groupby('id')['value'].agg(['max','min'])
max min
id
1 20 1
2 20 -10
Result for the code from above:
max min
id
1 3 10
2 20 -10
Here's a slightly tongue-in-cheek solution:
>>> df.groupby(['id', 'min_max'])['value'].apply(lambda g: getattr(g, g.name[1][:3])()).unstack()
min_max max_val min_val
id
1 3 10
2 20 -10
This applies a function that grabs the name of the real function to apply from the group key.
Obviously this wouldn't work so simply if there weren't such a simple relationship between the string "max_val" and the function name "max". It could be generalized by having a dict mapping column values to functions to apply, something like this:
func_map = {'min_val': min, 'max_val': max}
df.groupby(['id', 'min_max'])['value'].apply(lambda g: func_map[g.name[1]](g)).unstack()
Note that this is slightly less efficient than the version above, since it calls the plain Python max/min rather than the optimized pandas versions. But if you want a more generalizable solution, that's what you have to do, because there aren't optimized pandas versions of everything. (This is also more or less why there's no built-in way to do this: for most data, you can't assume a priori that your values can be mapped to meaningful functions, so it doesn't make sense to try to determine the function to apply based on the values themselves.)
One option is to do the customized aggregation with groupby.apply, since it doesn't fit with built in aggregation scenario well:
(df.groupby('id')
.apply(lambda g: pd.Series({'max': g.value[g.min_max == "max_val"].max(),
'min': g.value[g.min_max == "min_val"].min()})))
# max min
#id
# 1 3 10
# 2 20 -10
Solution with pivot_table:
df1 = df.pivot_table(index='id', columns='min_max', values='value', aggfunc=[np.min,np.max])
df1 = df1.loc[:, [('amin','min_val'), ('amax','max_val')]]
df1.columns = df1.columns.droplevel(1)
print (df1)
amin amax
id
1 10 3
2 -10 20