I want to set a cell of pandas dataframe equal to another. For example:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude']
However, when I checked
station_dim.loc[station_dim.nlc==573,'longitude']
It returns NaN
Beside directly set the station_dim.loc[station_dim.nlc==573,'longitude']to a number, what else choice do I have? And why can't I use this method?
Take a look at get_value, or use .values:
station_dim.loc[station_dim.nlc==573,'longitude']=station_dim.loc[station_dim.nlc==5152,'longitude'].values[0]
For the assignment to work - .loc[] will return a pd.Series, the index of that pd.Series would need to align with your df, which it probably doesn't. So either extract the value directly using .get_value() - where you need to get the index position first - or use .values, which returns a np.array, and take the first value of that array.
Related
I'm learning Python and want to use the "apply" function. Reading around the manual I found that if a I have a simple dataframe like this:
df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
A B
0 4 9
1 4 9
2 4 9
and then I use something like this:
df.apply(lambda x:x.sum(),axis=0)
output works because according to theory x receives every column and apply the sum to each so the the result is correctly this:
A 12
B 27
dtype: int64
When instead I issue something like:
df['A'].apply(lambda x:x.sum())
result is: 'int' object has no attribute 'sum'
question is: why is that working on a dataframe by column, it's ok and working on a single column is not ? In the end the logic should be the same. x should receive in input one column instead of two.
I know that for this simple example I should use other functions like df.agg or even df['A'].sum() but the question is to understand the logic of apply.
if you look at a specific column of a pandas.DataFrame object, you working with a pandas.Series with (in your case) integers as values. Well and integers don't have a sum() method.
(Run type(df['A']) to see that you are working with a series and not a data frame anymore when slicing a single column).
The irritating part is that if you work with an actual pandas.DataFrame object, every column is a pandas.Series object and they have a sum() method.
So there are two ways to fix your problem
Work with a pandas.DataFrame and not with a pandas.Series: df[['A']]. The additional brackets force pandas to return a pandas.DataFrame object. (Verify by type(df[['A']])) and use the lambda function just as you did before
use a function rather than a method when using lambda: df['A'].apply(lambda x: np.sum(x)) (assuming that you have imported numpy as np)
I would recommend to go with the second option as it seems to me the more generic and clearer way
However, this is only relevant if you want to apply a certain function to ever element in a pandas.Series or pandas.DataFrame. In your specific case, there is no need to take the detour that your are currently using. Just use df.sum(axis=0).
The approach with apply is over complicating things. The reason why this works is that every element of a pandas.DataFrame is a pandas.Series, which as a sum method. But so does a pandas.DataFrame has, so you can use this right away.
The only way, where you actually need to take the way with apply is if you had arrays in every cell of the pandas.DataFrame
I am wondering why pandas assign function cannot handle returned lists.
For example
df = pd.DataFrame({
"id" : [1,2,3,4,5],
"val" : [10,20,30,30,40]
})
def squareMe(x):
return x**2
df = df.assign(val2 = lambda x: squareMe(x.val))
# Out > Works fine : Returns a DataFrame with squared values
But if we return a list,
def squareMe(x):
return [x**2]
df = df.assign(val2 = lambda x: squareMe(x.val))
#Out > ValueError: Length of values (1) does not match length of index (5)
However pandas apply function works fine when returning a list
def squareMe(x):
return [x**2]
df["val2"] = df.val.apply(lambda x: squareMe(x))
Any particular reason why this is or am I doing something wrong?
Since you reference x.val in the call to squareMe, that function is passed a list (you can easily verify this by adding a debug statement to print type(x) inside the function).
Thus, x ** 2 returns a Series (since the expression is vectorized) and the assignment works correctly.
But when you return [x ** 2] you're returning the Series inside a list, which doesn't make Sense to apply since all it sees is an iterable of size "1" (the series inside it) and it deems this to be the incorrect length for performing a column assignment to a DataFrame of size 5 (which is exactly what ValueError: Length of values (1) does not match length of index (5) means).
The difference is with apply is that the function receives a number, not a series. And so you still return a single item (a list) which apply accepts, but is still technically wrong since you shouldn't need to wrap the result in a list.
More information: df.assign, df.apply
P.S.: you probably already understand this, but you can simplify this to df['val'] = df['x'] ** 2
assign isn't particularly meant for this, it is for assigning columns already returned sequences as the arguments.
Docs:
**Parameters : kwargs : dict of {str: callable or Series}
The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not
change input DataFrame (though pandas doesn’t check it). If the values
are not callable, (e.g. a Series, scalar, or array), they are simply
assigned.
Doing [x ** 2] returns a series of lists which would be treated like a matrix (or dataframe), and therefore as the error mentions:
ValueError: Length of values (1) does not match length of index (5)
The length of values wouldn't match to the index.
This has been driving me nuts all day in my own work, but I've got it now.
cs95 is almost correct, but not quite. If you follow their advice and put a print(f"{type(x)}") in your squareMe function you'll see that it's a Series, not a list.
That's the catch, x.val is always a Series (the entire column of values), and squareMe returns a Series.
In contrast, apply, if you specify axis=1, will iterate over each row in the column, so each value of x.val and pass each one to squareMe, building a new Series for your new column in the process.
The reason it confused you (and me!) is that, when it works in your first example, it looks like squareMe is operating on integers and returning an integer for each row. But in fact, it's taking advantage of operator overloading to square the Series, not individual values: It's using the pow function, which is aliased as **, which like the other overloaded operators on Series, works element-wise.
Now, when you change squareMe to return the list of the result: [x**2], it's again squaring the entire Series to get a new Series of squares, but then making a list of that Series. That is, a list of a single element, the element being a Series.
Now assign was expecting a Series back from squareMe of the same length as the index of the dataframe, which is 5, and you returned it a list with a single element - hence the error: expected length 5, got 1.
Your apply, in the meantime, is working on the Series val because that's what you called it on, and it's iterating over the values in that series. Another way to do the apply, which is closer to your assign is this:
df["val2"] = df.apply(lambda x: squareMe(x.val), axis=1)
Pandas loc method when used with row filter throws an error
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)]
IndexingError: Unalignable boolean Series provided as indexer (index
of the boolean Series and of the indexed object do not match).
whereas the same code without row filter works fine
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)]
steps to reproduce
test=pd.DataFrame({"holiday":[0,0,0],"weekday":[1,2,3],"workingday":[1,1,1]})
test[test.loc[:,['holiday','weekday']].apply(lambda x:True,axis=1)] ##works fine
test[test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)] ##fails
I am trying to understand what is the difference between these two which makes one fail whereas the other one succeed
So the basic syntax is DataFrame[things to look for, e.g row slices or columns]
With that in mind, you are trying to filter your dataframe test with the following commands (these are the code snippets in the brackets):
test.loc[:,['holiday','weekday']].apply(lambda x:True,axis=1)
This returns True for every row in the dataframe and therefore the "filter" returns the entire dataframe
test.loc[0:1,['holiday','weekday']].apply(lambda x:True,axis=1)
This part itself is working and it is doing so by slicing the rows 0 and 1 and then applying the lambda function. Therefore, the "filter" consists of True in only 2 rows. Now the point is, that there is no value for the third row and this causes your error: The indices of the dataframe that has to be sliced (3 rows) and the boolean Series used to slice it (2 values) don't match.
Solving this problem depends on what you actually want as your output, i.e. whether the lambda function is supposed to be applied only to a subset of the data or whether you want only a subset of the results being retrieved to work with.
For my thesis I need the implied volatility of options, I already created the following function for it:
#Implied volatility solver
def Implied_Vol_Solver(s_t,K,t,r_f,option,step_size):
#s_t=Current stock price, K=Strike price, t=time until maturity, r_f=risk-free rate and option=option price,stepsize=is precision in stepsizes
#sigma set equal to steps to make a step siz equal to the starting point
sigma=step_size
while sigma < 1:
#Regualar BlackScholes formula (current only call option, will also be used to calculate put options)
d_1=(np.log(s_t/K)+(r_f+(sigma**2)/2)*t)/(sigma*(np.sqrt(t)))
d_2=d_1-np.square(t)*sigma
P_implied=s_t*norm.cdf(d_1)-K*np.exp(-r_f*t)*norm.cdf(d_2)
if option-(P_implied)<step_size:
#convert stepts to a string to find the decimal point (couldn't be done with a float)
step_size=str(step_size)
#rounds sigma equal to the stepsize
return round(sigma,step_size[::-1].find('.'))
sigma+=step_size
return "Could not find the right volatility"
The variables I need are located in a Pandas DataFrame and I already created a loop for it, to test if it works (I will add the other variables when it works correctly):
for x in df_option_data['Settlement_Price']:
df_option_data['Implied_Volatility']=Implied_Vol_Solver(100,100,1,0.01,x,0.001)
However when I run this loop I will get 0.539 for the whole Implied_Voltality Column and these numbers needs to be different, what do I wrong? Or are there any easier solutions?
I also tried the following:
df_option_data['Implied_Volatility']=Implied_Vol_Solver(100,100,1,0.01,np.array(df_option_data['Settlement_Price']),0.001)
But than I get the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Essentially what I need is the following: A dataframe with 5 columns for the input variables and 1 column with the output variables (the implied volatility) which is calculated by the function.
You are replacing the result from Implied_Vol_Solver to the entire column instead of a specific cell.
Try the following:
df_option_data['Implied_Volatility'] = df_option_data['Settlement_Price'].apply(lambda x: Implied_Vol_Solver(100,100,1,0.01,x,0.001))
The apply function can apply a function to all the elements in a data column so that you don't need to do the for loop yourself.
Instead of having the input variables passed into the function, you could pass in the row (as a series) and pluck off the values from that. Then, use an apply function to get your output frame. This would look something like this:
def Implied_Vol_Solver(row):
s_t = row['s_t'] # or whatever the column is called in the dataframe
k = row['k'] # and so on and then leave the rest of your logic as is
Once you've modified the function, you can use it like this:
df_option_data['Implied_Volatility'] = df_option_data.apply(Implied_Vol_Solver, axis=1)
I have this Pandas DataFrame and I have to convert some of the items into coordinates, (meaning they have to be floats) and it includes the indexes while trying to convert them into floats. So I tried to set the indexes to the first thing in the DataFrame but it doesn't work. I wonder if it has anything to do with the fact that it is a part of the whole DataFrame, only the section that is "Latitude" and "Longitude".
df = df_volc.iloc(axis = 0)[0:, 3:5]
df.set_index("hello", inplace = True, drop = True)
df
and I get the a really long error, but this is the last part of it:
KeyError: '34.50'
if I don't do the set_index part I get:
Latitude Longitude
0 34.50 131.60
1 -23.30 -67.62
2 14.50 -90.88
I just wanna know if its possible to get rid of the indexes or set them.
The parameter you need to pass to set_index() function is keys : column label or list of column labels / arrays. In your scenario, it seems like "hello" is not a column name.
I just wanna know if its possible to get rid of the indexes or set them.
It is possible to replace the 0, 1, 2 index with something else, though it doesn't sound like it's necessary for your end goal:
to convert some of the items into [...] floats
To achieve this, you could overwrite the existing values by using astype():
df['Latitude'] = df['Latitude'].astype('float')