In-place shift operation in pandas - python

How do I save the shift operation in pandas to a same column, the below line returns the new shifted column.
df.groupby('uid')['top'].shift(periods=2)
I want to apply the shift operation on top column itself. Is there any way I can do in-place shift operation?

No, not exist inplace method DataFrameGroupBy.shift, need assign back to same column:
df['top'] = df.groupby('uid')['top'].shift(periods=2)
Or:
df = df.assign(top = df.groupby('uid')['top'].shift(periods=2))
Also I think inplace is not good practice, check this and this.

Related

Calling pandas apply function on results of a mask

I want to apply a function to a subset of rows in my dataframe based on some condition described in a mask. E.g:
mask = (n.city=='No City Found')
n[mask].city = n[mask].address.apply(lambda x: find_city(x))
When I do this, pandas warns me that I'm trying to set a value on a copy of a Dataframe slice. When I inspect the Dataframe, I see that my changes have not persisted.
If I create a new Dataframe slice x using mask and apply the function to x, the results of the apply function are correctly stored in x.
x = n[mask]
x.city = x.address.apply(lambda x: find_city(x))
Is there a way to map this data back to my original Dataframe such that it only affects rows that meet the conditions described in my original mask?
Or is there an easier way altogether to perform such an operation?
The right way to update values is using loc
n.loc[mask, 'city'] = n[mask].address.apply(lambda x: find_city(x))
You can also do it without the mask, in case you want to save the memory of the variable
n['city']=n.address.apply(
lambda x: find_city(x)
if x.city == 'No City Found' else x.city, axis=1
)

Dataframe sorting does not apply when using .loc

I need to sort panda dataframe df, by a datetime column my_date. IWhenever I use .loc sorting does not apply.
df = df.loc[(df.some_column == 'filter'),]
df.sort_values(by=['my_date'])
print(dfolc)
# ...
# Not sorted!
# ...
df = df.loc[(df.some_column == 'filter'),].sort_values(by=['my_date'])
# ...
# sorting WORKS!
What is the difference of these two uses? What am I missing about dataframes?
In the first case, you didn't perform an operation in-place: you should have used either df = df.sort_values(by=['my_date']) or df.sort_values(by=['my_date'], inplace=True).
In the second case, the result of .sort_values() was saved to df, hence printing df shows sorted dataframe.
In the code df = df.loc[(df.some_column == 'filter'),] df.sort_values(by=['my_date']) print(dfolc), you are using df.loc() df.sort_values(), I'm not sure how that works.
In the seconf line, you are calling it correctly df.loc().sort_values(), which is the correct way. You don't have to use the df. notation twice.

Which way is better to reset index in a pandas dataframe?

There are 3 ways to reset index - reset_index(), inplace, and manually setting the index as
df.index = list(range(len(df)))
Since inplace is going to be deprecated in pandas 2, which way is better - reset_index() or manual setting and why?
When assigning to the index, the rest of the data in your DataFrame is not changed, just the index.
If you call reset_index, it creates a copy of your original DataFrame, modifies its index, and returns that. You may prefer this if you're chaining method calls (df.reset_index().method2().method3() as opposed to df.index = ...; df.method2().method3()), but for larger DataFrames, this becomes inefficient, memory wise.
Direct assignment is preferred in terms of performance, but what you should prefer depends on the situation.
There are several ways:
df = df.reset_index(drop=True)
df = df.reset_index(inplace=True) -> returns None
Below solutions are faster:
df.index = pd.RangeIndex(len(df.index))
df.index = range(len(df.index))

Assignment through chained indexers

I would like to be able to assign to a DataFrame through chained indexers. Notionally like this:
subset = df.loc[mask]
... # much later
subset.loc[mask2, 'column'] += value
This does not work because, as I understand it, the second .loc triggers a copy-on-write. Is there a way to do this?
I could pass df and mask around so that the later code could combine mask and mask2 before making an assignment but it feels much cleaner to be able to pass around the subset view instead so that the later code only has to worry about it's own mask.
When you get to:
subset.loc[mask2, 'column']
assign this to another subset so you can access its index and columns attributes.
subsubset = subset.loc[mask2, 'column']
Then you can access df with subsubset's index and columns
df.loc[subsubset.index, subsubset.columns] += 1

Pandas: Use iterrows on Dataframe subset

What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?
Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'
I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P

Categories

Resources