There are 3 ways to reset index - reset_index(), inplace, and manually setting the index as
df.index = list(range(len(df)))
Since inplace is going to be deprecated in pandas 2, which way is better - reset_index() or manual setting and why?
When assigning to the index, the rest of the data in your DataFrame is not changed, just the index.
If you call reset_index, it creates a copy of your original DataFrame, modifies its index, and returns that. You may prefer this if you're chaining method calls (df.reset_index().method2().method3() as opposed to df.index = ...; df.method2().method3()), but for larger DataFrames, this becomes inefficient, memory wise.
Direct assignment is preferred in terms of performance, but what you should prefer depends on the situation.
There are several ways:
df = df.reset_index(drop=True)
df = df.reset_index(inplace=True) -> returns None
Below solutions are faster:
df.index = pd.RangeIndex(len(df.index))
df.index = range(len(df.index))
Related
How do I save the shift operation in pandas to a same column, the below line returns the new shifted column.
df.groupby('uid')['top'].shift(periods=2)
I want to apply the shift operation on top column itself. Is there any way I can do in-place shift operation?
No, not exist inplace method DataFrameGroupBy.shift, need assign back to same column:
df['top'] = df.groupby('uid')['top'].shift(periods=2)
Or:
df = df.assign(top = df.groupby('uid')['top'].shift(periods=2))
Also I think inplace is not good practice, check this and this.
I need to sort panda dataframe df, by a datetime column my_date. IWhenever I use .loc sorting does not apply.
df = df.loc[(df.some_column == 'filter'),]
df.sort_values(by=['my_date'])
print(dfolc)
# ...
# Not sorted!
# ...
df = df.loc[(df.some_column == 'filter'),].sort_values(by=['my_date'])
# ...
# sorting WORKS!
What is the difference of these two uses? What am I missing about dataframes?
In the first case, you didn't perform an operation in-place: you should have used either df = df.sort_values(by=['my_date']) or df.sort_values(by=['my_date'], inplace=True).
In the second case, the result of .sort_values() was saved to df, hence printing df shows sorted dataframe.
In the code df = df.loc[(df.some_column == 'filter'),] df.sort_values(by=['my_date']) print(dfolc), you are using df.loc() df.sort_values(), I'm not sure how that works.
In the seconf line, you are calling it correctly df.loc().sort_values(), which is the correct way. You don't have to use the df. notation twice.
I have pandas DataFrame df with different types of columns, some values of df are NaN.
To test some assumption, I create copy of df, and transform copied df to (0, 1) with pandas.isnull():
df_copy = df
for column in df_copy:
df_copy[column] = df_copy[column].isnull().astype(int)
but after that BOTH df and df_copy consist of 0 and 1.
Why this code transforms df to 0, 1 and is there way to prevent it?
You can prevent it declaring:
df_copy = df.copy()
This creates a new object. Prior to that you essentially had two pointers to the same object. You also might want to check this answer and note that DataFrames are mutable.
Btw, you could obtain the desired result simply by:
df_copy = df.isnull().astype(int)
even better memory-wise
for column in df:
df[column + 'flag'] = df[column].isnull().astype(int)
This has been killing me!
Any idea how to convert this to a list comprehension?
for x in dataframe:
if dataframe[x].value_counts().sum()<=1:
dataframe.drop(x, axis=1, inplace=True)
[dataframe.drop(x, axis=1, inplace=True) for x in dataframe if dataframe[x].value_counts().sum() <= 1]
I have not used pandas yet, but the documentation on dataframe.drop says it returns a new object, so I assume it will work.
I would probably suggest going the other way and filtering it, I don't know your dataframe but something like this should work:
counts_valid = df.T.apply(pd.value_counts()).sum() > 1
df = df[counts_valid]
Or, if I see what you are doing, you may be better with
counts_valid = df.T.nunique() > 1
df = df[counts_valid]
That will just keep rows that have more than one unique value.
I'm not sure how to reset index after dropna(). I have
df_all = df_all.dropna()
df_all.reset_index(drop=True)
but after running my code, row index skips steps. For example, it becomes 0,1,2,4,...
The code you've posted already does what you want, but does not do it "in place." Try adding inplace=True to reset_index() or else reassigning the result to df_all. Note that you can also use inplace=True with dropna(), so:
df_all.dropna(inplace=True)
df_all.reset_index(drop=True, inplace=True)
Does it all in place. Or,
df_all = df_all.dropna()
df_all = df_all.reset_index(drop=True)
to reassign df_all.
You can chain methods and write it as a one-liner:
df = df.dropna().reset_index(drop=True)
You can reset the index to default using set_axis() as well.
df.dropna(inplace=True)
df.set_axis(range(len(df)), inplace=True)
set_axis() is especially useful, if you want to reset the index to something other than the default because as long as the lengths match, you can change the index to literally anything with it. For example, you can change it to first row, second row etc.
df = df.dropna()
df = df.set_axis(['first row', 'second row'])