I'm not sure how to reset index after dropna(). I have
df_all = df_all.dropna()
df_all.reset_index(drop=True)
but after running my code, row index skips steps. For example, it becomes 0,1,2,4,...
The code you've posted already does what you want, but does not do it "in place." Try adding inplace=True to reset_index() or else reassigning the result to df_all. Note that you can also use inplace=True with dropna(), so:
df_all.dropna(inplace=True)
df_all.reset_index(drop=True, inplace=True)
Does it all in place. Or,
df_all = df_all.dropna()
df_all = df_all.reset_index(drop=True)
to reassign df_all.
You can chain methods and write it as a one-liner:
df = df.dropna().reset_index(drop=True)
You can reset the index to default using set_axis() as well.
df.dropna(inplace=True)
df.set_axis(range(len(df)), inplace=True)
set_axis() is especially useful, if you want to reset the index to something other than the default because as long as the lengths match, you can change the index to literally anything with it. For example, you can change it to first row, second row etc.
df = df.dropna()
df = df.set_axis(['first row', 'second row'])
Related
I'm trying to set the index of a df, except it doesn't work:
def save_to_csv(timestamps, values, lower, upper, query):
df = pd.DataFrame({'Time': timestamps, f'Q50-{format_filename(query)}': values})
df.set_index('Time')
df['Time'] = df['Time'].dt.strftime('%Y-%m-%dT%H:%M:%S')
print(df.tail())
df.to_csv(f"predictions/pred-Q50-{format_filename(query)}.csv")
and here is the output:
Time query_value
149007 2023-05-15T15:55:00 0.301318
149008 2023-05-15T15:56:00 0.301318
149009 2023-05-15T15:57:00 0.301318
149010 2023-05-15T15:58:00 0.301318
149011 2023-05-15T15:59:00 0.301318
I still have the original index and not the Time column set as index.
Any fix for that?
Actually, when saving to CSV, putting a index=False fixed everything!
Pandas is designed for chaining. This means that most operations return a modified version of the dataframe. Rather than reassigning at each line, the chaining format is used. Your call of set_index is returning the dataframe you want, but you aren't reassigning it to df.
This is how it would look with chaining.
def save_to_csv(timestamps, values, lower, upper, query):
df = (pd.DataFrame({'Time': timestamps, f'Q50-{format_filename(query)}':values})
.set_index('Time')
.assign(Time=lamdba x: x['Time'].dt.strftime('%Y-%m-%dT%H:%M:%S')
)
print(df.tail())
df.to_csv(f"predictions/pred-Q50-{format_filename(query)}.csv")
I am new to data Science and recently i have been working with pandas and cannot figure out what the following line means in it!
df1=df1.rename(columns=df1.iloc[0,:]).iloc[1:,:]
The problem states that this is used to make the columns with index 11 as the header but i can't understand how?
I know the use of rename but cannot understand what's happening here with multiple iloc ?
Just disect the line by each method applied:
df1 = # reassign df1 to ...
df1.rename( # the renamed frame of df1 ...
columns = # where column names will use mapper of ...
df1.iloc[0,:] # slice of df1 on row 0, include all columns ...
)
.iloc[1:,:] # the slice of the renamed frame from row 1 forward, include all columns...
Effectively, it's removing the first row and set as column names, which can be done similarly:
df1.columns = df1.iloc[0, :]
df1.drop(0, inplace=True)
I need to sort panda dataframe df, by a datetime column my_date. IWhenever I use .loc sorting does not apply.
df = df.loc[(df.some_column == 'filter'),]
df.sort_values(by=['my_date'])
print(dfolc)
# ...
# Not sorted!
# ...
df = df.loc[(df.some_column == 'filter'),].sort_values(by=['my_date'])
# ...
# sorting WORKS!
What is the difference of these two uses? What am I missing about dataframes?
In the first case, you didn't perform an operation in-place: you should have used either df = df.sort_values(by=['my_date']) or df.sort_values(by=['my_date'], inplace=True).
In the second case, the result of .sort_values() was saved to df, hence printing df shows sorted dataframe.
In the code df = df.loc[(df.some_column == 'filter'),] df.sort_values(by=['my_date']) print(dfolc), you are using df.loc() df.sort_values(), I'm not sure how that works.
In the seconf line, you are calling it correctly df.loc().sort_values(), which is the correct way. You don't have to use the df. notation twice.
There are 3 ways to reset index - reset_index(), inplace, and manually setting the index as
df.index = list(range(len(df)))
Since inplace is going to be deprecated in pandas 2, which way is better - reset_index() or manual setting and why?
When assigning to the index, the rest of the data in your DataFrame is not changed, just the index.
If you call reset_index, it creates a copy of your original DataFrame, modifies its index, and returns that. You may prefer this if you're chaining method calls (df.reset_index().method2().method3() as opposed to df.index = ...; df.method2().method3()), but for larger DataFrames, this becomes inefficient, memory wise.
Direct assignment is preferred in terms of performance, but what you should prefer depends on the situation.
There are several ways:
df = df.reset_index(drop=True)
df = df.reset_index(inplace=True) -> returns None
Below solutions are faster:
df.index = pd.RangeIndex(len(df.index))
df.index = range(len(df.index))
I have a dataframe that is generated from appending multiple dataframe together into a long list. As shown in figure, the default index is a loop between 0 ~ 7 because each original df has this index. The total row number is 240. So how can reindex the new df into 0~239 instead of 30 x 0~7.
I tried df.reset_index(drop=True), but it doesn't seem to work. I also tried:df.reindex(np.arange(240)) but it returned error
ValueError: cannot reindex from a duplicate axis
It seems you forget assign output, because by default reset_index does not work inplace:
df = df.reset_index(drop=True)
Or:
df.reset_index(drop=True, inplace=True)
But better solution is (if use concat) add parameter ignore_index=True:
df = pd.concat([df1, df2, ..., df7], ignore_index=True)
You could change your append() method to ignore index:
df1.append(df2, ignore_index=True)