I need to sort panda dataframe df, by a datetime column my_date. IWhenever I use .loc sorting does not apply.
df = df.loc[(df.some_column == 'filter'),]
df.sort_values(by=['my_date'])
print(dfolc)
# ...
# Not sorted!
# ...
df = df.loc[(df.some_column == 'filter'),].sort_values(by=['my_date'])
# ...
# sorting WORKS!
What is the difference of these two uses? What am I missing about dataframes?
In the first case, you didn't perform an operation in-place: you should have used either df = df.sort_values(by=['my_date']) or df.sort_values(by=['my_date'], inplace=True).
In the second case, the result of .sort_values() was saved to df, hence printing df shows sorted dataframe.
In the code df = df.loc[(df.some_column == 'filter'),] df.sort_values(by=['my_date']) print(dfolc), you are using df.loc() df.sort_values(), I'm not sure how that works.
In the seconf line, you are calling it correctly df.loc().sort_values(), which is the correct way. You don't have to use the df. notation twice.
Related
I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!
Use startswith for this:
df = df[df['Code'].str.startswith('pl')]
Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]
If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()
The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]
Assume I have a dataframe df which has 4 columns col = ["id","date","basket","gender"] and a function
def is_valid_date(df):
idx = some_scalar_function(df["basket") #returns an index
date = df["date"].values[idx]
return (date>some_date)
I have always understood the groupby as a "creation of a new dataframe" when splitting in the "split-apply-combine" (losely speaking) thus if I want to apply is_valid_date to each group of id, I would assume I could do
df.groupby("id").agg(get_first_date)
but it throws KeyError: 'basket' in the idx=some_scalar_function(df["basket"])
If use GroupBy.agg it working with each column separately, so cannot selecting like df["basket"], df["date"].
Solution is use GroupBy.apply with your custom function:
df.groupby("id").apply(get_first_date)
I have pandas DataFrame df with different types of columns, some values of df are NaN.
To test some assumption, I create copy of df, and transform copied df to (0, 1) with pandas.isnull():
df_copy = df
for column in df_copy:
df_copy[column] = df_copy[column].isnull().astype(int)
but after that BOTH df and df_copy consist of 0 and 1.
Why this code transforms df to 0, 1 and is there way to prevent it?
You can prevent it declaring:
df_copy = df.copy()
This creates a new object. Prior to that you essentially had two pointers to the same object. You also might want to check this answer and note that DataFrames are mutable.
Btw, you could obtain the desired result simply by:
df_copy = df.isnull().astype(int)
even better memory-wise
for column in df:
df[column + 'flag'] = df[column].isnull().astype(int)
I'm not sure how to reset index after dropna(). I have
df_all = df_all.dropna()
df_all.reset_index(drop=True)
but after running my code, row index skips steps. For example, it becomes 0,1,2,4,...
The code you've posted already does what you want, but does not do it "in place." Try adding inplace=True to reset_index() or else reassigning the result to df_all. Note that you can also use inplace=True with dropna(), so:
df_all.dropna(inplace=True)
df_all.reset_index(drop=True, inplace=True)
Does it all in place. Or,
df_all = df_all.dropna()
df_all = df_all.reset_index(drop=True)
to reassign df_all.
You can chain methods and write it as a one-liner:
df = df.dropna().reset_index(drop=True)
You can reset the index to default using set_axis() as well.
df.dropna(inplace=True)
df.set_axis(range(len(df)), inplace=True)
set_axis() is especially useful, if you want to reset the index to something other than the default because as long as the lengths match, you can change the index to literally anything with it. For example, you can change it to first row, second row etc.
df = df.dropna()
df = df.set_axis(['first row', 'second row'])
What is the best way to do iterrows with a subset of a DataFrame?
Let's take the following simple example:
import pandas as pd
df = pd.DataFrame({
'Product': list('AAAABBAA'),
'Quantity': [5,2,5,10,1,5,2,3],
'Start' : [
DT.datetime(2013,1,1,9,0),
DT.datetime(2013,1,1,8,5),
DT.datetime(2013,2,5,14,0),
DT.datetime(2013,2,5,16,0),
DT.datetime(2013,2,8,20,0),
DT.datetime(2013,2,8,16,50),
DT.datetime(2013,2,8,7,0),
DT.datetime(2013,7,4,8,0)]})
df = df.set_index(['Start'])
Now I would like to modify a subset of this DataFrame using the itterrows function, e.g.:
for i, row_i in df[df.Product == 'A'].iterrows():
row_i['Product'] = 'A1' # actually a more complex calculation
However, the changes do not persist.
Is there any possibility (except a manual lookup using the index 'i') to make persistent changes on the original Dataframe ?
Why do you need iterrows() for this? I think it's always preferrable to use vectorized operations in pandas (or numpy):
df.ix[df['Product'] == 'A', "Product"] = 'A1'
I guess the best way that comes to my mind is to generate a new vector with the desired result, where you can loop all you want and then reassign it back to the column
#make a copy of the column
P = df.Product.copy()
#do the operation or loop if you really must
P[ P=="A" ] = "A1"
#reassign to original df
df["Product"] = P